NOAI China 2024 · Tabular ML · Predicting basketball shooting percentage
Contest: NOAI China 2024 / APOAI 2025 mock · Round: Phase 1, Problem 1 · Category: Classical ML / tabular binary classification.
Official sources: jaredliw/ioai-tsp-2025 — basketball-shooting · open-cu/awesome-ioai-tasks (index). The base data is derived from the Kaggle "Kobe Bryant Shot Selection" public dataset, with features normalised by the organisers.
1. Problem restatement
Given a CSV of historical shots taken by a basketball star, predict whether each shot in the held-out test set was made (1) or missed (0). The features (verbatim from the official notebook):
# columns in data_train.csv
loc_x # normalized horizontal court position at shot time
loc_y # normalized vertical court position at shot time
minutes_remaining # minutes left in current quarter (normalized)
shot_distance # distance from shooter to basket (normalized)
shot_made_flag # TARGET: 1 if made, 0 if missed
shot_id # unique sample id
The task statement specifies only loc_x and loc_y as the input
features — the other numeric columns are "available but optional". The submission must include
a PyTorch model (submission_model.py + submission_dic.pth) and predictions.
Metric: accuracy on the hidden test set.
2. What's being tested
Tabular classification on a low-dimensional dataset where the signal is geometric — shooting percentage drops with distance from the basket and depends on angle. The test is whether you:
- Inspect the data before modelling (basic EDA, distribution of
loc_x, loc_y, class imbalance). - Recognise when feature engineering trivially beats deep models (polar coordinates from Cartesian).
- Know when gradient boosting beats neural networks on small tabular data.
- Use cross-validation properly and don't leak test labels via scaler fitting.
Maps onto the Classical ML page (logistic regression, random forest, gradient boosting, calibration) and the Python page (pandas, sklearn pipelines).
3. Data exploration / setup
The training file typically has ~25 000 rows after dropping rows with NaN in
shot_made_flag. Hidden test set: a few thousand rows. Class balance is close to 50/50
(real pro shooters make roughly half their attempts) so plain accuracy is a fair metric.
Critical observation: in basketball-court coordinates, shots cluster on a horseshoe pattern (corners +
arc + paint). A linear classifier in (loc_x, loc_y) space sees an almost rotationally
symmetric structure and underfits. A model that sees distance and angle as
features fits trivially.
Run this first:
import pandas as pd, numpy as np, matplotlib.pyplot as plt
df = pd.read_csv("data_train.csv")
print(df.shape, df.shot_made_flag.value_counts(normalize=True))
# scatter colored by shot result
ax = df.sample(3000).plot.scatter("loc_x", "loc_y",
c="shot_made_flag", cmap="bwr", alpha=0.4)
plt.show()
# accuracy as a function of distance
df["dist"] = np.hypot(df.loc_x, df.loc_y)
print(df.groupby(pd.cut(df.dist, 10)).shot_made_flag.mean())
4. Baseline approach
A 2-layer MLP on raw (loc_x, loc_y) — the exact baseline shown in the official notebook —
hits roughly 56-60% test accuracy.
import torch, torch.nn as nn, torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df = pd.read_csv("data_train.csv").dropna(subset=["shot_made_flag"])
X = df[["loc_x", "loc_y"]].to_numpy().astype("float32")
y = df["shot_made_flag"].to_numpy().astype("float32")
Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.f = nn.Sequential(nn.Linear(2, 8), nn.Tanh(),
nn.Linear(8, 8), nn.Tanh(),
nn.Linear(8, 1))
def forward(self, x): return self.f(x).squeeze(-1)
m = MLP()
opt = optim.Adam(m.parameters(), lr=1e-3)
crit = nn.BCEWithLogitsLoss()
Xtr_t = torch.tensor(Xtr); ytr_t = torch.tensor(ytr)
for epoch in range(50):
opt.zero_grad()
loss = crit(m(Xtr_t), ytr_t)
loss.backward(); opt.step()
with torch.no_grad():
pred = (torch.sigmoid(m(torch.tensor(Xva))) >= 0.5).numpy().astype(int)
print("val acc:", accuracy_score(yva, pred)) # ~ 0.58 [illustrative]
Score band: ~56-60% accuracy with default seeds. The MLP is barely better than predicting the majority class. [illustrative]
5. Improvements that move the needle
5.1 · Polar features (the single biggest win)
Convert (loc_x, loc_y) to (distance, angle). Distance is monotonically
related to make-probability; angle is informative because corner threes are easier than top-of-key
threes for the same distance. This single feature engineering step beats any tweak to the MLP
architecture.
def polar_features(X):
r = np.hypot(X[:, 0], X[:, 1])
theta = np.arctan2(X[:, 1], X[:, 0])
return np.column_stack([r, theta, np.sin(theta), np.cos(theta),
X[:, 0], X[:, 1]])
5.2 · Gradient boosting beats MLP here
With ~25k rows, 6 features, and class-balanced labels, a histogram gradient booster reaches ~64-66% accuracy with default hyperparameters — meaningfully above the MLP. This is the canonical "tabular data, small N, use trees" pattern.
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
X_polar = polar_features(X)
gbm = HistGradientBoostingClassifier(
max_depth=6, learning_rate=0.05, max_iter=600,
l2_regularization=1.0, early_stopping=True, validation_fraction=0.15,
random_state=42)
scores = cross_val_score(gbm, X_polar, y, cv=StratifiedKFold(5, shuffle=True, random_state=0))
print("CV acc:", scores.mean(), "+/-", scores.std()) # ~ 0.65 [illustrative]
5.3 · Add the "optional" features back
The problem statement says the inputs are loc_x, loc_y, but it does not forbid
using shot_distance and minutes_remaining. Read the rules carefully —
shot_distance is essentially a denoised version of polar distance and adds another 0.5-1
accuracy point.
5.4 · Wrap the model in a sklearn Pipeline so CV doesn't leak
A classic exam-day trap: StandardScaler().fit_transform(X) on the full data, then
cross-validate. The scaler has seen the val fold. Always wrap:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
("scale", StandardScaler()),
("gbm", HistGradientBoostingClassifier(max_iter=600, learning_rate=0.05, random_state=42)),
])
# Now cross_val_score refits the scaler inside each fold's training half.
5.5 · For the PyTorch-required submission: use the GBM to pseudo-label and distil
The contest requires a PyTorch model in the submission. If GBM beats your MLP, train the MLP on (X, GBM-predicted-probabilities) with cross-entropy against soft targets instead of (X, hard labels). The MLP can then mimic the GBM's decision boundary and you get the best of both worlds while satisfying the submission format.
probs_gbm = gbm.fit(X_polar, y).predict_proba(X_polar)[:, 1]
soft_t = torch.tensor(probs_gbm, dtype=torch.float32)
class MLP2(nn.Module):
def __init__(self, d):
super().__init__()
self.f = nn.Sequential(nn.Linear(d, 32), nn.GELU(),
nn.Linear(32, 32), nn.GELU(),
nn.Linear(32, 1))
def forward(self, x): return self.f(x).squeeze(-1)
m = MLP2(X_polar.shape[1])
opt = optim.AdamW(m.parameters(), lr=3e-3, weight_decay=1e-4)
Xp_t = torch.tensor(X_polar, dtype=torch.float32)
for epoch in range(300):
opt.zero_grad()
p = torch.sigmoid(m(Xp_t))
loss = -(soft_t * torch.log(p + 1e-9) + (1 - soft_t) * torch.log(1 - p + 1e-9)).mean()
loss.backward(); opt.step()
6. Submission format & gotchas
- Submission is a ZIP containing
submission_model.py(model class definition + aload_model()function),submission_dic.pth(state dict), and any preprocessing constants you need at inference. - The grader instantiates your model via
load_model(), loads the state dict, and callsmodel(X_tensor)for logits. Yourpredictmust return a tensor, not a numpy array. - If you use polar features at training, you must compute them inside
load_model()/forward()— the grader passes raw(loc_x, loc_y). - Don't forget
model.eval()in the loader. BatchNorm with batch size 1 at inference will silently corrupt predictions if you trained with it. - Seed everything. Reproducibility is part of grading and reruns must produce the same number.
7. What top solutions did
The reference notebook in the jaredliw repo uses the minimal 2-layer Tanh MLP described in the baseline above and achieves only the basic score — it is illustrative of "what gets you 50%", not "what wins". Top scorers in the actual NOAI 2024 selection (per write-ups discussed on the IOAI community contest forum) used (i) polar features, (ii) gradient boosting as an oracle, (iii) MLP distillation to satisfy the PyTorch submission format, and (iv) 5-seed ensembling at inference. Specific numeric leaderboard placements for the China selection are not published in English, so absolute scores are [illustrative].
8. Drill
D · Why does a 2-layer MLP underperform a gradient booster on this dataset?
Two reasons. (1) Data scale: with only ~25k rows and 2-6 numeric features, a deep network has too much capacity per unit of supervision; trees with axis-aligned splits encode the "shot harder beyond the 3-point line" structure with vastly fewer effective parameters. (2) Inductive bias: trees naturally handle non-monotone, threshold-like relationships (a 3-pointer that hits the corner is suddenly easier — a hard threshold in the angle). An MLP with Tanh activations has to learn the same step, and with small data it learns a smoothed version that loses accuracy at the thresholds. Tabular ML competitions have been won by XGBoost / LightGBM / CatBoost for a decade; treat "trees first, NN if necessary" as the default heuristic on a USAAIO tabular question.
D2 · How would you turn this into a regression problem? Would it help?
Each row is a single binary outcome, so regressing on it directly is just logistic regression in
disguise. The interesting regression task is to aggregate by court bin: divide the court into a
grid, compute the mean shot_made_flag per bin, and fit a model that predicts that
empirical FG%. This smooths noise from individual shots and gives a calibrated probability surface,
but it costs you per-shot context (minutes remaining, etc.). Useful as a feature (the model's
input gets an extra column "FG% historically at this location") even if not as the final output.