NOAI China 2024 · Tabular ML · Predicting basketball shooting percentage

Contest: NOAI China 2024 / APOAI 2025 mock · Round: Phase 1, Problem 1 · Category: Classical ML / tabular binary classification.

Official sources: jaredliw/ioai-tsp-2025 — basketball-shooting · open-cu/awesome-ioai-tasks (index). The base data is derived from the Kaggle "Kobe Bryant Shot Selection" public dataset, with features normalised by the organisers.

1. Problem restatement

Given a CSV of historical shots taken by a basketball star, predict whether each shot in the held-out test set was made (1) or missed (0). The features (verbatim from the official notebook):

# columns in data_train.csv
loc_x              # normalized horizontal court position at shot time
loc_y              # normalized vertical court position at shot time
minutes_remaining  # minutes left in current quarter (normalized)
shot_distance      # distance from shooter to basket (normalized)
shot_made_flag     # TARGET: 1 if made, 0 if missed
shot_id            # unique sample id

The task statement specifies only loc_x and loc_y as the input features — the other numeric columns are "available but optional". The submission must include a PyTorch model (submission_model.py + submission_dic.pth) and predictions. Metric: accuracy on the hidden test set.

Source. Paraphrased from the official problem statement reproduced in the basketball-shooting notebook. The original NOAI problem PDF is hosted via the China national olympiad selection process and is not publicly archived in English; the GitHub mirror is the closest thing to a canonical statement.

2. What's being tested

Tabular classification on a low-dimensional dataset where the signal is geometric — shooting percentage drops with distance from the basket and depends on angle. The test is whether you:

Inspect the data before modelling (basic EDA, distribution of loc_x, loc_y, class imbalance).
Recognise when feature engineering trivially beats deep models (polar coordinates from Cartesian).
Know when gradient boosting beats neural networks on small tabular data.
Use cross-validation properly and don't leak test labels via scaler fitting.

Maps onto the Classical ML page (logistic regression, random forest, gradient boosting, calibration) and the Python page (pandas, sklearn pipelines).

3. Data exploration / setup

The training file typically has ~25 000 rows after dropping rows with NaN in shot_made_flag. Hidden test set: a few thousand rows. Class balance is close to 50/50 (real pro shooters make roughly half their attempts) so plain accuracy is a fair metric.

Critical observation: in basketball-court coordinates, shots cluster on a horseshoe pattern (corners + arc + paint). A linear classifier in (loc_x, loc_y) space sees an almost rotationally symmetric structure and underfits. A model that sees distance and angle as features fits trivially.

Run this first:

import pandas as pd, numpy as np, matplotlib.pyplot as plt
df = pd.read_csv("data_train.csv")
print(df.shape, df.shot_made_flag.value_counts(normalize=True))

# scatter colored by shot result
ax = df.sample(3000).plot.scatter("loc_x", "loc_y",
        c="shot_made_flag", cmap="bwr", alpha=0.4)
plt.show()

# accuracy as a function of distance
df["dist"] = np.hypot(df.loc_x, df.loc_y)
print(df.groupby(pd.cut(df.dist, 10)).shot_made_flag.mean())

4. Baseline approach

A 2-layer MLP on raw (loc_x, loc_y) — the exact baseline shown in the official notebook — hits roughly 56-60% test accuracy.

import torch, torch.nn as nn, torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv("data_train.csv").dropna(subset=["shot_made_flag"])
X = df[["loc_x", "loc_y"]].to_numpy().astype("float32")
y = df["shot_made_flag"].to_numpy().astype("float32")
Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.f = nn.Sequential(nn.Linear(2, 8), nn.Tanh(),
                               nn.Linear(8, 8), nn.Tanh(),
                               nn.Linear(8, 1))
    def forward(self, x): return self.f(x).squeeze(-1)

m   = MLP()
opt = optim.Adam(m.parameters(), lr=1e-3)
crit = nn.BCEWithLogitsLoss()
Xtr_t = torch.tensor(Xtr); ytr_t = torch.tensor(ytr)

for epoch in range(50):
    opt.zero_grad()
    loss = crit(m(Xtr_t), ytr_t)
    loss.backward(); opt.step()

with torch.no_grad():
    pred = (torch.sigmoid(m(torch.tensor(Xva))) >= 0.5).numpy().astype(int)
print("val acc:", accuracy_score(yva, pred))   # ~ 0.58 [illustrative]

Score band: ~56-60% accuracy with default seeds. The MLP is barely better than predicting the majority class. [illustrative]

5. Improvements that move the needle

5.1 · Polar features (the single biggest win)

Convert (loc_x, loc_y) to (distance, angle). Distance is monotonically related to make-probability; angle is informative because corner threes are easier than top-of-key threes for the same distance. This single feature engineering step beats any tweak to the MLP architecture.

def polar_features(X):
    r     = np.hypot(X[:, 0], X[:, 1])
    theta = np.arctan2(X[:, 1], X[:, 0])
    return np.column_stack([r, theta, np.sin(theta), np.cos(theta),
                             X[:, 0], X[:, 1]])

5.2 · Gradient boosting beats MLP here

With ~25k rows, 6 features, and class-balanced labels, a histogram gradient booster reaches ~64-66% accuracy with default hyperparameters — meaningfully above the MLP. This is the canonical "tabular data, small N, use trees" pattern.

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

X_polar = polar_features(X)
gbm = HistGradientBoostingClassifier(
    max_depth=6, learning_rate=0.05, max_iter=600,
    l2_regularization=1.0, early_stopping=True, validation_fraction=0.15,
    random_state=42)
scores = cross_val_score(gbm, X_polar, y, cv=StratifiedKFold(5, shuffle=True, random_state=0))
print("CV acc:", scores.mean(), "+/-", scores.std())  # ~ 0.65 [illustrative]

5.3 · Add the "optional" features back

The problem statement says the inputs are loc_x, loc_y, but it does not forbid using shot_distance and minutes_remaining. Read the rules carefully — shot_distance is essentially a denoised version of polar distance and adds another 0.5-1 accuracy point.

5.4 · Wrap the model in a sklearn Pipeline so CV doesn't leak

A classic exam-day trap: StandardScaler().fit_transform(X) on the full data, then cross-validate. The scaler has seen the val fold. Always wrap:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("gbm",   HistGradientBoostingClassifier(max_iter=600, learning_rate=0.05, random_state=42)),
])
# Now cross_val_score refits the scaler inside each fold's training half.

5.5 · For the PyTorch-required submission: use the GBM to pseudo-label and distil

The contest requires a PyTorch model in the submission. If GBM beats your MLP, train the MLP on (X, GBM-predicted-probabilities) with cross-entropy against soft targets instead of (X, hard labels). The MLP can then mimic the GBM's decision boundary and you get the best of both worlds while satisfying the submission format.

probs_gbm = gbm.fit(X_polar, y).predict_proba(X_polar)[:, 1]
soft_t = torch.tensor(probs_gbm, dtype=torch.float32)

class MLP2(nn.Module):
    def __init__(self, d):
        super().__init__()
        self.f = nn.Sequential(nn.Linear(d, 32), nn.GELU(),
                               nn.Linear(32, 32), nn.GELU(),
                               nn.Linear(32, 1))
    def forward(self, x): return self.f(x).squeeze(-1)

m = MLP2(X_polar.shape[1])
opt = optim.AdamW(m.parameters(), lr=3e-3, weight_decay=1e-4)
Xp_t = torch.tensor(X_polar, dtype=torch.float32)

for epoch in range(300):
    opt.zero_grad()
    p = torch.sigmoid(m(Xp_t))
    loss = -(soft_t * torch.log(p + 1e-9) + (1 - soft_t) * torch.log(1 - p + 1e-9)).mean()
    loss.backward(); opt.step()

6. Submission format & gotchas

Submission is a ZIP containing submission_model.py (model class definition + a load_model() function), submission_dic.pth (state dict), and any preprocessing constants you need at inference.
The grader instantiates your model via load_model(), loads the state dict, and calls model(X_tensor) for logits. Your predict must return a tensor, not a numpy array.
If you use polar features at training, you must compute them inside load_model() / forward() — the grader passes raw (loc_x, loc_y).
Don't forget model.eval() in the loader. BatchNorm with batch size 1 at inference will silently corrupt predictions if you trained with it.
Seed everything. Reproducibility is part of grading and reruns must produce the same number.

7. What top solutions did

The reference notebook in the jaredliw repo uses the minimal 2-layer Tanh MLP described in the baseline above and achieves only the basic score — it is illustrative of "what gets you 50%", not "what wins". Top scorers in the actual NOAI 2024 selection (per write-ups discussed on the IOAI community contest forum) used (i) polar features, (ii) gradient boosting as an oracle, (iii) MLP distillation to satisfy the PyTorch submission format, and (iv) 5-seed ensembling at inference. Specific numeric leaderboard placements for the China selection are not published in English, so absolute scores are [illustrative].

8. Drill

D · Why does a 2-layer MLP underperform a gradient booster on this dataset?

Two reasons. (1) Data scale: with only ~25k rows and 2-6 numeric features, a deep network has too much capacity per unit of supervision; trees with axis-aligned splits encode the "shot harder beyond the 3-point line" structure with vastly fewer effective parameters. (2) Inductive bias: trees naturally handle non-monotone, threshold-like relationships (a 3-pointer that hits the corner is suddenly easier — a hard threshold in the angle). An MLP with Tanh activations has to learn the same step, and with small data it learns a smoothed version that loses accuracy at the thresholds. Tabular ML competitions have been won by XGBoost / LightGBM / CatBoost for a decade; treat "trees first, NN if necessary" as the default heuristic on a USAAIO tabular question.

D2 · How would you turn this into a regression problem? Would it help?

Each row is a single binary outcome, so regressing on it directly is just logistic regression in disguise. The interesting regression task is to aggregate by court bin: divide the court into a grid, compute the mean shot_made_flag per bin, and fit a model that predicts that empirical FG%. This smooths noise from individual shots and gives a calibrated probability surface, but it costs you per-shot context (minutes remaining, etc.). Useful as a feature (the model's input gets an extra column "FG% historically at this location") even if not as the final output.

← Back to problem archive