Mock A · Tabular ML — “Exam Outcome Prediction”

Format: Round-1 style · Time: 90 minutes · Compute: CPU only · Submission: predictions.csv · Metric: weighted F1 across 3 classes.

A self-contained tabular classification mock modeled on a USAAIO Round-1 problem. You receive a small synthetic training set and an unlabeled test set; your job is to ship a runnable notebook and a submission CSV inside the 90-minute window, scored by the rubric below.

Constraints.

Time budget: 90 minutes, hard stop. Set a timer before you read the problem.
Compute: CPU only. No GPU, no internet model downloads.
Libraries: anything in the standard scientific-Python stack (numpy, pandas, scikit-learn, lightgbm/xgboost if installed locally). No pretrained foundation models.
Submission: a single file predictions.csv with header id,label, one row per test row, label ∈ {low, mid, high}.
Write-up: 4–6 sentence markdown cell at the end of the notebook describing your preprocessing, features, model, and CV score.

Problem statement

You are given anonymized records for ~500 high-school students preparing for a standardized exam. Each row contains four numerical / categorical features describing the student's study habits and prior performance, plus a 3-class outcome label (low, mid, high) that encodes the final exam tier. The labels are imbalanced (roughly 30 / 45 / 25 percent), and one feature (attendance) contains ~8% missing values injected at random.

Train a classifier on the provided training set and predict the tier of every student in the held-out test set. The grader will compute weighted F1 (sklearn: f1_score(y_true, y_pred, average="weighted")) on the hidden test labels. Your raw weighted-F1 score does not directly become your grade — the rubric below scores you on the process you demonstrate (preprocessing, baseline, feature engineering, model choice, write-up), not on a single leaderboard number.

The dataset is fully synthetic and generated by the snippet in the next section; cite it as such if you publish your solution.

Data dictionary

Training set: train.csv — 500 rows, 5 columns (4 features + label).
Test set: test.csv — 150 rows, 5 columns (id + 4 features, no label).

Column	Type	Range / values	Notes
`id`	int	0–649	Row identifier; test-only column for re-joining predictions.
`study_hours`	float	0.0–12.0	Daily self-reported study hours, last 30 days.
`prior_score`	float	0–100	Previous standardized exam percentile-scaled to 0–100.
`sleep`	float	3.0–10.0	Average hours of sleep / night.
`attendance`	float (nullable)	0.0–1.0	Fraction of classes attended last semester. ~8% missing (MCAR).
`label`	category	`low`, `mid`, `high`	Train-only. Class balance ~30 / 45 / 25 %.

Synthetic data generator (use to materialize the dataset locally)

import numpy as np
import pandas as pd

rng = np.random.default_rng(2026)
N = 650

study_hours = rng.uniform(0, 12, size=N)
prior_score = rng.uniform(0, 100, size=N)
sleep       = rng.normal(7.0, 1.2, size=N).clip(3, 10)
attendance  = rng.beta(6, 2, size=N)  # right-skewed, in (0, 1)

# latent score drives the label
latent = (
    0.35 * (study_hours / 12)
    + 0.40 * (prior_score / 100)
    + 0.15 * (sleep / 10)
    + 0.10 * attendance
    + rng.normal(0, 0.08, size=N)
)
labels = np.where(latent < 0.45, "low",
         np.where(latent < 0.70, "mid", "high"))

# inject MCAR missingness on attendance
mask = rng.random(N) < 0.08
attendance[mask] = np.nan

df = pd.DataFrame({
    "study_hours": study_hours,
    "prior_score": prior_score,
    "sleep": sleep,
    "attendance": attendance,
    "label": labels,
})

train = df.iloc[:500].reset_index(drop=True)
test  = df.iloc[500:].reset_index(drop=True)
test.insert(0, "id", np.arange(500, 650))
train.to_csv("train.csv", index=False)
test.drop(columns=["label"]).to_csv("test.csv", index=False)
test[["id", "label"]].to_csv("test_labels.csv", index=False)  # for your own scoring

Submission format

A single CSV named predictions.csv with exactly 150 data rows and the header below.

id,label
500,mid
501,high
502,low
...
649,mid

The id column must match the test set order; the grader joins on id.
Labels must be one of low, mid, high (case-sensitive).
No extra columns, no index column, UTF-8, Unix newlines.

Scoring rubric (100 points)

Section	Points	What earns credit
Preprocessing	20	Reasonable train/val split (or k-fold), handling of missing `attendance` (impute or model-native NaN handling), label encoding, no leakage from test into fit.
Baseline	30	A simple, runnable first model (logistic regression or single decision tree) with a reported CV weighted-F1 before any tuning. Numbers are reproducible (`random_state` pinned).
Feature engineering	25	At least 2 derived features motivated by EDA (e.g. `study_hours × prior_score`, sleep buckets, attendance imputed + missing-indicator). Improvement over baseline shown via CV.
Model choice	15	Final model is appropriate for small tabular data with mixed-scale features (gradient boosting, random forest, or stacked ensemble) with a brief justification.
Write-up	10	4–6 sentence summary at the end of the notebook describing the pipeline, CV score, and one limitation.

Note: the rubric scores process. A high leaderboard score with no baseline, no CV, and no write-up caps at ~55. A clean baseline + tidy CV with a mid-tier leaderboard score routinely lands ~85.

Start the timer

Go. Set a 90-minute timer now. When it rings, save the notebook from a clean kernel and stop. Then open the reference solution and score yourself section by section.