USAAIO 2025 Round 1 · Problem 3 · Data exploration & a classical ML pipeline

Contest: 2025 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Pandas / EDA / classical ML.

Official sources: usaaio.org/past-problems · P3 Part 1 forum thread. The problem runs 18+ parts on the forum.

1. Problem restatement

Part 1 (verbatim from the forum) hands the contestant a CSV dataset and asks for a single piece of code: load the file into a pandas DataFrame, display the first 10 rows, and write a data_summary(df) function that prints the DataFrame's shape, dtypes, and per-column missing-value count.

The 17 follow-up parts walk through a complete ML pipeline on the same dataset: missing-value imputation, categorical encoding, train/validation split, baseline classifier, feature engineering, cross-validation, model comparison, and finally an analysis write-up. The dataset itself is not named in Part 1, only loaded — later parts reveal what's being predicted.

Source. Part 1 paraphrased from the forum thread. Parts 2–18 reconstructed from the published Round-1 ML syllabus; treat their wording as [verify against source].

2. What's being tested

Pandas fluency. Read a CSV, group, summarise, plot — without google'ing.
EDA hygiene. Missing values, dtypes, target distribution, leakage check before any model.
Pipeline literacy. Use sklearn.pipeline.Pipeline + a ColumnTransformer so preprocessing fits on train only.
Honest cross-validation. Don't tune on the test set; use a held-out fold.
Write-up. Final parts ask for prose analysis. Practice writing 3-sentence explanations of each modelling choice.

3. Data exploration / setup

import pandas as pd

df = pd.read_csv("dataset.csv")
print(df.head(10))

def data_summary(df: pd.DataFrame) -> None:
    print("shape:", df.shape)
    print("\ndtypes:")
    print(df.dtypes)
    print("\nmissing per column:")
    print(df.isnull().sum().sort_values(ascending=False))

data_summary(df)

This is the literal Part 1 answer. Spend the saved time on what comes next: look at the target column, plot its distribution, run df.describe(include="all"), and identify obvious leakers (columns that perfectly predict the target).

4. Baseline approach

Once you've cleaned missing values and encoded categoricals, a logistic regression / random forest baseline is enough to get most of the problem's points.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

num_cols = df.select_dtypes("number").columns.drop("target")
cat_cols = df.select_dtypes("object").columns

pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                      ("sc",  StandardScaler())]), num_cols),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
                      ("oh",  OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])

clf = Pipeline([("pre", pre), ("rf", RandomForestClassifier(n_estimators=300, random_state=0))])
scores = cross_val_score(clf, df.drop(columns="target"), df.target, cv=5, scoring="f1_macro")
print(scores.mean(), scores.std())

5. Improvements that move the needle

5.1 · Target-leakage check before any modelling

For each feature compute its single-feature CV score against the target. Any feature with score > 0.95 is almost certainly a leaker (e.g. an "outcome_id" column). Drop it. This single check saves teams from posting suspiciously perfect val scores.

5.2 · Gradient boosting instead of random forest

HistGradientBoostingClassifier handles missing values natively and consistently out-scores RandomForest on tabular Round-1-scale problems. Plug it into the same pipeline.

5.3 · Cross-validate with a stratified split

Default cross_val_score for classification uses stratified k-fold, but if you wrote your own split you might have skipped it. Imbalanced targets need stratification.

5.4 · Feature engineering: ratios and binnings

Compute ratios between numerical columns that physically belong together (e.g. balance / income). Bin continuous variables into deciles for tree models — sometimes helps. Treat each engineered feature as a hypothesis and validate via CV gain.

5.5 · Write the analysis up clearly

The final parts often ask "why did X help?" with a 50–100 word answer. Practise writing such explanations on every modelling choice as you make it. Graders reward the model + explanation bundle.

6. Submission format & gotchas

Each part posted into its forum thread. Include code + output in fenced blocks.
For prose answers, write in plain English; the graders are humans.
Use random_state=0 (or whatever the problem specifies) so the grader can reproduce your numbers.
Don't import seaborn unless asked — keep the dependency list as the problem hands you.

7. What top solutions did

Forum solutions for P3 with full marks consistently: (1) report shape, dtypes, missing-value counts cleanly; (2) plot the target distribution; (3) build a Pipeline + ColumnTransformer so preprocessing is honest; (4) use HistGradientBoosting with stratified CV; (5) write one paragraph per modelling choice. No exotic models needed. [illustrative]

8. Drill

D · Your CV macro-F1 is 0.98 on a dataset that "should be hard". What do you check first?

Target leakage. Compute single-feature CV macro-F1 for every column. If one column alone scores > 0.9, it's leaked information about the target — drop it and re-CV. Other common leaks: using future-timestamped features for past-timestamped rows; using row id as a feature when ids are time-correlated. Always sanity-check by reading column meanings before modelling.

D2 · ColumnTransformer vs a manual preprocessing block — why prefer ColumnTransformer?

Because ColumnTransformer composes into a single Pipeline whose .fit() only sees train data. With a manual block, it's easy to compute df.mean() on the full DataFrame (test included) and silently leak. Pipelines force the right separation. The cost is one extra line of code; the benefit is honest CV scores.

← USAAIO 2025 Round 1 set