USAAIO 2025 Round 1 · Problem 3 · Data exploration & a classical ML pipeline
Contest: 2025 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Pandas / EDA / classical ML.
Official sources: usaaio.org/past-problems · P3 Part 1 forum thread. The problem runs 18+ parts on the forum.
1. Problem restatement
Part 1 (verbatim from the forum) hands the contestant a CSV dataset and asks for a single piece of
code: load the file into a pandas DataFrame, display the first 10 rows, and write a
data_summary(df) function that prints the DataFrame's shape, dtypes, and per-column
missing-value count.
The 17 follow-up parts walk through a complete ML pipeline on the same dataset: missing-value imputation, categorical encoding, train/validation split, baseline classifier, feature engineering, cross-validation, model comparison, and finally an analysis write-up. The dataset itself is not named in Part 1, only loaded — later parts reveal what's being predicted.
2. What's being tested
- Pandas fluency. Read a CSV, group, summarise, plot — without google'ing.
- EDA hygiene. Missing values, dtypes, target distribution, leakage check before any model.
- Pipeline literacy. Use
sklearn.pipeline.Pipeline+ aColumnTransformerso preprocessing fits on train only. - Honest cross-validation. Don't tune on the test set; use a held-out fold.
- Write-up. Final parts ask for prose analysis. Practice writing 3-sentence explanations of each modelling choice.
3. Data exploration / setup
import pandas as pd
df = pd.read_csv("dataset.csv")
print(df.head(10))
def data_summary(df: pd.DataFrame) -> None:
print("shape:", df.shape)
print("\ndtypes:")
print(df.dtypes)
print("\nmissing per column:")
print(df.isnull().sum().sort_values(ascending=False))
data_summary(df)
This is the literal Part 1 answer. Spend the saved time on what comes next: look at the target
column, plot its distribution, run df.describe(include="all"), and identify obvious
leakers (columns that perfectly predict the target).
4. Baseline approach
Once you've cleaned missing values and encoded categoricals, a logistic regression / random forest baseline is enough to get most of the problem's points.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
num_cols = df.select_dtypes("number").columns.drop("target")
cat_cols = df.select_dtypes("object").columns
pre = ColumnTransformer([
("num", Pipeline([("imp", SimpleImputer(strategy="median")),
("sc", StandardScaler())]), num_cols),
("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])
clf = Pipeline([("pre", pre), ("rf", RandomForestClassifier(n_estimators=300, random_state=0))])
scores = cross_val_score(clf, df.drop(columns="target"), df.target, cv=5, scoring="f1_macro")
print(scores.mean(), scores.std())
5. Improvements that move the needle
5.1 · Target-leakage check before any modelling
For each feature compute its single-feature CV score against the target. Any feature with score > 0.95 is almost certainly a leaker (e.g. an "outcome_id" column). Drop it. This single check saves teams from posting suspiciously perfect val scores.
5.2 · Gradient boosting instead of random forest
HistGradientBoostingClassifier handles missing values natively and consistently
out-scores RandomForest on tabular Round-1-scale problems. Plug it into the same pipeline.
5.3 · Cross-validate with a stratified split
Default cross_val_score for classification uses stratified k-fold, but if you wrote
your own split you might have skipped it. Imbalanced targets need stratification.
5.4 · Feature engineering: ratios and binnings
Compute ratios between numerical columns that physically belong together (e.g. balance / income). Bin continuous variables into deciles for tree models — sometimes helps. Treat each engineered feature as a hypothesis and validate via CV gain.
5.5 · Write the analysis up clearly
The final parts often ask "why did X help?" with a 50–100 word answer. Practise writing such explanations on every modelling choice as you make it. Graders reward the model + explanation bundle.
6. Submission format & gotchas
- Each part posted into its forum thread. Include code + output in fenced blocks.
- For prose answers, write in plain English; the graders are humans.
- Use
random_state=0(or whatever the problem specifies) so the grader can reproduce your numbers. - Don't import seaborn unless asked — keep the dependency list as the problem hands you.
7. What top solutions did
Forum solutions for P3 with full marks consistently: (1) report shape, dtypes, missing-value counts cleanly; (2) plot the target distribution; (3) build a Pipeline + ColumnTransformer so preprocessing is honest; (4) use HistGradientBoosting with stratified CV; (5) write one paragraph per modelling choice. No exotic models needed. [illustrative]
8. Drill
D · Your CV macro-F1 is 0.98 on a dataset that "should be hard". What do you check first?
Target leakage. Compute single-feature CV macro-F1 for every column. If one column alone scores > 0.9, it's leaked information about the target — drop it and re-CV. Other common leaks: using future-timestamped features for past-timestamped rows; using row id as a feature when ids are time-correlated. Always sanity-check by reading column meanings before modelling.
D2 · ColumnTransformer vs a manual preprocessing block — why prefer ColumnTransformer?
Because ColumnTransformer composes into a single Pipeline whose .fit() only sees
train data. With a manual block, it's easy to compute df.mean() on the full DataFrame
(test included) and silently leak. Pipelines force the right separation. The cost is one extra
line of code; the benefit is honest CV scores.