Classical machine learning

Before deep learning is on the table, classical ML solves most contest tabular problems faster, with fewer bugs, and with smaller compute. This page sweeps the scikit-learn families you must own: regression, classification, ensembles, cross-validation, clustering, dimensionality reduction.

Syllabus link. Tracks the classical-ML block of the official USAAIO syllabus.

TL;DR. You must be able to (1) pick a sensible model family from the data type and size; (2) split data correctly (stratified, time-aware) and run k-fold CV; (3) reason about bias/variance and pick a regulariser; (4) read precision/recall/F1/ROC-AUC fluently and choose the metric that matches the cost structure; (5) ensemble two models for the easy final boost. USAAIO classical-ML problems are almost always "tabular CSV → predict target → maximise metric": this stack wins.

The contest workflow

Load & inspect. pd.read_csv, .info(), .describe(), scan for NaNs, look at value distributions.
Train/val/test split. Stratified for classification; time-based for time series. Never mix.
Baseline model. Logistic regression or random forest with default params. Whatever you build later must beat this.
Feature engineering. Numerical: standardize / log-transform. Categorical: one-hot or target encode. Date: extract year / month / day-of-week.
Model search. Try 2–3 model families, tune each with cross-validation.
Ensemble. Average top models. Usually +1–3% on the leaderboard.
Reproduce. Fix random seeds, save the trained model, log the val score.

1. Regression

Concept

Linear regression fits ŷ = Xw + b by minimising MSE ‖y − Xw − b‖². Closed form: w = (XᵀX)⁻¹ Xᵀy (when XᵀX is invertible). Adding L2 (ridge) regularisation gives (XᵀX + αI)⁻¹ Xᵀy — it shrinks weights and is numerically stable even when features are correlated. L1 (lasso) drives many weights to exactly zero, so it performs feature selection; the cost is a non-smooth objective solved by coordinate descent. Elastic Net blends L1 + L2 with a mixing parameter.

Pick the metric to match the cost: MAE for "I care equally about every dollar", RMSE when large errors hurt disproportionately, R² for "fraction of variance explained relative to predicting the mean".

Worked example — ridge vs. lasso on the diabetes dataset

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

X, y = load_diabetes(return_X_y=True)
Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.2, random_state=0)

for name, model in [("ridge", Ridge(alpha=1.0)),
                    ("lasso", Lasso(alpha=0.1, max_iter=10000))]:
    pipe = Pipeline([("s", StandardScaler()), ("m", model)]).fit(Xtr, ytr)
    pred = pipe.predict(Xva)
    print(name,
          "RMSE", round(np.sqrt(mean_squared_error(yva, pred)), 2),
          "R^2",  round(r2_score(yva, pred), 3),
          "nonzero weights", int(np.sum(pipe.named_steps["m"].coef_ != 0)))

Drills

D1 · When does ridge help?

Your linear regression has 30 features but only 50 training points and the val RMSE is huge. Which of {OLS, ridge, lasso} do you reach for and why?

Solution

Either ridge or lasso. p ≈ n means XᵀX is near-singular; OLS variance explodes. Ridge stabilises the inverse with αI; lasso additionally zeros out the least-useful features (often desirable when p is large). Use CV to pick α.

D2 · Lasso → 0 weights

Explain geometrically why lasso produces exactly-zero coefficients while ridge does not.

Solution

The L1 ball ‖w‖₁ ≤ t has corners on the coordinate axes. The level sets of the squared error are ellipses; they first touch the L1 ball at a corner with positive probability, sending some weights to 0. The L2 ball is smooth; intersection happens at a generic point with all nonzero coordinates.

D3 · MAE vs RMSE

Errors per sample: [1, 1, 1, 10]. Compute MAE and RMSE. What if the 10 is an outlier you want to ignore?

Solution

MAE = (1+1+1+10)/4 = 3.25. RMSE = √((1+1+1+100)/4) ≈ 5.07. To be robust to outliers, prefer MAE or Huber loss; to penalise big misses, prefer RMSE.

D4 · Gradient-boosting on tabular

You have a 50k×60 tabular CSV with mixed numeric+categorical features. Which model is your strong baseline and why?

Solution

HistGradientBoostingRegressor (or XGBoost / LightGBM). Handles missing values natively, no scaling needed, handles non-linear interactions, near-SOTA on tabular without tuning.

2. Classification

Concept

Logistic regression: p = σ(wᵀx + b), train by minimising binary cross-entropy −Σ yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ). Linear decision boundary, calibrated probabilities, hard to overfit. kNN: predict the majority label among the k nearest training points; no training, but inference cost O(n) per query and demands feature scaling. SVM: maximises margin; with the RBF kernel, fits non-linear boundaries via the kernel trick but scales poorly past ~10k points. Trees are non-linear, scale-free, handle missingness, but a single tree overfits — bag many (random forest) or boost them sequentially (gradient boosting).

Metrics: accuracy is misleading on imbalanced data. Precision = TP/(TP+FP) is "of the ones I flagged, how many were real". Recall = TP/(TP+FN) is "of the real positives, how many did I catch". F1 is the harmonic mean — useful when you want a single number that punishes either being zero. ROC-AUC ranks positives above negatives, threshold- independent.

Worked example — logistic regression with a class-imbalance fix

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

X, y = make_classification(n_samples=5000, n_features=20,
                           weights=[0.97, 0.03], random_state=0)
Xtr, Xva, ytr, yva = train_test_split(X, y, stratify=y, test_size=0.25, random_state=0)

clf = Pipeline([
    ("s", StandardScaler()),
    ("m", LogisticRegression(max_iter=1000, class_weight="balanced")),
]).fit(Xtr, ytr)

pred  = clf.predict(Xva)
proba = clf.predict_proba(Xva)[:, 1]
print(classification_report(yva, pred, digits=3))
print("ROC-AUC:", round(roc_auc_score(yva, proba), 3))

Drills

D1 · Choose the metric

You build a cancer screening model. False negatives = missed disease (catastrophic). False positives = extra biopsy (annoying). Which metric do you optimise?

Solution

Recall on the positive class, subject to a precision floor. Equivalently, set a low decision threshold so the model errs on the side of flagging — then enforce a minimum precision so the biopsy load is bearable.

D2 · k in kNN

What goes wrong at k = 1? At k = N?

Solution

k=1 overfits — zero training error, very noisy decision boundary, high variance. k=N always predicts the global majority class — maximum bias, no signal. Tune via CV; common range 3–25.

D3 · Why SVMs use kernels

Explain in one sentence what the kernel trick buys you.

Solution

It lets you fit a linear decision boundary in an implicit, high-dimensional feature space — and compute everything using only inner products k(x,y) = ϕ(x)ᵀϕ(y), never materialising ϕ itself.

D4 · Confusion matrix interpretation

A binary model on 1000 samples gives confusion matrix [[900, 50], [10, 40]] (rows = true, cols = predicted). Compute accuracy, precision, recall, F1 on the positive class.

Solution

Accuracy = (900+40)/1000 = 0.94. Precision = 40/(40+50) ≈ 0.444. Recall = 40/(40+10) = 0.80. F1 = 2·0.444·0.80/(0.444+0.80) ≈ 0.571.

3. Trees & ensembles

Concept

A decision tree splits the feature space along axis-aligned cuts that maximally reduce impurity (Gini or entropy for classification, variance for regression). Single trees have low bias but huge variance. Bagging trains T trees on bootstrap samples and averages; random forests further decorrelate by sampling features at each split. Boosting trains trees sequentially, each fit to the residuals (gradient boosting) or weighted errors (AdaBoost) of the previous ensemble. Gradient boosting on histograms (LightGBM, XGBoost, scikit-learn's HistGradientBoosting*) is the state of the art for tabular data.

Worked example — Random forest vs. gradient boosting

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
import numpy as np

X, y = fetch_california_housing(return_X_y=True)
cv = KFold(n_splits=5, shuffle=True, random_state=0)

for name, model in [("RF", RandomForestRegressor(n_estimators=300, random_state=0, n_jobs=-1)),
                    ("HGB", HistGradientBoostingRegressor(max_iter=500, learning_rate=0.05,
                                                          random_state=0))]:
    scores = cross_val_score(model, X, y, cv=cv,
                             scoring="neg_root_mean_squared_error", n_jobs=-1)
    print(name, "RMSE", -scores.mean().round(3), "±", scores.std().round(3))

Drills

D1 · Bagging variance reduction

Why does averaging T i.i.d. tree predictions with variance σ² reduce variance to σ²/T? What if the trees are correlated with correlation ρ?

Solution

For i.i.d., Var(Σ Xᵢ / T) = σ²/T. With pairwise correlation ρ: ρ σ² + (1−ρ) σ²/T. As T → ∞ the variance floors at ρ σ² — which is why random forests randomly sub-sample features per split (it lowers ρ).

D2 · Boosting overfits — but how?

You train gradient boosting with 5000 trees and validation loss bottoms out at tree 800 then rises. Two ways to fix.

Solution

(1) Early stopping at the iteration with best val loss. (2) Reduce learning_rate (and raise n_estimators) or lower max_depth. Subsampling rows/features per iteration also regularises.

D3 · Feature importance trap

You read "feature_importances_" off a random forest. Why might it mislead you about which features are causally important?

Solution

Tree-based importance is biased toward high-cardinality continuous features and double-counts correlated features. Prefer permutation importance on a held-out set, and remember that all these metrics are predictive, not causal.

D4 · Ensemble for the final +1%

You have RF (val acc 0.86), HGB (0.88), and logistic regression (0.83). Sketch a stacking ensemble.

Solution

Generate out-of-fold predictions from each base model on the training set. Use those three probability vectors as features for a meta-model (often logistic regression). Refit each base model on all training data, then meta-model predicts from base predictions on val/test. Typically nudges you to ~0.89.

4. Cross-validation & bias / variance

Concept

A model's expected error decomposes as bias² (systematic mismatch between model and truth) + variance (sensitivity to the particular training sample) + irreducible noise. Underfitting = high bias; overfitting = high variance. Regularisation, more data, and simpler models reduce variance; richer model families and more features reduce bias.

K-fold cross-validation estimates expected val error: split into k folds, train on k−1, evaluate on the held-out fold, repeat, report mean ± std. Use stratified K-fold for classification (preserves class balance) and TimeSeriesSplit for time-ordered data (never train on the future).

Worked example — bias/variance with a learning curve

from sklearn.model_selection import learning_curve, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np, matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
sizes, tr, va = learning_curve(
    LogisticRegression(max_iter=2000), X, y,
    cv=StratifiedKFold(5, shuffle=True, random_state=0),
    train_sizes=np.linspace(0.1, 1.0, 8), scoring="accuracy")

plt.plot(sizes, tr.mean(1), label="train")
plt.plot(sizes, va.mean(1), label="val")
plt.xlabel("training set size"); plt.ylabel("accuracy"); plt.legend()
# wide gap -> variance problem; both low and flat -> bias problem

Drills

D1 · Reading learning curves

Training accuracy 0.99, val accuracy 0.78. High bias or high variance?

Solution

High variance (overfitting). Fix with more data, more regularisation, or a simpler model.

D2 · Time-series split

You shuffle rows of a stock-price dataset before K-fold. What goes wrong?

Solution

You leak the future into the training fold (a model can see tomorrow's price while predicting yesterday's). Use TimeSeriesSplit: train on [0..t], validate on (t..t+Δ], advance.

D3 · CV vs. one val split

You tune 200 hyperparameters against a single fixed val split. Why is this dangerous and how do you fix it?

Solution

You're effectively training on that val split through repeated peeks. Solutions: use k-fold CV (averaged scores); or hold out a final test set you touch only once.

5. Unsupervised learning

Concept

k-means alternates assigning each point to its nearest centroid and updating each centroid as the mean of its cluster. It minimises within-cluster sum of squares but converges to local minima — use k-means++ initialisation and pick k with the elbow method or silhouette score. Hierarchical (agglomerative) clustering builds a tree of merges and lets you cut at any level; useful when k is unknown and the dataset is small (cost is O(n²) or worse).

PCA projects data onto the top-k eigenvectors of the covariance matrix — the directions of greatest variance. Centred input is mandatory; standardise if features are on different scales. t-SNE and UMAP are nonlinear; they are for visualization, not for downstream features (distances aren't globally meaningful).

Worked example — PCA + k-means on the iris dataset

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy as np

X, y = load_iris(return_X_y=True)
Xs = StandardScaler().fit_transform(X)

pca = PCA(n_components=2).fit(Xs)
Xp  = pca.transform(Xs)
print("variance explained:", pca.explained_variance_ratio_)  # ~[0.73, 0.23]

km = KMeans(n_clusters=3, n_init=10, random_state=0).fit(Xp)
print("ARI vs. true species:", adjusted_rand_score(y, km.labels_))

Drills

D1 · Why centre before PCA?

What happens if you skip centring?

Solution

The first principal component will point toward the mean rather than the direction of greatest variance, polluting all subsequent components. sklearn.decomposition.PCA centres automatically; if you do it by hand with SVD, you must subtract the mean first.

D2 · Pick k

Sketch how the elbow method picks k for k-means.

Solution

Plot within-cluster sum of squares vs. k. It monotonically decreases, but with an "elbow" where additional clusters yield only marginal improvement; pick the k at the elbow. Cross-check with silhouette score (higher is better, range −1..1).

D3 · When PCA fails

Give one example where PCA captures variance but destroys the signal you care about.

Solution

A class-separating direction with small variance but high discriminative power gets discarded because it's not a top variance axis. Use LDA (supervised) or a non-linear method instead.

Pitfalls cheat-sheet

Data leakage. Scaler/feature stats fit on full data; targets-derived features; future info bleeding into past rows.
Overfitting the val set. Hundreds of hyperparameter sweeps against one split.
Class imbalance. A 99%-accurate model on 99/1 data may be predicting "majority" every time. Stratify, use class weights, or resample.
Forgotten seeds. Two identical runs producing different scores wastes hours.
Forgetting to refit on full training data after cross-validation. CV gives you the score estimate; for the final submission, refit on all available training data.

Checkpoint — answer out loud

Can you explain bias vs. variance using one sentence and one example each?
Can you set up a stratified 5-fold CV pipeline that includes scaling without leakage?
Can you compute precision, recall, and F1 from a confusion matrix?
Can you state the difference between bagging and boosting, and pick when to use which?
Can you describe — in geometric terms — why lasso zeroes weights and ridge does not?

Next step

When tabular gradient boosting plateaus, you reach for neural nets — head to Deep learning with PyTorch. Test your ML reasoning on bias–variance, CV bias, information gain, and the SVM dual at the Round 2 theory drills.