Classical machine learning
Before deep learning is on the table, classical ML solves most contest tabular problems faster, with fewer bugs, and with smaller compute. This page sweeps the scikit-learn families you must own: regression, classification, ensembles, cross-validation, clustering, dimensionality reduction.
The contest workflow
- Load & inspect.
pd.read_csv,.info(),.describe(), scan for NaNs, look at value distributions. - Train/val/test split. Stratified for classification; time-based for time series. Never mix.
- Baseline model. Logistic regression or random forest with default params. Whatever you build later must beat this.
- Feature engineering. Numerical: standardize / log-transform. Categorical: one-hot or target encode. Date: extract year / month / day-of-week.
- Model search. Try 2–3 model families, tune each with cross-validation.
- Ensemble. Average top models. Usually +1–3% on the leaderboard.
- Reproduce. Fix random seeds, save the trained model, log the val score.
1. Regression
Concept
Linear regression fits ŷ = Xw + b by minimising MSE
‖y − Xw − b‖². Closed form: w = (XᵀX)⁻¹ Xᵀy (when XᵀX is
invertible). Adding L2 (ridge) regularisation gives (XᵀX + αI)⁻¹ Xᵀy — it
shrinks weights and is numerically stable even when features are correlated.
L1 (lasso) drives many weights to exactly zero, so it performs feature
selection; the cost is a non-smooth objective solved by coordinate descent. Elastic Net
blends L1 + L2 with a mixing parameter.
Pick the metric to match the cost: MAE for "I care equally about every dollar",
RMSE when large errors hurt disproportionately, R² for "fraction of variance
explained relative to predicting the mean".
Worked example — ridge vs. lasso on the diabetes dataset
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
X, y = load_diabetes(return_X_y=True)
Xtr, Xva, ytr, yva = train_test_split(X, y, test_size=0.2, random_state=0)
for name, model in [("ridge", Ridge(alpha=1.0)),
("lasso", Lasso(alpha=0.1, max_iter=10000))]:
pipe = Pipeline([("s", StandardScaler()), ("m", model)]).fit(Xtr, ytr)
pred = pipe.predict(Xva)
print(name,
"RMSE", round(np.sqrt(mean_squared_error(yva, pred)), 2),
"R^2", round(r2_score(yva, pred), 3),
"nonzero weights", int(np.sum(pipe.named_steps["m"].coef_ != 0)))
Drills
D1 · When does ridge help?
Your linear regression has 30 features but only 50 training points and the val RMSE is huge. Which of {OLS, ridge, lasso} do you reach for and why?
Solution
Either ridge or lasso. p ≈ n means XᵀX is near-singular; OLS variance
explodes. Ridge stabilises the inverse with αI; lasso additionally zeros out the
least-useful features (often desirable when p is large). Use CV to pick α.
D2 · Lasso → 0 weights
Explain geometrically why lasso produces exactly-zero coefficients while ridge does not.
Solution
The L1 ball ‖w‖₁ ≤ t has corners on the coordinate axes. The level sets of the squared
error are ellipses; they first touch the L1 ball at a corner with positive probability, sending some
weights to 0. The L2 ball is smooth; intersection happens at a generic point with all nonzero
coordinates.
D3 · MAE vs RMSE
Errors per sample: [1, 1, 1, 10]. Compute MAE and RMSE. What if the 10 is an outlier you
want to ignore?
Solution
MAE = (1+1+1+10)/4 = 3.25. RMSE = √((1+1+1+100)/4) ≈ 5.07. To be robust to
outliers, prefer MAE or Huber loss; to penalise big misses, prefer RMSE.
D4 · Gradient-boosting on tabular
You have a 50k×60 tabular CSV with mixed numeric+categorical features. Which model is your strong baseline and why?
Solution
HistGradientBoostingRegressor (or XGBoost / LightGBM). Handles missing values
natively, no scaling needed, handles non-linear interactions, near-SOTA on tabular without tuning.
2. Classification
Concept
Logistic regression: p = σ(wᵀx + b), train by minimising binary
cross-entropy −Σ yᵢ log pᵢ + (1−yᵢ) log(1−pᵢ). Linear decision boundary, calibrated
probabilities, hard to overfit. kNN: predict the majority label among the k nearest
training points; no training, but inference cost O(n) per query and demands feature
scaling. SVM: maximises margin; with the RBF kernel, fits non-linear boundaries via
the kernel trick but scales poorly past ~10k points. Trees are non-linear, scale-free,
handle missingness, but a single tree overfits — bag many (random forest) or boost them sequentially
(gradient boosting).
Metrics: accuracy is misleading on imbalanced data. Precision = TP/(TP+FP) is "of the ones I flagged, how many were real". Recall = TP/(TP+FN) is "of the real positives, how many did I catch". F1 is the harmonic mean — useful when you want a single number that punishes either being zero. ROC-AUC ranks positives above negatives, threshold- independent.
Worked example — logistic regression with a class-imbalance fix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
X, y = make_classification(n_samples=5000, n_features=20,
weights=[0.97, 0.03], random_state=0)
Xtr, Xva, ytr, yva = train_test_split(X, y, stratify=y, test_size=0.25, random_state=0)
clf = Pipeline([
("s", StandardScaler()),
("m", LogisticRegression(max_iter=1000, class_weight="balanced")),
]).fit(Xtr, ytr)
pred = clf.predict(Xva)
proba = clf.predict_proba(Xva)[:, 1]
print(classification_report(yva, pred, digits=3))
print("ROC-AUC:", round(roc_auc_score(yva, proba), 3))
Drills
D1 · Choose the metric
You build a cancer screening model. False negatives = missed disease (catastrophic). False positives = extra biopsy (annoying). Which metric do you optimise?
Solution
Recall on the positive class, subject to a precision floor. Equivalently, set a low decision threshold so the model errs on the side of flagging — then enforce a minimum precision so the biopsy load is bearable.
D2 · k in kNN
What goes wrong at k = 1? At k = N?
Solution
k=1 overfits — zero training error, very noisy decision boundary, high variance.
k=N always predicts the global majority class — maximum bias, no signal. Tune via CV;
common range 3–25.
D3 · Why SVMs use kernels
Explain in one sentence what the kernel trick buys you.
Solution
It lets you fit a linear decision boundary in an implicit, high-dimensional feature space — and
compute everything using only inner products k(x,y) = ϕ(x)ᵀϕ(y), never materialising
ϕ itself.
D4 · Confusion matrix interpretation
A binary model on 1000 samples gives confusion matrix [[900, 50], [10, 40]] (rows = true,
cols = predicted). Compute accuracy, precision, recall, F1 on the positive class.
Solution
Accuracy = (900+40)/1000 = 0.94. Precision = 40/(40+50) ≈ 0.444. Recall
= 40/(40+10) = 0.80. F1 = 2·0.444·0.80/(0.444+0.80) ≈ 0.571.
3. Trees & ensembles
Concept
A decision tree splits the feature space along axis-aligned cuts that maximally reduce impurity (Gini
or entropy for classification, variance for regression). Single trees have low bias but huge variance.
Bagging trains T trees on bootstrap samples and averages; random forests
further decorrelate by sampling features at each split. Boosting trains trees
sequentially, each fit to the residuals (gradient boosting) or weighted errors (AdaBoost) of the
previous ensemble. Gradient boosting on histograms (LightGBM, XGBoost, scikit-learn's
HistGradientBoosting*) is the state of the art for tabular data.
Worked example — Random forest vs. gradient boosting
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
import numpy as np
X, y = fetch_california_housing(return_X_y=True)
cv = KFold(n_splits=5, shuffle=True, random_state=0)
for name, model in [("RF", RandomForestRegressor(n_estimators=300, random_state=0, n_jobs=-1)),
("HGB", HistGradientBoostingRegressor(max_iter=500, learning_rate=0.05,
random_state=0))]:
scores = cross_val_score(model, X, y, cv=cv,
scoring="neg_root_mean_squared_error", n_jobs=-1)
print(name, "RMSE", -scores.mean().round(3), "±", scores.std().round(3))
Drills
D1 · Bagging variance reduction
Why does averaging T i.i.d. tree predictions with variance σ² reduce
variance to σ²/T? What if the trees are correlated with correlation ρ?
Solution
For i.i.d., Var(Σ Xᵢ / T) = σ²/T. With pairwise correlation ρ:
ρ σ² + (1−ρ) σ²/T. As T → ∞ the variance floors at ρ σ² — which
is why random forests randomly sub-sample features per split (it lowers ρ).
D2 · Boosting overfits — but how?
You train gradient boosting with 5000 trees and validation loss bottoms out at tree 800 then rises. Two ways to fix.
Solution
(1) Early stopping at the iteration with best val loss. (2) Reduce learning_rate (and
raise n_estimators) or lower max_depth. Subsampling rows/features per
iteration also regularises.
D3 · Feature importance trap
You read "feature_importances_" off a random forest. Why might it mislead you about which features are causally important?
Solution
Tree-based importance is biased toward high-cardinality continuous features and double-counts correlated features. Prefer permutation importance on a held-out set, and remember that all these metrics are predictive, not causal.
D4 · Ensemble for the final +1%
You have RF (val acc 0.86), HGB (0.88), and logistic regression (0.83). Sketch a stacking ensemble.
Solution
Generate out-of-fold predictions from each base model on the training set. Use those three probability vectors as features for a meta-model (often logistic regression). Refit each base model on all training data, then meta-model predicts from base predictions on val/test. Typically nudges you to ~0.89.
4. Cross-validation & bias / variance
Concept
A model's expected error decomposes as bias² (systematic mismatch between model and truth) + variance (sensitivity to the particular training sample) + irreducible noise. Underfitting = high bias; overfitting = high variance. Regularisation, more data, and simpler models reduce variance; richer model families and more features reduce bias.
K-fold cross-validation estimates expected val error: split into k folds, train on
k−1, evaluate on the held-out fold, repeat, report mean ± std. Use stratified
K-fold for classification (preserves class balance) and TimeSeriesSplit for time-ordered
data (never train on the future).
Worked example — bias/variance with a learning curve
from sklearn.model_selection import learning_curve, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np, matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
sizes, tr, va = learning_curve(
LogisticRegression(max_iter=2000), X, y,
cv=StratifiedKFold(5, shuffle=True, random_state=0),
train_sizes=np.linspace(0.1, 1.0, 8), scoring="accuracy")
plt.plot(sizes, tr.mean(1), label="train")
plt.plot(sizes, va.mean(1), label="val")
plt.xlabel("training set size"); plt.ylabel("accuracy"); plt.legend()
# wide gap -> variance problem; both low and flat -> bias problem
Drills
D1 · Reading learning curves
Training accuracy 0.99, val accuracy 0.78. High bias or high variance?
Solution
High variance (overfitting). Fix with more data, more regularisation, or a simpler model.
D2 · Time-series split
You shuffle rows of a stock-price dataset before K-fold. What goes wrong?
Solution
You leak the future into the training fold (a model can see tomorrow's price while predicting
yesterday's). Use TimeSeriesSplit: train on [0..t], validate on
(t..t+Δ], advance.
D3 · CV vs. one val split
You tune 200 hyperparameters against a single fixed val split. Why is this dangerous and how do you fix it?
Solution
You're effectively training on that val split through repeated peeks. Solutions: use k-fold CV (averaged scores); or hold out a final test set you touch only once.
5. Unsupervised learning
Concept
k-means alternates assigning each point to its nearest centroid and updating each
centroid as the mean of its cluster. It minimises within-cluster sum of squares but converges to local
minima — use k-means++ initialisation and pick k with the elbow method or silhouette score.
Hierarchical (agglomerative) clustering builds a tree of merges and lets you cut at
any level; useful when k is unknown and the dataset is small (cost is O(n²)
or worse).
PCA projects data onto the top-k eigenvectors of the covariance matrix —
the directions of greatest variance. Centred input is mandatory; standardise if features are on
different scales. t-SNE and UMAP are nonlinear; they are for
visualization, not for downstream features (distances aren't globally meaningful).
Worked example — PCA + k-means on the iris dataset
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy as np
X, y = load_iris(return_X_y=True)
Xs = StandardScaler().fit_transform(X)
pca = PCA(n_components=2).fit(Xs)
Xp = pca.transform(Xs)
print("variance explained:", pca.explained_variance_ratio_) # ~[0.73, 0.23]
km = KMeans(n_clusters=3, n_init=10, random_state=0).fit(Xp)
print("ARI vs. true species:", adjusted_rand_score(y, km.labels_))
Drills
D1 · Why centre before PCA?
What happens if you skip centring?
Solution
The first principal component will point toward the mean rather than the direction of greatest
variance, polluting all subsequent components. sklearn.decomposition.PCA centres
automatically; if you do it by hand with SVD, you must subtract the mean first.
D2 · Pick k
Sketch how the elbow method picks k for k-means.
Solution
Plot within-cluster sum of squares vs. k. It monotonically decreases, but with an
"elbow" where additional clusters yield only marginal improvement; pick the k at the
elbow. Cross-check with silhouette score (higher is better, range −1..1).
D3 · When PCA fails
Give one example where PCA captures variance but destroys the signal you care about.
Solution
A class-separating direction with small variance but high discriminative power gets discarded because it's not a top variance axis. Use LDA (supervised) or a non-linear method instead.
Pitfalls cheat-sheet
- Data leakage. Scaler/feature stats fit on full data; targets-derived features; future info bleeding into past rows.
- Overfitting the val set. Hundreds of hyperparameter sweeps against one split.
- Class imbalance. A 99%-accurate model on 99/1 data may be predicting "majority" every time. Stratify, use class weights, or resample.
- Forgotten seeds. Two identical runs producing different scores wastes hours.
- Forgetting to refit on full training data after cross-validation. CV gives you the score estimate; for the final submission, refit on all available training data.
Checkpoint — answer out loud
- Can you explain bias vs. variance using one sentence and one example each?
- Can you set up a stratified 5-fold CV pipeline that includes scaling without leakage?
- Can you compute precision, recall, and F1 from a confusion matrix?
- Can you state the difference between bagging and boosting, and pick when to use which?
- Can you describe — in geometric terms — why lasso zeroes weights and ridge does not?
Next step
When tabular gradient boosting plateaus, you reach for neural nets — head to Deep learning with PyTorch. Test your ML reasoning on bias–variance, CV bias, information gain, and the SVM dual at the Round 2 theory drills.