USAAIO 2024 R1 · Problem 2 · Linear regression & gradient descent [reconstructed]
Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Math / classical ML.
Official sources: usaaio.org/past-problems · USAAIO syllabus.
1. Problem restatement
Multi-part. Derive the normal equations from least-squares loss; implement the closed-form solution in NumPy; implement batch gradient descent for the same problem and verify both converge to the same answer; add L2 regularisation (ridge) and analyse the effect; finally apply to a supplied dataset and report bias-variance / training-loss curves.
2. What's being tested
- Matrix calculus. Derivative of (Xβ − y)^T (Xβ − y) with respect to β.
- Numerical linear algebra. Why
np.linalg.solveis preferred tonp.linalg.inv. - Gradient descent. Step-size selection, convergence diagnostics.
- Regularisation. Bias-variance trade-off.
3. Data exploration / setup
import numpy as np, pandas as pd
df = pd.read_csv("housing.csv") # example tabular regression dataset
X = df.drop(columns="price").values
y = df.price.values
# add intercept column
X = np.hstack([np.ones((X.shape[0], 1)), X])
4. Baseline approach
# closed form
beta = np.linalg.solve(X.T @ X, X.T @ y)
preds = X @ beta
rmse = np.sqrt(np.mean((preds - y) ** 2))
print("closed-form RMSE:", rmse)
5. Improvements that move the needle
5.1 · Standardise features before SGD
Unscaled features cause loss surface ill-conditioning — gradient descent zig-zags and converges 100× slower. Standardise X (mean 0, var 1) before SGD; remember to invert the transform at inference.
5.2 · Ridge regression
def ridge(X, y, lam):
d = X.shape[1]
A = X.T @ X + lam * np.eye(d)
A[0, 0] -= lam # don't penalise intercept
return np.linalg.solve(A, X.T @ y)
Sweep lam ∈ {0.01, 0.1, 1, 10, 100} on a val split. Plot val RMSE vs lam — typically a
U-shape; pick the minimum.
5.3 · Mini-batch SGD with cosine LR
For the "implement gradient descent" parts, use mini-batches (batch=32) and a cosine learning-rate schedule from lr=0.1 to lr=0.001. Plot the loss curve for the write-up.
5.4 · Cross-validate ridge alpha
5-fold CV is more honest than a single val split when the dataset is small.
5.5 · Verify closed-form ≈ SGD result
Compute ‖β_closed − β_sgd‖_2 / ‖β_closed‖_2. Should be < 1e-3 after sufficient SGD epochs. The grader rewards explicit cross-checks.
6. Submission format & gotchas
- Show the matrix-calculus derivation of
∂L/∂β = 2 X^T (Xβ − y)step by step. - Avoid
np.linalg.inv; usenp.linalg.solve. Faster and numerically cleaner. - When you add an intercept column, do not also include a non-zero mean in the standardised features — double-shifts cancel.
- Plot residuals; non-random patterns indicate a non-linear truth and earn analysis points.
7. What top solutions did
[reconstructed] Expected full-marks: derivation shown; closed-form via
solve; SGD with standardised features and a plotted loss curve; ridge with CV; final
write-up commenting on bias-variance trade-off.
8. Drill
D · Why prefer np.linalg.solve over np.linalg.inv for closed-form least squares?
Two reasons. (1) Numerical accuracy: solve(A, b) uses LU decomposition with partial
pivoting and avoids forming A^-1 explicitly. inv(A) then @ b compounds
rounding error. (2) Speed: solve is O(n^3) once; inv is O(n^3) plus O(n^2) multiplication, with
worse constants. The diff is small for n=10 features but real for n=1000.
D2 · Your SGD loss decreases then explodes. What did you do?
Learning rate too high relative to the largest feature scale. Either the features aren't standardised (so one column dominates the gradient) or lr is just too big. Cure: standardise X, and / or reduce lr by 10× and add cosine decay. Loss explosion is almost always a step-size bug, not a model bug.