USAAIO 2024 R1 · Problem 2 · Linear regression & gradient descent [reconstructed]

Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Math / classical ML.

Official sources: usaaio.org/past-problems · USAAIO syllabus.

Reconstruction notice. [reconstructed] from the syllabus. Cross-check when the 2024 PDF publishes.

1. Problem restatement

Multi-part. Derive the normal equations from least-squares loss; implement the closed-form solution in NumPy; implement batch gradient descent for the same problem and verify both converge to the same answer; add L2 regularisation (ridge) and analyse the effect; finally apply to a supplied dataset and report bias-variance / training-loss curves.

2. What's being tested

Matrix calculus. Derivative of (Xβ − y)^T (Xβ − y) with respect to β.
Numerical linear algebra. Why np.linalg.solve is preferred to np.linalg.inv.
Gradient descent. Step-size selection, convergence diagnostics.
Regularisation. Bias-variance trade-off.

3. Data exploration / setup

import numpy as np, pandas as pd
df = pd.read_csv("housing.csv")    # example tabular regression dataset
X = df.drop(columns="price").values
y = df.price.values

# add intercept column
X = np.hstack([np.ones((X.shape[0], 1)), X])

4. Baseline approach

# closed form
beta = np.linalg.solve(X.T @ X, X.T @ y)
preds = X @ beta
rmse = np.sqrt(np.mean((preds - y) ** 2))
print("closed-form RMSE:", rmse)

5. Improvements that move the needle

5.1 · Standardise features before SGD

Unscaled features cause loss surface ill-conditioning — gradient descent zig-zags and converges 100× slower. Standardise X (mean 0, var 1) before SGD; remember to invert the transform at inference.

5.2 · Ridge regression

def ridge(X, y, lam):
    d = X.shape[1]
    A = X.T @ X + lam * np.eye(d)
    A[0, 0] -= lam     # don't penalise intercept
    return np.linalg.solve(A, X.T @ y)

Sweep lam ∈ {0.01, 0.1, 1, 10, 100} on a val split. Plot val RMSE vs lam — typically a U-shape; pick the minimum.

5.3 · Mini-batch SGD with cosine LR

For the "implement gradient descent" parts, use mini-batches (batch=32) and a cosine learning-rate schedule from lr=0.1 to lr=0.001. Plot the loss curve for the write-up.

5.4 · Cross-validate ridge alpha

5-fold CV is more honest than a single val split when the dataset is small.

5.5 · Verify closed-form ≈ SGD result

Compute ‖β_closed − β_sgd‖_2 / ‖β_closed‖_2. Should be < 1e-3 after sufficient SGD epochs. The grader rewards explicit cross-checks.

6. Submission format & gotchas

Show the matrix-calculus derivation of ∂L/∂β = 2 X^T (Xβ − y) step by step.
Avoid np.linalg.inv; use np.linalg.solve. Faster and numerically cleaner.
When you add an intercept column, do not also include a non-zero mean in the standardised features — double-shifts cancel.
Plot residuals; non-random patterns indicate a non-linear truth and earn analysis points.

7. What top solutions did

[reconstructed] Expected full-marks: derivation shown; closed-form via solve; SGD with standardised features and a plotted loss curve; ridge with CV; final write-up commenting on bias-variance trade-off.

8. Drill

D · Why prefer np.linalg.solve over np.linalg.inv for closed-form least squares?

Two reasons. (1) Numerical accuracy: solve(A, b) uses LU decomposition with partial pivoting and avoids forming A^-1 explicitly. inv(A) then @ b compounds rounding error. (2) Speed: solve is O(n^3) once; inv is O(n^3) plus O(n^2) multiplication, with worse constants. The diff is small for n=10 features but real for n=1000.

D2 · Your SGD loss decreases then explodes. What did you do?

Learning rate too high relative to the largest feature scale. Either the features aren't standardised (so one column dominates the gradient) or lr is just too big. Cure: standardise X, and / or reduce lr by 10× and add cosine decay. Loss explosion is almost always a step-size bug, not a model bug.

← USAAIO 2024 Round 1 set