← USAAIO 2025 Round 1 set

USAAIO 2025 Round 1 · Problem 2 · Affine transformations & NN fundamentals

Contest: 2025 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Linear algebra / neural network basics.

Official sources: usaaio.org/past-problems · P2 Part 1 forum thread.

1. Problem restatement

The full problem is split across many parts (13+ on the forum). Part 1 is small and verbatim from the thread: given a weight matrix W and bias b, identify the input/output dimensions of the affine transformation y = Wx + b, then compute y by hand for a specific x. Later parts build from this base: compose two affine layers, add a nonlinearity, derive a gradient by hand, implement forward/backward in NumPy, and finally train a tiny classifier on a supplied dataset.

Source. Part 1 is verbatim-paraphrased from the forum thread linked above. Parts 2–13 are reconstructed from the documented progression (linear → nonlinear → composition → gradient → training loop) consistent with the Round-1 syllabus; treat exact wording as [verify against source].

2. What's being tested

Connects directly to Math (matrices, gradients) and Deep Learning (forward/backward, SGD).

3. Data exploration / setup

Most parts of P2 do not need a dataset — they're pencil-and-paper. The late parts use a small synthetic 2D dataset (two-moons, three-class blob, similar). Generate it locally:

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=400, noise=0.2, random_state=0)

4. Baseline approach

Part 1 is a one-liner. Read the matrix shapes off the dimensions, then compute.

import numpy as np
W = np.array([[1, -2, 0], [3, 1, -1]])     # 2x3
b = np.array([0, 1])                        # 2
x = np.array([1, 2, -1])                    # 3
y = W @ x + b
print(y)        # [-3, 5+1]  ->  [-3, 6]

Input dim N = 3, output dim M = 2. The same conventions extend to every later part — keep dimensions written down explicitly.

5. Improvements that move the needle

5.1 · Annotate shapes everywhere

Comment every line of code with shapes: # X: (N, D), W1: (D, H), Z: (N, H). This single habit eliminates 90% of off-by-one transpose bugs that cost partial credit.

5.2 · Implement forward and backward as pure NumPy

def forward(X, W1, b1, W2, b2):
    Z1 = X @ W1 + b1                # (N, H)
    A1 = np.maximum(0, Z1)          # ReLU
    Z2 = A1 @ W2 + b2               # (N, K)
    P  = softmax(Z2)                # (N, K)
    return Z1, A1, Z2, P

def backward(X, Y, Z1, A1, Z2, P, W2):
    N = X.shape[0]
    dZ2 = (P - Y) / N
    dW2 = A1.T @ dZ2
    db2 = dZ2.sum(0)
    dA1 = dZ2 @ W2.T
    dZ1 = dA1 * (Z1 > 0)
    dW1 = X.T @ dZ1
    db1 = dZ1.sum(0)
    return dW1, db1, dW2, db2

5.3 · Cross-check gradient with finite differences

For one weight at a time, perturb by ε = 1e-5, compute (L(+) − L(−))/(2ε), compare to your analytic gradient. If max relative error < 1e-5 you're correct. This is a standard sanity check and the grader rewards code that includes it.

5.4 · Initialise weights properly (Glorot / He)

Zero-initialising weights kills training. For a ReLU network use He init: W ~ N(0, 2/fan_in). Late parts of P2 fail silently with bad init.

5.5 · Show the training curve

Even if the question only asks for final accuracy, plot loss vs epoch. Graders for theory-coding problems often allocate a "write-up quality" portion, and a plot scores it cheaply.

6. Submission format & gotchas

7. What top solutions did

Public solutions on the forum show that perfect scores on P2 come from disciplined documentation: shape annotations on every line, a gradient-check cell included, He-initialised weights, and a saved loss curve. Mathematical write-ups use LaTeX in the forum's MathJax. No clever tricks — the problem rewards mechanical correctness. [illustrative]

8. Drill

D · Two affine layers without a nonlinearity. Why is this a single affine layer?

Let z = W2 (W1 x + b1) + b2 = (W2 W1) x + (W2 b1 + b2). Define W' = W2 W1, b' = W2 b1 + b2. Then z = W' x + b', exactly one affine map. This is why you need a nonlinearity between linear layers — without it, no matter how deep, the network is a single matrix in disguise.

D2 · Your gradient check fails by a factor of N. What's wrong?

You almost certainly forgot to divide by batch size in the loss or in the gradient. If L = mean(...) then dL/dZ = (P - Y) / N. If L = sum(...) then dL/dZ = P - Y. Match the loss reduction and the gradient reduction. A factor of N is the canonical "off-by-batch-size" bug.

← USAAIO 2025 Round 1 set