The math you actually need

Four areas: linear algebra, probability & statistics, multivariable calculus, convex optimization. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.

Syllabus link. This page tracks the math block of the official USAAIO syllabus. If something here disagrees with the syllabus, the syllabus wins.
TL;DR. By the end of this page you should be able to (1) multiply matrices and read off rank/eigenvalues/SVD shapes, (2) apply Bayes' rule and compute expectations of common distributions, (3) compute gradients and Hessians of small functions and chain-rule a 2-layer MLP, (4) state when gradient descent converges and (5) bound a tail probability with Hoeffding's inequality. USAAIO problems test these as quick MCQ ("what is the rank of this matrix?", "is this loss convex?") and as the derivations underneath the ML/DL pages.

1. Linear algebra

Concept

A vector x ∈ ℝⁿ is an arrow with magnitude and direction; a matrix A ∈ ℝᵐˣⁿ is a linear map taking ℝⁿ → ℝᵐ. The dot product xᵀy = Σ xᵢ yᵢ measures alignment and factors as ‖x‖‖y‖cos θ. Norms quantify size: ‖x‖₁ = Σ|xᵢ| (L1), ‖x‖₂ = √(Σ xᵢ²) (L2), ‖x‖∞ = max|xᵢ|.

The rank of A is the dimension of its column space — the number of linearly independent directions it can output. Eigenvalues satisfy A v = λ v; eigenvectors are directions that don't rotate under A. For a symmetric matrix, eigenvectors are orthogonal and eigenvalues are real — this is what makes PCA work, since the covariance matrix is symmetric.

SVD is the universal factorization: any X ∈ ℝᵐˣⁿ can be written X = U Σ Vᵀ with U, V orthogonal and Σ diagonal with non-negative singular values. SVD underlies low-rank approximation (keep the top-k singular values), PCA (right singular vectors of centered data = principal components), and the pseudo-inverse used in ridge regression.

Worked example — PCA by hand on 5 points

import numpy as np

X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9],
              [1.9, 2.2], [3.1, 3.0]])

# 1. center
Xc = X - X.mean(axis=0)

# 2. covariance matrix (n=5 samples, so divide by n-1=4)
C = (Xc.T @ Xc) / (Xc.shape[0] - 1)

# 3. eigendecomposition (symmetric -> use eigh)
evals, evecs = np.linalg.eigh(C)
order = np.argsort(evals)[::-1]
evals, evecs = evals[order], evecs[:, order]

# 4. project onto top component
pc1 = evecs[:, 0]
scores = Xc @ pc1
print("eigenvalues:", evals)        # variance per PC
print("PC1 direction:", pc1)        # ~ (0.677, 0.735) — the diagonal
print("variance kept:", evals[0] / evals.sum())

Drills

D1 · Determinant and trace from eigenvalues

A 3×3 matrix has eigenvalues 1, 2, 4. Find det(A) and trace(A).

Solution

det(A) = ∏λᵢ = 1·2·4 = 8. trace(A) = Σλᵢ = 7. These identities hold for any square matrix (over ℂ, counting multiplicity).

D2 · SVD shapes

X is 100×5. Give the shapes of U, Σ, Vᵀ in the thin SVD.

Solution

Thin SVD: U is 100×5, Σ is 5×5 diagonal, Vᵀ is 5×5. Reconstruction: X = U Σ Vᵀ. The "thin" form drops the (100−5) zero singular values.

D3 · Rank of a rank-1 update

Let A = u vᵀ where u ∈ ℝ⁵, v ∈ ℝ³, both non-zero. What is rank(A)?

Solution

rank(A) = 1. Every column of A is a scalar multiple of u, so the column space is the 1-dimensional span of u.

D4 · PCA variance fraction

Covariance eigenvalues are 12, 6, 1.5, 0.5. What fraction of variance is captured by the top two PCs?

Solution

(12 + 6) / (12 + 6 + 1.5 + 0.5) = 18/20 = 0.9090%.

D5 · Orthogonality of symmetric eigenvectors

Show that if A = Aᵀ and A v₁ = λ₁ v₁, A v₂ = λ₂ v₂ with λ₁ ≠ λ₂, then v₁ᵀ v₂ = 0.

Solution

Compute v₁ᵀ A v₂ two ways. From A v₂ = λ₂ v₂: v₁ᵀ A v₂ = λ₂ v₁ᵀ v₂. Using A = Aᵀ: v₁ᵀ A v₂ = (A v₁)ᵀ v₂ = λ₁ v₁ᵀ v₂. So (λ₁ − λ₂) v₁ᵀ v₂ = 0, and since λ₁ ≠ λ₂, v₁ᵀ v₂ = 0.

2. Probability & statistics

Concept

A random variable X maps outcomes to numbers. Expectation E[X] = Σ x P(X=x) (discrete) or ∫ x p(x) dx (continuous) is the long-run average. Variance Var(X) = E[(X − E[X])²] = E[X²] − E[X]² measures spread. Linearity of expectation always holds: E[aX + bY] = a E[X] + b E[Y], even when X and Y are dependent.

Key distributions: Bernoulli(p) has mean p, variance p(1−p); Binomial(n,p) has mean np, variance np(1−p); Gaussian 𝒩(μ, σ²) has density (2πσ²)^(−1/2) exp(−(x−μ)²/(2σ²)); Poisson(λ) has mean and variance both λ.

Bayes' rule: P(A|B) = P(B|A) P(A) / P(B). This is how every probabilistic ML model updates beliefs after seeing data. Maximum-likelihood estimation finds parameters that maximize log P(data | θ); for Gaussian noise that's least squares, for Bernoulli outputs that's logistic regression's cross-entropy.

Hoeffding's inequality: for i.i.d. bounded X₁,…,Xₙ ∈ [a,b] with mean μ, P(|X̄ − μ| ≥ t) ≤ 2 exp(−2 n t² / (b − a)²). This is the generalisation bound underlying "more data → better estimates" — concretely, the sample mean's deviation shrinks like 1/√n.

Worked example — Disease screening with Bayes

Disease prevalence P(D) = 1%. Test sensitivity P(+|D) = 95%, specificity P(−|¬D) = 90% (so P(+|¬D) = 10%). Patient tests positive — what is P(D|+)?

P_D, P_pos_given_D, P_pos_given_notD = 0.01, 0.95, 0.10
P_notD = 1 - P_D
P_pos  = P_pos_given_D * P_D + P_pos_given_notD * P_notD
P_D_given_pos = P_pos_given_D * P_D / P_pos
print(P_D_given_pos)   # 0.0876...  i.e. ~8.8%

The lesson: with a low base rate, even an accurate test produces mostly false positives.

Drills

D1 · Bernoulli MLE

Given i.i.d. observations x₁,…,xₙ ∈ {0,1} from Bernoulli(p), derive the MLE for p.

Solution

Likelihood: L(p) = Π pˣⁱ(1−p)^(1−xᵢ). Log-likelihood: ℓ(p) = (Σxᵢ) log p + (n − Σxᵢ) log(1−p). Set dℓ/dp = 0: (Σxᵢ)/p = (n − Σxᵢ)/(1−p)p̂ = (1/n) Σxᵢ = sample mean.

D2 · Linearity of expectation

Flip a fair coin 10 times. Let X be the number of heads. Compute E[X] and Var(X).

Solution

Write X = Σ Xᵢ with Xᵢ ∼ Bernoulli(0.5). Then E[X] = 10·0.5 = 5; by independence Var(X) = 10·0.25 = 2.5.

D3 · Hoeffding sample size

You estimate a click-through rate (a Bernoulli mean) and want |p̂ − p| ≤ 0.01 with probability ≥ 0.95. How many samples?

Solution

Hoeffding with b−a = 1, t = 0.01: 2 exp(−2n·0.0001) ≤ 0.05n ≥ ln(40)/0.0002 ≈ 18 444. Round up to ~18 500.

D4 · Independent ≠ uncorrelated (in general)

Let X ∼ Uniform(−1,1) and Y = X². Show Cov(X,Y) = 0 but X, Y are dependent.

Solution

E[X] = 0, E[XY] = E[X³] = 0 by symmetry, so Cov(X,Y) = 0. But Y is a deterministic function of X, so they cannot be independent. (For jointly Gaussian variables zero covariance does imply independence; in general it does not.)

D5 · Gaussian MLE for variance

Given i.i.d. x₁,…,xₙ ∼ 𝒩(μ, σ²) with μ known, derive the MLE of σ².

Solution

Log-likelihood (dropping constants): ℓ = −n/2 log σ² − (1/(2σ²)) Σ (xᵢ−μ)². Differentiate w.r.t. σ²: −n/(2σ²) + (1/(2σ⁴)) Σ(xᵢ−μ)² = 0σ̂² = (1/n) Σ(xᵢ−μ)². (This MLE is biased when μ is also estimated; the unbiased version divides by n−1.)

3. Multivariable calculus

Concept

A partial derivative ∂f/∂xᵢ measures the rate of change of f when only xᵢ moves. The gradient ∇f = (∂f/∂x₁, …, ∂f/∂xₙ) is a vector pointing in the direction of steepest ascent; its magnitude is the slope in that direction. Gradient descent is just x ← x − η ∇f(x).

The Jacobian of f: ℝⁿ → ℝᵐ is the m × n matrix of partials Jᵢⱼ = ∂fᵢ/∂xⱼ. The Hessian H of a scalar function is the n × n matrix of second partials; if H ≻ 0 (positive definite) at a critical point, that point is a strict local minimum, and the function is locally convex.

The chain rule for composition z = f(g(x)) is ∂z/∂x = (∂f/∂g)(∂g/∂x). In vector form: J_{f∘g} = J_f · J_g. Backpropagation is nothing more than this chain rule, applied right-to-left through every layer of a neural network. A second-order Taylor expansion at x₀ is f(x₀+δ) ≈ f(x₀) + ∇f(x₀)ᵀ δ + ½ δᵀ H δ; Newton's method uses this to take steps that account for curvature.

Worked example — Backprop on a 2-layer MLP, by hand and by autograd

Network: z = w x + b, h = ReLU(z), ŷ = v h + c, loss L = ½(ŷ − y)². Compute ∂L/∂w.

Chain rule, layer by layer:

∂L/∂ŷ = (ŷ − y); ∂ŷ/∂h = v; ∂h/∂z = 𝟙[z > 0]; ∂z/∂w = x. Therefore ∂L/∂w = (ŷ − y) · v · 𝟙[z > 0] · x.

import torch

w = torch.tensor(0.7, requires_grad=True)
b = torch.tensor(-0.1, requires_grad=True)
v = torch.tensor(1.3, requires_grad=True)
c = torch.tensor(0.2, requires_grad=True)
x = torch.tensor(2.0)
y = torch.tensor(1.5)

z   = w * x + b
h   = torch.relu(z)
y_hat = v * h + c
L   = 0.5 * (y_hat - y) ** 2
L.backward()

# manual:
manual = (y_hat - y).item() * v.item() * (1.0 if z.item() > 0 else 0.0) * x.item()
print(w.grad.item(), manual)    # equal within FP error

Drills

D1 · Gradient and one step

f(x,y) = x² + 4y². Compute ∇f(1,1). With η = 0.1, where do you land after one GD step?

Solution

∇f = (2x, 8y) = (2, 8). Step: (1,1) − 0.1·(2,8) = (0.8, 0.2).

D2 · Hessian and convexity

Is f(x,y) = x² + xy + y² convex?

Solution

Hessian H = [[2,1],[1,2]]. Eigenvalues: 2 ± 1 = 1, 3, both positive. So H ≻ 0 globally and f is strictly convex.

D3 · Chain rule through softmax + cross-entropy

For logits z and one-hot label y, with p = softmax(z) and L = − Σᵢ yᵢ log pᵢ, show ∂L/∂z = p − y.

Solution

∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ − pⱼ). Then ∂L/∂zⱼ = −Σᵢ yᵢ (1/pᵢ)·pᵢ(δᵢⱼ − pⱼ) = −yⱼ + pⱼ Σᵢ yᵢ = pⱼ − yⱼ (using Σyᵢ = 1). This identity is why cross-entropy + softmax has such a clean gradient and is the canonical classifier loss.

D4 · Jacobian of a linear layer

y = Wx + b with W ∈ ℝᵐˣⁿ. What is ∂y/∂x? What about ∂L/∂W for a scalar loss L (express in terms of ∂L/∂y and x)?

Solution

∂y/∂x = W. ∂L/∂W = (∂L/∂y) xᵀ — an outer product of the upstream gradient with the layer's input. This is exactly what PyTorch's autograd computes.

D5 · Taylor expansion of log(1 + x)

Expand log(1+x) to second order around x = 0 and use it to approximate log(1.05).

Solution

log(1+x) ≈ x − x²/2. At x = 0.05: 0.05 − 0.00125 = 0.04875. True value ≈ 0.04879.

4. Convex optimization

Concept

A set is convex if it contains the line segment between any two of its points. A function is convex if its epigraph is convex; equivalently, for any t ∈ [0,1], f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y) (Jensen). A twice-differentiable function is convex iff its Hessian is positive semidefinite everywhere. Squared error, logistic loss, hinge loss, and cross-entropy are all convex in their linear-model weights.

Gradient descent: x_{k+1} = x_k − η ∇f(x_k). For an L-smooth convex f, choosing η ≤ 1/L guarantees monotonic decrease and convergence at rate O(1/k). Momentum adds a velocity term: v ← βv + ∇f, x ← x − ηv; it accelerates along consistent directions and damps oscillation. Adam rescales each coordinate by its running RMS gradient, giving adaptive step sizes.

Constraints via Lagrange duality: to minimise f(x) subject to gᵢ(x) ≤ 0 and hⱼ(x) = 0, form the Lagrangian L(x, λ, ν) = f(x) + Σ λᵢ gᵢ(x) + Σ νⱼ hⱼ(x) with λᵢ ≥ 0. The KKT conditions (stationarity, primal feasibility, dual feasibility, complementary slackness) are necessary at an optimum; for convex problems they are also sufficient. SVMs are derived by writing the margin-maximisation problem and solving its KKT system.

Worked example — Constrained quadratic via Lagrange

Minimize f(x,y) = x² + y² subject to x + y = 1.

Lagrangian: L = x² + y² − λ(x + y − 1). Stationarity: 2x − λ = 0, 2y − λ = 0, so x = y. Constraint: 2x = 1x = y = ½. Minimum value = ½.

import numpy as np

# verify numerically via gradient descent on the unconstrained surrogate
# f(x) + (mu/2)(x+y-1)^2  with increasing penalty mu
x, y = 0.0, 0.0
lr, mu = 0.1, 50.0
for _ in range(500):
    gx = 2*x + mu*(x+y-1)
    gy = 2*y + mu*(x+y-1)
    x -= lr*gx; y -= lr*gy
print(x, y, x+y)     # ~ 0.5, 0.5, 1.0

Drills

D1 · Convexity of logistic loss

Show that ℓ(w) = log(1 + exp(−y wᵀ x)) is convex in w (with y ∈ {−1,+1}, x fixed).

Solution

log(1 + exp(z)) is convex in z (its second derivative is σ(z)(1−σ(z)) ≥ 0). Here z = −y wᵀ x is affine in w. Composition of a convex function with an affine map is convex. Done.

D2 · GD trajectory

Run gradient descent on f(x) = (x − 3)² from x₀ = 0 with η = 0.5. Compute x₁, x₂, x₃.

Solution

f'(x) = 2(x−3). Update x ← x − η·2(x−3) = x − (x−3) = 3. So x₁ = x₂ = x₃ = 3 — one step lands at the optimum because η = 1/L with L = 2.

D3 · Why divergence at large η

For the same f(x) = (x−3)², what is the smallest η that causes divergence?

Solution

Update factor on the deviation x − 3 is (1 − 2η). Divergence when |1 − 2η| > 1, i.e. η > 1. At η = 1 you oscillate around the optimum without converging.

D4 · Adam intuition

In one sentence each, explain why Adam often beats vanilla SGD on transformers but loses to SGD+momentum on ResNets.

Solution

Adam's per-parameter scaling helps when gradients are highly heterogeneous in magnitude (typical of transformer embeddings and layer norms). On convolutional vision nets with relatively uniform gradients and strong weight decay, SGD+momentum reaches a flatter minimum and generalizes better.

D5 · Jensen for the mean

Use Jensen's inequality to show E[X]² ≤ E[X²] for any random variable X with finite second moment.

Solution

g(t) = t² is convex. Jensen: g(E[X]) ≤ E[g(X)]E[X]² ≤ E[X²]. Equivalently, Var(X) ≥ 0.

5. Hoeffding's inequality & generalisation

Concept

Hoeffding's inequality is the cleanest concentration bound on the cards. For independent X₁,…,Xₙ with Xᵢ ∈ [a, b] and sample mean : P(|X̄ − E[X̄]| ≥ t) ≤ 2 exp(−2 n t² / (b − a)²). The exponential decay in n is the formal reason why test-set evaluation is reliable: the empirical accuracy on a held-out set is, with high probability, within O(1/√n) of the true accuracy. It also gives the simplest PAC-learning bound: a hypothesis class with |H| functions can achieve |train − test| ≤ √( (log|H| + log(2/δ)) / (2n) ) with probability 1 − δ.

Worked example — Confidence interval for accuracy

import math
n, acc, delta = 1000, 0.86, 0.05
# Hoeffding with bounded [0,1]:  t = sqrt(log(2/delta)/(2n))
t = math.sqrt(math.log(2/delta) / (2*n))
print(f"true accuracy in [{acc-t:.3f}, {acc+t:.3f}] with prob >= {1-delta}")
# -> roughly [0.817, 0.903]

Drills

D1 · Tighter bound from larger n

Quadruple n. By what factor does the confidence half-width t shrink?

Solution

t ∝ 1/√n; quadrupling n halves t.

D2 · Why "independent" matters

If your test set is constructed by oversampling each correct prediction, why does Hoeffding fail?

Solution

The samples are no longer independent of the model's behavior, and they are no longer i.i.d. with respect to the true data distribution. The bound assumes both. The empirical accuracy in this rigged sample is biased upward and concentrates around the wrong value.

D3 · Hoeffding vs. CLT

One sentence: how does Hoeffding differ from the Central Limit Theorem in what it guarantees?

Solution

CLT is asymptotic ("for large n, the sample mean is approximately Gaussian"). Hoeffding is a finite-sample, distribution-free upper bound on tail probability with explicit constants — usable for any n.

Checkpoint — answer out loud

Next step

Once the math feels mechanical, go install the toolchain on the Python data stack page and start translating the formulas in this section into runnable code. For short-answer practice, the Round 2 theory drills include softmax Jacobians, PCA-via-SVD, Hoeffding bounds, and a worked MLE derivation.