The math you actually need
Four areas: linear algebra, probability & statistics, multivariable calculus, convex optimization. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.
1. Linear algebra
Concept
A vector x ∈ ℝⁿ is an arrow with magnitude and direction; a matrix A ∈ ℝᵐˣⁿ is a
linear map taking ℝⁿ → ℝᵐ. The dot product xᵀy = Σ xᵢ yᵢ measures alignment and
factors as ‖x‖‖y‖cos θ. Norms quantify size: ‖x‖₁ = Σ|xᵢ| (L1),
‖x‖₂ = √(Σ xᵢ²) (L2), ‖x‖∞ = max|xᵢ|.
The rank of A is the dimension of its column space — the number of linearly
independent directions it can output. Eigenvalues satisfy A v = λ v;
eigenvectors are directions that don't rotate under A. For a symmetric matrix,
eigenvectors are orthogonal and eigenvalues are real — this is what makes PCA work, since the covariance
matrix is symmetric.
SVD is the universal factorization: any X ∈ ℝᵐˣⁿ can be written
X = U Σ Vᵀ with U, V orthogonal and Σ diagonal with non-negative
singular values. SVD underlies low-rank approximation (keep the top-k singular values), PCA (right
singular vectors of centered data = principal components), and the pseudo-inverse used in ridge
regression.
Worked example — PCA by hand on 5 points
import numpy as np
X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9],
[1.9, 2.2], [3.1, 3.0]])
# 1. center
Xc = X - X.mean(axis=0)
# 2. covariance matrix (n=5 samples, so divide by n-1=4)
C = (Xc.T @ Xc) / (Xc.shape[0] - 1)
# 3. eigendecomposition (symmetric -> use eigh)
evals, evecs = np.linalg.eigh(C)
order = np.argsort(evals)[::-1]
evals, evecs = evals[order], evecs[:, order]
# 4. project onto top component
pc1 = evecs[:, 0]
scores = Xc @ pc1
print("eigenvalues:", evals) # variance per PC
print("PC1 direction:", pc1) # ~ (0.677, 0.735) — the diagonal
print("variance kept:", evals[0] / evals.sum())
Drills
D1 · Determinant and trace from eigenvalues
A 3×3 matrix has eigenvalues 1, 2, 4. Find det(A) and trace(A).
Solution
det(A) = ∏λᵢ = 1·2·4 = 8. trace(A) = Σλᵢ = 7. These identities hold for any
square matrix (over ℂ, counting multiplicity).
D2 · SVD shapes
X is 100×5. Give the shapes of U, Σ, Vᵀ in the thin SVD.
Solution
Thin SVD: U is 100×5, Σ is 5×5 diagonal, Vᵀ is 5×5.
Reconstruction: X = U Σ Vᵀ. The "thin" form drops the (100−5) zero singular values.
D3 · Rank of a rank-1 update
Let A = u vᵀ where u ∈ ℝ⁵, v ∈ ℝ³, both non-zero. What is rank(A)?
Solution
rank(A) = 1. Every column of A is a scalar multiple of u, so
the column space is the 1-dimensional span of u.
D4 · PCA variance fraction
Covariance eigenvalues are 12, 6, 1.5, 0.5. What fraction of variance is captured by the top two PCs?
Solution
(12 + 6) / (12 + 6 + 1.5 + 0.5) = 18/20 = 0.90 → 90%.
D5 · Orthogonality of symmetric eigenvectors
Show that if A = Aᵀ and A v₁ = λ₁ v₁, A v₂ = λ₂ v₂ with
λ₁ ≠ λ₂, then v₁ᵀ v₂ = 0.
Solution
Compute v₁ᵀ A v₂ two ways. From A v₂ = λ₂ v₂: v₁ᵀ A v₂ = λ₂ v₁ᵀ v₂.
Using A = Aᵀ: v₁ᵀ A v₂ = (A v₁)ᵀ v₂ = λ₁ v₁ᵀ v₂. So
(λ₁ − λ₂) v₁ᵀ v₂ = 0, and since λ₁ ≠ λ₂, v₁ᵀ v₂ = 0.
2. Probability & statistics
Concept
A random variable X maps outcomes to numbers. Expectation
E[X] = Σ x P(X=x) (discrete) or ∫ x p(x) dx (continuous) is the long-run average.
Variance Var(X) = E[(X − E[X])²] = E[X²] − E[X]² measures spread.
Linearity of expectation always holds: E[aX + bY] = a E[X] + b E[Y], even when X and Y are
dependent.
Key distributions: Bernoulli(p) has mean p, variance p(1−p);
Binomial(n,p) has mean np, variance np(1−p); Gaussian
𝒩(μ, σ²) has density (2πσ²)^(−1/2) exp(−(x−μ)²/(2σ²)); Poisson(λ) has mean and
variance both λ.
Bayes' rule: P(A|B) = P(B|A) P(A) / P(B). This is how every probabilistic
ML model updates beliefs after seeing data. Maximum-likelihood estimation finds parameters that maximize
log P(data | θ); for Gaussian noise that's least squares, for Bernoulli outputs that's
logistic regression's cross-entropy.
Hoeffding's inequality: for i.i.d. bounded X₁,…,Xₙ ∈ [a,b] with mean
μ,
P(|X̄ − μ| ≥ t) ≤ 2 exp(−2 n t² / (b − a)²). This is the generalisation bound underlying
"more data → better estimates" — concretely, the sample mean's deviation shrinks like 1/√n.
Worked example — Disease screening with Bayes
Disease prevalence P(D) = 1%. Test sensitivity P(+|D) = 95%, specificity
P(−|¬D) = 90% (so P(+|¬D) = 10%). Patient tests positive — what is P(D|+)?
P_D, P_pos_given_D, P_pos_given_notD = 0.01, 0.95, 0.10
P_notD = 1 - P_D
P_pos = P_pos_given_D * P_D + P_pos_given_notD * P_notD
P_D_given_pos = P_pos_given_D * P_D / P_pos
print(P_D_given_pos) # 0.0876... i.e. ~8.8%
The lesson: with a low base rate, even an accurate test produces mostly false positives.
Drills
D1 · Bernoulli MLE
Given i.i.d. observations x₁,…,xₙ ∈ {0,1} from Bernoulli(p), derive the MLE for p.
Solution
Likelihood: L(p) = Π pˣⁱ(1−p)^(1−xᵢ). Log-likelihood:
ℓ(p) = (Σxᵢ) log p + (n − Σxᵢ) log(1−p). Set dℓ/dp = 0:
(Σxᵢ)/p = (n − Σxᵢ)/(1−p) → p̂ = (1/n) Σxᵢ = sample mean.
D2 · Linearity of expectation
Flip a fair coin 10 times. Let X be the number of heads. Compute E[X] and Var(X).
Solution
Write X = Σ Xᵢ with Xᵢ ∼ Bernoulli(0.5). Then
E[X] = 10·0.5 = 5; by independence Var(X) = 10·0.25 = 2.5.
D3 · Hoeffding sample size
You estimate a click-through rate (a Bernoulli mean) and want |p̂ − p| ≤ 0.01 with
probability ≥ 0.95. How many samples?
Solution
Hoeffding with b−a = 1, t = 0.01:
2 exp(−2n·0.0001) ≤ 0.05 → n ≥ ln(40)/0.0002 ≈ 18 444. Round up to ~18 500.
D4 · Independent ≠ uncorrelated (in general)
Let X ∼ Uniform(−1,1) and Y = X². Show Cov(X,Y) = 0 but X, Y are dependent.
Solution
E[X] = 0, E[XY] = E[X³] = 0 by symmetry, so Cov(X,Y) = 0. But Y
is a deterministic function of X, so they cannot be independent. (For jointly Gaussian variables zero
covariance does imply independence; in general it does not.)
D5 · Gaussian MLE for variance
Given i.i.d. x₁,…,xₙ ∼ 𝒩(μ, σ²) with μ known, derive the MLE of σ².
Solution
Log-likelihood (dropping constants): ℓ = −n/2 log σ² − (1/(2σ²)) Σ (xᵢ−μ)². Differentiate
w.r.t. σ²: −n/(2σ²) + (1/(2σ⁴)) Σ(xᵢ−μ)² = 0 → σ̂² = (1/n) Σ(xᵢ−μ)².
(This MLE is biased when μ is also estimated; the unbiased version divides by n−1.)
3. Multivariable calculus
Concept
A partial derivative ∂f/∂xᵢ measures the rate of change of f
when only xᵢ moves. The gradient ∇f = (∂f/∂x₁, …, ∂f/∂xₙ) is a
vector pointing in the direction of steepest ascent; its magnitude is the slope in that direction.
Gradient descent is just x ← x − η ∇f(x).
The Jacobian of f: ℝⁿ → ℝᵐ is the m × n matrix of partials
Jᵢⱼ = ∂fᵢ/∂xⱼ. The Hessian H of a scalar function is the
n × n matrix of second partials; if H ≻ 0 (positive definite) at a critical
point, that point is a strict local minimum, and the function is locally convex.
The chain rule for composition z = f(g(x)) is
∂z/∂x = (∂f/∂g)(∂g/∂x). In vector form: J_{f∘g} = J_f · J_g. Backpropagation
is nothing more than this chain rule, applied right-to-left through every layer of a neural network.
A second-order Taylor expansion at x₀ is
f(x₀+δ) ≈ f(x₀) + ∇f(x₀)ᵀ δ + ½ δᵀ H δ; Newton's method uses this to take steps that account
for curvature.
Worked example — Backprop on a 2-layer MLP, by hand and by autograd
Network: z = w x + b, h = ReLU(z), ŷ = v h + c, loss L = ½(ŷ − y)².
Compute ∂L/∂w.
Chain rule, layer by layer:
∂L/∂ŷ = (ŷ − y); ∂ŷ/∂h = v;
∂h/∂z = 𝟙[z > 0]; ∂z/∂w = x.
Therefore ∂L/∂w = (ŷ − y) · v · 𝟙[z > 0] · x.
import torch
w = torch.tensor(0.7, requires_grad=True)
b = torch.tensor(-0.1, requires_grad=True)
v = torch.tensor(1.3, requires_grad=True)
c = torch.tensor(0.2, requires_grad=True)
x = torch.tensor(2.0)
y = torch.tensor(1.5)
z = w * x + b
h = torch.relu(z)
y_hat = v * h + c
L = 0.5 * (y_hat - y) ** 2
L.backward()
# manual:
manual = (y_hat - y).item() * v.item() * (1.0 if z.item() > 0 else 0.0) * x.item()
print(w.grad.item(), manual) # equal within FP error
Drills
D1 · Gradient and one step
f(x,y) = x² + 4y². Compute ∇f(1,1). With η = 0.1, where do you land after one GD step?
Solution
∇f = (2x, 8y) = (2, 8). Step: (1,1) − 0.1·(2,8) = (0.8, 0.2).
D2 · Hessian and convexity
Is f(x,y) = x² + xy + y² convex?
Solution
Hessian H = [[2,1],[1,2]]. Eigenvalues: 2 ± 1 = 1, 3, both positive. So
H ≻ 0 globally and f is strictly convex.
D3 · Chain rule through softmax + cross-entropy
For logits z and one-hot label y, with p = softmax(z) and
L = − Σᵢ yᵢ log pᵢ, show ∂L/∂z = p − y.
Solution
∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ − pⱼ). Then
∂L/∂zⱼ = −Σᵢ yᵢ (1/pᵢ)·pᵢ(δᵢⱼ − pⱼ) = −yⱼ + pⱼ Σᵢ yᵢ = pⱼ − yⱼ
(using Σyᵢ = 1). This identity is why cross-entropy + softmax has such a clean gradient
and is the canonical classifier loss.
D4 · Jacobian of a linear layer
y = Wx + b with W ∈ ℝᵐˣⁿ. What is ∂y/∂x? What about ∂L/∂W
for a scalar loss L (express in terms of ∂L/∂y and x)?
Solution
∂y/∂x = W. ∂L/∂W = (∂L/∂y) xᵀ — an outer product of the upstream gradient
with the layer's input. This is exactly what PyTorch's autograd computes.
D5 · Taylor expansion of log(1 + x)
Expand log(1+x) to second order around x = 0 and use it to approximate
log(1.05).
Solution
log(1+x) ≈ x − x²/2. At x = 0.05: 0.05 − 0.00125 = 0.04875.
True value ≈ 0.04879.
4. Convex optimization
Concept
A set is convex if it contains the line segment between any two of its points. A function
is convex if its epigraph is convex; equivalently, for any t ∈ [0,1],
f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y) (Jensen). A twice-differentiable function is convex
iff its Hessian is positive semidefinite everywhere. Squared error, logistic loss, hinge loss, and
cross-entropy are all convex in their linear-model weights.
Gradient descent: x_{k+1} = x_k − η ∇f(x_k). For an L-smooth
convex f, choosing η ≤ 1/L guarantees monotonic decrease and convergence at
rate O(1/k). Momentum adds a velocity term:
v ← βv + ∇f, x ← x − ηv; it accelerates along consistent directions and damps
oscillation. Adam rescales each coordinate by its running RMS gradient, giving adaptive
step sizes.
Constraints via Lagrange duality: to minimise f(x) subject to
gᵢ(x) ≤ 0 and hⱼ(x) = 0, form the Lagrangian
L(x, λ, ν) = f(x) + Σ λᵢ gᵢ(x) + Σ νⱼ hⱼ(x) with λᵢ ≥ 0. The
KKT conditions (stationarity, primal feasibility, dual feasibility, complementary
slackness) are necessary at an optimum; for convex problems they are also sufficient. SVMs are derived
by writing the margin-maximisation problem and solving its KKT system.
Worked example — Constrained quadratic via Lagrange
Minimize f(x,y) = x² + y² subject to x + y = 1.
Lagrangian: L = x² + y² − λ(x + y − 1).
Stationarity: 2x − λ = 0, 2y − λ = 0, so x = y. Constraint:
2x = 1 → x = y = ½. Minimum value = ½.
import numpy as np
# verify numerically via gradient descent on the unconstrained surrogate
# f(x) + (mu/2)(x+y-1)^2 with increasing penalty mu
x, y = 0.0, 0.0
lr, mu = 0.1, 50.0
for _ in range(500):
gx = 2*x + mu*(x+y-1)
gy = 2*y + mu*(x+y-1)
x -= lr*gx; y -= lr*gy
print(x, y, x+y) # ~ 0.5, 0.5, 1.0
Drills
D1 · Convexity of logistic loss
Show that ℓ(w) = log(1 + exp(−y wᵀ x)) is convex in w (with y ∈ {−1,+1}, x fixed).
Solution
log(1 + exp(z)) is convex in z (its second derivative is
σ(z)(1−σ(z)) ≥ 0). Here z = −y wᵀ x is affine in w. Composition
of a convex function with an affine map is convex. Done.
D2 · GD trajectory
Run gradient descent on f(x) = (x − 3)² from x₀ = 0 with η = 0.5.
Compute x₁, x₂, x₃.
Solution
f'(x) = 2(x−3). Update x ← x − η·2(x−3) = x − (x−3) = 3. So
x₁ = x₂ = x₃ = 3 — one step lands at the optimum because η = 1/L with
L = 2.
D3 · Why divergence at large η
For the same f(x) = (x−3)², what is the smallest η that causes divergence?
Solution
Update factor on the deviation x − 3 is (1 − 2η). Divergence when
|1 − 2η| > 1, i.e. η > 1. At η = 1 you oscillate around the
optimum without converging.
D4 · Adam intuition
In one sentence each, explain why Adam often beats vanilla SGD on transformers but loses to SGD+momentum on ResNets.
Solution
Adam's per-parameter scaling helps when gradients are highly heterogeneous in magnitude (typical of transformer embeddings and layer norms). On convolutional vision nets with relatively uniform gradients and strong weight decay, SGD+momentum reaches a flatter minimum and generalizes better.
D5 · Jensen for the mean
Use Jensen's inequality to show E[X]² ≤ E[X²] for any random variable X with finite second moment.
Solution
g(t) = t² is convex. Jensen: g(E[X]) ≤ E[g(X)] → E[X]² ≤ E[X²].
Equivalently, Var(X) ≥ 0.
5. Hoeffding's inequality & generalisation
Concept
Hoeffding's inequality is the cleanest concentration bound on the cards. For independent
X₁,…,Xₙ with Xᵢ ∈ [a, b] and sample mean X̄:
P(|X̄ − E[X̄]| ≥ t) ≤ 2 exp(−2 n t² / (b − a)²). The exponential decay in n is
the formal reason why test-set evaluation is reliable: the empirical accuracy on a held-out set is, with
high probability, within O(1/√n) of the true accuracy. It also gives the simplest
PAC-learning bound: a hypothesis class with |H| functions can achieve
|train − test| ≤ √( (log|H| + log(2/δ)) / (2n) ) with probability 1 − δ.
Worked example — Confidence interval for accuracy
import math
n, acc, delta = 1000, 0.86, 0.05
# Hoeffding with bounded [0,1]: t = sqrt(log(2/delta)/(2n))
t = math.sqrt(math.log(2/delta) / (2*n))
print(f"true accuracy in [{acc-t:.3f}, {acc+t:.3f}] with prob >= {1-delta}")
# -> roughly [0.817, 0.903]
Drills
D1 · Tighter bound from larger n
Quadruple n. By what factor does the confidence half-width t shrink?
Solution
t ∝ 1/√n; quadrupling n halves t.
D2 · Why "independent" matters
If your test set is constructed by oversampling each correct prediction, why does Hoeffding fail?
Solution
The samples are no longer independent of the model's behavior, and they are no longer i.i.d. with respect to the true data distribution. The bound assumes both. The empirical accuracy in this rigged sample is biased upward and concentrates around the wrong value.
D3 · Hoeffding vs. CLT
One sentence: how does Hoeffding differ from the Central Limit Theorem in what it guarantees?
Solution
CLT is asymptotic ("for large n, the sample mean is approximately Gaussian"). Hoeffding is a
finite-sample, distribution-free upper bound on tail probability with explicit constants — usable for
any n.
Checkpoint — answer out loud
- Can you compute eigenvalues of a 2×2 matrix and state what they tell you about a linear map?
- Can you write down Bayes' rule and plug in numbers without looking?
- Can you derive the gradient of MSE loss for linear regression and verify it with autograd?
- Can you state when gradient descent converges and what happens if
ηis too large? - Given a held-out accuracy of 0.84 on 500 samples, can you give a 95% Hoeffding confidence interval?
Next step
Once the math feels mechanical, go install the toolchain on the Python data stack page and start translating the formulas in this section into runnable code. For short-answer practice, the Round 2 theory drills include softmax Jacobians, PCA-via-SVD, Hoeffding bounds, and a worked MLE derivation.