Round 2 theory drills

Round 2 of the USAAIO mixes coding with short-answer reasoning: compute a gradient, justify a design choice, derive a tail bound, count parameters. This page is a drill bank — thirty short-answer problems with full collapsible solutions, organized by syllabus block. Try each one on paper before you open the answer.

How to use this page. Set a timer for 4–6 minutes per problem. Write the answer on paper, including intermediate steps. Then open the solution and check both the final number and the derivation. The hardest part of Round 2 short-answers is partial credit — graders want to see the chain rule applied, not just an unjustified scalar.

30 problems across 6 blocks. All answerable with paper and pencil; no problem requires running code.

Linear algebra & calculus (6 problems)

Q1 · Backprop through softmax

Let s = softmax(z) where z ∈ R^K, so s_i = exp(z_i) / Σ_j exp(z_j). Derive the Jacobian ∂s_i / ∂z_j.

Solution

Differentiate s_i = exp(z_i) / Z with Z = Σ_k exp(z_k). The numerator depends on z_j only when i = j; the denominator always depends on z_j.

Case i = j. Quotient rule gives ∂s_i/∂z_i = (exp(z_i)·Z − exp(z_i)·exp(z_i)) / Z² = s_i − s_i² = s_i(1 − s_i).

Case i ≠ j. The numerator's derivative is zero, so ∂s_i/∂z_j = −exp(z_i)·exp(z_j)/Z² = −s_i s_j.

Compact form: J = diag(s) − s s^T. This matrix is symmetric, rank K−1, and singular (the all-ones vector lies in its null space, which expresses that shifting all logits by a constant does not change the softmax).

Q2 · Gradient of softmax cross-entropy

For one-hot target y and logits z, the loss is L = −Σ_i y_i log s_i with s = softmax(z). Show that ∂L/∂z = s − y.

Solution

Write L = −Σ_i y_i (z_i − log Z) where Z = Σ_k exp(z_k), using log s_i = z_i − log Z.

Differentiate with respect to z_j: the first term gives −y_j. For the second, ∂(log Z)/∂z_j = exp(z_j)/Z = s_j, so ∂(−Σ_i y_i (−log Z))/∂z_j = (Σ_i y_i) s_j.

Since y is one-hot, Σ_i y_i = 1, giving ∂L/∂z_j = s_j − y_j, i.e. ∂L/∂z = s − y.

This is why softmax + cross-entropy is a numerically pleasant pair: the gradient is just the prediction error, with no explicit divide-by-softmax that would blow up when s_i is tiny.

Q3 · Jacobian of an affine layer

For y = Wx + b with W ∈ R^{m×n}, give the Jacobian ∂y/∂x and ∂y/∂W (the second as a fourth-order tensor with explicit entries).

Solution

w.r.t. x. ∂y_i/∂x_j = W_{ij}, so ∂y/∂x = W. Shape m × n.

w.r.t. W. y_i = Σ_k W_{ik} x_k + b_i, so ∂y_i/∂W_{jk} = δ_{ij} x_k (Kronecker delta on i and j times x_k). As a tensor of shape m × m × n, only the diagonal slices in the first two indices are nonzero, and each such slice equals the row vector x^T.

In practice you never materialize this tensor. If L is a downstream scalar with upstream gradient g = ∂L/∂y ∈ R^m, then ∂L/∂W = g x^T (outer product) and ∂L/∂x = W^T g. These two rules cover every fully-connected layer in a backward pass.

Q4 · PCA via SVD

Given centered data X ∈ R^{n×d} with SVD X = U Σ V^T, identify the principal components, the variance explained by component i, and the rank-k approximation that minimizes Frobenius reconstruction error.

Solution

The sample covariance is C = (1/(n−1)) X^T X = (1/(n−1)) V Σ² V^T. So the columns of V are eigenvectors of C — these are the principal component directions in feature space.

The eigenvalue associated with component i is λ_i = σ_i² / (n−1), so the variance explained by component i is σ_i² / Σ_j σ_j².

By the Eckart–Young theorem, the best rank-k approximation in Frobenius norm is X_k = U_k Σ_k V_k^T, keeping the top-k singular values. The projection of the data onto the first k components is X V_k, shape n × k.

The Frobenius error of the rank-k approximation is √(Σ_{i>k} σ_i²), so plotting cumulative σ_i² against k (the scree plot) tells you where to truncate.

Q5 · Eigenvalues of a structured matrix

Let A = I_n + α 11^T where 1 ∈ R^n is the all-ones vector and α ∈ R. Find all eigenvalues and eigenvectors.

Solution

11^T is a rank-1 matrix. Its only nonzero eigenvalue is its trace, n, with eigenvector 1 (since (11^T) 1 = n · 1). Its null space is the (n−1)-dimensional subspace orthogonal to 1, all with eigenvalue 0.

Adding I_n shifts every eigenvalue by 1. So:

  • Eigenvalue 1 + α n with eigenvector 1 (the constant direction).
  • Eigenvalue 1 with multiplicity n−1, eigenvectors any basis of {v : 1^T v = 0}.

Determinant = product of eigenvalues = 1 + α n. Trace = n(1) + α n = n + α n, consistent with tr(A) = tr(I) + α tr(11^T) = n + α n. The matrix is invertible iff α ≠ −1/n.

Q6 · Chain rule through tanh-then-linear

Let h = tanh(Wx + b) and y = Vh. Compute ∂y/∂x in terms of W, V, and h.

Solution

Let z = Wx + b, so h = tanh(z). Then by the chain rule, ∂h/∂z = diag(tanh′(z)) = diag(1 − h²) (elementwise, using the identity tanh′(z) = 1 − tanh²(z)).

And ∂z/∂x = W, ∂y/∂h = V.

Composing: ∂y/∂x = V · diag(1 − h²) · W.

This is a useful object: the spectral norm of V diag(1 − h²) W tells you how much a perturbation of input x can grow at output y. Since |1 − h_i²| ≤ 1, the tanh layer is contractive in its derivative, which is exactly why deep tanh nets suffer vanishing gradients — products of these Jacobians shrink geometrically.

Probability & statistics (6 problems)

Q7 · Bayes update on a diagnostic test

A disease has prevalence 1%. A test has sensitivity 99% and specificity 95%. A randomly drawn person tests positive. What is the posterior probability they have the disease?

Solution

Let D = has disease, T = positive test. Given: P(D) = 0.01, P(T|D) = 0.99, P(T|¬D) = 1 − 0.95 = 0.05.

By Bayes: P(D|T) = P(T|D) P(D) / P(T) with P(T) = P(T|D)P(D) + P(T|¬D)P(¬D) = 0.99·0.01 + 0.05·0.99 = 0.0099 + 0.0495 = 0.0594.

So P(D|T) = 0.0099 / 0.0594 ≈ 0.1667 = 16.67%.

Even with a 99%-sensitive test, the low prevalence keeps the posterior modest — the false-positive mass 0.0495 dwarfs the true-positive mass 0.0099. The takeaway: prior matters as much as likelihood. This is also why a "highly accurate" model on a rare-event task can still be useless without recalibration.

Q8 · Hoeffding's inequality applied

You estimate a coin's bias p by averaging n i.i.d. flips. How many flips guarantee |p̂ − p| ≤ 0.01 with probability at least 0.99 by Hoeffding's bound?

Solution

Hoeffding for [0,1]-valued i.i.d. X_i: P(|X̅ − μ| ≥ ε) ≤ 2 exp(−2 n ε²).

We want 2 exp(−2 n ε²) ≤ δ with ε = 0.01 and δ = 0.01.

Solve: n ≥ ln(2/δ) / (2 ε²) = ln(200) / (2 · 10^{−4}). ln 200 ≈ 5.298, so n ≥ 5.298 / 0.0002 ≈ 26492.

Round up to roughly 26500 flips. Notice the bound is distribution-free — it doesn't use the variance of a Bernoulli (which is p(1−p) ≤ 0.25), so it is loose near p=0.5 compared to a CLT-based normal interval. The exponent −2nε² is what gives the inequality its sub-Gaussian flavor.

Q9 · Expectation and variance of common distributions

State E[X] and Var(X) for: (a) Bernoulli(p), (b) Binomial(n, p), (c) Poisson(λ), (d) Uniform(a,b), (e) Exponential(λ). Then derive (b) from (a).

Solution

(a) Bernoulli: E = p, Var = p(1−p). (b) Binomial: E = np, Var = np(1−p). (c) Poisson: E = λ, Var = λ. (d) Uniform(a,b): E = (a+b)/2, Var = (b−a)²/12. (e) Exponential(λ): E = 1/λ, Var = 1/λ².

Derivation of (b). Write X = Σ_{i=1..n} X_i where X_i are i.i.d. Bernoulli(p). By linearity, E[X] = Σ E[X_i] = np. By independence, variances add: Var(X) = Σ Var(X_i) = n p(1−p).

The fact that variance grows linearly in n (so standard deviation grows like √n) is the same √n that drives the CLT and the O(1/√n) rate of Monte Carlo estimators.

Q10 · MLE for a Gaussian

Given i.i.d. samples x_1, …, x_n from N(μ, σ²), derive the maximum-likelihood estimators of μ and σ².

Solution

Log-likelihood: ℓ(μ, σ²) = −(n/2) log(2πσ²) − (1/(2σ²)) Σ(x_i − μ)².

∂ℓ/∂μ = 0: (1/σ²) Σ(x_i − μ) = 0μ̂ = (1/n) Σ x_i (the sample mean).

∂ℓ/∂σ² = 0: −n/(2σ²) + (1/(2σ⁴)) Σ(x_i − μ̂)² = 0σ̂² = (1/n) Σ(x_i − μ̂)².

Note: the MLE of variance uses the divisor n, which makes it biased downward — E[σ̂²] = ((n−1)/n) σ². The unbiased estimator divides by n−1 (Bessel's correction). USAAIO graders care about whether you state which estimator you used.

Q11 · MAP vs MLE

For a Bernoulli with parameter θ and Beta(α, β) prior, derive the MAP estimator given k successes in n trials. Show what happens as n → ∞.

Solution

The posterior is Beta(α + k, β + n − k) (Beta–Bernoulli conjugacy). Its mode (for shape parameters > 1) is

θ̂_MAP = (α + k − 1) / (α + β + n − 2).

The MLE is θ̂_MLE = k/n.

As n → ∞, both numerator and denominator are dominated by the data: (α + k − 1)/(α + β + n − 2) → k/n. The prior washes out asymptotically. For small n the prior smooths the estimate — choosing α = β = 1 gives a uniform prior and MAP equals MLE; α = β = 2 is Laplace smoothing (add-one), which prevents the pathological estimate θ = 0 after observing zero successes.

Q12 · Central limit theorem application

A model's per-example loss has mean 0.5 and variance 0.25. You evaluate on a test set of n=10000 examples. Approximate the probability that the empirical mean loss exceeds 0.505.

Solution

By the CLT, the sample mean L̅_n is approximately N(μ, σ²/n) = N(0.5, 0.25/10000) = N(0.5, 2.5×10^{−5}), so its standard deviation is √(2.5×10^{−5}) = 0.005.

We want P(L̅_n > 0.505) = P(Z > (0.505 − 0.5)/0.005) = P(Z > 1).

From the standard normal table, P(Z > 1) ≈ 0.1587, so about 16%.

The lesson: a 0.005 difference in mean loss between two models on a 10k test set sits at exactly one standard error — it is not statistically significant. To distinguish 0.500 from 0.505 with high confidence (Z=3), you would need n = (3 · 0.5/0.005)² = 90000 examples. Test-set sizing should always start from the variance, not gut feel.

Classical ML (5 problems)

Q13 · Bias–variance decomposition

Write the expected squared error of a regression estimator f̂(x) as a sum of three terms, and explain which term each model knob (capacity, training-set size, noise floor) controls.

Solution

For a fixed test point x with target y = f(x) + ε, E[ε] = 0, Var(ε) = σ²:

E[(y − f̂(x))²] = (E[f̂(x)] − f(x))² + Var(f̂(x)) + σ².

The expectation is over both training-set randomness and target noise. The three terms are bias² (systematic error from model class being wrong), variance (sensitivity of the fitted model to the training sample), and irreducible noise.

Capacity (model complexity) trades bias for variance: more capacity → lower bias, higher variance. Training-set size shrinks variance roughly as 1/n for parametric models. Irreducible noise is a property of the data-generating process and cannot be reduced by any estimator — the only way to beat it is to add features that explain the previously-noise variance.

Q14 · k-fold CV expected error

Why does k-fold cross-validation produce a slightly pessimistic estimate of test error? When does this bias matter most?

Solution

In k-fold CV, each fold's model is trained on n(k−1)/k examples — fewer than the full n. Since most learners improve with more data, each fold's model is slightly weaker than the final model trained on all n, so its held-out error is slightly higher than the eventual test error. The CV estimate is therefore biased upward (pessimistic).

The bias is small when k is large (fold sizes approach n) — e.g., LOOCV with k=n is nearly unbiased but high-variance. With small k (k=2 or k=3) and small n, the gap can be a few percentage points.

This matters most when (a) the dataset is small enough that learning curves are still steep, (b) you are choosing between models whose true gap is comparable to the CV bias, or (c) you report CV error as the final headline number rather than holding out a separate test set.

Q15 · Why the kernel trick works

State the kernel trick. Why does it allow algorithms like SVM to operate in an infinite-dimensional feature space without ever explicitly computing the feature map?

Solution

The kernel trick: if an algorithm's only dependence on inputs x_i is through inner products <φ(x_i), φ(x_j)> for some feature map φ, then replacing every such inner product with a kernel function K(x_i, x_j) — defined directly without ever evaluating φ — runs the same algorithm in feature space.

This works because Mercer's theorem guarantees that any symmetric positive-semidefinite kernel K corresponds to an inner product in some (possibly infinite-dimensional) Hilbert space. The Gaussian RBF kernel K(x, x′) = exp(−γ ‖x − x′‖²) corresponds to an infinite-dimensional feature space (its Taylor expansion has infinitely many polynomial terms), but evaluating K costs O(d).

Algorithms that admit this — SVM (dual), kernel ridge regression, kernel PCA, Gaussian processes — share the structural property that the optimal solution lies in the span of the training points (the representer theorem), so only pairwise kernel evaluations are needed.

Q16 · Information gain in a decision tree

A node has 14 examples: 9 positive, 5 negative. A proposed split partitions them into a left child (6 pos, 2 neg) and right child (3 pos, 3 neg). Compute the information gain in bits.

Solution

Entropy: H(p) = −p log₂ p − (1−p) log₂ (1−p).

Parent. p = 9/14 ≈ 0.6429. H = −0.6429 log₂ 0.6429 − 0.3571 log₂ 0.3571 ≈ 0.6429·0.6374 + 0.3571·1.4854 ≈ 0.4098 + 0.5305 = 0.9403 bits.

Left (8 examples, 6 pos): p = 0.75. H = −0.75 log₂ 0.75 − 0.25 log₂ 0.25 = 0.75·0.4150 + 0.25·2 = 0.3113 + 0.5 = 0.8113 bits.

Right (6 examples, 3 pos): p = 0.5. H = 1 bit (maximum entropy).

Weighted child entropy: (8/14)·0.8113 + (6/14)·1 = 0.4636 + 0.4286 = 0.8922 bits.

Information gain = 0.9403 − 0.8922 = 0.0481 bits. A small but positive gain — the left child is purer than the parent, but the right child is more uniform.

Q17 · SVM dual formulation

Write the dual of the hard-margin SVM. Explain (a) why the optimal weight vector lies in the span of training inputs, (b) what "support vectors" means, and (c) why the dual is preferred when the kernel trick is used.

Solution

Hard-margin primal: min (1/2)‖w‖² s.t. y_i(w·x_i + b) ≥ 1.

Lagrangian leads to the dual: max_α Σ_i α_i − (1/2) Σ_{i,j} α_i α_j y_i y_j (x_i·x_j) s.t. α_i ≥ 0 and Σ_i α_i y_i = 0.

(a) Stationarity of the Lagrangian in w gives w = Σ_i α_i y_i x_i, a linear combination of training points (the representer theorem in its simplest form).

(b) KKT complementary slackness says α_i = 0 unless the constraint is active (y_i(w·x_i + b) = 1). Points with α_i > 0 sit on the margin — these are the support vectors. They alone determine w.

(c) The dual depends on inputs only through inner products x_i · x_j, so replacing these with a kernel K(x_i, x_j) lifts the classifier into a high- (or infinite-) dimensional space without ever writing down the feature map.

Deep learning (6 problems)

Q18 · Receptive field of stacked convolutions

A network has three sequential 3×3 convolutional layers, each with stride 1 and no dilation. A fourth layer is a 3×3 conv with stride 2. What is the receptive field at the output (in input pixels)?

Solution

Receptive field recurrence: RF_{l} = RF_{l−1} + (k_l − 1) · S_{l−1}, where S_{l−1} is the cumulative stride before layer l. Start with RF_0 = 1, S_0 = 1.

Layer 1 (3×3, stride 1): RF_1 = 1 + (3−1)·1 = 3. Cumulative stride = 1.

Layer 2 (3×3, stride 1): RF_2 = 3 + 2·1 = 5. Cumulative stride = 1.

Layer 3 (3×3, stride 1): RF_3 = 5 + 2·1 = 7. Cumulative stride = 1.

Layer 4 (3×3, stride 2): RF_4 = 7 + 2·1 = 9. Cumulative stride becomes 2 after this layer.

Receptive field = 9×9 input pixels. The stride-2 layer's effect on RF is the same as a stride-1 here because the multiplier is the cumulative stride from earlier layers, which was still 1. The stride-2 only enlarges RF growth for any subsequent layers.

Q19 · BatchNorm at train vs inference

Describe exactly what BatchNorm computes at training time and at inference time, and why the two differ. What goes wrong if you forget to switch the model into inference mode?

Solution

Training. For each mini-batch and each feature, BN computes the batch mean μ_B and variance σ_B², normalizes x̂ = (x − μ_B)/√(σ_B² + ε), then applies a learned affine y = γ x̂ + β. It also updates running estimates of the population mean/variance via an EMA.

Inference. No batch statistics are computed. Instead BN uses the stored running mean and variance (fixed parameters at inference). This makes the function deterministic and independent of batch composition.

Why they differ. At train time the network learns to expect batch-normalized features. At test time you want a deterministic mapping from a single input to a single output — using a one-example "batch" statistic would be catastrophically noisy. The running stats are an unbiased(ish) approximation of the population distribution accumulated during training.

Forgetting to switch modes. BN keeps using the current batch's statistics, so predictions become dependent on what else is in the batch — accuracy fluctuates wildly with batch size, single-example inference is hosed, and dropout layers stay active too. The classic symptom is "model is great in training, terrible at inference." In PyTorch the fix is calling the model's inference-mode method before forward passes at test time.

Q20 · Dropout expected value

With inverted dropout at probability p of dropping a unit, what scaling is applied to surviving activations during training, and why is no scaling needed at inference?

Solution

Let mask M_i ∼ Bernoulli(1−p) (so M_i = 1 with probability 1−p, meaning the unit survives). Inverted dropout outputs y_i = (M_i / (1−p)) x_i at training time.

Expected value: E[y_i] = E[M_i]/(1−p) · x_i = (1−p)/(1−p) · x_i = x_i.

So in expectation, dropout passes the activation through unchanged. At inference, dropout is turned off — outputs are simply x_i. Since the training-time scaling already corrects the expectation, no additional scaling at inference is required.

(Compare to the older "vanilla" dropout, which used y_i = M_i x_i at train time and multiplied by (1−p) at inference. Inverted dropout shifts the multiplication into the train path so that the inference forward pass is identical to a non-dropout network, which is friendlier for deployment.)

Q21 · Attention complexity

For a self-attention layer with sequence length n and head dimension d, derive the time and memory complexity. Why is the term often the bottleneck for long sequences?

Solution

Computation: Q, K, V ∈ R^{n×d} (one head). Scores S = QK^T ∈ R^{n×n} costs O(n² d). Softmax over rows of S is O(n²). Multiplying S V is O(n² d).

Time: O(n² d). Memory: O(n²) to materialize the attention matrix, plus O(nd) for Q/K/V/output.

For short sequences (e.g., n=512, d=64), the n² d term is comparable to the nd² cost of the projection matmuls. But once n > d, the scaling dominates, and once n is in the tens of thousands the attention matrix alone consumes gigabytes of memory. This motivates linear-attention variants, FlashAttention's IO-aware streaming computation, and sparse-attention patterns like sliding-window or block-local — all of which trade exact softmax for sub-quadratic compute.

Q22 · Why residual connections help

Give two distinct reasons (one optimization-side, one representation-side) that residual connections enable training of very deep networks.

Solution

Optimization (gradient flow). The backward pass through y = x + F(x) has Jacobian I + ∂F/∂x. The identity term guarantees that gradients are always at least the upstream gradient, so they cannot vanish merely by passing through more blocks. Without the skip, the gradient is the product of every ∂F_l/∂x, which is geometric in depth and easily vanishes (or explodes). Residuals turn this product into a sum.

Representation (identity initialization). A fresh residual block with small initial weights computes y ≈ x — it starts as the identity function and the optimizer only has to learn the delta. So adding a block strictly cannot hurt the function class (you can always set F = 0) and gives the network a strong "do no harm" prior. The contrast: a fresh plain block computes a near-random function, so deeper plain networks must un-learn random transformations layer by layer before they can become useful.

Q23 · Diagnosing dead ReLU

A ReLU MLP trains for a few epochs and then loss plateaus far above the achievable minimum. You observe that a large fraction of hidden units output zero on every input in the batch. Explain the mechanism and give two mitigations.

Solution

Mechanism. ReLU's derivative is 1 for positive inputs and 0 for non-positive inputs. If a large gradient update pushes a unit's weights so that w·x + b < 0 for every x in the training distribution, the unit outputs zero everywhere — and crucially, its gradient is also zero everywhere. So the unit never receives an update again. It is permanently "dead." Once enough hidden units die, the network has effectively collapsed to a smaller subnetwork and loses capacity.

Mitigations.

  • Lower the learning rate (or use warmup) so the catastrophic update that pushed the unit into the dead region never happens. Also use He / Kaiming initialization, which is variance-scaled for ReLU specifically.
  • Switch to LeakyReLU or GELU/Swish. LeakyReLU(x) = max(αx, x) with α ≈ 0.01 has a small but nonzero gradient on the negative side, so dead units can recover. GELU and Swish are smooth everywhere.

A diagnostic to confirm: log the fraction of zero activations per layer. If it grows monotonically during training and exceeds ~40% you almost certainly have the dead-ReLU pathology rather than a normal sparsity pattern.

Transformers & attention (4 problems)

Q24 · Why divide attention scores by √d_k?

In scaled dot-product attention, scores are QK^T / √d_k before the softmax. Derive why the √d_k divisor is necessary, starting from the variance of the dot product of two random vectors.

Solution

Assume q, k ∈ R^{d_k} have i.i.d. entries with mean 0 and variance 1. Then their dot product q·k = Σ_{i=1}^{d_k} q_i k_i is a sum of d_k independent zero-mean random variables, each with variance Var(q_i k_i) = Var(q_i) Var(k_i) = 1 (assuming independence).

By the variance-of-sum rule: Var(q·k) = d_k, so its standard deviation is √d_k.

Without scaling, scores at large d_k have magnitudes on the order of √d_k — for d_k=64 they easily reach ±8. The softmax of large-magnitude logits is extremely peaked, with one entry near 1 and the rest near 0. In that regime, softmax's gradient is essentially zero — backpropagation grinds to a halt.

Dividing by √d_k normalizes the variance of the scores back to 1 regardless of d_k, keeping the softmax in a regime with meaningful gradients. Without this scaling, the transformer would fail to train as d_k grows.

Q25 · Why positional encoding?

Self-attention is order-equivariant: permuting the input tokens permutes the output tokens identically, with no other change. Prove this, and explain why positional encodings are therefore essential for sequence tasks.

Solution

Proof of equivariance. Let X ∈ R^{n×d} be the token matrix and P a permutation matrix. Self-attention computes Attn(X) = softmax((XW_Q)(XW_K)^T / √d_k) (XW_V). Substituting PX:

Attn(PX) = softmax((PX W_Q)(PX W_K)^T / √d_k) (PX W_V) = softmax(P (XW_Q)(XW_K)^T P^T / √d_k) P(XW_V).

Softmax is applied row-wise; permuting rows then softmaxing is the same as softmaxing then permuting rows. The P^T on the right permutes columns identically. Both effects compose to P · Attn(X). So attention is permutation-equivariant.

Why this is a problem. Word order carries meaning ("dog bites man" ≠ "man bites dog"). A model that gives the same answer regardless of order cannot learn syntax. Positional encodings — additive sinusoids or learned embeddings, or rotational ones like RoPE — inject distinguishable per-position signals into the token embeddings before attention, breaking the symmetry and letting the model condition on order.

Q26 · Multi-head computational cost

A transformer has h heads, each with key/query/value dimension d_k = d/h. Show that the total compute of multi-head attention is the same as a single-head attention with dimension d.

Solution

Single head, dim d. QK^T costs O(n² d); multiplying by V costs O(n² d). Total: O(n² d).

Multi-head, h heads each of dim d_k = d/h. Each head costs O(n² d_k) = O(n² d/h). There are h heads, so total cost is h · O(n² d/h) = O(n² d).

Identical scaling. The Q/K/V projections also have the same total cost: a single head projects n×d → n×d for each of Q, K, V at cost O(n d²) apiece; h heads each project at cost O(n d · d/h), summing to O(n d²).

What multi-head buys you is not compute reduction — it is representational diversity. Each head can attend to a different subspace of relationships (syntactic dependency, coreference, positional adjacency, etc.) without giving up the global compute budget. The output is concatenated and re-projected, letting the model mix evidence from all heads.

Q27 · KV-cache memory

A transformer has L = 32 layers, h = 32 heads, head dim d_k = 128, batch size b = 1, sequence length n = 8192, and fp16 storage (2 bytes). Compute the memory required for the KV cache.

Solution

Per layer the cache stores K and V tensors of shape (b, h, n, d_k). That's two tensors of b · h · n · d_k elements each.

Total elements: L · 2 · b · h · n · d_k = 32 · 2 · 1 · 32 · 8192 · 128.

Compute step-by-step: 32 · 32 = 1024 heads × layers. 1024 · 8192 = 8388608. 8388608 · 128 = 1073741824 elements per K or V across all layers. Multiply by 2 (K and V): 2147483648 elements.

At 2 bytes each (fp16): 4294967296 bytes = 4 GB.

This is per request at batch 1. Serving 8 concurrent users with 8k context each would need 32 GB just for the cache — which is why production inference systems use paged attention, quantized caches (int8 or int4), or grouped-query attention (sharing K/V across head groups to cut this by 4–8×).

Generative & vision (3 problems)

Q28 · VAE reparameterization trick

A VAE encoder outputs μ(x) and σ(x), and samples z ∼ N(μ, σ²) to feed the decoder. Explain why naively sampling z prevents backpropagation, and how the reparameterization trick fixes it.

Solution

Why naive sampling breaks backprop. Sampling is a non-differentiable operation: there is no useful derivative of "a random draw from a distribution" with respect to the distribution's parameters. Treating z as a sampled value cuts the gradient path from the decoder's loss back through z to μ and σ, so the encoder receives no learning signal about how to shape the latent distribution.

The trick. Rewrite the sample as a deterministic function of the parameters and an external noise source:

z = μ(x) + σ(x) ⊙ ε, where ε ∼ N(0, I).

The randomness is now in ε, which is independent of the parameters. The mapping from μ, σ to z is differentiable. Backpropagation flows through z: ∂z/∂μ = 1, ∂z/∂σ = ε. The encoder gets gradients with respect to μ and σ, while the stochasticity that defines z is preserved.

This trick works for any distribution whose sample admits a location-scale or invertible-CDF reparameterization (Gaussian, Laplace, uniform), but does not work for discrete distributions — Gumbel-softmax is the discrete analogue.

Q29 · DDPM forward-process variance

DDPM defines q(x_t | x_{t−1}) = N(√(1−β_t) x_{t−1}, β_t I). Derive the marginal q(x_t | x_0) and give the closed-form expression for direct sampling at any timestep t.

Solution

Let α_t = 1 − β_t and α̅_t = ∏_{s=1}^t α_s. We want to express x_t in terms of x_0 and a single Gaussian draw.

One step: x_t = √α_t x_{t−1} + √β_t ε_t, ε_t ∼ N(0, I). Recursing:

x_t = √α_t (√α_{t−1} x_{t−2} + √β_{t−1} ε_{t−1}) + √β_t ε_t = √(α_t α_{t−1}) x_{t−2} + (√(α_t β_{t−1}) ε_{t−1} + √β_t ε_t).

The bracketed noise is a sum of independent zero-mean Gaussians, so it is Gaussian with variance α_t β_{t−1} + β_t = α_t(1 − α_{t−1}) + (1 − α_t) = 1 − α_t α_{t−1}.

By induction, after t steps:

q(x_t | x_0) = N(√α̅_t · x_0, (1 − α̅_t) I).

So we can sample x_t = √α̅_t x_0 + √(1 − α̅_t) ε with a single ε ∼ N(0, I) — no chained simulation needed. This is the key efficiency win: DDPM training samples a random t per example and computes x_t in one shot.

Q30 · U-Net skip connections

A U-Net concatenates encoder feature maps with decoder feature maps at matching resolutions. Without these skips, what specifically degrades, and why is concatenation preferred to summation here?

Solution

What degrades without skips. The encoder downsamples, accumulating semantic context but discarding spatial precision — a feature map at 1/16 resolution knows "there is a boundary somewhere in this 16×16 patch" but cannot localize it to a pixel. The decoder must upsample back to full resolution to produce a dense prediction (e.g., segmentation mask, denoised image). Without skips, the decoder has only the heavily-downsampled bottleneck features to work from, so outputs are blurry and edges are misaligned. Skip connections feed high-resolution encoder activations directly to the matching decoder stage, restoring spatial detail that was lost in pooling.

Why concatenation, not summation. Summation forces the encoder and decoder features to share the same channel count and the same semantic meaning per channel — the network must learn to align them. Concatenation imposes no such constraint: the decoder receives a wider tensor with separate channel groups, and a subsequent 1×1 or 3×3 conv learns how to mix encoder and decoder evidence with independent per-channel weights. This is strictly more expressive — summation is a special case where the mixing weights are forced to be uniform. ResNets sum because the skip is across a same-shape residual block; U-Nets concatenate because encoder and decoder branches encode different information at the same resolution.

Next step

Once these feel automatic, run a full timed sitting on the mocks page, then loop back to whichever stratum (math, ML, DL, transformers) the missed problems came from.