USAAIO 2025 Round 1 · Problem 5 · Transformer attention & a tiny encoder [reconstructed]
Contest: 2025 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Transformers / attention mechanism.
Official sources: usaaio.org/past-problems · 2025 Round 1 forum.
1. Problem restatement
The problem progresses from the math of scaled dot-product attention through a hand-built single-head attention layer, on to multi-head attention and finally a 2-layer encoder applied to a small text-classification dataset. Sub-parts ask:
- Given Q, K, V (small matrices), compute the attention output by hand.
- Explain the role of the 1/√d_k scaling factor.
- Implement single-head attention in NumPy.
- Extend to multi-head attention in PyTorch.
- Build a 2-layer encoder with residuals + LayerNorm.
- Train on a tiny corpus and report accuracy + attention-map plots.
2. What's being tested
- Attention as a matrix recipe. Softmax-of-(QK^T/√d) · V — drilled with hand computation and code.
- LayerNorm vs BatchNorm. Why transformers use the former.
- Positional encoding. Sinusoidal vs learned, why you need one.
- Residuals. What "pre-norm vs post-norm" means and which is more stable.
3. Data exploration / setup
import torch, torch.nn as nn, torch.nn.functional as F
# tiny synthetic classification dataset
torch.manual_seed(0)
vocab = 1000; seq_len = 24; nc = 5
X = torch.randint(0, vocab, (2000, seq_len))
y = torch.randint(0, nc, (2000,))
4. Baseline approach
Single-head attention in NumPy, by hand:
import numpy as np
def attention(Q, K, V):
d_k = Q.shape[-1]
scores = (Q @ K.T) / np.sqrt(d_k) # (Lq, Lk)
weights = np.exp(scores - scores.max(-1, keepdims=True))
weights /= weights.sum(-1, keepdims=True)
return weights @ V # (Lq, d_v)
Q = np.array([[1, 0, 0], [0, 1, 0]], dtype=float)
K = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=float)
V = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
print(attention(Q, K, V))
Then in PyTorch with multi-head:
class MHA(nn.Module):
def __init__(self, d, h):
super().__init__()
self.h, self.d = h, d // h
self.qkv = nn.Linear(d, 3 * d)
self.proj = nn.Linear(d, d)
def forward(self, x):
B, L, _ = x.shape
q, k, v = self.qkv(x).chunk(3, dim=-1)
q, k, v = (t.view(B, L, self.h, self.d).transpose(1, 2) for t in (q, k, v))
att = (q @ k.transpose(-2, -1)) / self.d ** 0.5
att = att.softmax(-1)
out = (att @ v).transpose(1, 2).contiguous().view(B, L, -1)
return self.proj(out)
5. Improvements that move the needle
5.1 · Pre-norm rather than post-norm
Apply LayerNorm before each sublayer rather than after. Trains more stably without learning-rate warmup at the small scale Round-1 problems use.
5.2 · Sinusoidal positional encoding done from scratch
def pos_enc(L, d):
pe = torch.zeros(L, d)
pos = torch.arange(L).unsqueeze(1)
div = torch.exp(torch.arange(0, d, 2) * -(np.log(10000.0) / d))
pe[:, 0::2] = torch.sin(pos * div)
pe[:, 1::2] = torch.cos(pos * div)
return pe
5.3 · Plot the attention weights
After training, take a batch and visualise the attention matrix from layer 2, head 0 as a heatmap. Graders reward visualisations that confirm the model "attends to" the parts you'd expect.
5.4 · Sanity-check on a copy task before the real data
Train your encoder on a "copy" task (output the input sequence). If it converges to near-zero loss, your attention + residual plumbing is correct. If not, fix that before touching the real task.
5.5 · Print parameter counts and per-layer activations
A short paragraph reporting head dim, layer count, FFN expansion ratio, parameter count, and a histogram of attention entropies is the kind of writeup that earns the analysis points.
6. Submission format & gotchas
- Hand-computation parts must show intermediate scores, softmax values, and final weighted sum.
- Code parts: paste the function plus a small test that prints expected output for graders to eyeball.
- If you use
nn.MultiheadAttentionfrom PyTorch in a later part, double-check the "batch_first" flag — wrong-axis bugs are silent. - Visualisations attached as inline images on the forum post.
7. What top solutions did
[reconstructed — verify against published solutions] Top P5 submissions on similar problems: hand-computations done with explicit intermediate matrices; clean from-scratch MHA + pre-norm encoder; copy-task sanity check; attention-map plots; a short paragraph linking the 1/√d scaling to gradient stability. Parameter counts kept under 100k — there's no need for a BERT-sized model.
8. Drill
D · Why divide by √d_k in scaled dot-product attention?
When Q and K are vectors of dimension d_k with entries drawn i.i.d. with variance 1, the dot
product q · k has variance d_k — so its standard deviation grows like √d_k. Without
scaling, softmax over large scores saturates: one position gets ~1 weight, others ~0, gradients
vanish. Dividing by √d_k normalises the variance back to 1 and keeps the softmax in its useful
regime. Try it: implement attention without the scaling, check that for d=64 training stalls
in a few hundred steps.
D2 · Why is LayerNorm preferred over BatchNorm in transformers?
LayerNorm normalises across feature dimensions for each token independently — sequence-length changes don't break it, and there's no batch-statistics machinery. BatchNorm normalises across the batch, which (a) breaks on tiny batches, (b) needs running-mean bookkeeping at inference, and (c) couples tokens in a batch that should be independent. For autoregressive decoders it also leaks information across positions. LayerNorm dodges all three issues.