Deep learning with PyTorch

Tensors, autograd, a multi-layer perceptron from scratch, the standard layer toolkit, training loops, regularization, optimizers — every piece you need to go from raw data to a trained neural network.

PyTorch only. The syllabus specifies PyTorch. Build muscle memory in this API and don't waste cycles on TensorFlow / JAX.
Syllabus link. Mapped to the DL block of the official USAAIO syllabus.
TL;DR. Five non-negotiables: (1) tensors, autograd, and the shape/dtype/device trio; (2) building an MLP and CNN with nn.Module and reading off the parameter count; (3) the canonical training-loop skeleton (zero_grad to forward to loss to backward to step); (4) picking the right loss + activation + optimizer for the task; (5) diagnosing diverging losses (NaN, oscillation, plateau). USAAIO DL questions are usually "small network on a small dataset, report metric + a bug fix" — fluency in the loop wins.

1. Tensors & autograd

Concept

A PyTorch Tensor is an n-dimensional array with three attributes you always check: .shape, .dtype, .device. Operations on tensors with requires_grad=True are recorded into a dynamic computation graph; calling .backward() on a scalar walks that graph in reverse, accumulating gradients into each leaf's .grad. This is just the chain rule from the math page, automated. Gradients accumulate — you must call optimizer.zero_grad() (or model.zero_grad(set_to_none=True)) at the start of every training step.

Worked example — verify a hand derivative with autograd

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 + 2 * x          # dy/dx = 3x^2 + 2 = 14
y.backward()
print(x.grad.item())        # 14.0

# Vector example: gradient of L = ||Wx||^2 w.r.t. W
W = torch.randn(3, 4, requires_grad=True)
xv = torch.randn(4)
L = (W @ xv).pow(2).sum()
L.backward()
# Manual: dL/dW = 2 (Wx) x^T
manual = 2 * (W.detach() @ xv).unsqueeze(1) * xv.unsqueeze(0)
print(torch.allclose(W.grad, manual))    # True

Drills

D1 · Shape on a linear layer

Input batch shape (B, 768); layer nn.Linear(768, 10). What is the output shape and how many parameters does the layer have?

Solution

Output (B, 10). Parameters: weight 768*10 = 7680 + bias 10 = 7690.

D2 · Why zero_grad?

You forget opt.zero_grad(). What happens after step 3?

Solution

Gradients from steps 1-3 sum into each parameter's .grad, so step 3 effectively uses g1 + g2 + g3. The optimizer takes huge, wrong-direction steps; loss diverges or oscillates.

D3 · Detach vs no_grad

Difference between x.detach() and the with torch.no_grad(): context?

Solution

x.detach() returns a tensor sharing memory with x but cut from the graph — operations on the returned tensor have requires_grad=False. no_grad() is a context manager that disables graph construction for everything inside it. Use no_grad for evaluation loops; use detach when you need a constant copy of a learned value.

D4 · Device juggling

You do x = x.to("cuda") but the model still lives on CPU. What error do you get?

Solution

"Expected all tensors to be on the same device". Move both model and data — typically model.to(device) once and xb = xb.to(device) inside the loop.

2. Building an MLP

Concept

A multilayer perceptron is a stack of Linear -> activation blocks ending in a task-specific head. nn.Module defines the architecture via __init__ and the forward pass via forward; parameters are registered automatically. ReLU (max(0, x)) is the default activation — cheap, non-saturating for x > 0. GELU is its smooth cousin used in transformers. Tanh and sigmoid saturate and are mostly reserved for output heads with bounded ranges.

Worked example — MLP for tabular classification

import torch, torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, p_drop=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden), nn.ReLU(), nn.Dropout(p_drop),
            nn.Linear(hidden, hidden), nn.ReLU(), nn.Dropout(p_drop),
            nn.Linear(hidden, out_dim),
        )
    def forward(self, x):
        return self.net(x)            # raw logits

model = MLP(20, 64, 3)
print(model)
print(sum(p.numel() for p in model.parameters()), "parameters")

Drills

D1 · Dead ReLU

What is a "dead ReLU" and one way to avoid it?

Solution

A neuron whose pre-activation is always < 0 for all training inputs; its gradient is always 0, so it never updates. Causes: too-high LR pushing weights into a regime that zeros every input. Mitigations: lower LR, use LeakyReLU/GELU, careful init (Kaiming/He), or BatchNorm.

D2 · Parameter count

Compute the parameter count of MLP(in_dim=20, hidden=64, out_dim=3) with three linear layers (20->64, 64->64, 64->3).

Solution

(20*64+64) + (64*64+64) + (64*3+3) = 1344 + 4160 + 195 = 5699 parameters.

D3 · Sigmoid + BCELoss vs BCEWithLogitsLoss

Why is BCEWithLogitsLoss numerically preferred?

Solution

It fuses sigmoid and binary cross-entropy in log-space, avoiding the explicit log(sigma(x)) which is -inf for very negative x in float32.

3. Losses, activations & output heads

Concept

Match the head, activation, and loss to the task:

Remember: CrossEntropyLoss takes raw logits and integer class indices, not one-hot vectors. Returning probabilities from your model and applying log-CE manually is a classic foot-gun.

4. Optimizers & learning-rate schedules

Concept

SGD + momentum: v <- beta*v + grad(L), theta <- theta - eta*v. Classical, robust, used with cosine LR schedules in vision. Adam / AdamW keeps per-parameter running estimates of the first moment (mean) and second moment (uncentred variance) of gradients and rescales each update by 1/(sqrt(v_hat) + eps). AdamW decouples weight decay from the gradient update — use it over Adam in 2024+.

LR schedules matter as much as the optimizer. Cosine annealing ramps from eta_max to ~0 over training. OneCycle warms up then cools down — very forgiving for tabular MLPs. Linear warmup + cosine decay is the transformer default.

Worked example — AdamW + cosine schedule

import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

opt   = AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
sched = CosineAnnealingLR(opt, T_max=20)   # 20 epochs

for epoch in range(20):
    for xb, yb in train_dl:
        opt.zero_grad()
        loss = loss_fn(model(xb), yb)
        loss.backward()
        opt.step()
    sched.step()
    print(f"epoch {epoch}  lr {sched.get_last_lr()[0]:.2e}  loss {loss.item():.4f}")

Drills

D1 · Pick LR

Loss explodes to NaN at iteration 30. What's your first knob?

Solution

Lower the learning rate (try 10x). Other quick checks: gradient clipping (nn.utils.clip_grad_norm_), look for log(0) by replacing BCELoss+sigmoid with BCEWithLogitsLoss, lower init scale.

D2 · AdamW vs Adam

One sentence: what does "decoupled" weight decay mean in AdamW?

Solution

The L2 shrinkage theta <- theta - eta*lambda*theta is applied directly to weights rather than added into the gradient, so the adaptive rescaling by v_hat does not weaken the regularisation.

5. Regularization & normalization

Concept

Tools, ordered by how often you'll need them: weight decay (set on the optimizer); dropout (zero each unit with probability p during training, scale by 1/(1-p) — automatic in nn.Dropout); batchnorm (normalize each feature across the batch — accelerates training of CNNs); layernorm (normalize within each sample across the feature dim — standard in transformers); early stopping (monitor val loss; stop when it stops improving); data augmentation (flips, crops, colour jitter for vision; masking and back-translation for NLP); finally, smaller model is the most effective regulariser when overfit and out of patience.

Drills

D1 · BatchNorm in inference mode

You forget to call model's inference toggle (the one that flips dropout off and uses running statistics in BatchNorm) and your val accuracy is awful. Why?

Solution

BatchNorm uses batch statistics in train mode (which on a single-sample validation batch is degenerate) and the running averages in inference mode. Dropout likewise stays active. Always toggle train mode / inference mode at the right time.

D2 · Dropout p choice

What's a sensible starting p for an MLP on small tabular data? For a transformer encoder?

Solution

MLP on small tabular: 0.2-0.5. Transformer encoder: 0.1 is the canonical starting point. Tune via val loss.

6. CNNs

Concept

Convolutions exploit three priors that make them dominant on grid-structured data: local connectivity (each kernel sees a small spatial neighborhood), weight sharing (the same kernel slides across the whole image), and spatial pooling (downsampling builds larger receptive fields and translation tolerance). A typical block is Conv2d to BatchNorm to ReLU to MaxPool, repeated, then flatten into a classifier head. For dense prediction tasks like segmentation, the U-Net architecture pairs a downsampling encoder with an upsampling decoder, with skip connections that re-inject high-resolution features.

Worked example — small CNN on CIFAR-shape inputs

import torch.nn as nn

class SmallCNN(nn.Module):
    def __init__(self, n_classes=10):
        super().__init__()
        self.feat = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2), # 32x16x16
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2), # 64x8x8
            nn.Conv2d(64,128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), nn.AdaptiveAvgPool2d(1),
        )
        self.head = nn.Sequential(nn.Flatten(), nn.Dropout(0.3), nn.Linear(128, n_classes))
    def forward(self, x):
        return self.head(self.feat(x))

Drills

D1 · Output size of a Conv2d

Input (B, 3, 32, 32), layer nn.Conv2d(3, 16, kernel_size=3, padding=1, stride=2). Output shape?

Solution

Spatial: floor((32 + 2*1 - 3)/2) + 1 = 16. Output: (B, 16, 16, 16).

D2 · Receptive field

Two 3x3 conv layers stacked (stride 1, no pool). What is the receptive field of one output unit?

Solution

5x5. Each layer adds (kernel - 1) = 2 to the receptive-field side, starting from 1x1.

D3 · Why skip connections in U-Net?

One sentence.

Solution

Concatenating encoder features back into the decoder restores high-resolution detail that the pooling layers throw away, yielding pixel-accurate predictions.

7. The canonical training loop

Concept

Memorise this skeleton — it has the same shape for every supervised task you'll write.

import torch, torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

device = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
model  = MLP(20, 64, 3).to(device)
opt    = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()

train_dl = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=64, shuffle=True)
val_dl   = DataLoader(TensorDataset(X_val_t,   y_val_t),   batch_size=256)

best_val, patience, bad = float("inf"), 5, 0
for epoch in range(50):
    model.train()
    running, n = 0.0, 0
    for xb, yb in train_dl:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        loss = loss_fn(model(xb), yb)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()
        running += loss.item() * xb.size(0); n += xb.size(0)
    train_loss = running / n

    model.train(False)              # inference mode
    with torch.no_grad():
        vloss, correct, ntot = 0.0, 0, 0
        for xb, yb in val_dl:
            xb, yb = xb.to(device), yb.to(device)
            logits = model(xb)
            vloss   += loss_fn(logits, yb).item() * xb.size(0)
            correct += (logits.argmax(-1) == yb).sum().item()
            ntot    += xb.size(0)
        val_loss, val_acc = vloss / ntot, correct / ntot
    print(f"epoch {epoch:02d}  train {train_loss:.4f}  val {val_loss:.4f}  acc {val_acc:.4f}")

    # early stopping
    if val_loss < best_val:
        best_val, bad = val_loss, 0
        torch.save(model.state_dict(), "best.pt")
    else:
        bad += 1
        if bad >= patience: break

In real code you would call model.eval() rather than model.train(False); both do the same thing.

8. Debugging diverging models

Concept

The five diagnostic questions, in order, when training misbehaves:

  1. Loss is NaN. LR too high, exploding gradients, or log(0). Lower LR; add clip_grad_norm_; switch to BCEWithLogitsLoss; check for -inf in inputs.
  2. Loss is flat. LR too low, dead activations, or wrong loss/head pairing. Try LR finder (start tiny, ramp up, plot); confirm activations aren't all zero with a histogram.
  3. Train down, val up. Overfitting. Add dropout, weight decay, augmentation, or early stopping; collect more data.
  4. Val accuracy > train. Data leak — almost always, you're evaluating on training data, or augmentation isn't applied to train but is to val, etc.
  5. GPU out of memory. Reduce batch size; use gradient accumulation; enable mixed precision (torch.amp.autocast).

Drills

D1 · Train accuracy 100%, val 60%

Three fixes, in priority order.

Solution

(1) Augmentation / more data; (2) stronger regularisation (dropout, weight decay); (3) smaller model. Verify with a learning curve.

D2 · Loss decreases then NaNs

You're training with lr=1e-2; loss drops nicely for 200 steps then NaNs. Likely cause?

Solution

Late-stage exploding gradient — once weights grow, the same LR becomes too large. Add gradient clipping at norm 1.0 and/or switch to a cosine LR schedule.

D3 · Forgot the head's activation

You used BCELoss on raw logits without a sigmoid. What goes wrong?

Solution

BCELoss expects probabilities in [0, 1]. Logits outside that range produce garbage gradients and NaNs from log. Either add nn.Sigmoid() to the head, or (preferred) switch to BCEWithLogitsLoss.

Checkpoint — answer out loud

Next step

With training loops in your fingers, head to Attention & transformers — the architecture behind every modern AI system. The Round 2 theory drills include receptive-field calculation, BatchNorm at train vs inference, dropout expected value, and the dead-ReLU diagnosis.