Deep learning with PyTorch
Tensors, autograd, a multi-layer perceptron from scratch, the standard layer toolkit, training loops, regularization, optimizers — every piece you need to go from raw data to a trained neural network.
nn.Module and reading off the parameter count; (3) the
canonical training-loop skeleton (zero_grad to forward to loss to backward to step); (4) picking the right
loss + activation + optimizer for the task; (5) diagnosing diverging losses (NaN, oscillation,
plateau). USAAIO DL questions are usually "small network on a small dataset, report metric + a bug
fix" — fluency in the loop wins.
1. Tensors & autograd
Concept
A PyTorch Tensor is an n-dimensional array with three attributes you always check:
.shape, .dtype, .device. Operations on tensors with
requires_grad=True are recorded into a dynamic computation graph; calling
.backward() on a scalar walks that graph in reverse, accumulating gradients into each
leaf's .grad. This is just the chain rule from the math page,
automated. Gradients accumulate — you must call optimizer.zero_grad() (or
model.zero_grad(set_to_none=True)) at the start of every training step.
Worked example — verify a hand derivative with autograd
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3 + 2 * x # dy/dx = 3x^2 + 2 = 14
y.backward()
print(x.grad.item()) # 14.0
# Vector example: gradient of L = ||Wx||^2 w.r.t. W
W = torch.randn(3, 4, requires_grad=True)
xv = torch.randn(4)
L = (W @ xv).pow(2).sum()
L.backward()
# Manual: dL/dW = 2 (Wx) x^T
manual = 2 * (W.detach() @ xv).unsqueeze(1) * xv.unsqueeze(0)
print(torch.allclose(W.grad, manual)) # True
Drills
D1 · Shape on a linear layer
Input batch shape (B, 768); layer nn.Linear(768, 10). What is the output
shape and how many parameters does the layer have?
Solution
Output (B, 10). Parameters: weight 768*10 = 7680 + bias 10
= 7690.
D2 · Why zero_grad?
You forget opt.zero_grad(). What happens after step 3?
Solution
Gradients from steps 1-3 sum into each parameter's .grad, so step 3 effectively uses
g1 + g2 + g3. The optimizer takes huge, wrong-direction steps; loss diverges or oscillates.
D3 · Detach vs no_grad
Difference between x.detach() and the with torch.no_grad(): context?
Solution
x.detach() returns a tensor sharing memory with x but cut from the graph
— operations on the returned tensor have requires_grad=False. no_grad() is a
context manager that disables graph construction for everything inside it. Use no_grad
for evaluation loops; use detach when you need a constant copy of a learned value.
D4 · Device juggling
You do x = x.to("cuda") but the model still lives on CPU. What error do you get?
Solution
"Expected all tensors to be on the same device". Move both model and data — typically
model.to(device) once and xb = xb.to(device) inside the loop.
2. Building an MLP
Concept
A multilayer perceptron is a stack of Linear -> activation blocks ending in a task-specific
head. nn.Module defines the architecture via __init__ and the forward pass via
forward; parameters are registered automatically. ReLU (max(0, x)) is the
default activation — cheap, non-saturating for x > 0. GELU is its smooth cousin used in
transformers. Tanh and sigmoid saturate and are mostly reserved for output heads with bounded ranges.
Worked example — MLP for tabular classification
import torch, torch.nn as nn
class MLP(nn.Module):
def __init__(self, in_dim, hidden, out_dim, p_drop=0.2):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden), nn.ReLU(), nn.Dropout(p_drop),
nn.Linear(hidden, hidden), nn.ReLU(), nn.Dropout(p_drop),
nn.Linear(hidden, out_dim),
)
def forward(self, x):
return self.net(x) # raw logits
model = MLP(20, 64, 3)
print(model)
print(sum(p.numel() for p in model.parameters()), "parameters")
Drills
D1 · Dead ReLU
What is a "dead ReLU" and one way to avoid it?
Solution
A neuron whose pre-activation is always < 0 for all training inputs; its gradient is always 0, so it never updates. Causes: too-high LR pushing weights into a regime that zeros every input. Mitigations: lower LR, use LeakyReLU/GELU, careful init (Kaiming/He), or BatchNorm.
D2 · Parameter count
Compute the parameter count of MLP(in_dim=20, hidden=64, out_dim=3) with three linear
layers (20->64, 64->64, 64->3).
Solution
(20*64+64) + (64*64+64) + (64*3+3) = 1344 + 4160 + 195 = 5699 parameters.
D3 · Sigmoid + BCELoss vs BCEWithLogitsLoss
Why is BCEWithLogitsLoss numerically preferred?
Solution
It fuses sigmoid and binary cross-entropy in log-space, avoiding the explicit
log(sigma(x)) which is -inf for very negative x in float32.
3. Losses, activations & output heads
Concept
Match the head, activation, and loss to the task:
- Regression: output
1unit, no final activation; lossnn.MSELoss(orL1Loss/SmoothL1Lossfor outlier-resistant variants). - Binary classification: output
1logit, no sigmoid in the model; lossnn.BCEWithLogitsLoss. - Multi-class (single-label): output
Klogits, no softmax in the model; lossnn.CrossEntropyLoss. - Multi-label: output
Klogits; lossnn.BCEWithLogitsLossapplied per label.
Remember: CrossEntropyLoss takes raw logits and integer class
indices, not one-hot vectors. Returning probabilities from your model and applying log-CE
manually is a classic foot-gun.
4. Optimizers & learning-rate schedules
Concept
SGD + momentum: v <- beta*v + grad(L), theta <- theta - eta*v.
Classical, robust, used with cosine LR schedules in vision. Adam / AdamW keeps
per-parameter running estimates of the first moment (mean) and second moment (uncentred variance) of
gradients and rescales each update by 1/(sqrt(v_hat) + eps). AdamW decouples weight decay
from the gradient update — use it over Adam in 2024+.
LR schedules matter as much as the optimizer. Cosine annealing ramps from
eta_max to ~0 over training. OneCycle warms up then cools down — very
forgiving for tabular MLPs. Linear warmup + cosine decay is the transformer default.
Worked example — AdamW + cosine schedule
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
opt = AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
sched = CosineAnnealingLR(opt, T_max=20) # 20 epochs
for epoch in range(20):
for xb, yb in train_dl:
opt.zero_grad()
loss = loss_fn(model(xb), yb)
loss.backward()
opt.step()
sched.step()
print(f"epoch {epoch} lr {sched.get_last_lr()[0]:.2e} loss {loss.item():.4f}")
Drills
D1 · Pick LR
Loss explodes to NaN at iteration 30. What's your first knob?
Solution
Lower the learning rate (try 10x). Other quick checks: gradient clipping
(nn.utils.clip_grad_norm_), look for log(0) by replacing
BCELoss+sigmoid with BCEWithLogitsLoss, lower init scale.
D2 · AdamW vs Adam
One sentence: what does "decoupled" weight decay mean in AdamW?
Solution
The L2 shrinkage theta <- theta - eta*lambda*theta is applied directly to weights
rather than added into the gradient, so the adaptive rescaling by v_hat does not weaken
the regularisation.
5. Regularization & normalization
Concept
Tools, ordered by how often you'll need them: weight decay (set on the optimizer);
dropout (zero each unit with probability p during training, scale by
1/(1-p) — automatic in nn.Dropout); batchnorm (normalize
each feature across the batch — accelerates training of CNNs); layernorm (normalize
within each sample across the feature dim — standard in transformers); early stopping
(monitor val loss; stop when it stops improving); data augmentation (flips, crops,
colour jitter for vision; masking and back-translation for NLP); finally, smaller model
is the most effective regulariser when overfit and out of patience.
Drills
D1 · BatchNorm in inference mode
You forget to call model's inference toggle (the one that flips dropout off and uses
running statistics in BatchNorm) and your val accuracy is awful. Why?
Solution
BatchNorm uses batch statistics in train mode (which on a single-sample validation batch is degenerate) and the running averages in inference mode. Dropout likewise stays active. Always toggle train mode / inference mode at the right time.
D2 · Dropout p choice
What's a sensible starting p for an MLP on small tabular data? For a transformer encoder?
Solution
MLP on small tabular: 0.2-0.5. Transformer encoder: 0.1 is the canonical
starting point. Tune via val loss.
6. CNNs
Concept
Convolutions exploit three priors that make them dominant on grid-structured data: local
connectivity (each kernel sees a small spatial neighborhood), weight sharing
(the same kernel slides across the whole image), and spatial pooling (downsampling
builds larger receptive fields and translation tolerance). A typical block is
Conv2d to BatchNorm to ReLU to MaxPool, repeated, then flatten into a classifier head.
For dense prediction tasks like segmentation, the U-Net architecture pairs a
downsampling encoder with an upsampling decoder, with skip connections that re-inject high-resolution
features.
Worked example — small CNN on CIFAR-shape inputs
import torch.nn as nn
class SmallCNN(nn.Module):
def __init__(self, n_classes=10):
super().__init__()
self.feat = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2), # 32x16x16
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2), # 64x8x8
nn.Conv2d(64,128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), nn.AdaptiveAvgPool2d(1),
)
self.head = nn.Sequential(nn.Flatten(), nn.Dropout(0.3), nn.Linear(128, n_classes))
def forward(self, x):
return self.head(self.feat(x))
Drills
D1 · Output size of a Conv2d
Input (B, 3, 32, 32), layer nn.Conv2d(3, 16, kernel_size=3, padding=1, stride=2).
Output shape?
Solution
Spatial: floor((32 + 2*1 - 3)/2) + 1 = 16. Output:
(B, 16, 16, 16).
D2 · Receptive field
Two 3x3 conv layers stacked (stride 1, no pool). What is the receptive field of one output unit?
Solution
5x5. Each layer adds (kernel - 1) = 2 to the receptive-field side, starting from 1x1.
D3 · Why skip connections in U-Net?
One sentence.
Solution
Concatenating encoder features back into the decoder restores high-resolution detail that the pooling layers throw away, yielding pixel-accurate predictions.
7. The canonical training loop
Concept
Memorise this skeleton — it has the same shape for every supervised task you'll write.
import torch, torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
device = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
model = MLP(20, 64, 3).to(device)
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()
train_dl = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=64, shuffle=True)
val_dl = DataLoader(TensorDataset(X_val_t, y_val_t), batch_size=256)
best_val, patience, bad = float("inf"), 5, 0
for epoch in range(50):
model.train()
running, n = 0.0, 0
for xb, yb in train_dl:
xb, yb = xb.to(device), yb.to(device)
opt.zero_grad()
loss = loss_fn(model(xb), yb)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step()
running += loss.item() * xb.size(0); n += xb.size(0)
train_loss = running / n
model.train(False) # inference mode
with torch.no_grad():
vloss, correct, ntot = 0.0, 0, 0
for xb, yb in val_dl:
xb, yb = xb.to(device), yb.to(device)
logits = model(xb)
vloss += loss_fn(logits, yb).item() * xb.size(0)
correct += (logits.argmax(-1) == yb).sum().item()
ntot += xb.size(0)
val_loss, val_acc = vloss / ntot, correct / ntot
print(f"epoch {epoch:02d} train {train_loss:.4f} val {val_loss:.4f} acc {val_acc:.4f}")
# early stopping
if val_loss < best_val:
best_val, bad = val_loss, 0
torch.save(model.state_dict(), "best.pt")
else:
bad += 1
if bad >= patience: break
In real code you would call model.eval() rather than
model.train(False); both do the same thing.
8. Debugging diverging models
Concept
The five diagnostic questions, in order, when training misbehaves:
- Loss is NaN. LR too high, exploding gradients, or
log(0). Lower LR; addclip_grad_norm_; switch toBCEWithLogitsLoss; check for-infin inputs. - Loss is flat. LR too low, dead activations, or wrong loss/head pairing. Try LR finder (start tiny, ramp up, plot); confirm activations aren't all zero with a histogram.
- Train down, val up. Overfitting. Add dropout, weight decay, augmentation, or early stopping; collect more data.
- Val accuracy > train. Data leak — almost always, you're evaluating on training data, or augmentation isn't applied to train but is to val, etc.
- GPU out of memory. Reduce batch size; use gradient accumulation; enable mixed
precision (
torch.amp.autocast).
Drills
D1 · Train accuracy 100%, val 60%
Three fixes, in priority order.
Solution
(1) Augmentation / more data; (2) stronger regularisation (dropout, weight decay); (3) smaller model. Verify with a learning curve.
D2 · Loss decreases then NaNs
You're training with lr=1e-2; loss drops nicely for 200 steps then NaNs. Likely cause?
Solution
Late-stage exploding gradient — once weights grow, the same LR becomes too large. Add gradient clipping at norm 1.0 and/or switch to a cosine LR schedule.
D3 · Forgot the head's activation
You used BCELoss on raw logits without a sigmoid. What goes wrong?
Solution
BCELoss expects probabilities in [0, 1]. Logits outside that range
produce garbage gradients and NaNs from log. Either add nn.Sigmoid() to the
head, or (preferred) switch to BCEWithLogitsLoss.
Checkpoint — answer out loud
- Can you write the training-loop skeleton from memory, including
zero_grad,clip_grad_norm_, and the train/inference-mode toggle? - Can you state which loss + activation goes with regression, binary classification, multi-class, multi-label?
- Can you compute the parameter count and output spatial size of a Conv2d layer?
- Can you list four causes of NaN loss and a fix for each?
- Can you describe how BatchNorm differs in train vs. inference mode?
Next step
With training loops in your fingers, head to Attention & transformers — the architecture behind every modern AI system. The Round 2 theory drills include receptive-field calculation, BatchNorm at train vs inference, dropout expected value, and the dead-ReLU diagnosis.