Mock B · Small vision — “Hand-drawn shape classification”

Format: Round-2 style · Time: 180 minutes · Compute: Colab L4 GPU allowed · Submission: predictions.csv + 3 short-answer responses · Metric: macro accuracy.

A self-contained small-scale vision mock modeled on a USAAIO Round-2 problem. You generate a 5 000-image synthetic dataset of hand-drawn shapes, train a CNN on a Colab L4 GPU, predict the class of a held-out test set, and answer three in-line theory questions. Three hours, end to end.

Constraints.

Time budget: 180 minutes, hard stop, including data generation, training, and write-up.
Compute: one Colab L4 GPU (or local equivalent). No pretrained ImageNet weights, no foundation models.
Libraries: PyTorch, torchvision, numpy, pandas, matplotlib. torchvision.transforms is allowed; timm is not.
Submission: predictions.csv with header id,label, plus a markdown cell answering the 3 theory questions at the end of the notebook.

Problem statement

You receive 5 000 grayscale 64×64 images of hand-drawn shapes in 5 categories: circle, square, triangle, star, arrow. Each shape is rendered by a stochastic procedure that varies size, stroke thickness, in-plane rotation, translation, and additive Gaussian noise. Class balance is exactly uniform (1 000 / class).

You will split the 5 000 images into 4 000 train and 1 000 held-out test by index (the generator below yields a deterministic split). Train a CNN from scratch, and submit predicted class labels for the test images. The grader computes macro accuracy (mean per-class accuracy) on the held-out labels.

The dataset is fully synthetic and generated by the snippet in the next section; cite it as such if you publish your solution.

Synthetic data generator (run once at the top of the notebook)

import numpy as np
from PIL import Image, ImageDraw
from pathlib import Path

CLASSES = ["circle", "square", "triangle", "star", "arrow"]
ROOT = Path("shapes")
ROOT.mkdir(exist_ok=True)
rng = np.random.default_rng(7)

def draw_shape(cls, size=64):
    img = Image.new("L", (size, size), color=0)
    d = ImageDraw.Draw(img)
    cx, cy = size // 2 + rng.integers(-6, 7), size // 2 + rng.integers(-6, 7)
    r = rng.integers(14, 24)
    stroke = int(rng.integers(1, 4))
    if cls == "circle":
        d.ellipse([cx-r, cy-r, cx+r, cy+r], outline=255, width=stroke)
    elif cls == "square":
        d.rectangle([cx-r, cy-r, cx+r, cy+r], outline=255, width=stroke)
    elif cls == "triangle":
        d.polygon([(cx, cy-r), (cx-r, cy+r), (cx+r, cy+r)], outline=255)
    elif cls == "star":
        pts = []
        for k in range(10):
            ang = -np.pi/2 + k * np.pi / 5
            rad = r if k % 2 == 0 else r * 0.45
            pts.append((cx + rad*np.cos(ang), cy + rad*np.sin(ang)))
        d.polygon(pts, outline=255)
    elif cls == "arrow":
        d.line([(cx-r, cy), (cx+r, cy)], fill=255, width=stroke)
        d.polygon([(cx+r, cy-6), (cx+r, cy+6), (cx+r+8, cy)], outline=255)
    img = img.rotate(rng.uniform(-25, 25), resample=Image.BILINEAR)
    arr = np.array(img).astype(np.float32)
    arr += rng.normal(0, 8, size=arr.shape)
    return np.clip(arr, 0, 255).astype(np.uint8)

records = []
for i in range(5000):
    cls = CLASSES[i % 5]
    arr = draw_shape(cls)
    Image.fromarray(arr).save(ROOT / f"{i:05d}.png")
    records.append((i, cls))

# deterministic 4000/1000 split by index
import pandas as pd
df = pd.DataFrame(records, columns=["id", "label"])
train = df.iloc[:4000]
test  = df.iloc[4000:]
train.to_csv("train.csv", index=False)
test[["id"]].to_csv("test.csv", index=False)             # no labels at test time
test[["id", "label"]].to_csv("test_labels.csv", index=False)  # for your own scoring

Submission format

A single CSV named predictions.csv with 1 000 data rows and the header below.

id,label
4000,star
4001,circle
4002,arrow
...
4999,square

id is the image index; the grader joins on id.
Labels must be exactly one of circle, square, triangle, star, arrow (lowercase).
Plus a markdown cell at the end of the notebook with answers to the 3 theory questions in Theory short-answer.

Scoring rubric (100 points)

Section	Points	What earns credit
Data loader	15	Custom `Dataset` + `DataLoader`, correct image-tensor normalization, train/val split (held-out within the 4 000 train), no leakage from the test 1 000.
Baseline CNN	30	A from-scratch CNN (3–5 conv blocks, ≤500 k params) trained for a documented number of epochs, with reported train/val accuracy and a loss curve.
Augmentation	20	At least 2 augmentations (e.g. `RandomAffine`, `RandomHorizontalFlip` — only if class-symmetric, else justify omission, `ColorJitter`), with a measured val-accuracy delta vs. no augmentation.
Training stability	15	Pinned seeds, fixed batch size, a sensible LR schedule (step / cosine), early stopping or explicit best-epoch selection on val.
Evaluation report	10	Per-class accuracy + confusion matrix on the val set; brief comment on the hardest class.
Theory short-answer	10	Three questions in Theory short-answer, ~3 sentences each.

Theory short-answer (10 pts, 3 questions)

Receptive fields. Your final conv stack is Conv(3×3) → MaxPool(2) → Conv(3×3) → MaxPool(2) → Conv(3×3). What is the receptive field of one output neuron with respect to the 64×64 input, in pixels? Show the arithmetic.
BatchNorm at inference. Why does a BatchNorm layer behave differently during training vs. inference, and what concretely happens if you forget to switch the model into inference mode before predicting on the test set?
Augmentation impact. Your rotation augmentation lifts val accuracy by ~3 percentage points but slows training by ~30%. Give one principled reason rotation specifically helps on this dataset, and one realistic situation where you would remove rotation augmentation.

Start the timer

Go. Set a 180-minute timer now. Reserve the last 20 minutes for the evaluation report and theory answers — do not let training eat your write-up time. When the timer rings, save the notebook from a clean kernel and stop. Then open the reference solution and score yourself section by section.