Contest-day cheatsheets

One page, ten sections, every API and shape rule I want at my fingertips during a proctored USAAIO round. The goal is recognition speed: skim once a day until you don't need it open.

How to use. The contest is on Colab without internet, so practise these from memory. If you can read each table and immediately picture the snippet, you're ready. If a cell makes you pause, that's your next drill.

1. NumPy

Broadcasting rules

Compare shapes right-to-left. For each pair: equal, or one is 1, or one is missing → OK. Otherwise error.

A shape	B shape	Result	Why
`(4, 3)`	`(3,)`	`(4, 3)`	row vector added to each row
`(4, 3)`	`(4, 1)`	`(4, 3)`	column vector added to each column
`(4, 3)`	`(4,)`	error	reshape to `(4, 1)` first
`(n, 1, d)`	`(1, n, d)`	`(n, n, d)`	pairwise broadcast trick
`(b, h, n, d)`	`(d,)`	`(b, h, n, d)`	scalar feature scaling

Shape ops

a.reshape(2, -1)              # -1 means "infer"
a.ravel()                     # flatten, returns view if possible
np.squeeze(a, axis=1)         # drop length-1 dims
np.expand_dims(a, axis=0)     # equivalent to a[None]
a.transpose(0, 2, 1)          # permute axes
np.stack([a, b], axis=0)      # new axis, equal shapes
np.concatenate([a, b], axis=1)# join on existing axis
np.tile(a, (2, 3))            # repeat along axes
np.repeat(a, 3, axis=0)       # repeat each element

Axis-aware reductions

Call	Input	Output	Note
`X.mean(axis=0)`	`(N, D)`	`(D,)`	per-column mean
`X.mean(axis=1)`	`(N, D)`	`(N,)`	per-row mean
`X.sum(axis=-1, keepdims=True)`	`(N, D)`	`(N, 1)`	preserve rank for broadcasting
`X.argmax(axis=1)`	`(N, K)`	`(N,)`	predicted class indices
`np.linalg.norm(X, axis=1)`	`(N, D)`	`(N,)`	L2 norm per row

Slicing & indexing

a[1:5:2]                      # start:stop:step
a[::-1]                       # reverse
a[mask]                       # boolean mask, mask.shape == a.shape (or broadcastable)
a[[0, 2, 5]]                  # fancy index — returns a copy
a[np.arange(N), y]            # pick a[i, y[i]] for each i (gather)
a[:, None] - b[None, :]       # pairwise diff via broadcasting
np.where(cond, x, y)          # vectorized ternary
np.clip(a, 0, 1)              # element-wise clip

Random (modern API)

rng = np.random.default_rng(0)
rng.standard_normal((4, 3))   # N(0, 1)
rng.normal(loc=0, scale=1, size=(4, 3))
rng.uniform(0, 1, size=10)
rng.integers(0, 10, size=5)   # exclusive of high
rng.choice(N, size=k, replace=False)
rng.permutation(N)

2. pandas

read_csv common args

Arg	Use
`sep=","`	or `"\t"`, `r"\s+"` for whitespace
`header=0` / `None`	row index of header, or no header
`names=[...]`	supply column names (with `header=None`)
`index_col="id"`	use this column as the row index
`usecols=["a","b"]`	only load these columns (saves memory)
`dtype={"x": "float32"}`	force column dtypes
`parse_dates=["ts"]`	parse to datetime64
`na_values=["?", "NA"]`	extra strings to treat as NaN
`nrows=1000`	peek at file
`chunksize=10000`	iterator of DataFrames for big files

loc vs iloc

df.loc["a":"c", ["x", "y"]]   # LABEL slicing, INCLUSIVE on both ends
df.iloc[0:3, [0, 1]]          # POSITION slicing, exclusive right
df.loc[df["x"] > 0, "y"] = 1  # safe conditional write
df.at["row3", "x"]            # scalar label access (fastest)
df.iat[2, 0]                  # scalar position access

GroupBy + agg

df.groupby("user").size()                       # count per group
df.groupby("user")["amt"].mean()                 # series result
df.groupby(["user", "day"])["amt"].sum()         # multi-key, MultiIndex

df.groupby("user").agg(
    n=("amt", "size"),
    mean_amt=("amt", "mean"),
    max_amt=("amt", "max"),
).reset_index()                                  # named columns, flat

df.groupby("user")["amt"].transform("mean")      # broadcast back to row shape

Merge / join

pd.merge(a, b, on="user_id", how="left")        # how: left/right/inner/outer
pd.merge(a, b, left_on="uid", right_on="user_id")
pd.merge(a, b, on="k", how="outer", indicator=True)  # _merge col shows source
a.join(b, on="user_id", how="left")              # join uses b's index
pd.concat([a, b], axis=0, ignore_index=True)     # stack rows

value_counts & pivot_table

df["label"].value_counts()                       # counts, sorted desc
df["label"].value_counts(normalize=True)         # proportions
pd.crosstab(df["a"], df["b"])                    # 2-way frequency

df.pivot_table(
    index="date", columns="product",
    values="sales", aggfunc="sum", fill_value=0,
)

Datetime

df["ts"] = pd.to_datetime(df["ts"], errors="coerce")
df["hour"] = df["ts"].dt.hour
df["dow"]  = df["ts"].dt.dayofweek
df["month"] = df["ts"].dt.month
df["is_weekend"] = df["dow"] >= 5
df.set_index("ts").resample("1D")["amt"].sum()

3. matplotlib

Figure / axes anatomy

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 4))   # one axes
fig, axes = plt.subplots(2, 3, figsize=(12, 6), sharex=True)
ax = axes[0, 1]                          # row, col indexing
fig.suptitle("Diagnostics")
plt.tight_layout()
fig.savefig("out.png", dpi=150, bbox_inches="tight")

The six plots

Plot	Call	When
line	`ax.plot(x, y, label="train")`	loss curves, time series
scatter	`ax.scatter(x, y, c=labels, s=8, alpha=0.6)`	2-D feature view, residuals
histogram	`ax.hist(x, bins=50, density=True)`	feature distribution
bar	`ax.bar(names, values)`	class counts, model comparison
image	`ax.imshow(img, cmap="gray")`	image samples, weight maps
heatmap	`ax.imshow(M); fig.colorbar(im, ax=ax)`	confusion matrix, attention

Style fixes

ax.set_xlabel("epoch"); ax.set_ylabel("loss"); ax.set_title("...")
ax.set_yscale("log")          # long loss curves
ax.set_xlim(0, 100); ax.set_ylim(0, 1)
ax.legend(loc="best"); ax.grid(alpha=0.3)
ax.tick_params(axis="x", rotation=45)
fig.colorbar(im, ax=ax, shrink=0.8)

4. scikit-learn

Estimator API

est.fit(X, y)            # learn parameters
est.predict(X)           # supervised: discrete labels or regression values
est.predict_proba(X)     # classifiers (where supported)
est.transform(X)         # transformers (scalers, encoders, PCA)
est.fit_transform(X)     # both in one call (on TRAIN only)
est.score(X, y)          # default metric (accuracy / R^2)

Preprocessing

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

pre = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "plan"]),
])
X_train_t = pre.fit_transform(X_train)
X_val_t   = pre.transform(X_val)        # NO refit on val

Pipeline

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("pre", pre),
    ("clf", LogisticRegression(max_iter=1000, C=1.0)),
])
pipe.fit(X_train, y_train)
pipe.predict(X_val)

model_selection

from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold,
    cross_val_score, GridSearchCV,
)

X_tr, X_va, y_tr, y_va = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y,
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="f1_macro")

grid = GridSearchCV(
    pipe,
    param_grid={"clf__C": [0.1, 1, 10]},
    cv=cv, scoring="f1_macro", n_jobs=-1,
)
grid.fit(X, y)
grid.best_params_, grid.best_score_

Metrics

Task	Metric	Call
classification	accuracy	`accuracy_score(y, yp)`
classification	F1 (macro / binary)	`f1_score(y, yp, average="macro")`
classification	ROC-AUC	`roc_auc_score(y, yp_proba)`
classification	confusion matrix	`confusion_matrix(y, yp)`
regression	MSE / RMSE	`mean_squared_error(y, yp, squared=False)`
regression	MAE	`mean_absolute_error(y, yp)`
regression	R²	`r2_score(y, yp)`

Common estimators

Estimator	Import	Key defaults
LogisticRegression	`linear_model`	`C=1.0, penalty="l2", max_iter=100` (bump to 1000)
Ridge / Lasso	`linear_model`	`alpha=1.0`
KNeighborsClassifier	`neighbors`	`n_neighbors=5`
DecisionTreeClassifier	`tree`	`max_depth=None, min_samples_split=2`
RandomForestClassifier	`ensemble`	`n_estimators=100, max_features="sqrt"`
GradientBoostingClassifier	`ensemble`	`n_estimators=100, lr=0.1, max_depth=3`
SVC	`svm`	`C=1.0, kernel="rbf", gamma="scale"`
KMeans	`cluster`	`n_clusters=8, n_init="auto"`
PCA	`decomposition`	`n_components=None`

5. PyTorch shape operations

view vs reshape

x.view(B, -1)        # requires contiguous memory; errors if not
x.reshape(B, -1)     # works on non-contiguous (may copy)
x.contiguous().view(B, -1)   # canonical fix after transpose/permute

permute vs transpose

x.transpose(1, 2)            # swap exactly two dims
x.permute(0, 2, 3, 1)        # arbitrary reorder, e.g. NCHW -> NHWC
# Both return a view with strided memory; follow with .contiguous() before .view().

unsqueeze / squeeze / expand / repeat

x.unsqueeze(0)               # add axis at position 0:  (D,) -> (1, D)
x.squeeze(1)                 # remove axis 1 if it is length 1
x.expand(B, -1, -1)          # broadcast view, NO memory copy, READ-ONLY-ish
x.repeat(B, 1, 1)            # physical copy along each dim

matmul broadcasting

A	B	A @ B
`(n, k)`	`(k, m)`	`(n, m)` — plain matmul
`(B, n, k)`	`(B, k, m)`	`(B, n, m)` — batched
`(B, n, k)`	`(k, m)`	`(B, n, m)` — right broadcasts
`(B, H, n, k)`	`(B, H, k, m)`	`(B, H, n, m)` — attention block

einsum patterns

import torch
# batched matmul
torch.einsum("bij,bjk->bik", A, B)

# multi-head attention QK^T:  Q (B, H, N, d), K (B, H, N, d)
attn = torch.einsum("bhid,bhjd->bhij", Q, K) / d**0.5
out  = torch.einsum("bhij,bhjd->bhid", attn.softmax(-1), V)

# dot product of every row pair
torch.einsum("nd,md->nm", X, Y)

# weighted sum across feature dim
torch.einsum("nd,d->n", X, w)

6. PyTorch training loop boilerplate

import torch, random, numpy as np
from torch.utils.data import DataLoader

def set_seed(s=42):
    random.seed(s); np.random.seed(s)
    torch.manual_seed(s); torch.cuda.manual_seed_all(s)

set_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True,  num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=2, pin_memory=True)

model = MyModel().to(device)
optim = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(optim, T_max=EPOCHS)
loss_fn = torch.nn.CrossEntropyLoss()

best_val = float("inf")
for epoch in range(EPOCHS):
    model.train()
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        optim.zero_grad()
        logits = model(xb)
        loss = loss_fn(logits, yb)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optim.step()
    sched.step()

    model.eval()
    tot, n = 0.0, 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            tot += loss_fn(model(xb), yb).item() * xb.size(0)
            n += xb.size(0)
    val_loss = tot / n
    if val_loss < best_val:
        best_val = val_loss
        torch.save({"model": model.state_dict(),
                    "optim": optim.state_dict(),
                    "epoch": epoch}, "best.pt")

Resume: ckpt = torch.load("best.pt"); model.load_state_dict(ckpt["model"]).

7. Loss / activation / optimizer reference

Losses

Loss	Use	Target shape & dtype
`nn.CrossEntropyLoss`	multi-class, integer labels	`(B,)` long, expects raw logits `(B, K)`
`nn.BCEWithLogitsLoss`	binary / multi-label	`(B,)` or `(B, K)` float in {0,1}, raw logits
`nn.NLLLoss`	after `log_softmax`	same as CE but log-probs input
`nn.MSELoss`	regression	float, same shape as input
`nn.L1Loss` / `SmoothL1Loss`	robust regression	float, same shape
`nn.KLDivLoss`	distillation	input is log-probs, target is probs

Activations

Activation	When to use
`ReLU`	default hidden activation for CNN/MLP
`GELU`	transformers (BERT, GPT)
`SiLU` / Swish	modern CNNs, EfficientNet
`LeakyReLU`	when ReLU is dying (lots of zeros)
`Tanh`	RNN cells, bounded output
`Sigmoid`	binary head only — never as hidden
`Softmax(dim=-1)`	turn logits into probabilities (manual, not for CE loss)

Optimizers

Optimizer	Defaults	Notes
`SGD`	`lr`, `momentum=0.9`	CV from scratch with LR schedule
`Adam`	`lr=1e-3, betas=(0.9, 0.999)`	safe default; faster early progress
`AdamW`	`lr=3e-4, weight_decay=1e-2`	transformers / modern default
`RMSprop`	`lr=1e-2`	RNNs, RL

Schedulers

torch.optim.lr_scheduler.StepLR(optim, step_size=10, gamma=0.1)
torch.optim.lr_scheduler.CosineAnnealingLR(optim, T_max=EPOCHS)
torch.optim.lr_scheduler.OneCycleLR(optim, max_lr=1e-3, total_steps=N)
torch.optim.lr_scheduler.ReduceLROnPlateau(optim, mode="min", patience=3)

8. Layer shape rules

Conv2d

nn.Conv2d(in_channels=C, out_channels=F,
          kernel_size=k, stride=s, padding=p)
# Input  : (B, C, H, W)
# Output : (B, F, H_out, W_out)
# H_out = floor((H + 2p - k) / s) + 1
# Params : F * (C * k * k + 1)  (the +1 is the bias)

Other layers

Layer	Input → Output	Param count
`Linear(in, out)`	`(, in)` → `(, out)`	`in * out + out`
`BatchNorm2d(C)`	`(B, C, H, W)` → same	`2C` (γ, β) + 2C buffers
`LayerNorm(d)`	`(*, d)` → same	`2d`
`MaxPool2d(k, s)`	`(B, C, H, W)` → `(B, C, H/s, W/s)`	0
`AdaptiveAvgPool2d(1)`	`(B, C, H, W)` → `(B, C, 1, 1)`	0
`Dropout(p)`	same shape	0
`Embedding(V, d)`	`(B, L)` long → `(B, L, d)`	`V * d`
`MultiheadAttention(d, h, batch_first=True)`	`(B, L, d)` → `(B, L, d)`	`4 * d * d` (Q,K,V,out)

MultiheadAttention call

attn = nn.MultiheadAttention(embed_dim=d, num_heads=h, batch_first=True)
out, weights = attn(query, key, value,
                    key_padding_mask=pad_mask,  # (B, L) bool, True = ignore
                    attn_mask=causal_mask)      # (L, L) bool/float
# query/key/value: (B, L, d). out: (B, L, d). weights: (B, L, L).

Conv output rule of thumb

"Same" padding for odd k: padding = k // 2 with stride=1 keeps H, W. Halving spatial dims is stride=2.

9. Hugging Face datasets & transformers

datasets

from datasets import load_dataset

ds = load_dataset("imdb")                        # DatasetDict {train, test}
ds = load_dataset("csv", data_files="train.csv") # local
small = ds["train"].shuffle(seed=0).select(range(2000))

def add_len(ex):
    ex["n_words"] = len(ex["text"].split())
    return ex
ds = ds.map(add_len)
ds = ds.filter(lambda ex: ex["n_words"] < 256)
ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])

AutoTokenizer

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tok(batch["text"], truncation=True, max_length=256, padding=False)

ds_tok = ds.map(tokenize, batched=True, remove_columns=["text"])

AutoModelForSequenceClassification + Trainer

from transformers import (
    AutoModelForSequenceClassification, TrainingArguments,
    Trainer, DataCollatorWithPadding,
)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2,
)

args = TrainingArguments(   # note: arg renamed eval_strategy in transformers >=4.41
    output_dir="out",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

def metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"accuracy": accuracy_score(p.label_ids, preds),
            "f1": f1_score(p.label_ids, preds, average="macro")}

trainer = Trainer(
    model=model, args=args,
    train_dataset=ds_tok["train"], eval_dataset=ds_tok["test"],
    tokenizer=tok,
    data_collator=DataCollatorWithPadding(tok),
    compute_metrics=metrics,
)
trainer.train()
results = trainer.predict(ds_tok["test"])

10. Complexity quick reference

Attention & transformers

Quantity	Cost	Note
Self-attention compute	O(n² · d)	n = seq len, d = model dim
Self-attention memory (attn matrix)	O(n²)	per head; FlashAttention reduces wall memory but not the FLOPs
FFN block	O(n · d²)	two linears with hidden 4d
KV cache (autoregressive)	2 · L · n · d bytes per layer	L layers, halves with fp16
Decode step with cache	O(n · d) per token	vs O(n² · d) without cache
Multi-head Q,K,V,O projections	4 · d² params	independent of head count

CNN layer math

Quantity	Formula
Conv2d params (with bias)	`F · (C · k · k + 1)`
Conv2d FLOPs (per image)	`H_out · W_out · F · C · k · k`
Dense layer params	`in · out + out`
Dense layer FLOPs	`2 · in · out` (mult + add)
Embedding params	`V · d` (usually the biggest matrix)

Classical ML algorithms

Algorithm	Train	Predict (per sample)
Linear / logistic regression	O(n · d²) closed form; O(n · d) per SGD step	O(d)
k-NN	O(1) (just store)	O(n · d) brute; O(d · log n) with KD-tree (low d)
Decision tree	O(n · d · log n)	O(depth)
Random forest (T trees)	O(T · n · d · log n)	O(T · depth)
Gradient boosting (T trees)	O(T · n · d · log n) sequential	O(T · depth)
SVM (RBF kernel)	O(n² · d) to O(n³ · d)	O(SV · d)
K-means (k clusters, I iters)	O(I · n · k · d)	O(k · d)
PCA via SVD	O(min(n · d², n² · d))	O(k · d) per sample

Sanity checks for the contest

BERT-base: 12 layers, d=768, 12 heads, 110M params; sequence cost is O(n²) so 512 tokens is the practical cap on Colab CPU.
A 1B-param model in fp32 needs ~4 GB just for weights; fp16 halves that. Adam state adds 2× more.
Doubling batch size doubles memory but roughly halves wall time per epoch — until you hit OOM.
If your forward pass uses N GFLOPs, the backward pass uses ~2N. Total training ≈ 3 · forward FLOPs · samples · epochs.

Where to go next

Once this page feels boring, you have the syntax internalised. Run the mock contests closed-book and only open this page when you actually get stuck — that's the contest-day workflow. The end-to-end notebooks show every API here in context.