1. NumPy
Broadcasting rules
Compare shapes right-to-left. For each pair: equal, or one is 1, or one is missing → OK. Otherwise error.
| A shape | B shape | Result | Why |
(4, 3) | (3,) | (4, 3) | row vector added to each row |
(4, 3) | (4, 1) | (4, 3) | column vector added to each column |
(4, 3) | (4,) | error | reshape to (4, 1) first |
(n, 1, d) | (1, n, d) | (n, n, d) | pairwise broadcast trick |
(b, h, n, d) | (d,) | (b, h, n, d) | scalar feature scaling |
Shape ops
a.reshape(2, -1) # -1 means "infer"
a.ravel() # flatten, returns view if possible
np.squeeze(a, axis=1) # drop length-1 dims
np.expand_dims(a, axis=0) # equivalent to a[None]
a.transpose(0, 2, 1) # permute axes
np.stack([a, b], axis=0) # new axis, equal shapes
np.concatenate([a, b], axis=1)# join on existing axis
np.tile(a, (2, 3)) # repeat along axes
np.repeat(a, 3, axis=0) # repeat each element
Axis-aware reductions
| Call | Input | Output | Note |
X.mean(axis=0) | (N, D) | (D,) | per-column mean |
X.mean(axis=1) | (N, D) | (N,) | per-row mean |
X.sum(axis=-1, keepdims=True) | (N, D) | (N, 1) | preserve rank for broadcasting |
X.argmax(axis=1) | (N, K) | (N,) | predicted class indices |
np.linalg.norm(X, axis=1) | (N, D) | (N,) | L2 norm per row |
Slicing & indexing
a[1:5:2] # start:stop:step
a[::-1] # reverse
a[mask] # boolean mask, mask.shape == a.shape (or broadcastable)
a[[0, 2, 5]] # fancy index — returns a copy
a[np.arange(N), y] # pick a[i, y[i]] for each i (gather)
a[:, None] - b[None, :] # pairwise diff via broadcasting
np.where(cond, x, y) # vectorized ternary
np.clip(a, 0, 1) # element-wise clip
Random (modern API)
rng = np.random.default_rng(0)
rng.standard_normal((4, 3)) # N(0, 1)
rng.normal(loc=0, scale=1, size=(4, 3))
rng.uniform(0, 1, size=10)
rng.integers(0, 10, size=5) # exclusive of high
rng.choice(N, size=k, replace=False)
rng.permutation(N)
2. pandas
read_csv common args
| Arg | Use |
sep="," | or "\t", r"\s+" for whitespace |
header=0 / None | row index of header, or no header |
names=[...] | supply column names (with header=None) |
index_col="id" | use this column as the row index |
usecols=["a","b"] | only load these columns (saves memory) |
dtype={"x": "float32"} | force column dtypes |
parse_dates=["ts"] | parse to datetime64 |
na_values=["?", "NA"] | extra strings to treat as NaN |
nrows=1000 | peek at file |
chunksize=10000 | iterator of DataFrames for big files |
loc vs iloc
df.loc["a":"c", ["x", "y"]] # LABEL slicing, INCLUSIVE on both ends
df.iloc[0:3, [0, 1]] # POSITION slicing, exclusive right
df.loc[df["x"] > 0, "y"] = 1 # safe conditional write
df.at["row3", "x"] # scalar label access (fastest)
df.iat[2, 0] # scalar position access
GroupBy + agg
df.groupby("user").size() # count per group
df.groupby("user")["amt"].mean() # series result
df.groupby(["user", "day"])["amt"].sum() # multi-key, MultiIndex
df.groupby("user").agg(
n=("amt", "size"),
mean_amt=("amt", "mean"),
max_amt=("amt", "max"),
).reset_index() # named columns, flat
df.groupby("user")["amt"].transform("mean") # broadcast back to row shape
Merge / join
pd.merge(a, b, on="user_id", how="left") # how: left/right/inner/outer
pd.merge(a, b, left_on="uid", right_on="user_id")
pd.merge(a, b, on="k", how="outer", indicator=True) # _merge col shows source
a.join(b, on="user_id", how="left") # join uses b's index
pd.concat([a, b], axis=0, ignore_index=True) # stack rows
value_counts & pivot_table
df["label"].value_counts() # counts, sorted desc
df["label"].value_counts(normalize=True) # proportions
pd.crosstab(df["a"], df["b"]) # 2-way frequency
df.pivot_table(
index="date", columns="product",
values="sales", aggfunc="sum", fill_value=0,
)
Datetime
df["ts"] = pd.to_datetime(df["ts"], errors="coerce")
df["hour"] = df["ts"].dt.hour
df["dow"] = df["ts"].dt.dayofweek
df["month"] = df["ts"].dt.month
df["is_weekend"] = df["dow"] >= 5
df.set_index("ts").resample("1D")["amt"].sum()
3. matplotlib
Figure / axes anatomy
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 4)) # one axes
fig, axes = plt.subplots(2, 3, figsize=(12, 6), sharex=True)
ax = axes[0, 1] # row, col indexing
fig.suptitle("Diagnostics")
plt.tight_layout()
fig.savefig("out.png", dpi=150, bbox_inches="tight")
The six plots
| Plot | Call | When |
| line | ax.plot(x, y, label="train") | loss curves, time series |
| scatter | ax.scatter(x, y, c=labels, s=8, alpha=0.6) | 2-D feature view, residuals |
| histogram | ax.hist(x, bins=50, density=True) | feature distribution |
| bar | ax.bar(names, values) | class counts, model comparison |
| image | ax.imshow(img, cmap="gray") | image samples, weight maps |
| heatmap | ax.imshow(M); fig.colorbar(im, ax=ax) | confusion matrix, attention |
Style fixes
ax.set_xlabel("epoch"); ax.set_ylabel("loss"); ax.set_title("...")
ax.set_yscale("log") # long loss curves
ax.set_xlim(0, 100); ax.set_ylim(0, 1)
ax.legend(loc="best"); ax.grid(alpha=0.3)
ax.tick_params(axis="x", rotation=45)
fig.colorbar(im, ax=ax, shrink=0.8)
4. scikit-learn
Estimator API
est.fit(X, y) # learn parameters
est.predict(X) # supervised: discrete labels or regression values
est.predict_proba(X) # classifiers (where supported)
est.transform(X) # transformers (scalers, encoders, PCA)
est.fit_transform(X) # both in one call (on TRAIN only)
est.score(X, y) # default metric (accuracy / R^2)
Preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
pre = ColumnTransformer([
("num", StandardScaler(), ["age", "income"]),
("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "plan"]),
])
X_train_t = pre.fit_transform(X_train)
X_val_t = pre.transform(X_val) # NO refit on val
Pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("pre", pre),
("clf", LogisticRegression(max_iter=1000, C=1.0)),
])
pipe.fit(X_train, y_train)
pipe.predict(X_val)
model_selection
from sklearn.model_selection import (
train_test_split, KFold, StratifiedKFold,
cross_val_score, GridSearchCV,
)
X_tr, X_va, y_tr, y_va = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y,
)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="f1_macro")
grid = GridSearchCV(
pipe,
param_grid={"clf__C": [0.1, 1, 10]},
cv=cv, scoring="f1_macro", n_jobs=-1,
)
grid.fit(X, y)
grid.best_params_, grid.best_score_
Metrics
| Task | Metric | Call |
| classification | accuracy | accuracy_score(y, yp) |
| classification | F1 (macro / binary) | f1_score(y, yp, average="macro") |
| classification | ROC-AUC | roc_auc_score(y, yp_proba) |
| classification | confusion matrix | confusion_matrix(y, yp) |
| regression | MSE / RMSE | mean_squared_error(y, yp, squared=False) |
| regression | MAE | mean_absolute_error(y, yp) |
| regression | R² | r2_score(y, yp) |
Common estimators
| Estimator | Import | Key defaults |
| LogisticRegression | linear_model | C=1.0, penalty="l2", max_iter=100 (bump to 1000) |
| Ridge / Lasso | linear_model | alpha=1.0 |
| KNeighborsClassifier | neighbors | n_neighbors=5 |
| DecisionTreeClassifier | tree | max_depth=None, min_samples_split=2 |
| RandomForestClassifier | ensemble | n_estimators=100, max_features="sqrt" |
| GradientBoostingClassifier | ensemble | n_estimators=100, lr=0.1, max_depth=3 |
| SVC | svm | C=1.0, kernel="rbf", gamma="scale" |
| KMeans | cluster | n_clusters=8, n_init="auto" |
| PCA | decomposition | n_components=None |
5. PyTorch shape operations
view vs reshape
x.view(B, -1) # requires contiguous memory; errors if not
x.reshape(B, -1) # works on non-contiguous (may copy)
x.contiguous().view(B, -1) # canonical fix after transpose/permute
permute vs transpose
x.transpose(1, 2) # swap exactly two dims
x.permute(0, 2, 3, 1) # arbitrary reorder, e.g. NCHW -> NHWC
# Both return a view with strided memory; follow with .contiguous() before .view().
unsqueeze / squeeze / expand / repeat
x.unsqueeze(0) # add axis at position 0: (D,) -> (1, D)
x.squeeze(1) # remove axis 1 if it is length 1
x.expand(B, -1, -1) # broadcast view, NO memory copy, READ-ONLY-ish
x.repeat(B, 1, 1) # physical copy along each dim
matmul broadcasting
| A | B | A @ B |
(n, k) | (k, m) | (n, m) — plain matmul |
(B, n, k) | (B, k, m) | (B, n, m) — batched |
(B, n, k) | (k, m) | (B, n, m) — right broadcasts |
(B, H, n, k) | (B, H, k, m) | (B, H, n, m) — attention block |
einsum patterns
import torch
# batched matmul
torch.einsum("bij,bjk->bik", A, B)
# multi-head attention QK^T: Q (B, H, N, d), K (B, H, N, d)
attn = torch.einsum("bhid,bhjd->bhij", Q, K) / d**0.5
out = torch.einsum("bhij,bhjd->bhid", attn.softmax(-1), V)
# dot product of every row pair
torch.einsum("nd,md->nm", X, Y)
# weighted sum across feature dim
torch.einsum("nd,d->n", X, w)
6. PyTorch training loop boilerplate
import torch, random, numpy as np
from torch.utils.data import DataLoader
def set_seed(s=42):
random.seed(s); np.random.seed(s)
torch.manual_seed(s); torch.cuda.manual_seed_all(s)
set_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_ds, batch_size=128, shuffle=False, num_workers=2, pin_memory=True)
model = MyModel().to(device)
optim = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(optim, T_max=EPOCHS)
loss_fn = torch.nn.CrossEntropyLoss()
best_val = float("inf")
for epoch in range(EPOCHS):
model.train()
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optim.zero_grad()
logits = model(xb)
loss = loss_fn(logits, yb)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optim.step()
sched.step()
model.eval()
tot, n = 0.0, 0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
tot += loss_fn(model(xb), yb).item() * xb.size(0)
n += xb.size(0)
val_loss = tot / n
if val_loss < best_val:
best_val = val_loss
torch.save({"model": model.state_dict(),
"optim": optim.state_dict(),
"epoch": epoch}, "best.pt")
Resume: ckpt = torch.load("best.pt"); model.load_state_dict(ckpt["model"]).
7. Loss / activation / optimizer reference
Losses
| Loss | Use | Target shape & dtype |
nn.CrossEntropyLoss | multi-class, integer labels | (B,) long, expects raw logits (B, K) |
nn.BCEWithLogitsLoss | binary / multi-label | (B,) or (B, K) float in {0,1}, raw logits |
nn.NLLLoss | after log_softmax | same as CE but log-probs input |
nn.MSELoss | regression | float, same shape as input |
nn.L1Loss / SmoothL1Loss | robust regression | float, same shape |
nn.KLDivLoss | distillation | input is log-probs, target is probs |
Activations
| Activation | When to use |
ReLU | default hidden activation for CNN/MLP |
GELU | transformers (BERT, GPT) |
SiLU / Swish | modern CNNs, EfficientNet |
LeakyReLU | when ReLU is dying (lots of zeros) |
Tanh | RNN cells, bounded output |
Sigmoid | binary head only — never as hidden |
Softmax(dim=-1) | turn logits into probabilities (manual, not for CE loss) |
Optimizers
| Optimizer | Defaults | Notes |
SGD | lr, momentum=0.9 | CV from scratch with LR schedule |
Adam | lr=1e-3, betas=(0.9, 0.999) | safe default; faster early progress |
AdamW | lr=3e-4, weight_decay=1e-2 | transformers / modern default |
RMSprop | lr=1e-2 | RNNs, RL |
Schedulers
torch.optim.lr_scheduler.StepLR(optim, step_size=10, gamma=0.1)
torch.optim.lr_scheduler.CosineAnnealingLR(optim, T_max=EPOCHS)
torch.optim.lr_scheduler.OneCycleLR(optim, max_lr=1e-3, total_steps=N)
torch.optim.lr_scheduler.ReduceLROnPlateau(optim, mode="min", patience=3)
8. Layer shape rules
Conv2d
nn.Conv2d(in_channels=C, out_channels=F,
kernel_size=k, stride=s, padding=p)
# Input : (B, C, H, W)
# Output : (B, F, H_out, W_out)
# H_out = floor((H + 2p - k) / s) + 1
# Params : F * (C * k * k + 1) (the +1 is the bias)
Other layers
| Layer | Input → Output | Param count |
Linear(in, out) | (*, in) → (*, out) | in * out + out |
BatchNorm2d(C) | (B, C, H, W) → same | 2C (γ, β) + 2C buffers |
LayerNorm(d) | (*, d) → same | 2d |
MaxPool2d(k, s) | (B, C, H, W) → (B, C, H/s, W/s) | 0 |
AdaptiveAvgPool2d(1) | (B, C, H, W) → (B, C, 1, 1) | 0 |
Dropout(p) | same shape | 0 |
Embedding(V, d) | (B, L) long → (B, L, d) | V * d |
MultiheadAttention(d, h, batch_first=True) | (B, L, d) → (B, L, d) | 4 * d * d (Q,K,V,out) |
MultiheadAttention call
attn = nn.MultiheadAttention(embed_dim=d, num_heads=h, batch_first=True)
out, weights = attn(query, key, value,
key_padding_mask=pad_mask, # (B, L) bool, True = ignore
attn_mask=causal_mask) # (L, L) bool/float
# query/key/value: (B, L, d). out: (B, L, d). weights: (B, L, L).
Conv output rule of thumb
"Same" padding for odd k: padding = k // 2 with stride=1 keeps H, W. Halving spatial dims is stride=2.
9. Hugging Face datasets & transformers
datasets
from datasets import load_dataset
ds = load_dataset("imdb") # DatasetDict {train, test}
ds = load_dataset("csv", data_files="train.csv") # local
small = ds["train"].shuffle(seed=0).select(range(2000))
def add_len(ex):
ex["n_words"] = len(ex["text"].split())
return ex
ds = ds.map(add_len)
ds = ds.filter(lambda ex: ex["n_words"] < 256)
ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
AutoTokenizer
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(batch):
return tok(batch["text"], truncation=True, max_length=256, padding=False)
ds_tok = ds.map(tokenize, batched=True, remove_columns=["text"])
AutoModelForSequenceClassification + Trainer
from transformers import (
AutoModelForSequenceClassification, TrainingArguments,
Trainer, DataCollatorWithPadding,
)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2,
)
args = TrainingArguments( # note: arg renamed eval_strategy in transformers >=4.41
output_dir="out",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
num_train_epochs=2,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
)
def metrics(p):
preds = np.argmax(p.predictions, axis=1)
return {"accuracy": accuracy_score(p.label_ids, preds),
"f1": f1_score(p.label_ids, preds, average="macro")}
trainer = Trainer(
model=model, args=args,
train_dataset=ds_tok["train"], eval_dataset=ds_tok["test"],
tokenizer=tok,
data_collator=DataCollatorWithPadding(tok),
compute_metrics=metrics,
)
trainer.train()
results = trainer.predict(ds_tok["test"])
10. Complexity quick reference
Attention & transformers
| Quantity | Cost | Note |
| Self-attention compute | O(n² · d) | n = seq len, d = model dim |
| Self-attention memory (attn matrix) | O(n²) | per head; FlashAttention reduces wall memory but not the FLOPs |
| FFN block | O(n · d²) | two linears with hidden 4d |
| KV cache (autoregressive) | 2 · L · n · d bytes per layer | L layers, halves with fp16 |
| Decode step with cache | O(n · d) per token | vs O(n² · d) without cache |
| Multi-head Q,K,V,O projections | 4 · d² params | independent of head count |
CNN layer math
| Quantity | Formula |
| Conv2d params (with bias) | F · (C · k · k + 1) |
| Conv2d FLOPs (per image) | H_out · W_out · F · C · k · k |
| Dense layer params | in · out + out |
| Dense layer FLOPs | 2 · in · out (mult + add) |
| Embedding params | V · d (usually the biggest matrix) |
Classical ML algorithms
| Algorithm | Train | Predict (per sample) |
| Linear / logistic regression | O(n · d²) closed form; O(n · d) per SGD step | O(d) |
| k-NN | O(1) (just store) | O(n · d) brute; O(d · log n) with KD-tree (low d) |
| Decision tree | O(n · d · log n) | O(depth) |
| Random forest (T trees) | O(T · n · d · log n) | O(T · depth) |
| Gradient boosting (T trees) | O(T · n · d · log n) sequential | O(T · depth) |
| SVM (RBF kernel) | O(n² · d) to O(n³ · d) | O(SV · d) |
| K-means (k clusters, I iters) | O(I · n · k · d) | O(k · d) |
| PCA via SVD | O(min(n · d², n² · d)) | O(k · d) per sample |
Sanity checks for the contest
- BERT-base: 12 layers, d=768, 12 heads, 110M params; sequence cost is O(n²) so 512 tokens is the practical cap on Colab CPU.
- A 1B-param model in fp32 needs ~4 GB just for weights; fp16 halves that. Adam state adds 2× more.
- Doubling batch size doubles memory but roughly halves wall time per epoch — until you hit OOM.
- If your forward pass uses N GFLOPs, the backward pass uses ~2N. Total training ≈ 3 · forward FLOPs · samples · epochs.
Where to go next
Once this page feels boring, you have the syntax internalised. Run the
mock contests closed-book and only open this page when you actually get
stuck — that's the contest-day workflow. The end-to-end notebooks
show every API here in context.