Python data stack

The exact subset of Python you need for USAAIO: NumPy for arrays, pandas for tabular data, matplotlib + seaborn for plots, scikit-learn for ML, and PyTorch for deep learning. Set them up once, learn the idioms, then stop thinking about syntax.

Syllabus link. Mapped to the Python & tooling block of the official USAAIO syllabus.

Practice these 4 end-to-end pipelines on Colab. Once the idioms below feel comfortable, open the end-to-end notebooks — full pipelines from raw data to submission for tabular ML, CIFAR-10, IMDB transformers, and a Bayesian A/B test.

TL;DR. Master five things: (1) NumPy vectorization and broadcasting; (2) pandas DataFrame ops — filtering, groupby, joins, missing-value handling; (3) producing publication-quality plots with matplotlib/seaborn; (4) debugging shape and dtype bugs with .shape/.dtype and timing with %timeit; (5) reproducibility — seeds, deterministic flags, version pinning. USAAIO will hand you a notebook and a CSV; if your fingers know these idioms you keep your brain free for the modeling.

Setup

python3 -m venv .venv
source .venv/bin/activate

pip install numpy pandas matplotlib seaborn scikit-learn
pip install torch torchvision torchaudio
pip install jupyterlab

jupyter lab

On Mac with Apple silicon, PyTorch installs with MPS (Metal) acceleration by default; you don't need CUDA.

1. NumPy — arrays & vectorization

Concept

NumPy's ndarray is a typed, fixed-shape, contiguous block of memory. Operations are implemented in C and SIMD, so vectorized expressions run 50–500× faster than Python loops. Vectorization means expressing computation as whole-array operations: a + b, np.exp(x), X @ w. Broadcasting is how arrays of different shapes combine: trailing dimensions are matched, length-1 dimensions are stretched "virtually" without copying. The rule: compare shapes right-to-left; each pair must be equal or one of them 1. So (4, 3) + (3,) works (treat the (3,) as a row added to every row); (4, 3) + (4,) does not — you must reshape to (4, 1).

Common pitfalls: (a) integer dtypes — np.array([1,2,3]) / 2 works as float in NumPy ≥ 1.20 but integer division // silently truncates; (b) views vs copies — slicing returns a view that shares memory, fancy indexing returns a copy; (c) axis=0 reduces rows (collapses the first dim), axis=1 reduces columns.

Worked example — vectorising a pairwise distance matrix

import numpy as np

rng = np.random.default_rng(0)
X = rng.standard_normal((500, 8))     # 500 points in 8-D

# slow, explicit double loop  -- O(n^2 d) Python ops
def pairwise_slow(X):
    n = X.shape[0]
    D = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            D[i, j] = np.sqrt(((X[i] - X[j]) ** 2).sum())
    return D

# fast, vectorized via broadcasting:
# X[:, None, :] has shape (n, 1, d); X[None, :, :] has shape (1, n, d)
# difference broadcasts to (n, n, d); sum over last axis, sqrt -> (n, n)
def pairwise_fast(X):
    diff = X[:, None, :] - X[None, :, :]
    return np.sqrt((diff ** 2).sum(-1))

# even faster identity:  ||x-y||^2 = ||x||^2 + ||y||^2 - 2 x.y
def pairwise_blas(X):
    sq = (X * X).sum(1)
    return np.sqrt(np.maximum(sq[:, None] + sq[None, :] - 2 * X @ X.T, 0))

On a laptop, the slow version takes seconds; the BLAS version finishes in ~5 ms.

Drills

D1 · Broadcasting shapes

You have X shape (64, 10) and a mean vector μ shape (10,). Centre each row of X. What does X − μ produce, and how would you center the columns instead?

Solution

X − μ broadcasts μ across rows → shape (64, 10), each row has μ subtracted. To centre columns, you'd subtract a per-row mean of shape (64, 1): X − X.mean(axis=1, keepdims=True).

D2 · Replace a loop

Rewrite for i in range(len(x)): y[i] = max(0, x[i]) without a Python loop.

Solution

y = np.maximum(0, x). (Or y = x * (x > 0).)

D3 · View vs copy gotcha

Predict the output:

a = np.arange(6).reshape(2, 3)
b = a[:, :2]
b[0, 0] = 99
print(a[0, 0])

Solution

Prints 99. Plain slicing returns a view sharing memory with a. To get an independent copy use a[:, :2].copy().

D4 · Axis arithmetic

Given X shape (N, D), write one-liners for (a) per-feature mean, (b) per-sample sum, (c) standardisation to zero mean / unit variance per column.

Solution

X.mean(0); X.sum(1); (X − X.mean(0)) / X.std(0, ddof=0). Use ddof=1 for the unbiased estimator.

D5 · One-hot encoding

Turn a vector of class indices y shape (N,) into a one-hot matrix of shape (N, K) with no loops.

Solution

oh = np.zeros((y.size, K)); oh[np.arange(y.size), y] = 1. Or np.eye(K)[y] — concise but allocates a K×K identity.

2. pandas — tabular data

Concept

A DataFrame is a labelled 2-D table — think Excel + SQL with NumPy underneath. Each column is a Series with its own dtype. The two indexers you must memorise: .loc[rows, cols] uses labels; .iloc[i, j] uses integer positions. Boolean masks (df[df["x"] > 0]) select rows; df.assign(col=...) or df["col"] = ... creates columns.

The three high-value operations are groupby + aggregate for feature engineering (df.groupby("user_id")["amount"].mean()), merge/join for combining tables (pd.merge(a, b, on="key", how="left")), and missing-value handling — decide explicitly between .dropna(subset=[...]) (lose rows) and .fillna(value) or .fillna(df.median(numeric_only=True)) (impute). Never accept silent NaNs flowing into a model.

Worked example — feature engineering on a transactions CSV

import pandas as pd
import numpy as np

df = pd.read_csv("transactions.csv", parse_dates=["ts"])
df.info()                                # dtypes & non-null counts
df.describe(include="all").T             # numeric + categorical summary
print(df.isna().mean().sort_values(ascending=False).head())   # missingness

# date features
df["hour"]  = df["ts"].dt.hour
df["dow"]   = df["ts"].dt.dayofweek
df["month"] = df["ts"].dt.month

# per-user aggregates joined back as features
agg = (df.groupby("user_id")
         .agg(n_tx=("amount", "size"),
              mean_amt=("amount", "mean"),
              max_amt=("amount", "max"))
         .reset_index())
df = df.merge(agg, on="user_id", how="left")

# impute and log-transform
df["amount"] = df["amount"].fillna(df["amount"].median())
df["log_amount"] = np.log1p(df["amount"])

Drills

D1 · loc vs iloc

Given a DataFrame indexed by string IDs, what's the difference between df.loc["a":"c"] and df.iloc[0:3]?

Solution

.loc uses label slicing and is inclusive on both ends (returns rows "a","b","c"). .iloc uses positional slicing and is exclusive on the right (returns positions 0, 1, 2). The off-by-one trips up beginners weekly.

D2 · GroupBy multiple aggregations

For a DataFrame with columns region, product, price, qty, compute per-region total revenue and average price in one expression.

Solution

df.assign(rev=df["price"]*df["qty"]) \
  .groupby("region") \
  .agg(total_rev=("rev","sum"), avg_price=("price","mean"))

D3 · Chained assignment trap

Why does df[df["x"] > 0]["y"] = 1 often fail to update the DataFrame?

Solution

The expression on the left creates a temporary copy (pandas chained indexing); the assignment writes to that copy, not the original. Use a single .loc: df.loc[df["x"] > 0, "y"] = 1.

D4 · Merge with mismatched keys

pd.merge(a, b, on="user_id", how="left") — what happens to (a) rows in a whose user_id is missing in b; (b) rows in b not in a?

Solution

(a) Kept; the columns from b are filled with NaN. (b) Dropped — left join keeps only keys present in the left table. Use how="outer" to keep both, and pass indicator=True to debug.

D5 · Pivot for a report

Reshape a long DataFrame (date, product, sales) into a wide table with one row per date and one column per product.

Solution

df.pivot(index="date", columns="product", values="sales"). Use pivot_table if duplicates need an aggregation function.

3. Matplotlib & seaborn — visual debugging

Concept

Plots are how you find data bugs and model bugs. Three plots you should make automatically on every project: (1) feature histograms and a pairplot to spot outliers and skew; (2) training/validation loss curves over epochs — diverging curves mean overfitting, both flat means underfitting; (3) predicted vs. true scatter (regression) or a confusion matrix (classification). Matplotlib gives you total control; seaborn is a thin wrapper for statistical defaults.

Worked example — loss curves and a confusion matrix

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix

epochs = np.arange(1, 21)
train_loss = 1.2 * np.exp(-epochs/8) + 0.05*np.random.randn(20)
val_loss   = 1.2 * np.exp(-epochs/10) + 0.1 + 0.05*np.random.randn(20)

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(epochs, train_loss, label="train")
ax[0].plot(epochs, val_loss,   label="val")
ax[0].set_xlabel("epoch"); ax[0].set_ylabel("loss"); ax[0].legend()

y_true = np.random.randint(0, 3, size=200)
y_pred = y_true.copy()
flip = np.random.rand(200) < 0.15
y_pred[flip] = (y_pred[flip] + 1) % 3
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax[1])
ax[1].set_xlabel("predicted"); ax[1].set_ylabel("true")

plt.tight_layout()
plt.savefig("diagnostics.png", dpi=120)

Drills

D1 · Reading a loss curve

Training loss keeps falling; validation loss bottoms out at epoch 5 and rises after. Diagnosis? What do you do?

Solution

Overfitting. Apply more regularisation (dropout, weight decay), get more data or augment, reduce model size, or — easiest — early-stop at epoch 5.

D2 · Correct chart type

Pick the right plot: (a) check whether two features are correlated, (b) compare loss across 6 optimizer variants, (c) inspect class imbalance in a label column.

Solution

(a) Scatter plot, or sns.regplot with a fitted line. (b) Bar chart of final val loss with error bars over seeds. (c) df["label"].value_counts().plot.bar().

D3 · Log scale

You plot loss over 100 epochs but the early-epoch losses are 100× larger than the late ones, flattening the curve. Fix?

Solution

ax.set_yscale("log"). Always default to log-scale on loss curves longer than a few epochs.

4. scikit-learn — the ML API

Concept

Every scikit-learn estimator implements .fit(X, y) and either .predict(X) or .transform(X). Pipelines compose them so preprocessing and modelling are atomic — and so that the validation set is preprocessed using statistics from the training fold only (no leakage).

Worked example — pipeline with held-out validation

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf",   LogisticRegression(max_iter=1000, C=1.0)),
])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_val)
print(accuracy_score(y_val, pred))
print(classification_report(y_val, pred))

Always fit the scaler on training data only. Then transform the validation/test sets with that fitted scaler. The Pipeline object does this automatically — that is its main reason to exist.

5. Reproducibility & debugging

Concept

Two runs of the same notebook should produce the same numbers. Set seeds in every random source you touch. Save the package versions you used (pip freeze > requirements.txt). For training runs that must be exactly bit-equal, also set the cuDNN deterministic flags and accept a small throughput cost.

import os, random, numpy as np, torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

Debugging & profiling toolkit

arr.shape, arr.dtype are your two best friends. Print them at every suspicious line.
%timeit expr in Jupyter for per-cell timing.
%prun fn() for a function-level profile.
import pdb; pdb.set_trace() drops you into a debugger; in notebooks use %debug after an exception.
"Restart kernel and run all" before claiming a notebook works.

Drills

D1 · Find the leak

scaler = StandardScaler().fit(X)        # fits on ALL data
X_scaled = scaler.transform(X)
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y)

What's wrong and how do you fix it?

Solution

The scaler learned the mean/variance from data that includes the validation set. Fix: split first, fit on X_train only, then transform both halves. Better: wrap in a Pipeline so leakage is structurally impossible.

D2 · Non-deterministic results

Two identical runs of your PyTorch training loop give different val accuracies. List four likely culprits.

Solution

(1) No seed in torch.manual_seed; (2) DataLoader with shuffle=True and no generator arg; (3) cuDNN nondeterministic kernels; (4) data-augmentation RNG not seeded; (5) different package versions across runs.

D3 · Notebook executed out of order

Why is "Restart & Run All" the only honest test of a notebook?

Solution

Jupyter cells retain state from any earlier execution. A variable might still reference a stale value from a cell you've since edited. Restart clears state; "Run All" proves the notebook produces its claimed result top-to-bottom on a clean kernel.

Checkpoint — answer out loud

Can you state the broadcasting rule and predict whether (4, 3) + (3,) works without running it?
Can you write a groupby + agg that produces three named columns?
Can you spot a data leak between fitting a scaler and a train/val split?
Can you produce a loss-curve plot and a confusion-matrix heatmap from memory?
Can you list the four seeding calls needed for a reproducible PyTorch run?

Next step

With the toolchain in your fingers, go to Classical machine learning — the fastest path to a competitive USAAIO baseline. For paper-and-pencil drills that test the math under the code, see the Round 2 theory bank.