Python data stack
The exact subset of Python you need for USAAIO: NumPy for arrays, pandas for tabular data, matplotlib + seaborn for plots, scikit-learn for ML, and PyTorch for deep learning. Set them up once, learn the idioms, then stop thinking about syntax.
.shape/.dtype
and timing with %timeit; (5) reproducibility — seeds, deterministic flags, version pinning.
USAAIO will hand you a notebook and a CSV; if your fingers know these idioms you keep your brain free
for the modeling.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install numpy pandas matplotlib seaborn scikit-learn
pip install torch torchvision torchaudio
pip install jupyterlab
jupyter lab
On Mac with Apple silicon, PyTorch installs with MPS (Metal) acceleration by default; you don't need CUDA.
1. NumPy — arrays & vectorization
Concept
NumPy's ndarray is a typed, fixed-shape, contiguous block of memory. Operations are
implemented in C and SIMD, so vectorized expressions run 50–500× faster than Python loops.
Vectorization means expressing computation as whole-array operations:
a + b, np.exp(x), X @ w. Broadcasting is how
arrays of different shapes combine: trailing dimensions are matched, length-1 dimensions are stretched
"virtually" without copying. The rule: compare shapes right-to-left; each pair must be equal or one of
them 1. So (4, 3) + (3,) works (treat the (3,) as a row added to every row);
(4, 3) + (4,) does not — you must reshape to (4, 1).
Common pitfalls: (a) integer dtypes — np.array([1,2,3]) / 2 works as float in NumPy ≥ 1.20
but integer division // silently truncates; (b) views vs copies — slicing returns a view
that shares memory, fancy indexing returns a copy; (c) axis=0 reduces rows (collapses the
first dim), axis=1 reduces columns.
Worked example — vectorising a pairwise distance matrix
import numpy as np
rng = np.random.default_rng(0)
X = rng.standard_normal((500, 8)) # 500 points in 8-D
# slow, explicit double loop -- O(n^2 d) Python ops
def pairwise_slow(X):
n = X.shape[0]
D = np.zeros((n, n))
for i in range(n):
for j in range(n):
D[i, j] = np.sqrt(((X[i] - X[j]) ** 2).sum())
return D
# fast, vectorized via broadcasting:
# X[:, None, :] has shape (n, 1, d); X[None, :, :] has shape (1, n, d)
# difference broadcasts to (n, n, d); sum over last axis, sqrt -> (n, n)
def pairwise_fast(X):
diff = X[:, None, :] - X[None, :, :]
return np.sqrt((diff ** 2).sum(-1))
# even faster identity: ||x-y||^2 = ||x||^2 + ||y||^2 - 2 x.y
def pairwise_blas(X):
sq = (X * X).sum(1)
return np.sqrt(np.maximum(sq[:, None] + sq[None, :] - 2 * X @ X.T, 0))
On a laptop, the slow version takes seconds; the BLAS version finishes in ~5 ms.
Drills
D1 · Broadcasting shapes
You have X shape (64, 10) and a mean vector μ shape (10,).
Centre each row of X. What does X − μ produce, and how would you center the
columns instead?
Solution
X − μ broadcasts μ across rows → shape (64, 10), each row has
μ subtracted. To centre columns, you'd subtract a per-row mean of shape (64, 1):
X − X.mean(axis=1, keepdims=True).
D2 · Replace a loop
Rewrite for i in range(len(x)): y[i] = max(0, x[i]) without a Python loop.
Solution
y = np.maximum(0, x). (Or y = x * (x > 0).)
D3 · View vs copy gotcha
Predict the output:
a = np.arange(6).reshape(2, 3)
b = a[:, :2]
b[0, 0] = 99
print(a[0, 0])
Solution
Prints 99. Plain slicing returns a view sharing memory with a. To get an
independent copy use a[:, :2].copy().
D4 · Axis arithmetic
Given X shape (N, D), write one-liners for (a) per-feature mean, (b)
per-sample sum, (c) standardisation to zero mean / unit variance per column.
Solution
X.mean(0); X.sum(1);
(X − X.mean(0)) / X.std(0, ddof=0). Use ddof=1 for the unbiased estimator.
D5 · One-hot encoding
Turn a vector of class indices y shape (N,) into a one-hot matrix of shape
(N, K) with no loops.
Solution
oh = np.zeros((y.size, K)); oh[np.arange(y.size), y] = 1. Or
np.eye(K)[y] — concise but allocates a K×K identity.
2. pandas — tabular data
Concept
A DataFrame is a labelled 2-D table — think Excel + SQL with NumPy underneath. Each column
is a Series with its own dtype. The two indexers you must memorise: .loc[rows, cols]
uses labels; .iloc[i, j] uses integer positions. Boolean masks (df[df["x"] > 0])
select rows; df.assign(col=...) or df["col"] = ... creates columns.
The three high-value operations are groupby + aggregate for feature engineering
(df.groupby("user_id")["amount"].mean()), merge/join for combining tables
(pd.merge(a, b, on="key", how="left")), and missing-value handling —
decide explicitly between .dropna(subset=[...]) (lose rows) and
.fillna(value) or .fillna(df.median(numeric_only=True)) (impute). Never accept
silent NaNs flowing into a model.
Worked example — feature engineering on a transactions CSV
import pandas as pd
import numpy as np
df = pd.read_csv("transactions.csv", parse_dates=["ts"])
df.info() # dtypes & non-null counts
df.describe(include="all").T # numeric + categorical summary
print(df.isna().mean().sort_values(ascending=False).head()) # missingness
# date features
df["hour"] = df["ts"].dt.hour
df["dow"] = df["ts"].dt.dayofweek
df["month"] = df["ts"].dt.month
# per-user aggregates joined back as features
agg = (df.groupby("user_id")
.agg(n_tx=("amount", "size"),
mean_amt=("amount", "mean"),
max_amt=("amount", "max"))
.reset_index())
df = df.merge(agg, on="user_id", how="left")
# impute and log-transform
df["amount"] = df["amount"].fillna(df["amount"].median())
df["log_amount"] = np.log1p(df["amount"])
Drills
D1 · loc vs iloc
Given a DataFrame indexed by string IDs, what's the difference between df.loc["a":"c"]
and df.iloc[0:3]?
Solution
.loc uses label slicing and is inclusive on both ends (returns rows
"a","b","c"). .iloc uses positional slicing and is exclusive on the right (returns
positions 0, 1, 2). The off-by-one trips up beginners weekly.
D2 · GroupBy multiple aggregations
For a DataFrame with columns region, product, price,
qty, compute per-region total revenue and average price in one expression.
Solution
df.assign(rev=df["price"]*df["qty"]) \
.groupby("region") \
.agg(total_rev=("rev","sum"), avg_price=("price","mean"))
D3 · Chained assignment trap
Why does df[df["x"] > 0]["y"] = 1 often fail to update the DataFrame?
Solution
The expression on the left creates a temporary copy (pandas chained indexing); the assignment
writes to that copy, not the original. Use a single .loc:
df.loc[df["x"] > 0, "y"] = 1.
D4 · Merge with mismatched keys
pd.merge(a, b, on="user_id", how="left") — what happens to (a) rows in a
whose user_id is missing in b; (b) rows in b not in a?
Solution
(a) Kept; the columns from b are filled with NaN. (b) Dropped — left join keeps only
keys present in the left table. Use how="outer" to keep both, and pass
indicator=True to debug.
D5 · Pivot for a report
Reshape a long DataFrame (date, product, sales) into a wide
table with one row per date and one column per product.
Solution
df.pivot(index="date", columns="product", values="sales"). Use
pivot_table if duplicates need an aggregation function.
3. Matplotlib & seaborn — visual debugging
Concept
Plots are how you find data bugs and model bugs. Three plots you should make automatically on every project: (1) feature histograms and a pairplot to spot outliers and skew; (2) training/validation loss curves over epochs — diverging curves mean overfitting, both flat means underfitting; (3) predicted vs. true scatter (regression) or a confusion matrix (classification). Matplotlib gives you total control; seaborn is a thin wrapper for statistical defaults.
Worked example — loss curves and a confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix
epochs = np.arange(1, 21)
train_loss = 1.2 * np.exp(-epochs/8) + 0.05*np.random.randn(20)
val_loss = 1.2 * np.exp(-epochs/10) + 0.1 + 0.05*np.random.randn(20)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(epochs, train_loss, label="train")
ax[0].plot(epochs, val_loss, label="val")
ax[0].set_xlabel("epoch"); ax[0].set_ylabel("loss"); ax[0].legend()
y_true = np.random.randint(0, 3, size=200)
y_pred = y_true.copy()
flip = np.random.rand(200) < 0.15
y_pred[flip] = (y_pred[flip] + 1) % 3
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax[1])
ax[1].set_xlabel("predicted"); ax[1].set_ylabel("true")
plt.tight_layout()
plt.savefig("diagnostics.png", dpi=120)
Drills
D1 · Reading a loss curve
Training loss keeps falling; validation loss bottoms out at epoch 5 and rises after. Diagnosis? What do you do?
Solution
Overfitting. Apply more regularisation (dropout, weight decay), get more data or augment, reduce model size, or — easiest — early-stop at epoch 5.
D2 · Correct chart type
Pick the right plot: (a) check whether two features are correlated, (b) compare loss across 6 optimizer variants, (c) inspect class imbalance in a label column.
Solution
(a) Scatter plot, or sns.regplot with a fitted line. (b) Bar chart of final val loss
with error bars over seeds. (c) df["label"].value_counts().plot.bar().
D3 · Log scale
You plot loss over 100 epochs but the early-epoch losses are 100× larger than the late ones, flattening the curve. Fix?
Solution
ax.set_yscale("log"). Always default to log-scale on loss curves longer than a few
epochs.
4. scikit-learn — the ML API
Concept
Every scikit-learn estimator implements .fit(X, y) and either .predict(X) or
.transform(X). Pipelines compose them so preprocessing and modelling are atomic — and so
that the validation set is preprocessed using statistics from the training fold only (no leakage).
Worked example — pipeline with held-out validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
pipe = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression(max_iter=1000, C=1.0)),
])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_val)
print(accuracy_score(y_val, pred))
print(classification_report(y_val, pred))
Pipeline object does this automatically — that is its main
reason to exist.
5. Reproducibility & debugging
Concept
Two runs of the same notebook should produce the same numbers. Set seeds in every random source you
touch. Save the package versions you used (pip freeze > requirements.txt). For training
runs that must be exactly bit-equal, also set the cuDNN deterministic flags and accept a small
throughput cost.
import os, random, numpy as np, torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
Debugging & profiling toolkit
arr.shape,arr.dtypeare your two best friends. Print them at every suspicious line.%timeit exprin Jupyter for per-cell timing.%prun fn()for a function-level profile.import pdb; pdb.set_trace()drops you into a debugger; in notebooks use%debugafter an exception.- "Restart kernel and run all" before claiming a notebook works.
Drills
D1 · Find the leak
scaler = StandardScaler().fit(X) # fits on ALL data
X_scaled = scaler.transform(X)
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y)
What's wrong and how do you fix it?
Solution
The scaler learned the mean/variance from data that includes the validation set. Fix: split first,
fit on X_train only, then transform both halves. Better: wrap in a Pipeline
so leakage is structurally impossible.
D2 · Non-deterministic results
Two identical runs of your PyTorch training loop give different val accuracies. List four likely culprits.
Solution
(1) No seed in torch.manual_seed; (2) DataLoader with
shuffle=True and no generator arg; (3) cuDNN nondeterministic kernels;
(4) data-augmentation RNG not seeded; (5) different package versions across runs.
D3 · Notebook executed out of order
Why is "Restart & Run All" the only honest test of a notebook?
Solution
Jupyter cells retain state from any earlier execution. A variable might still reference a stale value from a cell you've since edited. Restart clears state; "Run All" proves the notebook produces its claimed result top-to-bottom on a clean kernel.
Checkpoint — answer out loud
- Can you state the broadcasting rule and predict whether
(4, 3) + (3,)works without running it? - Can you write a
groupby+aggthat produces three named columns? - Can you spot a data leak between fitting a scaler and a train/val split?
- Can you produce a loss-curve plot and a confusion-matrix heatmap from memory?
- Can you list the four seeding calls needed for a reproducible PyTorch run?
Next step
With the toolchain in your fingers, go to Classical machine learning — the fastest path to a competitive USAAIO baseline. For paper-and-pencil drills that test the math under the code, see the Round 2 theory bank.