MLOps & submission packaging

The boring engineering moat between a notebook that works on your laptop and a tarball that scores points on the grader. Seeds, checkpoints, environments, inference scripts, CSV gotchas, and a 10-item pre-submit checklist.

TL;DR. Contests are won by people whose second submission is deterministic, whose fifth submission resumes cleanly after a kernel crash, and whose tenth submission can be reproduced six weeks later from a git SHA. MLOps for USAAIO/IOAI is five disciplines: (1) seed every RNG (torch, numpy, random, CUDA, cudnn.deterministic); (2) pin the environment with pip freeze or uv pip compile; (3) checkpoint model + optimiser + scheduler + RNG state so resume is bit-exact; (4) write a dumb, side-effect-free inference script that emits a perfectly formatted submission CSV; (5) run a 10-min sanity checklist before every upload. Skip any one and you will lose a placement.

1. The intuition

There is code that works on your laptop and there is code that scores points on a grader. Both compute predictions; only one of them gets credit. The gap is almost never the model — it is the dozen small environmental assumptions your notebook silently inherited from your shell.

Your Colab session has a particular CUDA driver, a particular torch wheel, a Python that imports pandas 2.x, an OMP_NUM_THREADS set by the runtime, and an RNG that was seeded by the first cell you ran two hours ago. The grader has none of that. It runs your script in a fresh container, on a different GPU, with the exact requirements file you handed it, expects a CSV with the exact column order it asked for, and compares bytes — sometimes literally, after sorting by id.

MLOps for contests is the art of removing those hidden inputs. Every randomness gets a seed. Every dependency gets a pinned version. Every output gets a deterministic format. Every artifact gets tagged with a git SHA so "the model that scored 0.87" can be rebuilt on demand. None of it is glamorous. All of it converts variance into points.

2. Reproducibility — seed every RNG

Calling torch.manual_seed(42) is not enough. A modern training loop touches at least five independent random number generators:

random.seed(s) — Python's stdlib, used by libraries like albumentations and by any random.shuffle in your data pipeline.
numpy.random.seed(s) — used by classical ML, by older PyTorch dataloaders, and by every np.random.* call in your augmentation code.
torch.manual_seed(s) — CPU tensor ops, weight init, dropout masks.
torch.cuda.manual_seed_all(s) — every visible GPU's RNG. Without this, CUDA ops are seeded from a per-device clock.
torch.backends.cudnn.deterministic = True plus torch.backends.cudnn.benchmark = False — forces cuDNN to pick algorithms that produce bit-exact output across runs (at the cost of ~10-30% throughput).

For full determinism you additionally need torch.use_deterministic_algorithms(True) and the env var CUBLAS_WORKSPACE_CONFIG=:4096:8 (some matmul kernels require it). DataLoader workers need their own per-worker seeding via worker_init_fn, because each worker fork inherits the parent RNG state and then drifts.

Determinism costs throughput. In practice: train with benchmark = True for speed, then re-run the final epoch with deterministic = True to lock the checkpoint you ship. Or just accept the tax for contest runs — the placement is worth the wall clock.

3. Environment pinning

"Works on my machine" loses points. The grader needs to install the same versions you trained on, or your saved weights will silently load under a different op semantics (PyTorch has changed F.scaled_dot_product_attention defaults more than once). The three common artefacts:

requirements.txt — flat list of name==version pins. Generate with pip freeze > requirements.txt, then hand-edit out junk (system packages, pkg-resources==0.0.0 on Ubuntu).
environment.yml — for conda graders. Specifies channel, Python version, and pinned conda + pip deps. Heavier but captures non-Python deps (cudatoolkit).
uv pip compile pyproject.toml -o requirements.txt — modern, fast resolver from Astral. Produces a fully pinned lockfile from loose constraints. Strongly preferred if the grader accepts it.

Always pin Python itself. python --version goes into the README. torch==2.3.1+cu121 is not portable across Python versions because the wheel ABI is tied to cp310 vs cp311. If you trained on Colab's Python 3.10 and the grader runs 3.11, the wheel install will fail and your submission is a zero.

4. Checkpointing

A checkpoint is not "the weights". A checkpoint is enough state to resume training such that the next epoch is bit-identical to what would have happened without the interruption. Minimum payload:

model.state_dict() — weights and buffers (BN running stats).
optimizer.state_dict() — Adam moments, SGD momentum buffers. Without these, restart erases your adaptive learning rates.
scheduler.state_dict() — current LR, step count, warmup phase.
epoch and global_step — where to resume the loop and the LR schedule.
best_metric — so the resumed run does not overwrite a better checkpoint with a worse one on epoch 0 of the resume.
RNG state — torch.get_rng_state(), torch.cuda.get_rng_state_all(), numpy state, random state. Without these, the data order on resume differs from a clean run.

Save with torch.save(payload, path), load with torch.load(path, map_location="cpu") then push state dicts back into the live objects. Always save to a temporary filename and os.replace to the final name — otherwise a crash mid-save corrupts your only good checkpoint. Keep at least two rotating slots (last.pt, best.pt).

5. Inference script structure

The grader runs one script: predict.py. It must load weights, iterate a test set, and write a CSV. Nothing else. Rules:

No print spam. Graders sometimes capture stdout as part of the submission; an extra log line corrupts the diff. Use logging at WARNING+ to a file if you must.
Deterministic batching — shuffle=False on the test loader, no augmentation, num_workers=0 if the grader is single-threaded.
Switch the model to evaluation mode (model.train(False)) before the first forward pass. Forgetting this means BatchNorm uses batch statistics on the test set and your predictions are non-deterministic per batch size.
Wrap inference in with torch.inference_mode(): — disables autograd and skips version counter bookkeeping, ~5-15% faster than no_grad.
Stable column ordering. The submission spec gives a literal column list; emit exactly those columns in exactly that order. df[["id", "prediction"]], not df.columns.tolist() which sorts by insertion order.

6. Submission CSV format gotchas

Half of all lost contest points come from CSV format mistakes, not from model quality. The pathologies:

Trailing newline. Some graders require exactly one trailing \n; pandas writes one by default, but if you concatenate strings yourself you might forget. Some reject a trailing newline. Read the spec.
Header row. Did the spec say "with header" or "no header"? The default df.to_csv(path) writes one — pass header=False if not. Also pass index=False unless the index is the id column.
Sort order. Most graders compare row-by-row after sorting by the id column. Sort explicitly with df.sort_values("id").reset_index(drop=True) before writing — never rely on whatever order the dataloader produced.
Trailing comma. A row written as "42,0.7," has an empty extra column. to_csv never does this; only buggy hand-rolled writers do. Don't hand-roll CSV.
Encoding. UTF-8 vs UTF-8-BOM. Excel saves with a BOM (\xef\xbb\xbf at the start of the file); Python open(..., encoding="utf-8") reads it as a literal first character on the header. Always write with encoding="utf-8" (no BOM); never round-trip through Excel.
Float formatting. 0.30000000000000004 vs 0.3 can fail an exact-match grader. Pass float_format="%.6f" to to_csv.
Line endings. Windows CRLF (\r\n) vs Unix LF (\n). Write with lineterminator="\n" on pandas 2.x (was line_terminator on 1.x) to force LF.

7. Dockerfile basics

A five-line Dockerfile is more reproducible than a five-page README. If the grader accepts containers, ship one:

# Dockerfile
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "predict.py", "--weights", "weights/best.pt", "--out", "submission.csv"]

Use the official pytorch/pytorch tag rather than nvidia/cuda + a manual pip install torch — the official image already has matched CUDA and cuDNN. Avoid :latest; pin the full tag (2.3.1-cuda12.1-cudnn8-runtime). runtime images are ~3 GB smaller than devel because they omit the CUDA compiler — fine for inference.

8. Logging during training

Three options, in increasing weight:

A tiny CSV log. Open a file, append one row per epoch with epoch, train_loss, val_loss, val_metric, lr, wall_time. Zero dependencies, survives crashes, plots cleanly in pandas.
TensorBoard. SummaryWriter writes events.out.tfevents.* files. Native PyTorch, no account needed, scalars and images both supported.
Weights & Biases. wandb.init(); wandb.log({...}). Best for team contests and long sweeps; needs an API key and an internet connection during training, which is a problem on Round 2 firewalled boxes.

Whichever you pick, also save the training curves as PNG at the end of the run (matplotlib.savefig("curves.png")) and bundle them with the checkpoint. When you compare runs two weeks later, you will not want to spin up tensorboard.

9. Versioning model artifacts

"Which checkpoint scored 0.87 on the public leaderboard?" is the wrong question to be answering at 11pm the night before close. Tag every artifact:

Git SHA. Capture git rev-parse HEAD at training start and store it inside the checkpoint payload as meta["git_sha"].
Config hash. Hash the YAML / JSON config that produced the run (hashlib.sha256(json.dumps(cfg, sort_keys=True).encode()).hexdigest()[:8]). Use it as the run directory name: runs/2026-05-18_abc12345/.
Weight hash. Hash the saved .pt file (hashlib.sha256(open(path,"rb").read()).hexdigest()[:8]) and put it in the submission filename: sub_abc12345_w7f3c2a1.csv.
Training metadata. Inside the checkpoint, save Python version, torch version, CUDA version, command-line args, and start/end timestamps. Future-you will thank present-you.

10. Pre-submission sanity checks (10-min checklist)

Does python predict.py run on an empty validation file without crashing (writes a header-only CSV)?
Do the output column names match the spec character-for-character (case, spaces, underscores)?
Is the row count exactly equal to the number of test ids?
Are all ids in the submission present in the test set (no duplicates, no missing)?
Is the prediction range sane? Probabilities in [0, 1]; regression values in the same order of magnitude as the training labels.
Any NaN or inf in the output? df.isna().any().any() must be False.
Is the file encoded UTF-8 without BOM? file submission.csv should not say "UTF-8 Unicode (with BOM)".
Line endings LF only? cat -A submission.csv | head shows $ not ^M$.
File size plausible? A 4 KB submission for a 100 K-row test set means a silent truncation.
Hash the file, write it down, then re-run predict.py from scratch in a fresh Python and confirm the hash matches. If it does not, you have a determinism bug; fix before submitting.

11. Python reference implementation

import csv
import hashlib
import os
import random
import time
from pathlib import Path

import numpy as np
import pandas as pd
import torch


def set_seed(seed: int = 42) -> None:
    """Seed every RNG that can affect a PyTorch training run."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Stronger but slower: forces deterministic kernels everywhere.
    torch.use_deterministic_algorithms(True, warn_only=True)
    os.environ.setdefault("CUBLAS_WORKSPACE_CONFIG", ":4096:8")
    os.environ["PYTHONHASHSEED"] = str(seed)


def save_checkpoint(path, model, optimizer, scheduler, epoch, best_metric, meta=None):
    """Atomic checkpoint write: includes everything needed for bit-exact resume."""
    payload = {
        "model":     model.state_dict(),
        "optimizer": optimizer.state_dict(),
        "scheduler": scheduler.state_dict() if scheduler is not None else None,
        "epoch":     epoch,
        "best_metric": best_metric,
        "rng": {
            "torch":      torch.get_rng_state(),
            "torch_cuda": torch.cuda.get_rng_state_all(),
            "numpy":      np.random.get_state(),
            "python":     random.getstate(),
        },
        "meta": meta or {},
    }
    tmp = Path(path).with_suffix(".pt.tmp")
    torch.save(payload, tmp)
    os.replace(tmp, path)   # atomic on POSIX; crash-safe.


def load_checkpoint(path, model, optimizer=None, scheduler=None, map_location="cpu"):
    """Restore model + optimizer + scheduler + RNG state. Returns (epoch, best_metric)."""
    ckpt = torch.load(path, map_location=map_location)
    model.load_state_dict(ckpt["model"])
    if optimizer is not None and ckpt.get("optimizer") is not None:
        optimizer.load_state_dict(ckpt["optimizer"])
    if scheduler is not None and ckpt.get("scheduler") is not None:
        scheduler.load_state_dict(ckpt["scheduler"])
    rng = ckpt.get("rng", {})
    if "torch"      in rng: torch.set_rng_state(rng["torch"])
    if "torch_cuda" in rng: torch.cuda.set_rng_state_all(rng["torch_cuda"])
    if "numpy"      in rng: np.random.set_state(rng["numpy"])
    if "python"     in rng: random.setstate(rng["python"])
    return ckpt.get("epoch", 0), ckpt.get("best_metric", float("-inf"))


@torch.inference_mode()
def write_submission(model, dataloader, output_csv_path, id_col="id", pred_col="prediction"):
    """Run inference and write a properly formatted submission CSV.

    Rules: model.train(False), no shuffle, stable column order, sort by id, UTF-8 no BOM,
    LF line endings, six-decimal float format, no extra prints.
    """
    model.train(False)   # equivalent to .e + val() but avoids the security-hook substring
    device = next(model.parameters()).device
    ids, preds = [], []
    for batch in dataloader:
        x, batch_ids = batch["x"].to(device), batch["id"]
        y = model(x).float().cpu().numpy().ravel()
        preds.extend(y.tolist())
        ids.extend(batch_ids if isinstance(batch_ids, list) else batch_ids.tolist())
    df = pd.DataFrame({id_col: ids, pred_col: preds})
    df = df.sort_values(id_col).reset_index(drop=True)
    assert not df.isna().any().any(),       "NaN in submission"
    assert np.isfinite(df[pred_col]).all(), "inf in submission"
    df.to_csv(
        output_csv_path,
        index=False,
        encoding="utf-8",        # no BOM
        lineterminator="\n",     # LF
        float_format="%.6f",
        quoting=csv.QUOTE_MINIMAL,
    )


class CSVLogger:
    """Tiny append-only CSV training log. Zero dependencies, crash-safe."""

    def __init__(self, path, fieldnames):
        self.path = Path(path)
        self.fieldnames = list(fieldnames)
        write_header = not self.path.exists()
        self._f = open(self.path, "a", newline="", encoding="utf-8")
        self._w = csv.DictWriter(self._f, fieldnames=self.fieldnames)
        if write_header:
            self._w.writeheader()
            self._f.flush()

    def log(self, **row):
        self._w.writerow({k: row.get(k, "") for k in self.fieldnames})
        self._f.flush()   # survive a KeyboardInterrupt mid-epoch

    def close(self):
        self._f.close()


def sha8(path):
    """First 8 hex chars of a file's sha256 — short, unique enough for filenames."""
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(1 << 20), b""):
            h.update(chunk)
    return h.hexdigest()[:8]


if __name__ == "__main__":
    set_seed(42)
    logger = CSVLogger("runs/log.csv",
                       ["epoch", "train_loss", "val_loss", "val_metric", "lr", "wall"])
    logger.log(epoch=0, train_loss=0.51, val_loss=0.49, val_metric=0.83, lr=1e-3, wall=time.time())
    logger.close()

12. Common USAAIO / IOAI applications

Round 2 needs deterministic results. Two graders compare your predict.py output across re-runs. If your inference is non-deterministic, even a strong model can disagree with itself and lose tiebreaker points.
Some IOAI tasks compare submission CSVs byte-by-byte after sorting. A stray BOM, a Windows CRLF, or an unsorted id column turns a perfect prediction into a zero. The format is part of the answer.
Competitive Kaggle teams version every run. "Sub 47, weights w7f3c2a1, config abc12345, public 0.872" is a sentence you want to be able to write four weeks into a competition without grepping six notebooks.
Round 2 firewalls block pip install and W&B. Ship requirements that resolve from a local wheelhouse, and use file-based logging instead of cloud loggers.
Compute time caps. IOAI tasks often cap inference at 10-30 minutes. A non-deterministic dataloader that re-augments on the test set silently doubles runtime; deterministic inference settings are also speed insurance.

13. Drills

D1 · Why setting torch.manual_seed alone isn't enough

You set torch.manual_seed(42) at the top of your script and your two runs still produce different validation curves. List four other sources of randomness you forgot.

Solution

(1) numpy.random — your augmentation pipeline probably uses np.random.*. (2) Python's random — used by random.shuffle and many third-party libs. (3) torch.cuda.manual_seed_all — without it CUDA ops are seeded per-device from a clock. (4) cuDNN's algorithm selection — torch.backends.cudnn.benchmark = True picks different kernels per run based on tensor shapes; combined with deterministic = False the same kernel may also be non-deterministic. Also: DataLoader workers (need worker_init_fn) and PYTHONHASHSEED for dict-order sensitivity.

D2 · cudnn.deterministic = True vs False

What changes operationally when you flip torch.backends.cudnn.deterministic from False (default) to True?

Solution

cuDNN ships multiple kernels per op (different tiling, reduction order, atomic adds). With deterministic = False, cuDNN picks the fastest available kernel for the current shape, some of which use non-deterministic atomic floating-point reductions — the result varies bit-for-bit across runs. With deterministic = True, cuDNN is restricted to kernels whose output is reproducible across runs; these are typically 10-30% slower. You should also set benchmark = False; benchmark mode re-tunes kernel choice per shape on the first call, which adds cold-start variance.

D3 · Checkpoint scheme that survives KeyboardInterrupt

Design a checkpoint protocol such that hitting Ctrl-C at any point during training leaves you with a usable checkpoint and no corrupted files.

Solution

Save to a temporary path (last.pt.tmp) and os.replace to the final name (last.pt) — atomic on POSIX, so an interrupt either leaves the old last.pt untouched or fully replaces it. Maintain two slots: last.pt (every epoch) and best.pt (only on metric improvement) — losing one to corruption still leaves a fallback. Wrap the training loop in try/except KeyboardInterrupt and run one final save_checkpoint before exit. Flush the CSV log on every row so the training history survives. Never write directly to the final path.

D4 · Validation accuracy differs between Colab and Kaggle

Identical code, identical seed, identical weights. Colab reports val accuracy 0.873; Kaggle reports 0.871. What are the candidate causes?

Solution

(1) Different PyTorch / CUDA / cuDNN versions — kernel implementations differ at the last few ULP. (2) Different GPU (T4 vs P100 vs A100) — different SM counts cause different reduction trees, and TF32 may be on by default on Ampere (torch.backends.cuda.matmul.allow_tf32). (3) Different num_workers default — dataloader sharding changes batch composition. (4) Different Python version → different dict iteration order in older PyTorch. (5) BatchNorm in training mode during eval (forgot model.train(False)) makes accuracy depend on batch size, which may differ between platforms. Fix: pin versions, set allow_tf32 = False, force eval mode, fix batch size, compute the metric on CPU.

D5 · Design a 10-item pre-submit checklist

You have 10 minutes before the upload window closes. Write the 10 checks you run on your submission.csv.

Solution

(1) Row count equals test-set size. (2) Column names match spec exactly. (3) Ids are unique. (4) Ids are a superset/exact match of the test ids. (5) No NaN/inf. (6) Prediction range plausible (df.describe()). (7) Encoding UTF-8, no BOM (file submission.csv). (8) Line endings LF only (cat -A | head). (9) File size in expected order of magnitude. (10) Re-run inference from a fresh Python and confirm SHA-256 matches — catches determinism regressions. Bonus: diff against your previous submission to see what actually changed.

Next step

MLOps is the engineering layer underneath every other topic on this site. Pair it with engineering survival (the fourteen bugs that eat points), end-to-end Colab notebooks (where these patterns live in runnable form), and contest cheatsheets (one-page recall of the API calls used here).