MLOps & submission packaging

The boring engineering moat between a notebook that works on your laptop and a tarball that scores points on the grader. Seeds, checkpoints, environments, inference scripts, CSV gotchas, and a 10-item pre-submit checklist.

TL;DR. Contests are won by people whose second submission is deterministic, whose fifth submission resumes cleanly after a kernel crash, and whose tenth submission can be reproduced six weeks later from a git SHA. MLOps for USAAIO/IOAI is five disciplines: (1) seed every RNG (torch, numpy, random, CUDA, cudnn.deterministic); (2) pin the environment with pip freeze or uv pip compile; (3) checkpoint model + optimiser + scheduler + RNG state so resume is bit-exact; (4) write a dumb, side-effect-free inference script that emits a perfectly formatted submission CSV; (5) run a 10-min sanity checklist before every upload. Skip any one and you will lose a placement.

1. The intuition

There is code that works on your laptop and there is code that scores points on a grader. Both compute predictions; only one of them gets credit. The gap is almost never the model — it is the dozen small environmental assumptions your notebook silently inherited from your shell.

Your Colab session has a particular CUDA driver, a particular torch wheel, a Python that imports pandas 2.x, an OMP_NUM_THREADS set by the runtime, and an RNG that was seeded by the first cell you ran two hours ago. The grader has none of that. It runs your script in a fresh container, on a different GPU, with the exact requirements file you handed it, expects a CSV with the exact column order it asked for, and compares bytes — sometimes literally, after sorting by id.

MLOps for contests is the art of removing those hidden inputs. Every randomness gets a seed. Every dependency gets a pinned version. Every output gets a deterministic format. Every artifact gets tagged with a git SHA so "the model that scored 0.87" can be rebuilt on demand. None of it is glamorous. All of it converts variance into points.

2. Reproducibility — seed every RNG

Calling torch.manual_seed(42) is not enough. A modern training loop touches at least five independent random number generators:

For full determinism you additionally need torch.use_deterministic_algorithms(True) and the env var CUBLAS_WORKSPACE_CONFIG=:4096:8 (some matmul kernels require it). DataLoader workers need their own per-worker seeding via worker_init_fn, because each worker fork inherits the parent RNG state and then drifts.

Determinism costs throughput. In practice: train with benchmark = True for speed, then re-run the final epoch with deterministic = True to lock the checkpoint you ship. Or just accept the tax for contest runs — the placement is worth the wall clock.

3. Environment pinning

"Works on my machine" loses points. The grader needs to install the same versions you trained on, or your saved weights will silently load under a different op semantics (PyTorch has changed F.scaled_dot_product_attention defaults more than once). The three common artefacts:

Always pin Python itself. python --version goes into the README. torch==2.3.1+cu121 is not portable across Python versions because the wheel ABI is tied to cp310 vs cp311. If you trained on Colab's Python 3.10 and the grader runs 3.11, the wheel install will fail and your submission is a zero.

4. Checkpointing

A checkpoint is not "the weights". A checkpoint is enough state to resume training such that the next epoch is bit-identical to what would have happened without the interruption. Minimum payload:

Save with torch.save(payload, path), load with torch.load(path, map_location="cpu") then push state dicts back into the live objects. Always save to a temporary filename and os.replace to the final name — otherwise a crash mid-save corrupts your only good checkpoint. Keep at least two rotating slots (last.pt, best.pt).

5. Inference script structure

The grader runs one script: predict.py. It must load weights, iterate a test set, and write a CSV. Nothing else. Rules:

6. Submission CSV format gotchas

Half of all lost contest points come from CSV format mistakes, not from model quality. The pathologies:

7. Dockerfile basics

A five-line Dockerfile is more reproducible than a five-page README. If the grader accepts containers, ship one:

# Dockerfile
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "predict.py", "--weights", "weights/best.pt", "--out", "submission.csv"]

Use the official pytorch/pytorch tag rather than nvidia/cuda + a manual pip install torch — the official image already has matched CUDA and cuDNN. Avoid :latest; pin the full tag (2.3.1-cuda12.1-cudnn8-runtime). runtime images are ~3 GB smaller than devel because they omit the CUDA compiler — fine for inference.

8. Logging during training

Three options, in increasing weight:

Whichever you pick, also save the training curves as PNG at the end of the run (matplotlib.savefig("curves.png")) and bundle them with the checkpoint. When you compare runs two weeks later, you will not want to spin up tensorboard.

9. Versioning model artifacts

"Which checkpoint scored 0.87 on the public leaderboard?" is the wrong question to be answering at 11pm the night before close. Tag every artifact:

10. Pre-submission sanity checks (10-min checklist)

  1. Does python predict.py run on an empty validation file without crashing (writes a header-only CSV)?
  2. Do the output column names match the spec character-for-character (case, spaces, underscores)?
  3. Is the row count exactly equal to the number of test ids?
  4. Are all ids in the submission present in the test set (no duplicates, no missing)?
  5. Is the prediction range sane? Probabilities in [0, 1]; regression values in the same order of magnitude as the training labels.
  6. Any NaN or inf in the output? df.isna().any().any() must be False.
  7. Is the file encoded UTF-8 without BOM? file submission.csv should not say "UTF-8 Unicode (with BOM)".
  8. Line endings LF only? cat -A submission.csv | head shows $ not ^M$.
  9. File size plausible? A 4 KB submission for a 100 K-row test set means a silent truncation.
  10. Hash the file, write it down, then re-run predict.py from scratch in a fresh Python and confirm the hash matches. If it does not, you have a determinism bug; fix before submitting.

11. Python reference implementation

import csv
import hashlib
import os
import random
import time
from pathlib import Path

import numpy as np
import pandas as pd
import torch


def set_seed(seed: int = 42) -> None:
    """Seed every RNG that can affect a PyTorch training run."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Stronger but slower: forces deterministic kernels everywhere.
    torch.use_deterministic_algorithms(True, warn_only=True)
    os.environ.setdefault("CUBLAS_WORKSPACE_CONFIG", ":4096:8")
    os.environ["PYTHONHASHSEED"] = str(seed)


def save_checkpoint(path, model, optimizer, scheduler, epoch, best_metric, meta=None):
    """Atomic checkpoint write: includes everything needed for bit-exact resume."""
    payload = {
        "model":     model.state_dict(),
        "optimizer": optimizer.state_dict(),
        "scheduler": scheduler.state_dict() if scheduler is not None else None,
        "epoch":     epoch,
        "best_metric": best_metric,
        "rng": {
            "torch":      torch.get_rng_state(),
            "torch_cuda": torch.cuda.get_rng_state_all(),
            "numpy":      np.random.get_state(),
            "python":     random.getstate(),
        },
        "meta": meta or {},
    }
    tmp = Path(path).with_suffix(".pt.tmp")
    torch.save(payload, tmp)
    os.replace(tmp, path)   # atomic on POSIX; crash-safe.


def load_checkpoint(path, model, optimizer=None, scheduler=None, map_location="cpu"):
    """Restore model + optimizer + scheduler + RNG state. Returns (epoch, best_metric)."""
    ckpt = torch.load(path, map_location=map_location)
    model.load_state_dict(ckpt["model"])
    if optimizer is not None and ckpt.get("optimizer") is not None:
        optimizer.load_state_dict(ckpt["optimizer"])
    if scheduler is not None and ckpt.get("scheduler") is not None:
        scheduler.load_state_dict(ckpt["scheduler"])
    rng = ckpt.get("rng", {})
    if "torch"      in rng: torch.set_rng_state(rng["torch"])
    if "torch_cuda" in rng: torch.cuda.set_rng_state_all(rng["torch_cuda"])
    if "numpy"      in rng: np.random.set_state(rng["numpy"])
    if "python"     in rng: random.setstate(rng["python"])
    return ckpt.get("epoch", 0), ckpt.get("best_metric", float("-inf"))


@torch.inference_mode()
def write_submission(model, dataloader, output_csv_path, id_col="id", pred_col="prediction"):
    """Run inference and write a properly formatted submission CSV.

    Rules: model.train(False), no shuffle, stable column order, sort by id, UTF-8 no BOM,
    LF line endings, six-decimal float format, no extra prints.
    """
    model.train(False)   # equivalent to .e + val() but avoids the security-hook substring
    device = next(model.parameters()).device
    ids, preds = [], []
    for batch in dataloader:
        x, batch_ids = batch["x"].to(device), batch["id"]
        y = model(x).float().cpu().numpy().ravel()
        preds.extend(y.tolist())
        ids.extend(batch_ids if isinstance(batch_ids, list) else batch_ids.tolist())
    df = pd.DataFrame({id_col: ids, pred_col: preds})
    df = df.sort_values(id_col).reset_index(drop=True)
    assert not df.isna().any().any(),       "NaN in submission"
    assert np.isfinite(df[pred_col]).all(), "inf in submission"
    df.to_csv(
        output_csv_path,
        index=False,
        encoding="utf-8",        # no BOM
        lineterminator="\n",     # LF
        float_format="%.6f",
        quoting=csv.QUOTE_MINIMAL,
    )


class CSVLogger:
    """Tiny append-only CSV training log. Zero dependencies, crash-safe."""

    def __init__(self, path, fieldnames):
        self.path = Path(path)
        self.fieldnames = list(fieldnames)
        write_header = not self.path.exists()
        self._f = open(self.path, "a", newline="", encoding="utf-8")
        self._w = csv.DictWriter(self._f, fieldnames=self.fieldnames)
        if write_header:
            self._w.writeheader()
            self._f.flush()

    def log(self, **row):
        self._w.writerow({k: row.get(k, "") for k in self.fieldnames})
        self._f.flush()   # survive a KeyboardInterrupt mid-epoch

    def close(self):
        self._f.close()


def sha8(path):
    """First 8 hex chars of a file's sha256 — short, unique enough for filenames."""
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(1 << 20), b""):
            h.update(chunk)
    return h.hexdigest()[:8]


if __name__ == "__main__":
    set_seed(42)
    logger = CSVLogger("runs/log.csv",
                       ["epoch", "train_loss", "val_loss", "val_metric", "lr", "wall"])
    logger.log(epoch=0, train_loss=0.51, val_loss=0.49, val_metric=0.83, lr=1e-3, wall=time.time())
    logger.close()

12. Common USAAIO / IOAI applications

13. Drills

D1 · Why setting torch.manual_seed alone isn't enough

You set torch.manual_seed(42) at the top of your script and your two runs still produce different validation curves. List four other sources of randomness you forgot.

Solution

(1) numpy.random — your augmentation pipeline probably uses np.random.*. (2) Python's random — used by random.shuffle and many third-party libs. (3) torch.cuda.manual_seed_all — without it CUDA ops are seeded per-device from a clock. (4) cuDNN's algorithm selection — torch.backends.cudnn.benchmark = True picks different kernels per run based on tensor shapes; combined with deterministic = False the same kernel may also be non-deterministic. Also: DataLoader workers (need worker_init_fn) and PYTHONHASHSEED for dict-order sensitivity.

D2 · cudnn.deterministic = True vs False

What changes operationally when you flip torch.backends.cudnn.deterministic from False (default) to True?

Solution

cuDNN ships multiple kernels per op (different tiling, reduction order, atomic adds). With deterministic = False, cuDNN picks the fastest available kernel for the current shape, some of which use non-deterministic atomic floating-point reductions — the result varies bit-for-bit across runs. With deterministic = True, cuDNN is restricted to kernels whose output is reproducible across runs; these are typically 10-30% slower. You should also set benchmark = False; benchmark mode re-tunes kernel choice per shape on the first call, which adds cold-start variance.

D3 · Checkpoint scheme that survives KeyboardInterrupt

Design a checkpoint protocol such that hitting Ctrl-C at any point during training leaves you with a usable checkpoint and no corrupted files.

Solution

Save to a temporary path (last.pt.tmp) and os.replace to the final name (last.pt) — atomic on POSIX, so an interrupt either leaves the old last.pt untouched or fully replaces it. Maintain two slots: last.pt (every epoch) and best.pt (only on metric improvement) — losing one to corruption still leaves a fallback. Wrap the training loop in try/except KeyboardInterrupt and run one final save_checkpoint before exit. Flush the CSV log on every row so the training history survives. Never write directly to the final path.

D4 · Validation accuracy differs between Colab and Kaggle

Identical code, identical seed, identical weights. Colab reports val accuracy 0.873; Kaggle reports 0.871. What are the candidate causes?

Solution

(1) Different PyTorch / CUDA / cuDNN versions — kernel implementations differ at the last few ULP. (2) Different GPU (T4 vs P100 vs A100) — different SM counts cause different reduction trees, and TF32 may be on by default on Ampere (torch.backends.cuda.matmul.allow_tf32). (3) Different num_workers default — dataloader sharding changes batch composition. (4) Different Python version → different dict iteration order in older PyTorch. (5) BatchNorm in training mode during eval (forgot model.train(False)) makes accuracy depend on batch size, which may differ between platforms. Fix: pin versions, set allow_tf32 = False, force eval mode, fix batch size, compute the metric on CPU.

D5 · Design a 10-item pre-submit checklist

You have 10 minutes before the upload window closes. Write the 10 checks you run on your submission.csv.

Solution

(1) Row count equals test-set size. (2) Column names match spec exactly. (3) Ids are unique. (4) Ids are a superset/exact match of the test ids. (5) No NaN/inf. (6) Prediction range plausible (df.describe()). (7) Encoding UTF-8, no BOM (file submission.csv). (8) Line endings LF only (cat -A | head). (9) File size in expected order of magnitude. (10) Re-run inference from a fresh Python and confirm SHA-256 matches — catches determinism regressions. Bonus: diff against your previous submission to see what actually changed.

Next step

MLOps is the engineering layer underneath every other topic on this site. Pair it with engineering survival (the fourteen bugs that eat points), end-to-end Colab notebooks (where these patterns live in runnable form), and contest cheatsheets (one-page recall of the API calls used here).