USAAIO 2024 R1 · Problem 1 · Probability & Bayes [reconstructed]

Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Probability theory / naive Bayes.

Official sources: usaaio.org/past-problems (2024 PDF not yet listed) · USAAIO syllabus.

Reconstruction notice. This walkthrough is [reconstructed] from the USAAIO syllabus topic list. The 2024 Round-1 PDF is not in the public archive at time of writing; cross-check once it publishes.

1. Problem restatement

A multi-part problem testing probability fundamentals via small word problems plus an implementation component. Typical parts: compute conditional probability from a 2×2 table; state and apply Bayes' rule to a medical-test scenario; compute the posterior of a hypothesis given evidence; implement a naive-Bayes spam classifier from scratch; evaluate on a held-out set.

2. What's being tested

Reading word problems. Translating "85% sensitivity, 5% false-positive rate, 1% prevalence" into P(+ | D), P(+ | ¬D), P(D).
Bayes' rule. Both the formula and the intuition — base rates dominate.
Independence assumptions. Why naive Bayes is "naive" and when it still works.
Laplace smoothing. Avoiding zero probabilities for unseen features.

3. Data exploration / setup

The implementation parts use a small text dataset (spam / not-spam, or sentiment). Synthetic stand-in:

import pandas as pd
df = pd.read_csv("messages.csv")    # columns: text, label (0/1)
print(df.label.value_counts())
print(df.text.str.split().str.len().describe())

4. Baseline approach

from collections import Counter
import numpy as np

def fit_nb(texts, labels, alpha=1.0):
    classes = sorted(set(labels))
    vocab = sorted({w for t in texts for w in t.split()})
    word_idx = {w: i for i, w in enumerate(vocab)}
    log_prior = np.zeros(len(classes))
    log_lik = np.zeros((len(classes), len(vocab)))
    for ci, c in enumerate(classes):
        docs = [t for t, l in zip(texts, labels) if l == c]
        log_prior[ci] = np.log(len(docs) / len(texts))
        wc = Counter(w for d in docs for w in d.split())
        total = sum(wc.values()) + alpha * len(vocab)
        for w, i in word_idx.items():
            log_lik[ci, i] = np.log((wc[w] + alpha) / total)
    return classes, word_idx, log_prior, log_lik

def predict_nb(text, classes, word_idx, log_prior, log_lik):
    scores = log_prior.copy()
    for w in text.split():
        if w in word_idx:
            scores += log_lik[:, word_idx[w]]
    return classes[int(np.argmax(scores))]

5. Improvements that move the needle

5.1 · Tune Laplace alpha

Sweep α ∈ {0.01, 0.1, 1.0, 10} on a val split. Smaller α favours rare evidential words; larger α smooths more aggressively. The optimum depends on vocabulary sparsity.

5.2 · Use log-probabilities throughout

Computing in linear space underflows fast. Already done above — make this point in the write-up.

5.3 · Bigrams as features

Most "naive Bayes from scratch" implementations are unigram-only. Adding bigrams (plus α-smoothing on the bigger vocabulary) consistently lifts F1 by 3–5 points.

5.4 · Per-class evaluation

Report precision/recall per class and a confusion matrix. The problem usually allocates analysis points to this.

5.5 · Compare with scikit-learn MultinomialNB

Side-by-side comparison shows your implementation matches the library to within smoothing differences. Earns reproducibility points.

6. Submission format & gotchas

Show work for hand-probability parts. Box the final number.
Use log-probabilities in code; underflow is a common silent bug.
Don't import sklearn for the from-scratch parts; the grader expects your own implementation.
Stratified train/val split — class imbalance breaks unstratified splits.

7. What top solutions did

[reconstructed] Expected full-marks: hand-Bayes parts done with explicit conditional probabilities; multinomial NB from scratch with α-smoothing; bigram extension; comparison against sklearn; per-class evaluation in the write-up.

8. Drill

D · A test for a disease has 95% sensitivity, 5% false-positive rate, disease prevalence 1%. You test positive. What's P(disease | positive)?

Bayes: P(D|+) = P(+|D)·P(D) / [P(+|D)·P(D) + P(+|¬D)·P(¬D)] = 0.95·0.01 / (0.95·0.01 + 0.05·0.99) = 0.0095 / 0.0590 ≈ 0.161. So even after a "positive" result, you're 16% likely to have the disease — the base rate of 1% dominates. This is the canonical Bayes counter-intuitive answer and it shows up on every olympiad-level probability test.

D2 · Why is the "naive" independence assumption usually wrong but the classifier still useful?

Words in text are obviously correlated ("New" and "York" co-occur). Treating them as independent double-counts the evidence. But the decision boundary from naive Bayes is often correct even when the calibrated probabilities are not — for argmax classification you only need the class scores to be in the right order, not on the right absolute scale. So naive Bayes is a strong classifier even when it's a terrible probability estimator.

← USAAIO 2024 Round 1 set