USAAIO 2024 R1 · Problem 1 · Probability & Bayes [reconstructed]
Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Probability theory / naive Bayes.
Official sources: usaaio.org/past-problems (2024 PDF not yet listed) · USAAIO syllabus.
1. Problem restatement
A multi-part problem testing probability fundamentals via small word problems plus an implementation component. Typical parts: compute conditional probability from a 2×2 table; state and apply Bayes' rule to a medical-test scenario; compute the posterior of a hypothesis given evidence; implement a naive-Bayes spam classifier from scratch; evaluate on a held-out set.
2. What's being tested
- Reading word problems. Translating "85% sensitivity, 5% false-positive rate,
1% prevalence" into
P(+ | D),P(+ | ¬D),P(D). - Bayes' rule. Both the formula and the intuition — base rates dominate.
- Independence assumptions. Why naive Bayes is "naive" and when it still works.
- Laplace smoothing. Avoiding zero probabilities for unseen features.
3. Data exploration / setup
The implementation parts use a small text dataset (spam / not-spam, or sentiment). Synthetic stand-in:
import pandas as pd
df = pd.read_csv("messages.csv") # columns: text, label (0/1)
print(df.label.value_counts())
print(df.text.str.split().str.len().describe())
4. Baseline approach
from collections import Counter
import numpy as np
def fit_nb(texts, labels, alpha=1.0):
classes = sorted(set(labels))
vocab = sorted({w for t in texts for w in t.split()})
word_idx = {w: i for i, w in enumerate(vocab)}
log_prior = np.zeros(len(classes))
log_lik = np.zeros((len(classes), len(vocab)))
for ci, c in enumerate(classes):
docs = [t for t, l in zip(texts, labels) if l == c]
log_prior[ci] = np.log(len(docs) / len(texts))
wc = Counter(w for d in docs for w in d.split())
total = sum(wc.values()) + alpha * len(vocab)
for w, i in word_idx.items():
log_lik[ci, i] = np.log((wc[w] + alpha) / total)
return classes, word_idx, log_prior, log_lik
def predict_nb(text, classes, word_idx, log_prior, log_lik):
scores = log_prior.copy()
for w in text.split():
if w in word_idx:
scores += log_lik[:, word_idx[w]]
return classes[int(np.argmax(scores))]
5. Improvements that move the needle
5.1 · Tune Laplace alpha
Sweep α ∈ {0.01, 0.1, 1.0, 10} on a val split. Smaller α favours rare evidential words; larger α smooths more aggressively. The optimum depends on vocabulary sparsity.
5.2 · Use log-probabilities throughout
Computing in linear space underflows fast. Already done above — make this point in the write-up.
5.3 · Bigrams as features
Most "naive Bayes from scratch" implementations are unigram-only. Adding bigrams (plus α-smoothing on the bigger vocabulary) consistently lifts F1 by 3–5 points.
5.4 · Per-class evaluation
Report precision/recall per class and a confusion matrix. The problem usually allocates analysis points to this.
5.5 · Compare with scikit-learn MultinomialNB
Side-by-side comparison shows your implementation matches the library to within smoothing differences. Earns reproducibility points.
6. Submission format & gotchas
- Show work for hand-probability parts. Box the final number.
- Use log-probabilities in code; underflow is a common silent bug.
- Don't import sklearn for the from-scratch parts; the grader expects your own implementation.
- Stratified train/val split — class imbalance breaks unstratified splits.
7. What top solutions did
[reconstructed] Expected full-marks: hand-Bayes parts done with explicit conditional probabilities; multinomial NB from scratch with α-smoothing; bigram extension; comparison against sklearn; per-class evaluation in the write-up.
8. Drill
D · A test for a disease has 95% sensitivity, 5% false-positive rate, disease prevalence 1%. You test positive. What's P(disease | positive)?
Bayes: P(D|+) = P(+|D)·P(D) / [P(+|D)·P(D) + P(+|¬D)·P(¬D)] = 0.95·0.01 / (0.95·0.01 + 0.05·0.99) = 0.0095 / 0.0590 ≈ 0.161. So even after a "positive" result, you're 16% likely to have the disease — the base rate of 1% dominates. This is the canonical Bayes counter-intuitive answer and it shows up on every olympiad-level probability test.
D2 · Why is the "naive" independence assumption usually wrong but the classifier still useful?
Words in text are obviously correlated ("New" and "York" co-occur). Treating them as independent double-counts the evidence. But the decision boundary from naive Bayes is often correct even when the calibrated probabilities are not — for argmax classification you only need the class scores to be in the right order, not on the right absolute scale. So naive Bayes is a strong classifier even when it's a terrible probability estimator.