USAAIO 2024 R1 · Problem 5 · Word embeddings & sentence classification [reconstructed]
Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: NLP / word embeddings.
Official sources: usaaio.org/past-problems · USAAIO syllabus.
1. Problem restatement
Multi-part: explain the skip-gram objective; train word2vec-style embeddings on a supplied corpus (or use pre-trained GloVe vectors); build sentence representations by mean-pooling token embeddings; classify sentences with a small MLP; analyse cosine similarities between learned word vectors; discuss limitations vs contextual embeddings.
2. What's being tested
- Skip-gram math. The negative-sampling loss as a binary classification.
- Vector geometry. Cosine similarity, word analogies (
king − man + woman ≈ queen). - Sentence representation. Mean-pooling vs sum-pooling vs first-token.
- Limits of static embeddings. Polysemy ("bank"), out-of-vocab handling.
3. Data exploration / setup
import pandas as pd
df = pd.read_csv("sentiment.csv") # text, label
print(df.label.value_counts())
print(df.text.str.split().str.len().describe())
4. Baseline approach
Quickest path: load GloVe (50d or 100d) vectors, mean-pool, train LogReg on top.
import numpy as np
glove = {}
for line in open("glove.6B.50d.txt"):
p = line.split()
glove[p[0]] = np.array(p[1:], dtype=float)
def sent_vec(text, d=50):
vecs = [glove[w] for w in text.lower().split() if w in glove]
return np.mean(vecs, axis=0) if vecs else np.zeros(d)
E = np.stack([sent_vec(t) for t in df.text])
y = df.label.values
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
print(cross_val_score(LogisticRegression(C=4.0, max_iter=2000), E, y, cv=5, scoring="f1_macro").mean())
5. Improvements that move the needle
5.1 · Train embeddings on the task corpus
GloVe is pre-trained on Wikipedia; in-domain embeddings learned on the task corpus often outperform
on small classification problems. Use gensim Word2Vec with skip-gram + negative
sampling.
5.2 · TF-IDF weighted pooling
Mean-pooling treats stopwords ("the", "a") equally with content words. Weight by IDF: rare words get higher weight. Often +2–3 macro-F1.
5.3 · Concatenate features
Concatenate (mean embedding, max embedding, TF-IDF-weighted embedding). Three views of the same sentence each carrying different signal.
5.4 · Small MLP head
Swap LogReg for a 2-layer MLP with dropout. Mostly helps when the dataset is > 10k rows.
5.5 · Discuss contextual alternatives
Write a paragraph on what static embeddings get wrong (polysemy, syntax). Mention that frozen-BERT [CLS] embeddings or sentence-transformers would address it. Earns reasoning points.
6. Submission format & gotchas
- Lowercase + simple tokenisation before GloVe lookup, or you'll miss most matches.
- OOV words: skip rather than zero — zero-vectors pull the mean toward origin.
- For the skip-gram math part, write the loss explicitly:
L = −log σ(v_c · v_w) − Σ_neg log σ(−v_n · v_w). - Cosine, not Euclidean, for word similarity.
7. What top solutions did
[reconstructed] Full-marks pattern: skip-gram explanation with the negative-sampling loss; in-domain word2vec training; TF-IDF-weighted pooling; analogy demo (king/queen); discussion of what contextual embeddings would buy you.
8. Drill
D · Why does negative sampling speed up skip-gram vs full softmax?
Full softmax requires normalising over the entire vocabulary V at every step — O(V) per word per training pair. Negative sampling replaces the V-way softmax with K binary-classification problems (true context vs K random non-contexts), so the cost is O(K) per pair with K=5–20 typical. For V=50,000 and K=10, the speedup is 5,000×. The trade-off: negative sampling optimises a slightly different objective, but in practice the learned vectors are comparable for downstream tasks.
D2 · Your sentiment classifier with GloVe scores 75 macro-F1. The training data has 1000 rows. Why might a sentence-transformer beat this by 10 points?
Three reasons. (1) Sentence-transformers are pre-trained on hundreds of millions of pairs with a contrastive objective, so the embedding space is task-aware in a way static word vectors aren't. (2) They produce a single sentence vector by mean-pooling contextual token outputs; mean-pooling static word vectors loses word-order and disambiguation signal. (3) The pre-training already saw the sentiment-relevant vocabulary in many contexts. For 1k rows, this transfer is the bulk of the lift. The trade-off is inference latency: a transformer pass is 10–100× slower than a GloVe lookup.