USAAIO 2024 R1 · Problem 5 · Word embeddings & sentence classification [reconstructed]

Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: NLP / word embeddings.

Official sources: usaaio.org/past-problems · USAAIO syllabus.

Reconstruction notice. [reconstructed] from the syllabus.

1. Problem restatement

Multi-part: explain the skip-gram objective; train word2vec-style embeddings on a supplied corpus (or use pre-trained GloVe vectors); build sentence representations by mean-pooling token embeddings; classify sentences with a small MLP; analyse cosine similarities between learned word vectors; discuss limitations vs contextual embeddings.

2. What's being tested

Skip-gram math. The negative-sampling loss as a binary classification.
Vector geometry. Cosine similarity, word analogies (king − man + woman ≈ queen).
Sentence representation. Mean-pooling vs sum-pooling vs first-token.
Limits of static embeddings. Polysemy ("bank"), out-of-vocab handling.

3. Data exploration / setup

import pandas as pd
df = pd.read_csv("sentiment.csv")    # text, label
print(df.label.value_counts())
print(df.text.str.split().str.len().describe())

4. Baseline approach

Quickest path: load GloVe (50d or 100d) vectors, mean-pool, train LogReg on top.

import numpy as np

glove = {}
for line in open("glove.6B.50d.txt"):
    p = line.split()
    glove[p[0]] = np.array(p[1:], dtype=float)

def sent_vec(text, d=50):
    vecs = [glove[w] for w in text.lower().split() if w in glove]
    return np.mean(vecs, axis=0) if vecs else np.zeros(d)

E = np.stack([sent_vec(t) for t in df.text])
y = df.label.values

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
print(cross_val_score(LogisticRegression(C=4.0, max_iter=2000), E, y, cv=5, scoring="f1_macro").mean())

5. Improvements that move the needle

5.1 · Train embeddings on the task corpus

GloVe is pre-trained on Wikipedia; in-domain embeddings learned on the task corpus often outperform on small classification problems. Use gensim Word2Vec with skip-gram + negative sampling.

5.2 · TF-IDF weighted pooling

Mean-pooling treats stopwords ("the", "a") equally with content words. Weight by IDF: rare words get higher weight. Often +2–3 macro-F1.

5.3 · Concatenate features

Concatenate (mean embedding, max embedding, TF-IDF-weighted embedding). Three views of the same sentence each carrying different signal.

5.4 · Small MLP head

Swap LogReg for a 2-layer MLP with dropout. Mostly helps when the dataset is > 10k rows.

5.5 · Discuss contextual alternatives

Write a paragraph on what static embeddings get wrong (polysemy, syntax). Mention that frozen-BERT [CLS] embeddings or sentence-transformers would address it. Earns reasoning points.

6. Submission format & gotchas

Lowercase + simple tokenisation before GloVe lookup, or you'll miss most matches.
OOV words: skip rather than zero — zero-vectors pull the mean toward origin.
For the skip-gram math part, write the loss explicitly: L = −log σ(v_c · v_w) − Σ_neg log σ(−v_n · v_w).
Cosine, not Euclidean, for word similarity.

7. What top solutions did

[reconstructed] Full-marks pattern: skip-gram explanation with the negative-sampling loss; in-domain word2vec training; TF-IDF-weighted pooling; analogy demo (king/queen); discussion of what contextual embeddings would buy you.

8. Drill

D · Why does negative sampling speed up skip-gram vs full softmax?

Full softmax requires normalising over the entire vocabulary V at every step — O(V) per word per training pair. Negative sampling replaces the V-way softmax with K binary-classification problems (true context vs K random non-contexts), so the cost is O(K) per pair with K=5–20 typical. For V=50,000 and K=10, the speedup is 5,000×. The trade-off: negative sampling optimises a slightly different objective, but in practice the learned vectors are comparable for downstream tasks.

D2 · Your sentiment classifier with GloVe scores 75 macro-F1. The training data has 1000 rows. Why might a sentence-transformer beat this by 10 points?

Three reasons. (1) Sentence-transformers are pre-trained on hundreds of millions of pairs with a contrastive objective, so the embedding space is task-aware in a way static word vectors aren't. (2) They produce a single sentence vector by mean-pooling contextual token outputs; mean-pooling static word vectors loses word-order and disambiguation signal. (3) The pre-training already saw the sentiment-relevant vocabulary in many contexts. For 1k rows, this transfer is the bulk of the lift. The trade-off is inference latency: a transformer pass is 10–100× slower than a GloVe lookup.

← USAAIO 2024 Round 1 set