IOAI 2024 · NLP · Fine-tune a language model on a ciphered language
Contest: IOAI 2024 (Bulgaria) · Round: Scientific, at-home stage · Category: NLP · On-site sibling: 7-way extension of the same classifier.
Official sources: ioai-official.org/2024-tasks · at-home problems (zip) · best solutions (zip) · open-cu/awesome-ioai-tasks · on-site notebook mirror.
1. Problem restatement
Organisers release a corpus of text in an unknown ciphered language — natural-language sentences that have been deterministically transformed (substitution cipher plus token-level reshuffling) so that no off-the-shelf tokenizer or pre-trained model has ever seen them. Each sentence carries one of 5 class labels in the at-home task. Contestants get a small training set (~thousands of labelled sentences), a validation set, and a hidden test set. The task: build a classifier that maps a ciphered sentence to its label.
The on-site twist adds two new classes (giving 7 in total) with very few labelled examples per new class, and forbids retraining the original 5-way head — you must extend it.
Hardware budget per the official notebook: solution must train end-to-end in < 1 hour on a single L4 GPU and run inference on 500 samples in < 2 minutes. Allowed: any open-weights model from HuggingFace, any tokenizer, standard PyTorch.
IOAI 2024_ NLP Problem on-site.ipynb) confirms the L4 / 1-hour / 2-minute
runtime budget.
2. What's being tested
The problem deliberately breaks the pre-training assumption: because the language is invented, word-piece tokenizers fragment ciphered tokens into nonsense subwords, and any pre-trained semantic knowledge in BERT/RoBERTa/Llama is useless. What survives is the distributional structure — how tokens co-occur with each other and with class labels. So the test is really:
- Can you build a useful tokenizer from scratch (BPE/WordPiece) on a small corpus?
- Can you fine-tune a small transformer (or distil one) from scratch in < 1 hour?
- Do you know that masked-LM pre-training on the unlabelled corpus is a free lunch before classification?
- Can you handle severe class imbalance (the on-site 2 extra classes have ~10 samples each)?
This maps directly onto the Transformers page (encoder fine-tuning, LayerNorm, AdamW), the Python page (HF datasets, tokenizers), and the Deep Learning page (early stopping, class-weighted cross-entropy).
3. Data exploration / setup
The data ships as three CSVs: train.csv, val.csv, test.csv, each
with columns id, text, and (for train/val) label. The
text field looks like:
# a row from train.csv (paraphrased example — real ciphered tokens are 5-8 char nonsense)
{
"id": 17,
"text": "zlpu krapa-mer voli kran zlpu kran-gri",
"label": 2,
}
Quick EDA things to look at before you touch a model:
- Vocabulary size after whitespace tokenisation — typically a few thousand unique tokens. Small enough to train BPE from scratch.
- Sentence length distribution — usually 5-30 tokens. So
max_len=64is plenty. - Class balance — at-home task is roughly uniform across 5 classes (a few hundred examples per class).
- Train + val + test unlabelled tokens — combined ~50-100k tokens. Enough to pre-train a small encoder via MLM.
Metric: macro-F1 on the hidden test set (so per-class recall counts, not just overall accuracy). The leaderboard score is macro-F1 scaled to 0-100.
4. Baseline approach
A baseline that hits ~0.55 macro-F1 in 20 lines: train a fresh BPE tokenizer, embed each token, average the embeddings, run logistic regression. No GPU needed.
import pandas as pd, numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
train = pd.read_csv("train.csv")
val = pd.read_csv("val.csv")
# treat ciphered "words" as opaque tokens; TF-IDF with 1-2 grams
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.95, sublinear_tf=True)
Xtr = vec.fit_transform(train.text)
Xva = vec.transform(val.text)
clf = LogisticRegression(C=4.0, max_iter=2000, class_weight="balanced")
clf.fit(Xtr, train.label)
pred = clf.predict(Xva)
print("val macro-F1:", f1_score(val.label, pred, average="macro"))
# expected ~ 0.55-0.65 on the at-home dev set [illustrative]
Score band: this baseline scored roughly in the 55-65 range on official-format dev splits in community write-ups; absolute leaderboard numbers are not public so treat the band as [illustrative].
5. Improvements that move the needle
5.1 · Train a small BPE tokenizer and MLM-pretrain a tiny BERT
The ciphered language has its own subword regularities (the same morphemes repeat across roots). Train a BPE tokenizer on train + val + test text (test text is unlabelled, so using its tokens is fair under the rules) and then pre-train a 4-layer encoder with masked-LM for 5-10 minutes on L4. The encoder learns "which token tends to appear near which" — exactly the structure a linear bag-of-words discards.
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from transformers import BertConfig, BertForMaskedLM, BertTokenizerFast
import torch
from torch.utils.data import Dataset, DataLoader
# 1. learn BPE on all available text
tok = Tokenizer(models.BPE(unk_token="[UNK]"))
tok.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(
vocab_size=8000,
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
)
all_text = list(train.text) + list(val.text) + list(test.text)
tok.train_from_iterator(all_text, trainer)
tok.save("cipher_bpe.json")
hf_tok = BertTokenizerFast(tokenizer_file="cipher_bpe.json",
cls_token="[CLS]", sep_token="[SEP]",
pad_token="[PAD]", mask_token="[MASK]", unk_token="[UNK]")
# 2. tiny BERT (≈4M params — fits L4 fine-tune in <15 min)
cfg = BertConfig(vocab_size=hf_tok.vocab_size, hidden_size=192,
num_hidden_layers=4, num_attention_heads=6,
intermediate_size=512, max_position_embeddings=64)
mlm = BertForMaskedLM(cfg).cuda()
# ... standard MLM pre-training loop with 15% random masking ...
5.2 · Fine-tune for classification with class-balanced loss
Swap the MLM head for a classification head, fine-tune for 5-10 epochs. Use
class_weight = compute_class_weight("balanced", ...) inside cross-entropy so rare classes
aren't ignored.
from transformers import BertForSequenceClassification
from sklearn.utils.class_weight import compute_class_weight
mlm.save_pretrained("cipher_mlm")
clf_model = BertForSequenceClassification.from_pretrained(
"cipher_mlm", num_labels=5).cuda()
w = compute_class_weight("balanced", classes=np.arange(5), y=train.label.values)
loss_fn = torch.nn.CrossEntropyLoss(weight=torch.tensor(w, dtype=torch.float, device="cuda"))
opt = torch.optim.AdamW(clf_model.parameters(), lr=3e-5, weight_decay=0.01)
# standard training loop with linear warmup + cosine decay ...
5.3 · Self-training / pseudo-labels on the unlabelled test set
After fine-tuning, predict on the test set, keep predictions with probability > 0.9, fold them into the training data with a 0.5× weight, and re-train. This typically lifts macro-F1 by 2-4 points because the unlabelled corpus is many times larger than the labelled one.
5.4 · Ensemble 3 seeds, average logits
Tiny transformers have high variance. Train 3 models with different seeds and average their pre-softmax logits before argmax. Reliable +1-2 macro-F1.
5.5 · For the on-site 7-way extension: prototype classifier with frozen encoder
The on-site rules forbid retraining the 5-way head and forbid new learned parameters. Solution: encode every training example for the two new classes through the frozen 5-way model's penultimate layer, average the embeddings to get a "prototype" per new class, and at inference compute cosine similarity to the 5 existing logits and the 2 new prototypes — pick argmax. This is the official permitted move ("compute averages and distances between encodings").
import torch, torch.nn.functional as F
# encoder is the BERT body without the classifier head
def embed(text, encoder, tok):
ids = tok(text, padding=True, truncation=True, max_length=64,
return_tensors="pt").to("cuda")
with torch.no_grad():
h = encoder(**ids).last_hidden_state[:, 0] # [CLS]
return F.normalize(h, dim=-1)
# prototypes for the 2 new classes
proto_new = {c: embed(samples_for_c, encoder, tok).mean(0) for c in [5, 6]}
def predict_extended(text, clf_logits_fn, encoder, tok, T=10.0):
logits_old = clf_logits_fn(text) # (5,)
h = embed([text], encoder, tok)[0]
sims_new = torch.stack([h @ p for p in proto_new.values()]) * T # (2,)
full = torch.cat([logits_old, sims_new])
return int(full.argmax())
6. Submission format & gotchas
- Submission is
submission.csvwith columnsid,label, integer labels in{0, …, 4}(or{0, …, 6}on-site). - Make sure
idorder matchestest.csv— the grader joins onid, but a mismatched length silently fails. - Macro-F1 punishes "predict majority class". A 0.95-accuracy submission can score 0.20 macro-F1 if it ignores a class entirely.
- Don't accidentally fit your BPE on val labels (only on text). Don't fit any scaler/encoder on the test labels (you don't have them).
- Save your tokenizer alongside the model — the grader runs your inference notebook fresh.
7. What top solutions did
The official "best solutions" archive on ioai-official.org bundles two write-ups. Common pattern across both: (1) custom BPE tokenizer, (2) tiny BERT (hidden ~192-256, 4-6 layers) MLM-pretrained on the full corpus including unlabelled test, (3) classification fine-tune with class weights, (4) 3-seed ensemble. The on-site write-up extends the at-home model with a frozen-encoder prototype classifier exactly as sketched above. Specific leaderboard numbers are not reproduced publicly outside the official zip; treat any quoted score as [illustrative].
Citations: IOAI 2024 best-solutions archive (linked at top), and the on-site mirror notebook by ssuslyakoff which reproduces the prototype-classifier extension pattern.
8. Drill
D · Why does TF-IDF + LogReg already get 55-65 macro-F1 here, but plateau?
Because the cipher is a deterministic token-level substitution: word-identity is preserved one-to-one, so a bag-of-tokens model still gets per-class lexical signal ("certain ciphered tokens appear mainly in class 2 documents"). The plateau hits because TF-IDF discards token order — and the ciphered language preserves syntax (subject-verb-object, modifier-noun) under the cipher. A transformer with positional encoding can model "ciphered token A almost always follows ciphered token B in class 4 but not in class 1", which is invisible to TF-IDF.
Follow-up: would character-level n-grams (analyzer="char_wb", ngram_range=(3,5))
help? Try it — for a deterministic substitution cipher, character n-grams capture morphology and often
add 3-5 points to the baseline at zero extra cost. This is the cheapest improvement you can make.
D2 · The on-site task forbids new learned parameters. Why does the prototype trick comply?
"Compute averages and distances between data encodings" is explicitly listed as permitted in the official problem statement. A class prototype is an average of frozen-encoder outputs; the cosine-similarity classifier is a distance. No new weights are trained. If you instead fine-tuned a 7-way head, you would add and learn new parameters, which is disallowed.