IOAI 2024 · NLP · Fine-tune a language model on a ciphered language

Contest: IOAI 2024 (Bulgaria) · Round: Scientific, at-home stage · Category: NLP · On-site sibling: 7-way extension of the same classifier.

Official sources: ioai-official.org/2024-tasks · at-home problems (zip) · best solutions (zip) · open-cu/awesome-ioai-tasks · on-site notebook mirror.

1. Problem restatement

Organisers release a corpus of text in an unknown ciphered language — natural-language sentences that have been deterministically transformed (substitution cipher plus token-level reshuffling) so that no off-the-shelf tokenizer or pre-trained model has ever seen them. Each sentence carries one of 5 class labels in the at-home task. Contestants get a small training set (~thousands of labelled sentences), a validation set, and a hidden test set. The task: build a classifier that maps a ciphered sentence to its label.

The on-site twist adds two new classes (giving 7 in total) with very few labelled examples per new class, and forbids retraining the original 5-way head — you must extend it.

Hardware budget per the official notebook: solution must train end-to-end in < 1 hour on a single L4 GPU and run inference on 500 samples in < 2 minutes. Allowed: any open-weights model from HuggingFace, any tokenizer, standard PyTorch.

Source. Paraphrased from the IOAI 2024 at-home and on-site task PDFs linked above. The on-site notebook (IOAI 2024_ NLP Problem on-site.ipynb) confirms the L4 / 1-hour / 2-minute runtime budget.

2. What's being tested

The problem deliberately breaks the pre-training assumption: because the language is invented, word-piece tokenizers fragment ciphered tokens into nonsense subwords, and any pre-trained semantic knowledge in BERT/RoBERTa/Llama is useless. What survives is the distributional structure — how tokens co-occur with each other and with class labels. So the test is really:

Can you build a useful tokenizer from scratch (BPE/WordPiece) on a small corpus?
Can you fine-tune a small transformer (or distil one) from scratch in < 1 hour?
Do you know that masked-LM pre-training on the unlabelled corpus is a free lunch before classification?
Can you handle severe class imbalance (the on-site 2 extra classes have ~10 samples each)?

This maps directly onto the Transformers page (encoder fine-tuning, LayerNorm, AdamW), the Python page (HF datasets, tokenizers), and the Deep Learning page (early stopping, class-weighted cross-entropy).

3. Data exploration / setup

The data ships as three CSVs: train.csv, val.csv, test.csv, each with columns id, text, and (for train/val) label. The text field looks like:

# a row from train.csv (paraphrased example — real ciphered tokens are 5-8 char nonsense)
{
  "id":   17,
  "text": "zlpu krapa-mer voli kran zlpu kran-gri",
  "label": 2,
}

Quick EDA things to look at before you touch a model:

Vocabulary size after whitespace tokenisation — typically a few thousand unique tokens. Small enough to train BPE from scratch.
Sentence length distribution — usually 5-30 tokens. So max_len=64 is plenty.
Class balance — at-home task is roughly uniform across 5 classes (a few hundred examples per class).
Train + val + test unlabelled tokens — combined ~50-100k tokens. Enough to pre-train a small encoder via MLM.

Metric: macro-F1 on the hidden test set (so per-class recall counts, not just overall accuracy). The leaderboard score is macro-F1 scaled to 0-100.

4. Baseline approach

A baseline that hits ~0.55 macro-F1 in 20 lines: train a fresh BPE tokenizer, embed each token, average the embeddings, run logistic regression. No GPU needed.

import pandas as pd, numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

train = pd.read_csv("train.csv")
val   = pd.read_csv("val.csv")

# treat ciphered "words" as opaque tokens; TF-IDF with 1-2 grams
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.95, sublinear_tf=True)
Xtr = vec.fit_transform(train.text)
Xva = vec.transform(val.text)

clf = LogisticRegression(C=4.0, max_iter=2000, class_weight="balanced")
clf.fit(Xtr, train.label)
pred = clf.predict(Xva)
print("val macro-F1:", f1_score(val.label, pred, average="macro"))
# expected ~ 0.55-0.65 on the at-home dev set [illustrative]

Score band: this baseline scored roughly in the 55-65 range on official-format dev splits in community write-ups; absolute leaderboard numbers are not public so treat the band as [illustrative].

5. Improvements that move the needle

5.1 · Train a small BPE tokenizer and MLM-pretrain a tiny BERT

The ciphered language has its own subword regularities (the same morphemes repeat across roots). Train a BPE tokenizer on train + val + test text (test text is unlabelled, so using its tokens is fair under the rules) and then pre-train a 4-layer encoder with masked-LM for 5-10 minutes on L4. The encoder learns "which token tends to appear near which" — exactly the structure a linear bag-of-words discards.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from transformers import BertConfig, BertForMaskedLM, BertTokenizerFast
import torch
from torch.utils.data import Dataset, DataLoader

# 1. learn BPE on all available text
tok = Tokenizer(models.BPE(unk_token="[UNK]"))
tok.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(
    vocab_size=8000,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
)
all_text = list(train.text) + list(val.text) + list(test.text)
tok.train_from_iterator(all_text, trainer)
tok.save("cipher_bpe.json")

hf_tok = BertTokenizerFast(tokenizer_file="cipher_bpe.json",
                           cls_token="[CLS]", sep_token="[SEP]",
                           pad_token="[PAD]", mask_token="[MASK]", unk_token="[UNK]")

# 2. tiny BERT (≈4M params — fits L4 fine-tune in <15 min)
cfg = BertConfig(vocab_size=hf_tok.vocab_size, hidden_size=192,
                 num_hidden_layers=4, num_attention_heads=6,
                 intermediate_size=512, max_position_embeddings=64)
mlm = BertForMaskedLM(cfg).cuda()
# ... standard MLM pre-training loop with 15% random masking ...

5.2 · Fine-tune for classification with class-balanced loss

Swap the MLM head for a classification head, fine-tune for 5-10 epochs. Use class_weight = compute_class_weight("balanced", ...) inside cross-entropy so rare classes aren't ignored.

from transformers import BertForSequenceClassification
from sklearn.utils.class_weight import compute_class_weight

mlm.save_pretrained("cipher_mlm")
clf_model = BertForSequenceClassification.from_pretrained(
    "cipher_mlm", num_labels=5).cuda()

w = compute_class_weight("balanced", classes=np.arange(5), y=train.label.values)
loss_fn = torch.nn.CrossEntropyLoss(weight=torch.tensor(w, dtype=torch.float, device="cuda"))
opt = torch.optim.AdamW(clf_model.parameters(), lr=3e-5, weight_decay=0.01)
# standard training loop with linear warmup + cosine decay ...

5.3 · Self-training / pseudo-labels on the unlabelled test set

After fine-tuning, predict on the test set, keep predictions with probability > 0.9, fold them into the training data with a 0.5× weight, and re-train. This typically lifts macro-F1 by 2-4 points because the unlabelled corpus is many times larger than the labelled one.

5.4 · Ensemble 3 seeds, average logits

Tiny transformers have high variance. Train 3 models with different seeds and average their pre-softmax logits before argmax. Reliable +1-2 macro-F1.

5.5 · For the on-site 7-way extension: prototype classifier with frozen encoder

The on-site rules forbid retraining the 5-way head and forbid new learned parameters. Solution: encode every training example for the two new classes through the frozen 5-way model's penultimate layer, average the embeddings to get a "prototype" per new class, and at inference compute cosine similarity to the 5 existing logits and the 2 new prototypes — pick argmax. This is the official permitted move ("compute averages and distances between encodings").

import torch, torch.nn.functional as F

# encoder is the BERT body without the classifier head
def embed(text, encoder, tok):
    ids = tok(text, padding=True, truncation=True, max_length=64,
              return_tensors="pt").to("cuda")
    with torch.no_grad():
        h = encoder(**ids).last_hidden_state[:, 0]  # [CLS]
    return F.normalize(h, dim=-1)

# prototypes for the 2 new classes
proto_new = {c: embed(samples_for_c, encoder, tok).mean(0) for c in [5, 6]}

def predict_extended(text, clf_logits_fn, encoder, tok, T=10.0):
    logits_old = clf_logits_fn(text)                    # (5,)
    h = embed([text], encoder, tok)[0]
    sims_new = torch.stack([h @ p for p in proto_new.values()]) * T  # (2,)
    full = torch.cat([logits_old, sims_new])
    return int(full.argmax())

6. Submission format & gotchas

Submission is submission.csv with columns id,label, integer labels in {0, …, 4} (or {0, …, 6} on-site).
Make sure id order matches test.csv — the grader joins on id, but a mismatched length silently fails.
Macro-F1 punishes "predict majority class". A 0.95-accuracy submission can score 0.20 macro-F1 if it ignores a class entirely.
Don't accidentally fit your BPE on val labels (only on text). Don't fit any scaler/encoder on the test labels (you don't have them).
Save your tokenizer alongside the model — the grader runs your inference notebook fresh.

7. What top solutions did

The official "best solutions" archive on ioai-official.org bundles two write-ups. Common pattern across both: (1) custom BPE tokenizer, (2) tiny BERT (hidden ~192-256, 4-6 layers) MLM-pretrained on the full corpus including unlabelled test, (3) classification fine-tune with class weights, (4) 3-seed ensemble. The on-site write-up extends the at-home model with a frozen-encoder prototype classifier exactly as sketched above. Specific leaderboard numbers are not reproduced publicly outside the official zip; treat any quoted score as [illustrative].

Citations: IOAI 2024 best-solutions archive (linked at top), and the on-site mirror notebook by ssuslyakoff which reproduces the prototype-classifier extension pattern.

8. Drill

D · Why does TF-IDF + LogReg already get 55-65 macro-F1 here, but plateau?

Because the cipher is a deterministic token-level substitution: word-identity is preserved one-to-one, so a bag-of-tokens model still gets per-class lexical signal ("certain ciphered tokens appear mainly in class 2 documents"). The plateau hits because TF-IDF discards token order — and the ciphered language preserves syntax (subject-verb-object, modifier-noun) under the cipher. A transformer with positional encoding can model "ciphered token A almost always follows ciphered token B in class 4 but not in class 1", which is invisible to TF-IDF.

Follow-up: would character-level n-grams (analyzer="char_wb", ngram_range=(3,5)) help? Try it — for a deterministic substitution cipher, character n-grams capture morphology and often add 3-5 points to the baseline at zero extra cost. This is the cheapest improvement you can make.

D2 · The on-site task forbids new learned parameters. Why does the prototype trick comply?

"Compute averages and distances between data encodings" is explicitly listed as permitted in the official problem statement. A class prototype is an average of frozen-encoder outputs; the cosine-similarity classifier is a distance. No new weights are trained. If you instead fine-tuned a 7-way head, you would add and learn new parameters, which is disallowed.

← Back to problem archive