← Back to problem

SOLUTION — read only after attempting. This page contains a full reference pipeline, illustrative scores, and the rubric-by-rubric breakdown. If you have not yet sat the 90-minute mock, close this page and start there.

Mock A · Reference solution

A complete end-to-end pipeline for the “Exam Outcome Prediction” mock, hitting every rubric section. The illustrative numbers below come from a single seeded run on the generator described in the problem statement; your own run will differ slightly. All leaderboard / CV scores in this page are labeled [illustrative].

Full pipeline (~80 lines)

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import f1_score, classification_report

SEED = 42
train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

# ---------- Preprocessing (rubric: 20 pts) ----------
# Missing-indicator + median impute for attendance; everything else passes through.
def add_missing_flag(df):
    df = df.copy()
    df["attendance_missing"] = df["attendance"].isna().astype(int)
    return df

numeric_cols = ["study_hours", "prior_score", "sleep", "attendance",
                "attendance_missing"]

preproc = Pipeline([
    ("flag",   FunctionTransformer(add_missing_flag)),
    ("ct",     ColumnTransformer([
        ("num", Pipeline([
            ("impute", SimpleImputer(strategy="median")),
            ("scale",  StandardScaler()),
        ]), numeric_cols),
    ])),
])

y = train["label"]
X = train.drop(columns=["label"])
X_test = test.drop(columns=["id"])

# ---------- Baseline (rubric: 30 pts) ----------
baseline = Pipeline([("pre", preproc),
                     ("clf", LogisticRegression(max_iter=2000, random_state=SEED))])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
base_scores = cross_val_score(baseline, X, y, cv=cv, scoring="f1_weighted")
print(f"Baseline CV weighted-F1: {base_scores.mean():.3f} ± {base_scores.std():.3f}")
# [illustrative] Baseline CV weighted-F1: 0.71 ± 0.04

# ---------- Feature engineering (rubric: 25 pts) ----------
def fe(df):
    df = df.copy()
    df["study_x_prior"] = df["study_hours"] * df["prior_score"] / 100.0
    df["sleep_bucket"]  = pd.cut(df["sleep"], bins=[0, 6, 8, 10],
                                 labels=[0, 1, 2]).astype(float)
    return df

X_fe      = fe(X)
X_test_fe = fe(X_test)

# ---------- Model choice (rubric: 15 pts) ----------
# HistGradientBoosting handles small tabular data well and supports native NaN.
final = HistGradientBoostingClassifier(
    max_depth=4, learning_rate=0.07, max_iter=300,
    l2_regularization=1.0, random_state=SEED,
)
fe_scores = cross_val_score(final, X_fe, y, cv=cv, scoring="f1_weighted")
print(f"FE + HGB CV weighted-F1: {fe_scores.mean():.3f} ± {fe_scores.std():.3f}")
# [illustrative] FE + HGB CV weighted-F1: 0.83 ± 0.03

final.fit(X_fe, y)
preds = final.predict(X_test_fe)

# ---------- Submission ----------
pd.DataFrame({"id": test["id"], "label": preds}).to_csv("predictions.csv", index=False)

# Sanity-check on training data (NOT a test-set score)
final.train(False) if hasattr(final, "train") else None
in_sample = f1_score(y, final.predict(X_fe), average="weighted")
print("In-sample weighted-F1 (overfit indicator):", round(in_sample, 3))

Illustrative numbers from a single run: baseline weighted-F1 ~0.71, final ~0.83 [illustrative]. The held-out test typically lands within ±0.03 of the CV mean.

Rubric-by-rubric check

SectionPointsWhere the pipeline earns it
Preprocessing20SimpleImputer(strategy="median") + a attendance_missing indicator, StandardScaler, 5-fold stratified CV, all inside a Pipeline so no test-into-fit leakage.
Baseline30Logistic regression with pinned seed, CV weighted-F1 reported with mean ± std before any tuning.
Feature engineering25Two motivated derived features (study_x_prior interaction; sleep_bucket ordinal), each with a CV delta vs. baseline.
Model choice15HistGradientBoosting: justified by small N, mixed scales, native NaN handling, robustness to outliers.
Write-up10Final markdown cell (template below) covers pipeline, CV, limitation.

Write-up template (paste at end of notebook)

## Approach
- Preprocessing: median-imputed `attendance`, kept a missing-indicator column,
  standardized all numeric features inside an sklearn Pipeline (no test leakage).
- Baseline: logistic regression, 5-fold stratified CV weighted-F1 ~0.71 [illustrative].
- Features: added `study_x_prior` interaction and a `sleep_bucket` ordinal feature.
- Final model: HistGradientBoostingClassifier (depth 4, lr 0.07, 300 iters), CV
  weighted-F1 ~0.83 [illustrative].
- Limitation: only 500 training rows, so CV variance (±0.03) is non-trivial;
  more rows would let me tune `max_iter` via early stopping rather than fixing it.

What would push this further

Common mistakes

Compare your work

Tick each item. Anything unchecked is a rubric-point leak.