2018 · Problem A — Roller Coaster
Feature engineering TOPSIS / EWM Ranking Data cleaningThe prompt, restated
A theme-park enthusiast magazine wants a more objective ranking of the world's best roller coasters. The contest supplies a dataset of several hundred coasters operating in 2018, with features including type (steel / wood / hybrid), top speed, maximum height, maximum drop, total length, number of inversions, duration, year opened, manufacturer, and country.
The team is asked to (1) decide which features actually capture "best coaster" and explain why, (2) build a model that produces a single objective ranking from the dataset, (3) compare the top 10 of the model to an enthusiast public-poll top-10 list, (4) discuss whether "oldest still-running" or "manufacturer pedigree" deserve to be features, and (5) write a 2-page article for the magazine explaining the methodology in lay terms.
Key modeling idea
This is a multi-criteria ranking on tabular data. Unlike the 2020-A summer-job problem (which had a single decision-maker), here the weights need to be either elicited from a stakeholder (the magazine's editorial board) or derived from the data itself. EWM works beautifully because it's purely data-driven; AHP works if you defend an editorial weight profile.
Suggested approach
- Step 1 — Clean the data. Many coasters are missing duration or drop. Impute by type-group median; document every imputation.
- Step 2 — Feature engineering. Engineer derived features: estimated peak G-force (proportional to $v^2/r$), thrill-time-density (inversions per second), legacy bonus (years operating).
- Step 3 — Weight via EWM: this rewards features that vary informatively across the dataset. Sanity-check against an AHP alternative weight set.
- Step 4 — Rank via TOPSIS. Produce a top-50 list and a top-10 list.
- Step 5 — Compare to enthusiast polls. Use Spearman's rank correlation against the Mitch Hawker / Golden Ticket Awards top-10 list. Discuss the gaps.
Data sources to consider
| Source | What you get |
|---|---|
| Roller Coaster DataBase (RCDB.com) | The canonical coaster database — speed, height, drop, length, inversions |
| COMAP-provided dataset | Subset of RCDB curated for the problem |
| Mitch Hawker Best Coaster Poll | Enthusiast top-100 list, 2003–2017 |
| Golden Ticket Awards (Amusement Today) | Annual industry top-50 rankings |
| TripAdvisor / Reddit r/rollercoasters | Lay-public popularity signals |
Common pitfalls and judge commentary patterns
- Treating all features as benefit. Year-opened is ambiguous (older could be heritage or dated). Pick a direction and justify.
- Equal weights without saying so. Many papers silently sum z-scores; this is just EWM with a hidden assumption. State it.
- Skipping missing-data handling. Half the coasters are missing one field; imputation strategy matters.
- No correlation with public taste. The prompt explicitly compares against enthusiast polls; ignoring this is a red flag.
- Article that reads like the paper. The 2-page magazine piece is for enthusiasts, not engineers — voice has to change.
Python sketch
End-to-end EWM + TOPSIS pipeline on a coaster dataframe. Drop in real CSV from RCDB.
import numpy as np, pandas as pd
# illustrative subset of features
df = pd.DataFrame({
"name": ["A","B","C","D","E","F","G","H"],
"speed_mph": [95, 80,128,102, 70, 88,110, 75],
"height_ft": [200,180,456,300,150,210,325,170],
"drop_ft": [190,170,418,275,140,200,300,160],
"length_ft": [4500,3800,8133,5800,3200,4300,6700,3500],
"inversions": [4, 0, 0, 3, 0, 2, 1, 0],
"duration_s": [150,130,210,180,110,160,200,125],
})
X = df.drop(columns="name").to_numpy(dtype=float)
# step 1 — min-max normalize (all benefit features here)
Xn = (X - X.min(0)) / (X.max(0) - X.min(0) + 1e-9)
# step 2 — EWM weights
P = Xn / Xn.sum(0)
n = Xn.shape[0]
ent = -1/np.log(n) * np.nansum(np.where(P>0, P*np.log(P), 0), axis=0)
w = (1 - ent) / (1 - ent).sum()
# step 3 — TOPSIS
V = Xn * w
ideal_best, ideal_worst = V.max(0), V.min(0)
d_plus = np.linalg.norm(V - ideal_best, axis=1)
d_minus = np.linalg.norm(V - ideal_worst, axis=1)
df["score"] = d_minus / (d_plus + d_minus)
print(df.sort_values("score", ascending=False).round(3))
Sensitivity & validation checklist
- Re-rank with equal weights, AHP-elicited weights, and EWM weights — overlap of top-10?
- Drop one feature at a time; does any single column dominate the ranking?
- Spearman correlation vs. Mitch Hawker top-100 — report the number.
- Split by type (steel vs. wood) — does the model produce a sensible split-leaderboard?
- Check that no famous coaster (e.g., Kingda Ka, Steel Vengeance) ranks > #50; if so, investigate the feature responsible.