Outstanding paper anatomy
Every year roughly 1% of HiMCM submissions are designated Outstanding. The gap between Finalist and Outstanding is not the math — it is what a judge notices in the first ninety seconds and whether the paper holds together under a careful second read. This page reverse-engineers what those papers do that the other 99% do not.
1. Why this page exists
The COMAP rubric (Summary, Restatement, Assumptions, Model, Results, Sensitivity, Strengths/Weaknesses, Conclusion, References) is public. Yet most teams who follow it land in Meritorious. That is because the rubric tells you what to put in, not what makes each section good. The point of this page is to fill in the second half.
The patterns below are drawn from published commentaries on Outstanding papers from HiMCM and MCM/ICM contests 2018–2024 — in particular the editor-in-chief retrospectives in The UMAP Journal volumes 39–45, plus the judges' commentary articles that COMAP releases alongside each problem. Quotes are paraphrased, never fabricated; the named contest years are real, the specific phrasings are mine.
2. The opening 90 seconds
A first-round judge typically spends about five minutes per paper, and the first ninety of those seconds are on the Summary Sheet. The papers that survive triage do four things on page 1:
- Name the problem in one sentence — not paraphrased from the prompt, framed in the team's own terms with the scope bounded.
- Name the approach in one sentence — the specific technique, not "we built a mathematical model."
- State the numerical answer — actual numbers, units, and the conditions under which they hold.
- State three to four recommendations — what the stakeholder should do, ranked.
A useful reference is the 2024 Problem A sample paper hosted in this site at problems/2024-a-sample-paper.html (the "To Play or Not to Play" Wordle problem) and the 2023 Problem A commentary on "Dandelions: A Pretty Flower or Pesky Weed?" published in The UMAP Journal 44.2. Outstanding papers from both years opened with summaries that were independently readable — a judge could skip directly to the conclusion and still know what the team had recommended and why.
"We model the spread of dandelions on a 1 km² grass plot as a coupled reaction–diffusion system in which seed dispersal follows an empirical wind-kernel and germination depends on local soil moisture. Calibrated against the USDA Plants Database (2019–2023) for Taraxacum officinale, our model predicts a 38 % increase in coverage over five years under business-as-usual mowing, dropping to 4 % under a twice-weekly mowing regime. We recommend: (i) mow on a 4-day cycle during May–June; (ii) prioritise edges and sunlit patches; (iii) accept dandelions as a net-positive pollinator resource on plots > 5 ha." Why this works: problem framed, technique named (reaction–diffusion with wind kernel), numbers given with units and conditions, three ranked recommendations, all in five sentences.
"In this paper, we develop a mathematical model to study the spread of dandelions, which are a common weed. We use differential equations and computer simulations to analyse the problem. Our model produces results that show how dandelions spread under different conditions. We then perform sensitivity analysis and provide recommendations to the stakeholders." Why this fails triage: no numbers, no specific technique (which differential equations?), no scope, no recommendation, and the verb tense (mostly present, vague) signals a TOC-as-summary. A judge flips to page 2.
3. Assumption quality
Assumptions are the most reliably weak section in non-Outstanding papers. The judges' commentary on the 2022 HiMCM Problem B ("Forest Carbon Sequestration") explicitly called out teams that listed "the data is reliable" as an assumption. That is not an assumption — it is a declaration of trust. A real assumption is a claim about the world that the model relies on and that could be wrong.
| Pattern | ✗ Weak (Meritorious-and-below) | ✓ Strong (Outstanding-style) |
|---|---|---|
| Data trust | BAD"We assume the data is accurate and reliable." | GOOD"We assume the NOAA buoy temperature record (Station 41001, 2014–2023) is unbiased to within ±0.2 °C, the manufacturer's stated sensor accuracy. This bounds the calibration error in §4.2." |
| Functional form | BAD"We assume the relationship is linear." | GOOD"We assume infection rate scales linearly with contact frequency below 50 contacts/day. Above this we use a saturating Hill function, motivated by Anderson & May (1991, §6.3). The 50/day threshold is varied in §6." |
| Geographic scope | BAD"We assume the population is uniform." | GOOD"We assume population density within a single ZIP code is uniform but allow it to vary across ZIPs. This loses sub-neighbourhood structure but lets us use the 2020 US Census tract data directly." |
| Time horizon | BAD"We assume conditions remain constant over time." | GOOD"We assume the demand curve is stationary over the 12-month optimisation horizon. Multi-year forecasting is out of scope; we revisit this in the weaknesses section." |
| Independence | BAD"We assume the events are independent." | GOOD"We assume Wordle players' guess attempts are conditionally independent given the day's solution word. Real players share strategies on social media; we test the impact of correlated guessing in §6.4." |
| Cost / parameter | BAD"We assume the cost is $1000." | GOOD"We take the per-unit installation cost as $1,040 (NREL 2023 utility-scale solar benchmark, residential rooftop, 10 kW system). Sensitivity to ±25 % of this value is reported in Table 6." |
| Behaviour | BAD"We assume players play rationally." | GOOD"We model players as boundedly-rational: they pick the guess that maximises information gain among their top 20 candidate words by frequency, not over all 2,309 legal words. This matches observed average-guess statistics from the WordleBot dataset." |
| Boundary / closure | BAD"We assume there are no external factors." | GOOD"We close the system at the watershed boundary: inflows are limited to the three named tributaries (Table 2); groundwater exchange is assumed zero on the 30-day timescale, justified by aquifer recharge rates < 0.5 cm/day in the USGS regional study." |
The pattern across all eight: a strong assumption cites a source, names a number, and points forward to the section that uses it. A judge skimming the assumptions list should be able to find each assumption used somewhere in the body. If they cannot, the assumption is decorative — and decorative assumptions are a Meritorious-tier tell.
4. Model coherence — keeping the thread unbroken
Once a judge gets past the summary and assumptions, they look for a single quality that most teams fail at: coherence between variables, equations, code, and results. The same symbol means the same thing on page 4, page 11, and page 19. The equation derived in §3 is the equation implemented in the code. The numbers reported in §5 come out of that code.
This sounds trivial. It is not. Under the 14-day clock, with three teammates editing in parallel, threads break. The judges' commentary on the 2021 HiMCM Problem A ("Storing the Sun") flagged several otherwise-strong papers where the storage-capacity variable was called S in §3, Cs in §4, and capacity in the Python listing — and the three values disagreed.
| Layer | ✗ Broken thread | ✓ Unbroken thread |
|---|---|---|
| Variable naming | Symbol S in §3 becomes Cs in §4, becomes capacity in the appendix code. No table of variables. |
Notation table on page 4 fixes S = storage capacity (kWh). Every later equation and code listing uses S, with a comment # S = storage capacity, kWh (Notation Table, row 7). |
| Equation provenance | Equation (4) is stated with no derivation. The reader cannot tell whether it follows from Assumption 3 or was looked up online. | Equation (4) is derived in two lines from Assumption 3 and Equation (2), with the derivation in the body, not the appendix. The number "(4)" is cited at the point of use. |
| Code ↔ math | The text says "we use a 4th-order Runge–Kutta." The code uses scipy.integrate.odeint (which is LSODA, not RK4). |
The text says "we use SciPy's LSODA via solve_ivp(method='LSODA')" and the code matches; an inline footnote notes why LSODA over RK4 (stiff equations). |
| Reported numbers | §5 reports a result of 47.2 %. The code prints 0.4724. The figure axis labels 47.0 %. | One canonical number per result, sourced from a single code cell, with the cell number and runtime parameter set named ("Run R-3, baseline parameters, Table 4"). |
| Figure ↔ body | Figure 6 appears with no caption, never referenced in the body. The reader must guess what it shows. | "Figure 6 shows the response surface for the two strongest parameters identified in §6.2; the contour at 90 % efficiency intersects the feasible region at S = 12.4 kWh, P = 7.1 kW." |
The simplest discipline that prevents most thread-breaking: a notation table on page 4, and one teammate whose only job in the last 24 hours is to grep the document for every symbol and confirm it matches the table. Outstanding papers feel like they were edited by one person even when written by three.
5. Sensitivity that matters
Almost every paper has a "Sensitivity Analysis" section. Almost none of them earn it. The default move — "we varied each parameter by ±10 % and the results did not change much" — is exactly what judges in the COMAP 2020 MCM Problem D commentary derided as "the ±10 % ritual." It signals that sensitivity was added at the end to tick a box.
What judges actually want from sensitivity:
- A ranking. Which parameters matter most? A tornado plot showing parameter influence by elasticity or partial-rank correlation coefficient, sorted, top 5–10.
- Boundary behaviour. What happens at the edges of the plausible range — not just ±10 %, but the 5th and 95th percentile of the parameter's real-world distribution? Outstanding sensitivity sections sweep to the edge of feasibility and report what breaks.
- Response curves, not just numbers. Show the output as a function of the top-2 parameters across their range, not a single perturbation. A 2D contour or 1D line plot for each top parameter.
- A conclusion change. Outstanding sensitivity sections usually find at least one regime where the recommendation flips. That's the point — to bound the recommendation, not to reassure the reader.
- Honesty about interactions. One-at-a-time (OAT) sensitivity misses interactions. If two parameters interact, Outstanding papers use a Sobol or Morris screening (cheap;
SALibin Python) and say so.
| Approach | ✗ Throwaway | ✓ Outstanding-style |
|---|---|---|
| Method | "We varied each parameter by ±10 % and observed that the model output changed by less than 5 %." | "We computed partial rank correlation coefficients (PRCC) over 5,000 Latin Hypercube samples across the parameter ranges in Table 5. Top three drivers: k (PRCC = 0.71), α (0.43), β (−0.31)." |
| Display | A 3×3 table of numbers, no plot. | A tornado plot (Figure 8) with parameters sorted by influence, plus a 2D contour of the recommendation region in the (k, α) plane. |
| Interpretation | "The model is robust." | "The recommendation is robust to α and β but inverts at k > 0.4 — see Figure 9. Since k is measurable in the field to ±0.05, we recommend a pre-deployment field test of k as part of the rollout plan." |
6. Validation strategy
A model that is internally consistent can still be wrong about the world. Validation — showing that the model reproduces something already known — is what separates a working model from a believable one. The judges' commentary on the 2019 HiMCM Problem B ("Hashtag Effectiveness") explicitly named validation as the most common gap between Finalist and Outstanding: many Finalist papers built a plausible model but never tested it against real data.
Three validation moves that show up in Outstanding papers:
- Known-answer cases. Plug in a scenario whose answer is already known (a previous year's data, a textbook problem, a degenerate limit) and confirm the model recovers it. Example: for an SIR-style epidemic model, set β = 0 and confirm the susceptible population stays constant; set γ = 0 and confirm everyone eventually gets infected. Both checks are one line of code.
- Hold-out comparison. Fit the model on one slice of the data, predict on a held-out slice, report the error. Even a 70/30 split with a single error metric (RMSE, MAPE) is enough to demonstrate the practice. The 2024 HiMCM Problem A Outstanding papers cited in UMAP Journal 45.2 consistently held out 2023 Wordle data for validation after fitting on 2022.
- Cross-method comparison. Solve the same problem two different ways (closed-form vs. simulation; ODE integration vs. Markov chain) and confirm the answers agree. This is the single highest-credibility validation move and is cheap when the problem allows it.
The 2021 Problem A ("Storing the Sun") Outstanding commentary in UMAP Journal 42.4 highlighted that the strongest papers validated against NREL's published utility-scale storage benchmark and reported a within- factor-of-two match — and then explained the residual gap rather than hiding it. The willingness to report a 2× discrepancy with an explanation beats a fabricated perfect match.
7. Writing voice
Writing style is the smallest of the categories but it compounds with every other one. The differences between Outstanding and Meritorious prose are usually not grand — they are small, consistent micro-habits.
- Present tense for the model, past tense for what was done. "The model assumes" (present), "We fit the parameters using least squares" (past). Mixing tenses within a section is the most common prose flag in COMAP commentaries.
- "We" is fine; passive voice is fine; but pick one. "We computed" and "The parameters were computed" are both acceptable; alternating between them in adjacent sentences reads as drafted-by-committee.
- Math typography. Variables italicised (k), units in upright Roman (kg, m/s), function names in upright (sin, log), single letter for scalars, bold for vectors. LaTeX makes this automatic; Word users have to be deliberate.
- Figure captions that stand alone. A judge skimming should understand a figure from the caption without finding the body text. "Figure 4: Response of efficiency η to storage capacity S, holding all other parameters at baseline (Table 4). The dashed line marks the recommended operating point (S = 12.4 kWh)."
- No "in this paper, we…" repeated every section. The reader knows it is your paper. Outstanding papers say what they did, not that they are about to say it.
- Numbers with units, every time. "12.4" is not a result; "12.4 kWh" is.
- One idea per paragraph. When in doubt, break the paragraph.
8. Common Outstanding-paper sins (anti-patterns)
These are patterns that show up in otherwise-strong papers and drop them out of Outstanding.
- The TOC-summary. The Summary Sheet that reads "Section 2 introduces the model, Section 3 presents the analysis, Section 4 discusses sensitivity." A judge cannot extract any finding from it. If your summary uses the word "introduces," rewrite it.
- Code in appendix that doesn't match the text. The body says "we used a genetic algorithm"; the appendix lists a hill-climbing loop. Or the body cites Equation (7); the code implements a different equation. A judge who flips to the appendix and finds a mismatch downgrades the paper severely.
- Padded references. Twelve references where four are actually cited. Or Wikipedia entries listed alongside journal articles with no distinction. Outstanding papers cite four to ten sources, all of them used at point of need, with primary sources (journal papers, government datasets, textbooks) outnumbering web references.
- Sensitivity that confirms what you already wanted. If every parameter sweep ends with "the model is robust," the section reads as cover. Outstanding sensitivity sections find a regime where the recommendation changes; if there isn't one, say so explicitly and explain why.
- The non-technical letter that's just the summary. Copy-pasting the abstract into a letter to the school board is a flag the team ran out of time. A real letter strips jargon, uses concrete examples for the named audience, and is one page.
- "We could improve this with more data." A throwaway weakness. Specific weaknesses name the parameter, the data source that would resolve it, and the magnitude of the current uncertainty.
- Figures without captions, or captions without figure numbers. A judge looking at Figure 6 should not have to scroll up to find out what it shows.
- Skipping the AI usage report. Current COMAP rules require disclosure of AI tool use; the report sits outside the 25-page count. A missing report is a compliance flag that ends the discussion.
- Anonymity slips. A school name in the PDF metadata, an author's name in a figure footer. Always print to PDF and re-open to check properties before submitting.
9. The Outstanding paper checklist
Tick the boxes that apply to your current draft. The counter below the list updates live; aim for 25+ of 27 by the Day 13 review.
10. Where to find real Outstanding papers
- COMAP's Outstanding Papers shop —
comap.compublishes selected Outstanding papers each year, sometimes free, sometimes as purchasable PDFs. Search the contest year and problem letter. - The UMAP Journal — quarterly journal from COMAP. The summer/fall issues each year carry judges' commentaries on the MCM, ICM, and HiMCM contests. Available through MAA institutional subscriptions and many university libraries.
- "Mathematical Modeling for the MCM/ICM Contests" — multi-volume anthology (Chiang & Klein, eds.) reprinting Outstanding papers with judge commentary. Volumes 1–4 cover 2010–2022.
- MAA archives — the Mathematical Association of America hosts some legacy commentary and historical Outstanding paper lists at
maa.org. - This site's worked sample — see problems/2024-a-sample-paper.html for a 2024-A walkthrough in Outstanding-style structure.
- Self-grade — once you have a draft, run it through himcm/rubric.html and aim for ≥ 90 % of the weighted total.