Five identical matte-black stacked trays of varying heights — the leaderboard.

Benchmarks · the atlas

How models actually get measured.

Ten benchmarks every AI builder should be able to read skeptically. What each measures, what it doesn't measure, and how to interpret a 2026-era leaderboard without getting suckered by saturation or contamination.

MMLU

Massive Multitask Language Understanding

2020-09 · Hendrycks et al.

57 academic subject domains, multiple-choice questions at undergraduate level (history, biology, math, philosophy, etc.). Was the field's general-knowledge benchmark of record 2020-2024. Currently saturated — frontier models score 85-92%+, often hitting the test's own error rate ceiling. Successor: MMLU-Pro.

What it measures: Breadth of academic knowledge across many disciplines.

What it doesn't: Reasoning, multi-step thinking, or generation quality. MMLU is recognition, not creation.

MMLU-Pro

2024-06 · Wang et al. (Waterloo + Vector Institute)

More challenging variant of MMLU with 10 answer choices instead of 4, and a deliberate emphasis on reasoning rather than recall. Frontier models score 70-80%+. Still discriminates between current top-tier models.

What it measures: Reasoning under expanded answer space. Less saturated than MMLU.

What it doesn't: Long-form reasoning, agentic tasks, real-world utility.

GPQA Diamond

Google-Proof Q&A Diamond split

2023-11 · Rein et al. (NYU + Anthropic)

Graduate-level physics, biology, chemistry questions written by domain experts and stress-tested to be hard even for non-expert humans with web access. The 'Diamond' split is the hardest 198 questions. Mid-2026 leaders score 50-65%; humans-with-Google score ~34%. The current go-to scientific-reasoning benchmark.

What it measures: Real expert-level scientific reasoning. Resistant to lookup.

What it doesn't: General knowledge, creativity, or multi-modal.

HumanEval + MBPP

HumanEval / Mostly Basic Python Problems

2021 · Chen et al. (OpenAI) + Austin et al. (Google)

Short Python programming tasks (1-30 lines) tested with unit tests. The original LLM code-generation benchmark. Currently very saturated (95%+ on frontier models). Useful only as a baseline floor.

What it measures: Basic Python correctness on small tasks.

What it doesn't: Real software engineering. Solving 164 toy problems is unrelated to building production systems.

SWE-bench Verified

2024-03 (orig) · 2024-08 (Verified) · Jimenez/Yang et al. (Princeton + Cognition)

500 real GitHub issues from popular Python repositories, each requiring multi-file code changes to resolve. Verified subset is the 500-issue split with manually-checked test correctness. This is the canonical agentic-software-engineering benchmark. Frontier agentic systems hit ~75-85% in mid-2026.

What it measures: Real-world software engineering on actual GitHub issues with executable tests.

What it doesn't: Frontend, design, code quality, security. The test passing != the change being good.

MMMU

Massive Multi-discipline Multimodal Understanding

2023-11 · Yue et al. (CMU + Toronto + multiple)

11,500 college-exam-level questions across 30 subjects, with images required (diagrams, charts, photos). The canonical multimodal benchmark. Frontier multimodal models score 60-75%.

What it measures: Reasoning over text + images simultaneously across diverse subjects.

What it doesn't: Pure-text reasoning, video, or audio understanding.

AIME

American Invitational Mathematics Examination

1983-present · MAA

Real high-school-to-undergraduate math competition exam (15 questions, 3 hours, integer answers 0-999). Adopted by AI evaluators ~2023 as a hard math benchmark. The o1/o3/DeepSeek-R1 reasoning era is largely measured on AIME because it forces multi-step computation that resists shortcut-via-pattern-matching. Frontier reasoning models score 70-95%+.

What it measures: Multi-step quantitative reasoning, geometry, number theory, combinatorics.

What it doesn't: Open-ended problem-solving. AIME answers are integers — a constraint that allows objective scoring but is unrepresentative.

LMSYS Chatbot Arena

Chatbot Arena (lmarena.ai)

2023-05 · LMSYS (UC Berkeley)

Crowdsourced pairwise human-preference voting. Two anonymous model responses to a user prompt, user picks the better one. Elo-style rating system. The most influential public ranking system. Has known biases (style favored over correctness; verbose answers favored over concise) but still the most-cited public number for 'which model do humans prefer.'

What it measures: Aggregate human preference under typical chat conditions.

What it doesn't: Correctness. Highly-rated models can still be wrong; verbose-but-confident wins more often than terse-but-correct.

BIG-bench / HELM

Beyond the Imitation Game + Holistic Evaluation of Language Models

2022 · Google/Stanford (HELM) and BIG-bench team (200+ authors)

Comprehensive benchmark suites that aggregate dozens-to-hundreds of sub-benchmarks across reasoning, knowledge, safety, bias, fairness, robustness. More academic than practical — useful for thorough evals but harder to summarize in a single number. HELM is the active Stanford CRFM project; BIG-bench is largely retired.

What it measures: Breadth + thoroughness across many evaluation dimensions.

What it doesn't: Practical 'which model should I use' guidance. The summary score is too aggregated to be useful.

ARC-AGI

Abstraction and Reasoning Corpus

2019 · François Chollet

Visual pattern-recognition puzzles designed to be easy for humans + hard for current ML. Chollet's argument: current systems score poorly on tasks requiring fluid abstract reasoning. ARC-AGI-2 (2025) introduced as a harder follow-up. OpenAI o3 hit 87% on ARC-AGI-1 in Dec 2024 via test-time-compute scaling. ARC Prize 2024 + 2025 are public competitions.

What it measures: Fluid abstract reasoning on novel visual patterns.

What it doesn't: Real-world utility. ARC tasks are deliberately weird; high scores there don't immediately transfer to practical applications.

How to read a leaderboard skeptically.

01
Saturated benchmarks tell you nothing. If everyone scores 95%+, the benchmark has stopped discriminating. MMLU and HumanEval are in this state.
02
Different benchmarks favor different model strategies. Long-context reasoning models do well on AIME + GPQA + ARC-AGI. Knowledge-dense models do well on MMLU + MMMU. Code-tuned models do well on HumanEval + SWE-bench. A single ranking number across all of these is misleading.
03
Test set contamination is real. Many frontier models have seen most public benchmarks during pretraining. Held-out evals (FrontierMath, SWE-bench Verified) try to mitigate this, but the field knows contamination affects published numbers.
04
Verbose models win Arena. LMSYS-style human-preference voting consistently favors longer, more confident-sounding responses — even when shorter responses are correct and verbose ones are wrong. Don't read Arena Elo as 'best model.'
05
Real-world utility is hard to measure. SWE-bench is the closest to 'agentic real-world work,' but even it tests against unit-test passing, not code quality or maintainability. Real evaluation still requires using the model on YOUR task and judging the output.
06
Inference-time-scaling changes economics. o1-style and R1-style models can score very well on hard benchmarks (GPQA, AIME, ARC-AGI) by using 10× more inference compute per query. The benchmark score is real but doesn't reflect the cost-per-task tradeoff.

Scaling laws →Live leaderboard tracker →← atlas index