Massive Multitask Language Understanding
2020-09 · Hendrycks et al.
57 academic subject domains, multiple-choice questions at undergraduate level (history, biology, math, philosophy, etc.). Was the field's general-knowledge benchmark of record 2020-2024. Currently saturated — frontier models score 85-92%+, often hitting the test's own error rate ceiling. Successor: MMLU-Pro.
What it measures: Breadth of academic knowledge across many disciplines.
What it doesn't: Reasoning, multi-step thinking, or generation quality. MMLU is recognition, not creation.
MMLU-Pro
2024-06 · Wang et al. (Waterloo + Vector Institute)
More challenging variant of MMLU with 10 answer choices instead of 4, and a deliberate emphasis on reasoning rather than recall. Frontier models score 70-80%+. Still discriminates between current top-tier models.
What it measures: Reasoning under expanded answer space. Less saturated than MMLU.
What it doesn't: Long-form reasoning, agentic tasks, real-world utility.
Google-Proof Q&A Diamond split
2023-11 · Rein et al. (NYU + Anthropic)
Graduate-level physics, biology, chemistry questions written by domain experts and stress-tested to be hard even for non-expert humans with web access. The 'Diamond' split is the hardest 198 questions. Mid-2026 leaders score 50-65%; humans-with-Google score ~34%. The current go-to scientific-reasoning benchmark.
What it measures: Real expert-level scientific reasoning. Resistant to lookup.
What it doesn't: General knowledge, creativity, or multi-modal.
HumanEval / Mostly Basic Python Problems
2021 · Chen et al. (OpenAI) + Austin et al. (Google)
Short Python programming tasks (1-30 lines) tested with unit tests. The original LLM code-generation benchmark. Currently very saturated (95%+ on frontier models). Useful only as a baseline floor.
What it measures: Basic Python correctness on small tasks.
What it doesn't: Real software engineering. Solving 164 toy problems is unrelated to building production systems.
SWE-bench Verified
2024-03 (orig) · 2024-08 (Verified) · Jimenez/Yang et al. (Princeton + Cognition)
500 real GitHub issues from popular Python repositories, each requiring multi-file code changes to resolve. Verified subset is the 500-issue split with manually-checked test correctness. This is the canonical agentic-software-engineering benchmark. Frontier agentic systems hit ~75-85% in mid-2026.
What it measures: Real-world software engineering on actual GitHub issues with executable tests.
What it doesn't: Frontend, design, code quality, security. The test passing != the change being good.
Massive Multi-discipline Multimodal Understanding
2023-11 · Yue et al. (CMU + Toronto + multiple)
11,500 college-exam-level questions across 30 subjects, with images required (diagrams, charts, photos). The canonical multimodal benchmark. Frontier multimodal models score 60-75%.
What it measures: Reasoning over text + images simultaneously across diverse subjects.
What it doesn't: Pure-text reasoning, video, or audio understanding.
American Invitational Mathematics Examination
1983-present · MAA
Real high-school-to-undergraduate math competition exam (15 questions, 3 hours, integer answers 0-999). Adopted by AI evaluators ~2023 as a hard math benchmark. The o1/o3/DeepSeek-R1 reasoning era is largely measured on AIME because it forces multi-step computation that resists shortcut-via-pattern-matching. Frontier reasoning models score 70-95%+.
What it measures: Multi-step quantitative reasoning, geometry, number theory, combinatorics.
What it doesn't: Open-ended problem-solving. AIME answers are integers — a constraint that allows objective scoring but is unrepresentative.
Chatbot Arena (lmarena.ai)
2023-05 · LMSYS (UC Berkeley)
Crowdsourced pairwise human-preference voting. Two anonymous model responses to a user prompt, user picks the better one. Elo-style rating system. The most influential public ranking system. Has known biases (style favored over correctness; verbose answers favored over concise) but still the most-cited public number for 'which model do humans prefer.'
What it measures: Aggregate human preference under typical chat conditions.
What it doesn't: Correctness. Highly-rated models can still be wrong; verbose-but-confident wins more often than terse-but-correct.
Beyond the Imitation Game + Holistic Evaluation of Language Models
2022 · Google/Stanford (HELM) and BIG-bench team (200+ authors)
Comprehensive benchmark suites that aggregate dozens-to-hundreds of sub-benchmarks across reasoning, knowledge, safety, bias, fairness, robustness. More academic than practical — useful for thorough evals but harder to summarize in a single number. HELM is the active Stanford CRFM project; BIG-bench is largely retired.
What it measures: Breadth + thoroughness across many evaluation dimensions.
What it doesn't: Practical 'which model should I use' guidance. The summary score is too aggregated to be useful.
Abstraction and Reasoning Corpus
2019 · François Chollet
Visual pattern-recognition puzzles designed to be easy for humans + hard for current ML. Chollet's argument: current systems score poorly on tasks requiring fluid abstract reasoning. ARC-AGI-2 (2025) introduced as a harder follow-up. OpenAI o3 hit 87% on ARC-AGI-1 in Dec 2024 via test-time-compute scaling. ARC Prize 2024 + 2025 are public competitions.
What it measures: Fluid abstract reasoning on novel visual patterns.
What it doesn't: Real-world utility. ARC tasks are deliberately weird; high scores there don't immediately transfer to practical applications.