The AI leaderboard, read honestly
Composite tracker · 8 benchmarks · what each measures, what it hides
The eight benchmarks at a glance
LMSYS Chatbot Arena Elo
MMLU — Massive Multitask Language Understanding
HumanEval and SWE-Bench Verified
HumanEval
Public · static · saturated
164 Python function-completion problems from OpenAI's 2021 Codex paper (arXiv:2107.03374). Each problem gives a docstring and a function signature; the model writes the body. As of June 2026 best-effort, frontier code models score above 95% pass@1 and HumanEval is effectively saturated — it no longer discriminates among the top tier.
HumanEval+
Public · static · still discriminates
EvalPlus (Liu et al., arXiv:2305.01210) adds about 80x more test cases per problem to expose code that passes the original sparse tests but fails on edge cases. Top models drop 5 to 15 points on HumanEval+ versus HumanEval. If you cite HumanEval, you should cite HumanEval+ alongside it.
SWE-Bench Verified
Public · periodic refresh · primary code benchmark in 2026
500 real GitHub issues from 12 popular Python repositories, human-verified by OpenAI in 2024 to remove ambiguous or under-specified problems. Models are given the repo and the issue and must produce a patch that passes the hidden test suite. As of June 2026 best-effort, top scores are in the 60% to 75% range — still meaningful headroom.
SWE-Bench Multimodal and Multilingual
Public · expanding · low saturation
Extensions launched in late 2024 and 2025 that add JavaScript, Java, Rust, and image-grounded bug reports. These remain genuinely hard for current models and are where vendors compete for headline numbers in 2026.
GPQA, MATH, LiveBench, HellaSwag
GPQA — graduate-level Q&A
Diamond subset is the gold standard
Rein et al. 2023 (arXiv:2311.12022). 448 questions in physics, biology, and chemistry written by PhD-holders specifically to be Google-proof. The Diamond subset (198 questions) is the hardest. The 2023 paper reports that expert validators in the relevant field score around 65% with web access. Frontier models in 2026 are reportedly approaching or exceeding human-expert range on Diamond — verify on the official GPQA repo at github.com/idavidrein/gpqa.
MATH
Saturated · use AIME variants instead
Hendrycks et al. 2021 (arXiv:2103.03874). 12,500 competition mathematics problems (AMC, AIME, etc.) with worked solutions. As of June 2026 best-effort, frontier reasoning-tuned models exceed 90% and the benchmark is largely saturated. MATH-500 (a curated subset) is what most papers report. Look at AIME-2024 and Putnam variants instead for current discrimination.
LiveBench
Refreshes monthly · the most contamination-resistant
White et al. 2024 (livebench.ai, paper at arXiv:2406.19314). Designed specifically to resist contamination by drawing from recent arXiv papers, recent news, recent IMO problems, and rotating monthly. Categories: reasoning, coding, mathematics, language, data analysis, instruction following. This is one of the few benchmarks that still moves meaningfully when a frontier model releases.
HellaSwag
Saturated · keep for historical comparison only
Zellers et al. 2019 (arXiv:1905.07830). Commonsense sentence completion with adversarially-generated distractors. Saturated above 95% across the frontier since 2023. Still cited because it is in the legacy harness, but it tells you nothing about modern model capability. Ignore it.
Top 5 by benchmark — June 2026 best-effort
We deliberately are not publishing a hardcoded top-5 list per benchmark in this page. Rankings change weekly as vendors release point updates, and any list we ship today will be wrong within two weeks. For the current state of each leaderboard, check the primary source: lmarena.ai for Chatbot Arena Elo, github.com/openai/swe-bench for SWE-Bench Verified, livebench.ai for LiveBench, the Vellum LLM Leaderboard at vellum.ai/llm-leaderboard for a composite view, and Artificial Analysis at artificialanalysis.ai for cost-adjusted intelligence scores. The Vellum and Artificial Analysis composites are updated continuously and are the cleanest single dashboards we have found in 2026.
The contamination problem, plainly
Arena Elo is not capability
Anthropic, OpenAI, and Google have all published acknowledgments that Chatbot Arena Elo correlates with user preference, not with capability on hard tasks. The 2024 paper by Boyeau et al. (arXiv:2406.12624) showed that a model fine-tuned for length and formatting can gain 30 to 50 Elo points without any capability change. The 2025 Cohere analysis of Arena (arXiv:2504.20879) further documented prerelease testing and selection bias. Treat Arena Elo as 'which model do people enjoy chatting with,' not 'which model gets harder problems right.' For the latter, look at SWE-Bench Verified, GPQA-Diamond, and LiveBench.
Benchmarks that actually predict real-world utility
- SWE-Bench Verified — closest proxy to 'can this model close a real engineering ticket end-to-end.' Highly recommended if you care about agentic coding.
- LiveBench (current month) — least contaminated composite available; monthly refresh means scores genuinely move when capability moves.
- GPQA-Diamond — best public proxy for graduate-level reasoning in the physical sciences. Saturating but still informative.
- MMLU-Pro — replaces vanilla MMLU and discriminates at the frontier; 10-option multiple choice removes most of the lucky-guess floor.
- AIME 2024 and 2025 — fresh math competition problems with low contamination risk; use instead of saturated MATH.
- ARC-AGI v2 (Chollet et al.) — measures novel-problem reasoning; remains genuinely hard for frontier models in 2026; check arcprize.org for current state.
- Long-context retrieval evals (RULER, NoCha, FACTS Grounding) — measure faithfulness on 100K-plus contexts, where most production workloads actually live.
- Your own evaluation set — the only benchmark that perfectly predicts your real-world utility is one written on your real prompts. Hold out 50 to 200 examples from your actual use case. Score quarterly. Trust that number more than any leaderboard.
How to read a vendor benchmark claim
Step 1
Check the benchmark name precisely
'MMLU 90.2' versus 'MMLU-Pro 78.4' are completely different claims. Vendors will sometimes write 'MMLU' when they mean the easier original benchmark. Demand the exact dataset version.
Step 2
Check whether the benchmark is contamination-controlled
If it is a benchmark released before 2023 and not refreshed, assume contamination. If it is LiveBench, GPQA-Diamond, SWE-Bench Verified, MMLU-Pro, or AIME-2024, contamination risk is bounded.
Step 3
Check the harness and the prompting setup
Pass@1 versus pass@10, chain-of-thought versus zero-shot, with-tools versus without-tools — these flips can swing scores 20 points. Demand identical evaluation harnesses across compared models.
Step 4
Check whether the score was self-reported or independent
Artificial Analysis, Vellum, and the Stanford HELM project run their own independent evaluations. Vendor-reported numbers should be treated as upper bounds until reproduced independently.
Step 5
Check whether it predicts your task
MMLU on medicine does not predict legal-contract drafting. SWE-Bench Verified does not predict creative writing. Match the benchmark to the use case, or run your own eval.
The minimum-effective-dose tracking stack
Sources
- [01]
MMLU was introduced by Hendrycks et al. in 2020 as a 57-subject multiple-choice benchmark.
arxiv.org/abs/2009.03300
- [02]
MMLU-Pro by Wang et al. (2024) addresses MMLU saturation with 10 answer choices and harder questions.
arxiv.org/abs/2406.01574
- [03]
HumanEval was introduced in OpenAI's 2021 Codex paper as 164 Python function-completion problems.
arxiv.org/abs/2107.03374
- [04]
EvalPlus (HumanEval+ and MBPP+) by Liu et al. exposes inflated HumanEval scores using ~80x more test cases.
arxiv.org/abs/2305.01210
- [05]
SWE-Bench Verified is OpenAI's 500-problem human-verified subset of SWE-Bench.
github.com/openai/swe-bench
- [06]
Original SWE-Bench paper by Jimenez et al. (2023) introduces real GitHub-issue resolution as a code benchmark.
arxiv.org/abs/2310.06770
- [07]
GPQA by Rein et al. (2023) provides 448 graduate-level physics, biology, and chemistry questions written to be Google-proof.
arxiv.org/abs/2311.12022
- [08]
Official GPQA repository with Diamond subset and current usage instructions.
github.com/idavidrein/gpqa
- [09]
MATH benchmark by Hendrycks et al. (2021) contains 12,500 competition math problems.
arxiv.org/abs/2103.03874
- [10]
LiveBench publishes a monthly-refreshed contamination-resistant benchmark across reasoning, coding, math, language, and data analysis.
livebench.ai
- [11]
LiveBench paper by White et al. (2024) describes the contamination-resistant evaluation methodology.
arxiv.org/abs/2406.19314
- [12]
HellaSwag by Zellers et al. (2019) is a commonsense sentence-completion benchmark, saturated since 2023.
arxiv.org/abs/1905.07830
- [13]
LMSYS Chatbot Arena (now lmarena.ai) hosts the live human-preference Elo leaderboard.
lmarena.ai
- [14]
Original Chatbot Arena paper by Chiang et al. (2024) describes the blind pairwise human-preference Elo methodology.
arxiv.org/abs/2403.04132
- [15]
2025 Cohere analysis documents prerelease testing and selection bias in Chatbot Arena.
arxiv.org/abs/2504.20879
- [16]
Boyeau et al. (2024) show length and formatting fine-tuning can gain 30-50 Arena Elo points without capability change.
arxiv.org/abs/2406.12624
- [17]
Sainz et al. (2023) 'NLP Evaluation in Trouble' documents widespread benchmark contamination across popular NLP datasets.
arxiv.org/abs/2310.18018
- [18]
Deng et al. (2023) show fine-tuning on benchmark-adjacent data produces gains indistinguishable from genuine capability improvement.
arxiv.org/abs/2311.04850
- [19]
Vellum LLM Leaderboard provides a continuously-updated composite view of frontier model benchmarks.
vellum.ai/llm-leaderboard
- [20]
Artificial Analysis publishes independent cost-adjusted intelligence scores across major LLM providers.
artificialanalysis.ai
- [21]
Stanford HELM project provides holistic independent evaluation across many models and benchmarks.
crfm.stanford.edu/helm
- [22]
ARC-AGI v2 by Chollet et al. measures novel-problem reasoning and remains genuinely hard for frontier models in 2026.
arcprize.org