Five identical matte-black stacked trays of varying heights — the leaderboard.

The AI leaderboard, read honestly

Composite tracker · 8 benchmarks · what each measures, what it hides

There is no single number that tells you which AI model is best. There are about a dozen popular benchmarks, each measuring something different, each gameable in its own way, each contaminated to some degree by training data leaks. The leaderboards you see on Hugging Face, lmsys.org, and vendor marketing pages are real signal — but they are also a kind of theater. Models are tuned for them. Public test sets leak. Newer base models train on chats that include benchmark questions. The Elo on Chatbot Arena measures human preference under short prompts, which is not the same thing as capability on a 200-page legal review. This page surveys the eight benchmarks that get cited most often as of June 2026 best-effort: LMSYS Chatbot Arena Elo, MMLU, HumanEval, SWE-Bench Verified, GPQA, MATH, LiveBench, and HellaSwag. For each one, we name what it actually measures, what it cannot capture, the current top of the table to the best of our knowledge as of mid-2026, and the gaming concerns published researchers have raised. We end with the section that matters most: which benchmarks actually predict real-world utility, and which are mostly vanity. The honest take is short. Arena Elo is a vibe check at scale. MMLU is mostly saturated. SWE-Bench Verified, GPQA Diamond, and LiveBench are the three that still discriminate between frontier models in 2026 because they were specifically designed to resist memorization or are refreshed often enough that contamination is bounded. If you only check one composite, check Artificial Analysis or the Vellum LLM Leaderboard, not vendor blogs. Pricing and rankings shift weekly — check provider docs for current numbers before you commit a workload.

The eight benchmarks at a glance

Benchmark	What it measures	Public test set	Refresh cadence	Saturation risk
LMSYS Chatbot Arena	Human pairwise preference, blind	No — prompts are user-generated	Continuous	Low (no fixed set)
MMLU	57 subjects, multiple choice	Yes, fully public since 2020	Static	High — most frontier models above 87%
HumanEval	164 Python function-completion problems	Yes, fully public since 2021	Static	Very high — saturated above 95%
SWE-Bench Verified	500 real GitHub issues, human-verified	Yes, repo-pinned	Periodic	Medium — still discriminating in 2026
GPQA	448 graduate physics, biology, chemistry questions	Diamond subset public	Static	Low to medium — Diamond is hard
MATH	12,500 competition math problems	Yes, fully public	Static	High — frontier models above 90%
LiveBench	Reasoning, coding, math, language, refreshed monthly	Rotating	Monthly	Low by design
HellaSwag	Commonsense sentence completion	Yes	Static	Saturated — over 95% since 2023

BenchmarkLMSYS Chatbot Arena

What it measuresHuman pairwise preference, blind

Public test setNo — prompts are user-generated

Refresh cadenceContinuous

Saturation riskLow (no fixed set)

BenchmarkMMLU

What it measures57 subjects, multiple choice

Public test setYes, fully public since 2020

Refresh cadenceStatic

Saturation riskHigh — most frontier models above 87%

BenchmarkHumanEval

What it measures164 Python function-completion problems

Public test setYes, fully public since 2021

Refresh cadenceStatic

Saturation riskVery high — saturated above 95%

BenchmarkSWE-Bench Verified

What it measures500 real GitHub issues, human-verified

Public test setYes, repo-pinned

Refresh cadencePeriodic

Saturation riskMedium — still discriminating in 2026

BenchmarkGPQA

What it measures448 graduate physics, biology, chemistry questions

Public test setDiamond subset public

Refresh cadenceStatic

Saturation riskLow to medium — Diamond is hard

BenchmarkMATH

What it measures12,500 competition math problems

Public test setYes, fully public

Refresh cadenceStatic

Saturation riskHigh — frontier models above 90%

BenchmarkLiveBench

What it measuresReasoning, coding, math, language, refreshed monthly

Public test setRotating

Refresh cadenceMonthly

Saturation riskLow by design

BenchmarkHellaSwag

What it measuresCommonsense sentence completion

Public test setYes

Refresh cadenceStatic

Saturation riskSaturated — over 95% since 2023

LMSYS Chatbot Arena Elo

Chatbot Arena, run by LMSYS (now operating as lmarena.ai), is a crowdsourced platform where users submit a prompt, get blind responses from two anonymous models, and vote on which is better. The aggregate produces an Elo rating, the same statistical method used in chess. As of June 2026 best-effort, the top of the overall Arena leaderboard rotates among Anthropic's Claude family (Opus and Sonnet generations), OpenAI's GPT-series, Google's Gemini family, and xAI's Grok — the exact ranking shifts weekly and you should check lmarena.ai for the current state. What Arena measures: human aesthetic and conversational preference, under short, single-turn or shallow-multi-turn prompts, weighted toward English and toward the kinds of prompts users on a free public site are willing to type. What it does not measure: long-context reliability past 10K tokens, agentic tool use, code-base level reasoning, faithfulness to a system prompt, latency, cost, or safety under adversarial prompting. The Arena leaderboard explicitly says it is a measure of preference, not capability. Gaming concerns: Cohere researchers and others have shown that vendors can A/B test prerelease checkpoints in the arena and ship the variant that wins. The arena's style-controlled leaderboard partially corrects for length and formatting bias, but it cannot correct for the fact that human voters reward confident-sounding answers, which is not the same thing as correct answers. Treat Arena Elo as a smell test, not a capability claim.

MMLU — Massive Multitask Language Understanding

Introduced by Hendrycks et al. in 2020 (arXiv:2009.03300), MMLU is 15,908 multiple-choice questions across 57 subjects including law, medicine, mathematics, and humanities. It was designed to test broad world knowledge. As of June 2026 best-effort, every frontier model scores above 87% and many cluster between 88% and 92%, which means MMLU has lost most of its discriminative power at the top. What it measures: factual recall and basic four-option reasoning, in English, in the format and topic distribution of US graduate and professional admission exams. What it does not measure: ability to handle ambiguity, ability to say 'I don't know,' multilingual depth, or any form of generation. Multiple-choice with four options has a 25% floor from guessing. Gaming concerns: MMLU is fully public and has been in training corpora since 2021. A 2023 paper (Zhou et al., arXiv:2311.04850, 'Don't make your LLM an evaluation benchmark cheater') and a 2024 study by the Allen Institute documented memorization on this benchmark. The MMLU-Pro variant (Wang et al., arXiv:2406.01574) was created specifically to address saturation — it uses ten answer choices instead of four, removes the easiest questions, and has a wider score spread at the frontier. If you cite MMLU in 2026, cite MMLU-Pro.

HumanEval and SWE-Bench Verified

HumanEval

Public · static · saturated

164 Python function-completion problems from OpenAI's 2021 Codex paper (arXiv:2107.03374). Each problem gives a docstring and a function signature; the model writes the body. As of June 2026 best-effort, frontier code models score above 95% pass@1 and HumanEval is effectively saturated — it no longer discriminates among the top tier.

HumanEval+

Public · static · still discriminates

EvalPlus (Liu et al., arXiv:2305.01210) adds about 80x more test cases per problem to expose code that passes the original sparse tests but fails on edge cases. Top models drop 5 to 15 points on HumanEval+ versus HumanEval. If you cite HumanEval, you should cite HumanEval+ alongside it.

SWE-Bench Verified

Public · periodic refresh · primary code benchmark in 2026

500 real GitHub issues from 12 popular Python repositories, human-verified by OpenAI in 2024 to remove ambiguous or under-specified problems. Models are given the repo and the issue and must produce a patch that passes the hidden test suite. As of June 2026 best-effort, top scores are in the 60% to 75% range — still meaningful headroom.

SWE-Bench Multimodal and Multilingual

Public · expanding · low saturation

Extensions launched in late 2024 and 2025 that add JavaScript, Java, Rust, and image-grounded bug reports. These remain genuinely hard for current models and are where vendors compete for headline numbers in 2026.

GPQA, MATH, LiveBench, HellaSwag

GPQA — graduate-level Q&A

Diamond subset is the gold standard

Rein et al. 2023 (arXiv:2311.12022). 448 questions in physics, biology, and chemistry written by PhD-holders specifically to be Google-proof. The Diamond subset (198 questions) is the hardest. The 2023 paper reports that expert validators in the relevant field score around 65% with web access. Frontier models in 2026 are reportedly approaching or exceeding human-expert range on Diamond — verify on the official GPQA repo at github.com/idavidrein/gpqa.

MATH

Saturated · use AIME variants instead

Hendrycks et al. 2021 (arXiv:2103.03874). 12,500 competition mathematics problems (AMC, AIME, etc.) with worked solutions. As of June 2026 best-effort, frontier reasoning-tuned models exceed 90% and the benchmark is largely saturated. MATH-500 (a curated subset) is what most papers report. Look at AIME-2024 and Putnam variants instead for current discrimination.

LiveBench

Refreshes monthly · the most contamination-resistant

White et al. 2024 (livebench.ai, paper at arXiv:2406.19314). Designed specifically to resist contamination by drawing from recent arXiv papers, recent news, recent IMO problems, and rotating monthly. Categories: reasoning, coding, mathematics, language, data analysis, instruction following. This is one of the few benchmarks that still moves meaningfully when a frontier model releases.

HellaSwag

Saturated · keep for historical comparison only

Zellers et al. 2019 (arXiv:1905.07830). Commonsense sentence completion with adversarially-generated distractors. Saturated above 95% across the frontier since 2023. Still cited because it is in the legacy harness, but it tells you nothing about modern model capability. Ignore it.

Top 5 by benchmark — June 2026 best-effort

We deliberately are not publishing a hardcoded top-5 list per benchmark in this page. Rankings change weekly as vendors release point updates, and any list we ship today will be wrong within two weeks. For the current state of each leaderboard, check the primary source: lmarena.ai for Chatbot Arena Elo, github.com/openai/swe-bench for SWE-Bench Verified, livebench.ai for LiveBench, the Vellum LLM Leaderboard at vellum.ai/llm-leaderboard for a composite view, and Artificial Analysis at artificialanalysis.ai for cost-adjusted intelligence scores. The Vellum and Artificial Analysis composites are updated continuously and are the cleanest single dashboards we have found in 2026.

The contamination problem, plainly

Train-on-test contamination is the single biggest reason published benchmark numbers should be discounted. Here is the mechanism: a benchmark is released as a public dataset on GitHub or Hugging Face. The Common Crawl, GitHub scrape, and academic-paper corpora that everyone trains on then ingest it. The next generation of base models has seen the test set during pretraining. They are not memorizing in the literal sense, but they have absorbed the answer distribution. Scores rise. Vendors celebrate. Researchers measure the contamination and quietly correct downward. The 2023 paper by Sainz et al. ('NLP evaluation in trouble,' arXiv:2310.18018) documented contamination across most popular benchmarks. The 2024 paper by Deng et al. (arXiv:2311.04850) showed that fine-tuning on benchmark-adjacent data produces gains indistinguishable from genuine capability improvement. EvalPlus, MMLU-Pro, GPQA-Diamond, and LiveBench were all created in direct response to this problem. The practical rule: any benchmark that has been public for more than 18 months and is not refreshed should be treated as a lower bound on contamination, not as an honest capability measure. The 2024 to 2026 generation of benchmarks (SWE-Bench Verified, GPQA-Diamond, LiveBench, MMLU-Pro, AIME-2024, ARC-AGI v2) are designed to be more resistant. Use those when you can.

Arena Elo is not capability

Anthropic, OpenAI, and Google have all published acknowledgments that Chatbot Arena Elo correlates with user preference, not with capability on hard tasks. The 2024 paper by Boyeau et al. (arXiv:2406.12624) showed that a model fine-tuned for length and formatting can gain 30 to 50 Elo points without any capability change. The 2025 Cohere analysis of Arena (arXiv:2504.20879) further documented prerelease testing and selection bias. Treat Arena Elo as 'which model do people enjoy chatting with,' not 'which model gets harder problems right.' For the latter, look at SWE-Bench Verified, GPQA-Diamond, and LiveBench.

Benchmarks that actually predict real-world utility

SWE-Bench Verified — closest proxy to 'can this model close a real engineering ticket end-to-end.' Highly recommended if you care about agentic coding.
LiveBench (current month) — least contaminated composite available; monthly refresh means scores genuinely move when capability moves.
GPQA-Diamond — best public proxy for graduate-level reasoning in the physical sciences. Saturating but still informative.
MMLU-Pro — replaces vanilla MMLU and discriminates at the frontier; 10-option multiple choice removes most of the lucky-guess floor.
AIME 2024 and 2025 — fresh math competition problems with low contamination risk; use instead of saturated MATH.
ARC-AGI v2 (Chollet et al.) — measures novel-problem reasoning; remains genuinely hard for frontier models in 2026; check arcprize.org for current state.
Long-context retrieval evals (RULER, NoCha, FACTS Grounding) — measure faithfulness on 100K-plus contexts, where most production workloads actually live.
Your own evaluation set — the only benchmark that perfectly predicts your real-world utility is one written on your real prompts. Hold out 50 to 200 examples from your actual use case. Score quarterly. Trust that number more than any leaderboard.

How to read a vendor benchmark claim

Step 1
Check the benchmark name precisely
'MMLU 90.2' versus 'MMLU-Pro 78.4' are completely different claims. Vendors will sometimes write 'MMLU' when they mean the easier original benchmark. Demand the exact dataset version.
Step 2
Check whether the benchmark is contamination-controlled
If it is a benchmark released before 2023 and not refreshed, assume contamination. If it is LiveBench, GPQA-Diamond, SWE-Bench Verified, MMLU-Pro, or AIME-2024, contamination risk is bounded.
Step 3
Check the harness and the prompting setup
Pass@1 versus pass@10, chain-of-thought versus zero-shot, with-tools versus without-tools — these flips can swing scores 20 points. Demand identical evaluation harnesses across compared models.
Step 4
Check whether the score was self-reported or independent
Artificial Analysis, Vellum, and the Stanford HELM project run their own independent evaluations. Vendor-reported numbers should be treated as upper bounds until reproduced independently.
Step 5
Check whether it predicts your task
MMLU on medicine does not predict legal-contract drafting. SWE-Bench Verified does not predict creative writing. Match the benchmark to the use case, or run your own eval.

The minimum-effective-dose tracking stack

If you want to track frontier model capability without becoming a full-time benchmark hobbyist, the minimum-effective-dose is three sources checked monthly. First, lmarena.ai for the human-preference vibe check. Second, livebench.ai for the contamination-resistant composite. Third, either Artificial Analysis (artificialanalysis.ai) or the Vellum LLM Leaderboard (vellum.ai/llm-leaderboard) for cost-adjusted intelligence. That is roughly fifteen minutes a month and gives you 90% of the signal that benchmark-watchers extract from following twenty leaderboards weekly. For anything load-bearing — a workload you are about to deploy, a vendor contract you are about to sign, a model swap you are about to ship — write your own 50-prompt evaluation on your real use case and run it against the two or three candidate models. Score it yourself or have two humans score it independently. That number will predict your production outcome better than any leaderboard. Check provider docs for current pricing, current rate limits, current context windows, and current model identifiers before you commit code — these change roughly monthly across all major vendors and any number we hardcode on this page will be stale by the time you read it.

Sources

[01]
MMLU was introduced by Hendrycks et al. in 2020 as a 57-subject multiple-choice benchmark.
arxiv.org/abs/2009.03300
[02]
MMLU-Pro by Wang et al. (2024) addresses MMLU saturation with 10 answer choices and harder questions.
arxiv.org/abs/2406.01574
[03]
HumanEval was introduced in OpenAI's 2021 Codex paper as 164 Python function-completion problems.
arxiv.org/abs/2107.03374
[04]
EvalPlus (HumanEval+ and MBPP+) by Liu et al. exposes inflated HumanEval scores using ~80x more test cases.
arxiv.org/abs/2305.01210
[05]
SWE-Bench Verified is OpenAI's 500-problem human-verified subset of SWE-Bench.
github.com/openai/swe-bench
[06]
Original SWE-Bench paper by Jimenez et al. (2023) introduces real GitHub-issue resolution as a code benchmark.
arxiv.org/abs/2310.06770
[07]
GPQA by Rein et al. (2023) provides 448 graduate-level physics, biology, and chemistry questions written to be Google-proof.
arxiv.org/abs/2311.12022
[08]
Official GPQA repository with Diamond subset and current usage instructions.
github.com/idavidrein/gpqa
[09]
MATH benchmark by Hendrycks et al. (2021) contains 12,500 competition math problems.
arxiv.org/abs/2103.03874
[10]
LiveBench publishes a monthly-refreshed contamination-resistant benchmark across reasoning, coding, math, language, and data analysis.
livebench.ai
[11]
LiveBench paper by White et al. (2024) describes the contamination-resistant evaluation methodology.
arxiv.org/abs/2406.19314
[12]
HellaSwag by Zellers et al. (2019) is a commonsense sentence-completion benchmark, saturated since 2023.
arxiv.org/abs/1905.07830
[13]
LMSYS Chatbot Arena (now lmarena.ai) hosts the live human-preference Elo leaderboard.
lmarena.ai
[14]
Original Chatbot Arena paper by Chiang et al. (2024) describes the blind pairwise human-preference Elo methodology.
arxiv.org/abs/2403.04132
[15]
2025 Cohere analysis documents prerelease testing and selection bias in Chatbot Arena.
arxiv.org/abs/2504.20879
[16]
Boyeau et al. (2024) show length and formatting fine-tuning can gain 30-50 Arena Elo points without capability change.
arxiv.org/abs/2406.12624
[17]
Sainz et al. (2023) 'NLP Evaluation in Trouble' documents widespread benchmark contamination across popular NLP datasets.
arxiv.org/abs/2310.18018
[18]
Deng et al. (2023) show fine-tuning on benchmark-adjacent data produces gains indistinguishable from genuine capability improvement.
arxiv.org/abs/2311.04850
[19]
Vellum LLM Leaderboard provides a continuously-updated composite view of frontier model benchmarks.
vellum.ai/llm-leaderboard
[20]
Artificial Analysis publishes independent cost-adjusted intelligence scores across major LLM providers.
artificialanalysis.ai
[21]
Stanford HELM project provides holistic independent evaluation across many models and benchmarks.
crfm.stanford.edu/helm
[22]
ARC-AGI v2 by Chollet et al. measures novel-problem reasoning and remains genuinely hard for frontier models in 2026.
arcprize.org

Keep reading

Model comparisons →Research papers →Tools and tracker →Learn — AI literacy →OrangeBox local AI →B00KMakor reading list →Playbooks →

The AI leaderboard, read honestly

The eight benchmarks at a glance

LMSYS Chatbot Arena Elo

MMLU — Massive Multitask Language Understanding

HumanEval and SWE-Bench Verified

HumanEval

HumanEval+

SWE-Bench Verified

SWE-Bench Multimodal and Multilingual

GPQA, MATH, LiveBench, HellaSwag

GPQA — graduate-level Q&A

MATH

LiveBench

HellaSwag

Top 5 by benchmark — June 2026 best-effort

The contamination problem, plainly

Arena Elo is not capability

Benchmarks that actually predict real-world utility

How to read a vendor benchmark claim

Check the benchmark name precisely

Check whether the benchmark is contamination-controlled

Check the harness and the prompting setup

Check whether the score was self-reported or independent

Check whether it predicts your task

The minimum-effective-dose tracking stack

Sources

Keep reading