built throughORANGEBOX·see what it ships·$1 →
Five identical matte-black stacked trays of varying heights — the leaderboard.

AtomEons / Learn / leaderboard

The AI leaderboard, read honestly

Composite tracker · 8 benchmarks · what each measures, what it hides

There is no single number that tells you which AI model is best. There are about a dozen popular benchmarks, each measuring something different, each gameable in its own way, each contaminated to some degree by training data leaks. The leaderboards you see on Hugging Face, lmsys.org, and vendor marketing pages are real signal — but they are also a kind of theater. Models are tuned for them. Public test sets leak. Newer base models train on chats that include benchmark questions. The Elo on Chatbot Arena measures human preference under short prompts, which is not the same thing as capability on a 200-page legal review. This page surveys the eight benchmarks that get cited most often as of June 2026 best-effort: LMSYS Chatbot Arena Elo, MMLU, HumanEval, SWE-Bench Verified, GPQA, MATH, LiveBench, and HellaSwag. For each one, we name what it actually measures, what it cannot capture, the current top of the table to the best of our knowledge as of mid-2026, and the gaming concerns published researchers have raised. We end with the section that matters most: which benchmarks actually predict real-world utility, and which are mostly vanity. The honest take is short. Arena Elo is a vibe check at scale. MMLU is mostly saturated. SWE-Bench Verified, GPQA Diamond, and LiveBench are the three that still discriminate between frontier models in 2026 because they were specifically designed to resist memorization or are refreshed often enough that contamination is bounded. If you only check one composite, check Artificial Analysis or the Vellum LLM Leaderboard, not vendor blogs. Pricing and rankings shift weekly — check provider docs for current numbers before you commit a workload.

The eight benchmarks at a glance

BenchmarkLMSYS Chatbot Arena
What it measuresHuman pairwise preference, blind
Public test setNo — prompts are user-generated
Refresh cadenceContinuous
Saturation riskLow (no fixed set)
BenchmarkMMLU
What it measures57 subjects, multiple choice
Public test setYes, fully public since 2020
Refresh cadenceStatic
Saturation riskHigh — most frontier models above 87%
BenchmarkHumanEval
What it measures164 Python function-completion problems
Public test setYes, fully public since 2021
Refresh cadenceStatic
Saturation riskVery high — saturated above 95%
BenchmarkSWE-Bench Verified
What it measures500 real GitHub issues, human-verified
Public test setYes, repo-pinned
Refresh cadencePeriodic
Saturation riskMedium — still discriminating in 2026
BenchmarkGPQA
What it measures448 graduate physics, biology, chemistry questions
Public test setDiamond subset public
Refresh cadenceStatic
Saturation riskLow to medium — Diamond is hard
BenchmarkMATH
What it measures12,500 competition math problems
Public test setYes, fully public
Refresh cadenceStatic
Saturation riskHigh — frontier models above 90%
BenchmarkLiveBench
What it measuresReasoning, coding, math, language, refreshed monthly
Public test setRotating
Refresh cadenceMonthly
Saturation riskLow by design
BenchmarkHellaSwag
What it measuresCommonsense sentence completion
Public test setYes
Refresh cadenceStatic
Saturation riskSaturated — over 95% since 2023

LMSYS Chatbot Arena Elo

Chatbot Arena, run by LMSYS (now operating as lmarena.ai), is a crowdsourced platform where users submit a prompt, get blind responses from two anonymous models, and vote on which is better. The aggregate produces an Elo rating, the same statistical method used in chess. As of June 2026 best-effort, the top of the overall Arena leaderboard rotates among Anthropic's Claude family (Opus and Sonnet generations), OpenAI's GPT-series, Google's Gemini family, and xAI's Grok — the exact ranking shifts weekly and you should check lmarena.ai for the current state. What Arena measures: human aesthetic and conversational preference, under short, single-turn or shallow-multi-turn prompts, weighted toward English and toward the kinds of prompts users on a free public site are willing to type. What it does not measure: long-context reliability past 10K tokens, agentic tool use, code-base level reasoning, faithfulness to a system prompt, latency, cost, or safety under adversarial prompting. The Arena leaderboard explicitly says it is a measure of preference, not capability. Gaming concerns: Cohere researchers and others have shown that vendors can A/B test prerelease checkpoints in the arena and ship the variant that wins. The arena's style-controlled leaderboard partially corrects for length and formatting bias, but it cannot correct for the fact that human voters reward confident-sounding answers, which is not the same thing as correct answers. Treat Arena Elo as a smell test, not a capability claim.

MMLU — Massive Multitask Language Understanding

Introduced by Hendrycks et al. in 2020 (arXiv:2009.03300), MMLU is 15,908 multiple-choice questions across 57 subjects including law, medicine, mathematics, and humanities. It was designed to test broad world knowledge. As of June 2026 best-effort, every frontier model scores above 87% and many cluster between 88% and 92%, which means MMLU has lost most of its discriminative power at the top. What it measures: factual recall and basic four-option reasoning, in English, in the format and topic distribution of US graduate and professional admission exams. What it does not measure: ability to handle ambiguity, ability to say 'I don't know,' multilingual depth, or any form of generation. Multiple-choice with four options has a 25% floor from guessing. Gaming concerns: MMLU is fully public and has been in training corpora since 2021. A 2023 paper (Zhou et al., arXiv:2311.04850, 'Don't make your LLM an evaluation benchmark cheater') and a 2024 study by the Allen Institute documented memorization on this benchmark. The MMLU-Pro variant (Wang et al., arXiv:2406.01574) was created specifically to address saturation — it uses ten answer choices instead of four, removes the easiest questions, and has a wider score spread at the frontier. If you cite MMLU in 2026, cite MMLU-Pro.

HumanEval and SWE-Bench Verified

HumanEval

Public · static · saturated

164 Python function-completion problems from OpenAI's 2021 Codex paper (arXiv:2107.03374). Each problem gives a docstring and a function signature; the model writes the body. As of June 2026 best-effort, frontier code models score above 95% pass@1 and HumanEval is effectively saturated — it no longer discriminates among the top tier.

HumanEval+

Public · static · still discriminates

EvalPlus (Liu et al., arXiv:2305.01210) adds about 80x more test cases per problem to expose code that passes the original sparse tests but fails on edge cases. Top models drop 5 to 15 points on HumanEval+ versus HumanEval. If you cite HumanEval, you should cite HumanEval+ alongside it.

SWE-Bench Verified

Public · periodic refresh · primary code benchmark in 2026

500 real GitHub issues from 12 popular Python repositories, human-verified by OpenAI in 2024 to remove ambiguous or under-specified problems. Models are given the repo and the issue and must produce a patch that passes the hidden test suite. As of June 2026 best-effort, top scores are in the 60% to 75% range — still meaningful headroom.

SWE-Bench Multimodal and Multilingual

Public · expanding · low saturation

Extensions launched in late 2024 and 2025 that add JavaScript, Java, Rust, and image-grounded bug reports. These remain genuinely hard for current models and are where vendors compete for headline numbers in 2026.

GPQA, MATH, LiveBench, HellaSwag

GPQA — graduate-level Q&A

Diamond subset is the gold standard

Rein et al. 2023 (arXiv:2311.12022). 448 questions in physics, biology, and chemistry written by PhD-holders specifically to be Google-proof. The Diamond subset (198 questions) is the hardest. The 2023 paper reports that expert validators in the relevant field score around 65% with web access. Frontier models in 2026 are reportedly approaching or exceeding human-expert range on Diamond — verify on the official GPQA repo at github.com/idavidrein/gpqa.

MATH

Saturated · use AIME variants instead

Hendrycks et al. 2021 (arXiv:2103.03874). 12,500 competition mathematics problems (AMC, AIME, etc.) with worked solutions. As of June 2026 best-effort, frontier reasoning-tuned models exceed 90% and the benchmark is largely saturated. MATH-500 (a curated subset) is what most papers report. Look at AIME-2024 and Putnam variants instead for current discrimination.

LiveBench

Refreshes monthly · the most contamination-resistant

White et al. 2024 (livebench.ai, paper at arXiv:2406.19314). Designed specifically to resist contamination by drawing from recent arXiv papers, recent news, recent IMO problems, and rotating monthly. Categories: reasoning, coding, mathematics, language, data analysis, instruction following. This is one of the few benchmarks that still moves meaningfully when a frontier model releases.

HellaSwag

Saturated · keep for historical comparison only

Zellers et al. 2019 (arXiv:1905.07830). Commonsense sentence completion with adversarially-generated distractors. Saturated above 95% across the frontier since 2023. Still cited because it is in the legacy harness, but it tells you nothing about modern model capability. Ignore it.

Top 5 by benchmark — June 2026 best-effort

We deliberately are not publishing a hardcoded top-5 list per benchmark in this page. Rankings change weekly as vendors release point updates, and any list we ship today will be wrong within two weeks. For the current state of each leaderboard, check the primary source: lmarena.ai for Chatbot Arena Elo, github.com/openai/swe-bench for SWE-Bench Verified, livebench.ai for LiveBench, the Vellum LLM Leaderboard at vellum.ai/llm-leaderboard for a composite view, and Artificial Analysis at artificialanalysis.ai for cost-adjusted intelligence scores. The Vellum and Artificial Analysis composites are updated continuously and are the cleanest single dashboards we have found in 2026.

The contamination problem, plainly

Train-on-test contamination is the single biggest reason published benchmark numbers should be discounted. Here is the mechanism: a benchmark is released as a public dataset on GitHub or Hugging Face. The Common Crawl, GitHub scrape, and academic-paper corpora that everyone trains on then ingest it. The next generation of base models has seen the test set during pretraining. They are not memorizing in the literal sense, but they have absorbed the answer distribution. Scores rise. Vendors celebrate. Researchers measure the contamination and quietly correct downward. The 2023 paper by Sainz et al. ('NLP evaluation in trouble,' arXiv:2310.18018) documented contamination across most popular benchmarks. The 2024 paper by Deng et al. (arXiv:2311.04850) showed that fine-tuning on benchmark-adjacent data produces gains indistinguishable from genuine capability improvement. EvalPlus, MMLU-Pro, GPQA-Diamond, and LiveBench were all created in direct response to this problem. The practical rule: any benchmark that has been public for more than 18 months and is not refreshed should be treated as a lower bound on contamination, not as an honest capability measure. The 2024 to 2026 generation of benchmarks (SWE-Bench Verified, GPQA-Diamond, LiveBench, MMLU-Pro, AIME-2024, ARC-AGI v2) are designed to be more resistant. Use those when you can.

Arena Elo is not capability

Anthropic, OpenAI, and Google have all published acknowledgments that Chatbot Arena Elo correlates with user preference, not with capability on hard tasks. The 2024 paper by Boyeau et al. (arXiv:2406.12624) showed that a model fine-tuned for length and formatting can gain 30 to 50 Elo points without any capability change. The 2025 Cohere analysis of Arena (arXiv:2504.20879) further documented prerelease testing and selection bias. Treat Arena Elo as 'which model do people enjoy chatting with,' not 'which model gets harder problems right.' For the latter, look at SWE-Bench Verified, GPQA-Diamond, and LiveBench.

Benchmarks that actually predict real-world utility

  • SWE-Bench Verified — closest proxy to 'can this model close a real engineering ticket end-to-end.' Highly recommended if you care about agentic coding.
  • LiveBench (current month) — least contaminated composite available; monthly refresh means scores genuinely move when capability moves.
  • GPQA-Diamond — best public proxy for graduate-level reasoning in the physical sciences. Saturating but still informative.
  • MMLU-Pro — replaces vanilla MMLU and discriminates at the frontier; 10-option multiple choice removes most of the lucky-guess floor.
  • AIME 2024 and 2025 — fresh math competition problems with low contamination risk; use instead of saturated MATH.
  • ARC-AGI v2 (Chollet et al.) — measures novel-problem reasoning; remains genuinely hard for frontier models in 2026; check arcprize.org for current state.
  • Long-context retrieval evals (RULER, NoCha, FACTS Grounding) — measure faithfulness on 100K-plus contexts, where most production workloads actually live.
  • Your own evaluation set — the only benchmark that perfectly predicts your real-world utility is one written on your real prompts. Hold out 50 to 200 examples from your actual use case. Score quarterly. Trust that number more than any leaderboard.

How to read a vendor benchmark claim

  1. Step 1

    Check the benchmark name precisely

    'MMLU 90.2' versus 'MMLU-Pro 78.4' are completely different claims. Vendors will sometimes write 'MMLU' when they mean the easier original benchmark. Demand the exact dataset version.

  2. Step 2

    Check whether the benchmark is contamination-controlled

    If it is a benchmark released before 2023 and not refreshed, assume contamination. If it is LiveBench, GPQA-Diamond, SWE-Bench Verified, MMLU-Pro, or AIME-2024, contamination risk is bounded.

  3. Step 3

    Check the harness and the prompting setup

    Pass@1 versus pass@10, chain-of-thought versus zero-shot, with-tools versus without-tools — these flips can swing scores 20 points. Demand identical evaluation harnesses across compared models.

  4. Step 4

    Check whether the score was self-reported or independent

    Artificial Analysis, Vellum, and the Stanford HELM project run their own independent evaluations. Vendor-reported numbers should be treated as upper bounds until reproduced independently.

  5. Step 5

    Check whether it predicts your task

    MMLU on medicine does not predict legal-contract drafting. SWE-Bench Verified does not predict creative writing. Match the benchmark to the use case, or run your own eval.

The minimum-effective-dose tracking stack

If you want to track frontier model capability without becoming a full-time benchmark hobbyist, the minimum-effective-dose is three sources checked monthly. First, lmarena.ai for the human-preference vibe check. Second, livebench.ai for the contamination-resistant composite. Third, either Artificial Analysis (artificialanalysis.ai) or the Vellum LLM Leaderboard (vellum.ai/llm-leaderboard) for cost-adjusted intelligence. That is roughly fifteen minutes a month and gives you 90% of the signal that benchmark-watchers extract from following twenty leaderboards weekly. For anything load-bearing — a workload you are about to deploy, a vendor contract you are about to sign, a model swap you are about to ship — write your own 50-prompt evaluation on your real use case and run it against the two or three candidate models. Score it yourself or have two humans score it independently. That number will predict your production outcome better than any leaderboard. Check provider docs for current pricing, current rate limits, current context windows, and current model identifiers before you commit code — these change roughly monthly across all major vendors and any number we hardcode on this page will be stale by the time you read it.

Sources

  1. [01]

    MMLU was introduced by Hendrycks et al. in 2020 as a 57-subject multiple-choice benchmark.

    arxiv.org/abs/2009.03300

  2. [02]

    MMLU-Pro by Wang et al. (2024) addresses MMLU saturation with 10 answer choices and harder questions.

    arxiv.org/abs/2406.01574

  3. [03]

    HumanEval was introduced in OpenAI's 2021 Codex paper as 164 Python function-completion problems.

    arxiv.org/abs/2107.03374

  4. [04]

    EvalPlus (HumanEval+ and MBPP+) by Liu et al. exposes inflated HumanEval scores using ~80x more test cases.

    arxiv.org/abs/2305.01210

  5. [05]

    SWE-Bench Verified is OpenAI's 500-problem human-verified subset of SWE-Bench.

    github.com/openai/swe-bench

  6. [06]

    Original SWE-Bench paper by Jimenez et al. (2023) introduces real GitHub-issue resolution as a code benchmark.

    arxiv.org/abs/2310.06770

  7. [07]

    GPQA by Rein et al. (2023) provides 448 graduate-level physics, biology, and chemistry questions written to be Google-proof.

    arxiv.org/abs/2311.12022

  8. [08]

    Official GPQA repository with Diamond subset and current usage instructions.

    github.com/idavidrein/gpqa

  9. [09]

    MATH benchmark by Hendrycks et al. (2021) contains 12,500 competition math problems.

    arxiv.org/abs/2103.03874

  10. [10]

    LiveBench publishes a monthly-refreshed contamination-resistant benchmark across reasoning, coding, math, language, and data analysis.

    livebench.ai

  11. [11]

    LiveBench paper by White et al. (2024) describes the contamination-resistant evaluation methodology.

    arxiv.org/abs/2406.19314

  12. [12]

    HellaSwag by Zellers et al. (2019) is a commonsense sentence-completion benchmark, saturated since 2023.

    arxiv.org/abs/1905.07830

  13. [13]

    LMSYS Chatbot Arena (now lmarena.ai) hosts the live human-preference Elo leaderboard.

    lmarena.ai

  14. [14]

    Original Chatbot Arena paper by Chiang et al. (2024) describes the blind pairwise human-preference Elo methodology.

    arxiv.org/abs/2403.04132

  15. [15]

    2025 Cohere analysis documents prerelease testing and selection bias in Chatbot Arena.

    arxiv.org/abs/2504.20879

  16. [16]

    Boyeau et al. (2024) show length and formatting fine-tuning can gain 30-50 Arena Elo points without capability change.

    arxiv.org/abs/2406.12624

  17. [17]

    Sainz et al. (2023) 'NLP Evaluation in Trouble' documents widespread benchmark contamination across popular NLP datasets.

    arxiv.org/abs/2310.18018

  18. [18]

    Deng et al. (2023) show fine-tuning on benchmark-adjacent data produces gains indistinguishable from genuine capability improvement.

    arxiv.org/abs/2311.04850

  19. [19]

    Vellum LLM Leaderboard provides a continuously-updated composite view of frontier model benchmarks.

    vellum.ai/llm-leaderboard

  20. [20]

    Artificial Analysis publishes independent cost-adjusted intelligence scores across major LLM providers.

    artificialanalysis.ai

  21. [21]

    Stanford HELM project provides holistic independent evaluation across many models and benchmarks.

    crfm.stanford.edu/helm

  22. [22]

    ARC-AGI v2 by Chollet et al. measures novel-problem reasoning and remains genuinely hard for frontier models in 2026.

    arcprize.org

LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM