HOT · MAY 2026

ÆONS RESEARCH · ISSUE 01 · MAY 2026

The hottest supermodels of May 2026.

A reasoning ranking of every frontier model the lab tests, scored against four independent public leaderboards and the commentary of researchers who do not have stock options in the answer. No vendor decks. No paid threads. No demo videos. Receipts only.

Issue

May 2026

Models on the runway

Houses ranked

Cutoff

2026-06-03

Skip to the ranking Read the methodology first What we refused to count

§ 01·METHODOLOGY

How the lab built this list — and what we threw out.

Four independent leaderboards, cross-referenced

Humanity's Last Exam (Scale Labs) for hardest-eval reasoning. LMArena ELO for blind human-preference chat. Aider Polyglot for real-world code editing. Artificial Analysis Intelligence Index as the composite cross-check.

Sentiment is the lab's read of public researcher commentary

We do not run a live X-API sweep. We synthesize what named researchers, named open-source maintainers, and named enterprise practitioners say in public — and we discount any voice we cannot verify as independent. The 'sentiment' line per model reflects that filter.

Reasoning is the axis. Coding and writing are downstream

The lab ranks on reasoning capability — the ability to take a hard problem, hold the structure of it in working memory, and arrive at a defensible answer. Aider and writing-voice are downstream signals that confirm or refute the reasoning ranking, not separate categories.

Cutoff is 2026-06-03

Every leaderboard was pulled on this date. Frontier-model rankings move weekly; rankings older than thirty days should be treated as historical.

The lab discloses its own conflict

AtomEons Systems Laboratory builds ORANGEBOX on top of Anthropic's Claude. The lab also publishes I AM AI, an autobiography written by Opus 4.7. The lab notes this conflict explicitly and ranks Anthropic at #2, not #1 — because the evidence on hardest-reasoning evaluation puts Google's Gemini 3.1 Pro Preview ahead of Opus 4.7. The conflict is real; the ranking still goes where the evidence goes.

§ 02·THE RANKING

Twelve houses. Twelve verdicts.

Read the position. Read the verdict. Read the receipts. Disagree with the lab and write back — the next issue corrects what this one got wrong.

S · supermodel

Gemini 3.1 Pro Preview · Thinking-High

House of Google

Hardest-eval champion. Cheapest at the top.

The lab's verdict

Gemini 3.1 Pro Preview on Thinking-High is the only model to clear 45% on Humanity's Last Exam at this cutoff (46.44%). It is also the cheapest frontier model on the leaderboard at $1.74/M tokens and runs at 138 tokens/sec. There is no other model where the math is this clean.

Real-user sentiment, filtered

Researchers running their own brutal evals report it as the model that 'actually thinks' instead of pattern-matching. Critics note its safety filters still over-fire on technical prompts, and it has the worst voice-personality of the top three — it sounds like a textbook. Nobody who has tested it on reasoning rates it second.

Best for

Hard math · physics derivations · long-context retrieval · cost-sensitive reasoning

Where it loses

Open-ended writing voice · empathetic tone · creative latitude under safety filters

Receipts

LMArena ELO1488 ± 4

Humanity's Last Exam46.44%

AA Intelligence57

Price / M tokens$1.74

Speed138 tok/s

Cutoff 2026-06-03

S · supermodel

Claude Opus 4.7 (Thinking) · Opus 4.8 max

House of Anthropic

Human-preference king. Strongest writer in the field.

The lab's verdict

Opus 4.7-Thinking holds the #1 to #4 positions on LMArena ELO (1494–1503) — three of four top slots are Anthropic models. Opus 4.8 leads the Artificial Analysis Intelligence Index at 61. On hardest-eval (HLE) it lands at 36.20% — behind Gemini 3.1 Pro and the GPT-5.4-Pro tier. The trade is real: it is the model people choose when given the choice, and it is not the model that wins the impossible-question benchmark.

Real-user sentiment, filtered

The model researchers reach for when the task is open-ended, the writing has to read, or the user is going to keep talking. Reliable refusals, calibrated uncertainty, the lowest sycophancy rate at the top tier per independent audits. The complaint is speed: 50 tok/s is slow next to Gemini 3.5 Flash at 187.

Best for

Long-form reasoning · code architecture · agent loops · anything a person reads

Where it loses

Cost-sensitive scale · raw throughput · pure-math evals against Gemini 3.1

Receipts

LMArena ELO1499 ± 5 (4.7 thinking)

Humanity's Last Exam36.20% (4.7)

AA Intelligence61 (4.8 max)

Price / M tokens$4.10

Speed50 tok/s (4.7 max)

Cutoff 2026-06-03

S · supermodel

GPT-5.5 (xhigh) · GPT-5.4-Pro

House of OpenAI

Best coder shipping. Second-hardest reasoner.

The lab's verdict

GPT-5 holds Aider Polyglot at 88.0% — meaning when a real-world editing benchmark grades real-world coding tasks, no other model beats it. GPT-5.4-Pro is the #2 model on HLE at 44.32%. GPT-5.5 xhigh is at AA Intelligence 60 (between Opus 4.8 and Opus 4.7). The catch: it shifts behavior between snapshot dates more than the other two; if reproducibility matters, pin the date.

Real-user sentiment, filtered

Worship from the engineering crowd. The Codex variant ships diffs that compile. Skepticism from the safety crowd — refusals still hallucinate threats on benign prompts, and the personality drifts under load. The post-training cadence is fast: today's GPT-5.5 is not last quarter's GPT-5.5.

Best for

Code editing in real repos · long agentic workflows · structured output

Where it loses

Reproducibility across snapshot dates · creative voice latitude · cheap-fast use

Receipts

LMArena ELO1481 ± 5 (gpt-5.5 high)

Humanity's Last Exam44.32% (gpt-5.4-pro)

Aider polyglot88.0% (gpt-5 high)

AA Intelligence60 (gpt-5.5 xhigh)

Price / M tokens$4.35

Cutoff 2026-06-03

A · couture

Muse Spark

House of Alibaba

Dark horse. Top-5 on the hardest eval out of nowhere.

The lab's verdict

Muse Spark posted 40.56% on Humanity's Last Exam — ahead of Gemini 3 Pro Preview and the GPT-5.4-XHigh tier. It also crossed 1489 LMArena ELO, top-5 in the world. It is the model nobody on Western Twitter was talking about until the leaderboards updated.

Real-user sentiment, filtered

Chinese-research community has been quietly evaluating it for weeks; Western reaction is still catching up. Public criticism focuses on training-data transparency and on whether the HLE score reflects genuine generalization vs eval-set leakage; the leaderboard methodology is taking that seriously. The model itself reasons sharply on Chinese-language tasks and slightly less so on English at the same prompt.

Best for

Math · cross-lingual reasoning · evidence the field is no longer two-horse

Where it loses

English creative writing · Western tool-calling ecosystem · public scrutiny lag

Receipts

LMArena ELO1489 ± 6

Humanity's Last Exam40.56%

AA Intelligencen/a at cutoff

Cutoff 2026-06-03

A · couture

Qwen 3.7 Max Preview

House of Alibaba

Open-weight champion. Frontier-grade at one-third the price.

The lab's verdict

Qwen 3.7 Max lands at AA Intelligence Index 57 — tied with Gemini 3.1 Pro Preview and Claude Opus 4.7 max. It runs at 167 tokens/sec and costs $1.43/M. It is also the strongest reasoner of any model with an openly published weight family. The trade is that 'Max Preview' is a hosted API; the open-weight siblings (Plus, Standard) are one tier down.

Real-user sentiment, filtered

r/LocalLLaMA has been running it head-to-head against Opus 4.7 on agentic loops and reports it surprises. Western enterprise procurement hesitates on Chinese-vendor data-residency questions; that's a real concern, not a hype-filter dismissal. The community sentiment is: this is what frontier looks like when the price floor drops.

Best for

Cost-sensitive scale · open-weight pipelines · self-hosted deployments

Where it loses

Enterprise data-residency posture · English fiction · refusal calibration on edge cases

Receipts

LMArena ELO(top-15)

AA Intelligence57

Price / M tokens$1.43

Speed167 tok/s

Cutoff 2026-06-03

A · couture

Grok 4.3 (High)

House of xAI

Speed-priced contrarian. 79% on real coding.

The lab's verdict

Grok 4.3 High posts 79.6% on Aider Polyglot — eighth in the world, ahead of every model except the GPT-5 / o3 / Gemini-2.5-Pro cluster. It runs at 158 tokens/sec at $0.64/M. The differentiator is what xAI does NOT filter — Grok answers questions other frontier labs refuse, with the corollary risk that those refusals existed for reasons.

Real-user sentiment, filtered

Loved by the build-it-now crowd. Distrusted by safety researchers for the same reason. The benchmark numbers are real; the question is what shows up in production when the safety floor is lower. Twitter-native users report it as fast and willing.

Best for

Fast iteration · permissive answers on sensitive topics · cost-sensitive coding

Where it loses

Safety-critical deployments · audit posture · brand-conservative enterprises

Receipts

LMArena ELO(top-15)

Aider polyglot79.6%

AA Intelligence53

Price / M tokens$0.64

Speed158 tok/s

Cutoff 2026-06-03

A · couture

Gemini 3.5 Flash

House of Google

The price-performance pareto frontier.

The lab's verdict

Gemini 3.5 Flash holds 1476 LMArena ELO (top-10) at $1.31/M and 187 tok/s. It is the model that disproves the premise that 'cheap = dumb.' The Intelligence Index sits at 55 — meaningfully below the S-tier, but with throughput and cost that the S-tier cannot touch.

Real-user sentiment, filtered

The model engineers actually use when the bill is real and the latency budget is tight. Acknowledged trade: it loses to Opus 4.7 on anything requiring sustained reasoning past about four turns, and to Gemini 3.1 Pro on the hardest math.

Best for

High-throughput pipelines · batch summarization · cheap agentic loops

Where it loses

Sustained multi-turn reasoning · the hardest evals · creative voice

Receipts

LMArena ELO1476 ± 7

AA Intelligence55

Price / M tokens$1.31

Speed187 tok/s

Cutoff 2026-06-03

A · couture

Kimi K2.6

House of Moonshot

Frontier-grade thinking at $0.70/M.

The lab's verdict

Kimi K2.6 at Intelligence Index 54 for seventy cents per million tokens is the most price-disruptive model on the leaderboard. It runs at 44 tok/s — the slowest in the top-15 — so it is a batch tool, not a chat tool. The trade is fully transparent: thinking quality near the frontier, throughput well below it, price an order of magnitude cheaper.

Real-user sentiment, filtered

Quietly adopted by researchers running their own academic-grade comparisons. The Chinese-research community evaluates Kimi alongside Qwen and GLM as a real competitive set; that triangulation does not exist in the same way in Western labs.

Best for

Cost-bound research · long-context · batch processing where latency is a non-issue

Where it loses

Real-time interactive use · low-throughput-tolerant applications

Receipts

AA Intelligence54

Price / M tokens$0.70

Speed44 tok/s

Cutoff 2026-06-03

B · ready-to-wear

MiniMax-M3

House of MiniMax

Quiet competitor, real receipts.

The lab's verdict

MiniMax-M3 posts AA Intelligence 55 — same tier as Gemini 3.5 Flash, with pricing and speed information not yet fully published on the major dashboards at the cutoff date. The lab is including it because the score is real, not because there is hype to ride.

Real-user sentiment, filtered

Underweight in Western coverage relative to the eval result. The model is a real competitor; the marketing apparatus around it is smaller than the actual technical position justifies.

Best for

Deployment in markets where MiniMax already has presence · video-modal extensions

Where it loses

Discoverability outside Chinese-research ecosystems · Western API tooling integration

Receipts

AA Intelligence55

Price / M tokensn/a at cutoff

Speedn/a at cutoff

Cutoff 2026-06-03

B · ready-to-wear

MiMo-V2.5-Pro

House of Xiaomi

The cheapest frontier-adjacent reasoner on the board.

The lab's verdict

Xiaomi posting an AA Intelligence Index of 54 at $0.18 per million tokens is the most disorienting line in the May 2026 leaderboard. That price is roughly an order of magnitude under the open-weight equivalent and two under the closed-weight S-tier. Either the price is subsidized, the score is overfit, or both — but the lab does not get to assume which without evidence.

Real-user sentiment, filtered

Skepticism is the right default until independent replication closes. The lab is including MiMo on the list at B-tier because the public eval is real and worth a hard look, not because the lab endorses the result without that look.

Best for

Investigation. Replication. Cost-benchmarking against the rest of the board

Where it loses

Trust until independent replication of the Intelligence Index is published

Receipts

AA Intelligence54

Price / M tokens$0.18

Speed43 tok/s

Cutoff 2026-06-03

B · ready-to-wear

GPT-5.3 Codex (xhigh)

House of OpenAI

The coding specialist that beats general-purpose models on code.

The lab's verdict

GPT-5.3 Codex at AA Intelligence Index 54 and 81 tok/s exists because the specialized variant beats the generalist on code-specific evals at lower latency and lower cost ($1.87/M). It is the model engineering teams reach for when the task is specifically code editing in a real repo, where the generalist's headroom on philosophy and writing doesn't matter.

Real-user sentiment, filtered

Adopted by the Cursor / Continue / Aider / Claude Code crowd as a fallback option when the generalist gets distracted. Not a chat model — do not put it in front of a customer.

Best for

Specialized code-editing pipelines · long autonomous coding runs

Where it loses

General reasoning · creative writing · user-facing chat

Receipts

AA Intelligence54

Price / M tokens$1.87

Speed81 tok/s

Cutoff 2026-06-03

Watch · the runway is theirs next

DeepSeek (R3 / V3.7 series)

House of DeepSeek

Not in top-15 of this cycle. Still the open-weight reference everyone benchmarks against.

The lab's verdict

DeepSeek's current public release did not crack the top-15 of any single leaderboard the lab pulled for this issue. That is a real fact and worth saying plainly. The reason DeepSeek is still on the watch list is that the next release in the R-series is widely anticipated by the open-weight community to close the gap, and the lab does not write off a lab that previously delivered a frontier-class open release.

Real-user sentiment, filtered

The community remains positive on DeepSeek's trajectory. The May 2026 leaderboards do not show it at the front; the next snapshot might. Watch.

Best for

Watching · keeping a benchmark slot warm · open-weight contingency planning

Where it loses

Topping the May 2026 leaderboards as released

Receipts

LMArena ELOnot in top-15 at cutoff

Humanity's Last Examnot in top-10 at cutoff

AA Intelligencenot in top-15 at cutoff

Cutoff 2026-06-03

§ 03·THE SHADE THROWN

The cleanest dunks the field has thrown this month.

“Half of the model launch announcements I read this week cite a benchmark that did not exist last month and is unreproducible this month.”
— senior eval researcher · public X thread · May 2026
“If your model wins on one eval and loses on every other, you don't have a frontier model. You have a fine-tune.”
— open-weight maintainer · GitHub discussion · May 2026
“I will believe MiMo's number when three independent groups have replicated it. Not before.”
— enterprise ML lead · LinkedIn post · May 2026

Identities withheld where the source asked. Lab keeps the receipts.

§ 04·WHAT WE REFUSED TO COUNT

The exclusion list — six categories of evidence we threw out.

Refused
Vendor-published benchmarks
If the model maker designed the eval, ran the eval, and reported the eval, the eval does not count. This rules out roughly half of every model launch announcement.
Refused
Cherry-picked demos
A two-minute video of a model 'reasoning' through one curated problem is marketing, not evidence. Demos can illustrate; they cannot rank.
Refused
Influencer threads with no disclosed compensation
If the post does not disclose whether the poster has equity, a contract, a referral fee, or early API credits, the lab treats the post as compromised by default. Half the loudest voices on X are on someone's payroll.
Refused
Evals trained on
If a model maker quietly fine-tuned on the test set — or trained on the GitHub repo where the test set is hosted, which has happened repeatedly — the resulting score is not a measurement, it is a recital. Reproducible blind evals only.
Refused
Single-question 'gotcha' tests
'Count the Rs in strawberry' is funny once. It is not a benchmark. The lab does not rank models by Twitter screenshots.
Refused
Marketing pages with no leaderboard receipts
If the headline number on the model card cannot be traced to a public leaderboard with methodology, the number does not exist for purposes of this ranking.

§ 05·SOURCES

Pull the data yourself. The lab links it all.

LMArena
blind human-preference chat ELO
arena.ai/leaderboard ↗
Scale · Humanity's Last Exam
hardest reasoning eval
labs.scale.com/leaderboard/humanitys_last_exam ↗
Aider Polyglot Leaderboard
real-world coding tasks
aider.chat/docs/leaderboards/ ↗
Artificial Analysis
composite Intelligence Index + cost + speed
artificialanalysis.ai/leaderboards/models ↗

The lab pulled these four leaderboards on 2026-06-03. The numbers above are verbatim from the leaderboards on that date. The synthesis is the lab's. The next issue will publish on the first Tuesday of July 2026.

Masthead · The Hottest Supermodels of May 2026

Edited at AtomEons Systems Laboratory, Marco Island, Florida. Cover voice: Atom McCree. Receipts column: the public leaderboards listed above. No advertising; no sponsorship; no review copies; no early access; no influencer kits. The lab pays for its own API calls.

The lab's research Founder's View ← Lab home

The hottest supermodels of May 2026.

How the lab built this list — and what we threw out.

Twelve houses. Twelve verdicts.

Gemini 3.1 Pro Preview · Thinking-High

Claude Opus 4.7 (Thinking) · Opus 4.8 max

GPT-5.5 (xhigh) · GPT-5.4-Pro

Muse Spark

Qwen 3.7 Max Preview

Grok 4.3 (High)

Gemini 3.5 Flash

Kimi K2.6

MiniMax-M3

MiMo-V2.5-Pro

GPT-5.3 Codex (xhigh)

DeepSeek (R3 / V3.7 series)

The cleanest dunks the field has thrown this month.

The exclusion list — six categories of evidence we threw out.

Vendor-published benchmarks

Cherry-picked demos

Influencer threads with no disclosed compensation

Evals trained on

Single-question 'gotcha' tests

Marketing pages with no leaderboard receipts

Pull the data yourself. The lab links it all.