Macro of an open mechanical watch movement showing gears and jewels — interpretability is looking inside.

Reasoning models · the atlas

The o1/R1 paradigm. What changed.

September 2024 OpenAI shipped o1. The field has been catching up + diversifying since. This page is the honest walk: how reasoning models actually work, what they're for, what they don't fix, and which one to reach for when you have a hard problem.

History · how we got here

Eight milestones.

2022-01

Chain-of-thought prompting (Wei et al.)

Showed that prompting LLMs with 'let's think step by step' substantially improves arithmetic + commonsense + symbolic reasoning. Established that the underlying capability was there in pretrained models — they just had to be coaxed to use it. Set the stage for everything that followed.

2022-03

Self-consistency (Wang et al.)

Run chain-of-thought K times, take the most-frequent answer. Robust improvement over single-shot CoT. Introduced the 'sample many, aggregate' inference-time pattern that o1 would later industrialize.

2023-05

Tree-of-Thoughts (Yao et al.)

Don't just sample many linear chains — search a tree of possible reasoning paths, backtrack from dead ends. Conceptual ancestor of o1's hidden reasoning trees.

2024-09

OpenAI o1 (preview + then GA)

First publicly available production reasoning model. Spends significantly more compute at inference time generating long internal chains of thought before producing a final answer. AIME, GPQA, Codeforces scores jump substantially over GPT-4o. Reasoning chain is hidden from the user (OpenAI cites safety + competitive reasons).

2024-12

OpenAI o3 (preview)

Follow-up to o1 with substantially better scores on hard benchmarks. ARC-AGI-1 87% (vs ~25% for GPT-4o). FrontierMath benchmark 25% (vs ~2% for previous models). Demonstrated that inference-time-compute scaling is a power-law axis like training compute.

2025-01

DeepSeek-R1

Open-weights Chinese reasoning model that matched o1 performance on multiple public benchmarks. Critically: released the technical report describing the training method (R1-Zero pure-RL, then distillation), opening the recipe to the broader research community. Spawned a wave of open-weight reasoning models.

2025-02

Gemini 2.0 Flash Thinking + Gemini 2.5 Thinking

Google's reasoning-mode variants. Like o1, generates internal chains of thought; unlike o1, reasoning traces are visible to the user. Strong on math + science benchmarks. Pairs with Google's substantial multimodal + long-context advantage.

2025-05

Claude Opus 4 + Sonnet 4 (Extended Thinking)

Anthropic's reasoning-mode variants. User can choose 'extended thinking' on a per-query basis. Reasoning traces visible, similar to Gemini. Strong on coding + agentic benchmarks. The 'reasoning mode is a toggle, not a separate model' productization choice.

How they actually work

Five mechanisms.

Long chain-of-thought generation

The model generates much longer internal reasoning before producing a final answer — sometimes thousands of tokens of working-out for a single response. The reasoning resembles a person scratching out solutions, considering alternatives, backtracking, re-checking.

RL on outcome rewards (DeepSeek-R1-Zero pattern)

Reinforcement learning where the reward signal is the correctness of the final answer (not process — just outcome). Surprisingly, the model spontaneously learns to use longer chains of thought because doing so improves correctness. Demonstrated openly by DeepSeek R1-Zero before R1's full pipeline.

Distillation from larger reasoning models

Once you have a powerful reasoning model, you can generate training data for smaller models by having the big model solve problems and using its reasoning traces as training signal. DeepSeek R1 distilled into smaller models (1.5B, 7B, 32B Llama-based variants) achieved strong reasoning at much smaller scale.

Process-reward models

Optional addition: train a separate reward model that scores intermediate reasoning steps (not just final answers). Used by some labs to guide the chain-of-thought search. Process rewards are harder to define + train than outcome rewards but can produce more legible reasoning.

Sampling + best-of-K + tree search at inference

Most reasoning models support an option to sample N candidate reasoning chains and pick the best (by self-consistency, by reward model, by verifier). This is what makes inference-time-compute a power-law-scalable axis: spend more compute → see better answers.

When to reach for one

Five honest tradeoffs.

Use a reasoning model for

Hard math problems (AIME-level and up). Competitive-programming problems. Multi-step logical proofs. Complex code synthesis where correctness matters more than speed. Scientific reasoning (GPQA-level). ARC-AGI-style abstract puzzles.

Don't use a reasoning model for

Quick chat. Summarization. Translation. Style transfer. Most retrieval-augmented generation (the reasoning is wasted if the answer is already in the context). Real-time conversational use cases where latency matters.

Cost tradeoff

Reasoning models use 5-20× more tokens per query than their non-reasoning siblings. If you're paying per token, expect bills to be 5-20× higher. If you're paying flat-rate (ChatGPT Plus, Claude Pro), you get rate-limited differently.

Latency tradeoff

First-token latency on reasoning models is much longer — sometimes 10-60 seconds for hard queries vs sub-second for direct models. UX considerations matter: build for the wait.

What they don't fix

Reasoning models hallucinate just as much as non-reasoning models when the answer requires factual recall they don't have. Reasoning is procedural improvement, not knowledge improvement. RAG still required for factual grounding.

Scaling laws · the second axis →How they get measured →← atlas index