2022-01
Chain-of-thought prompting (Wei et al.)
Showed that prompting LLMs with 'let's think step by step' substantially improves arithmetic + commonsense + symbolic reasoning. Established that the underlying capability was there in pretrained models — they just had to be coaxed to use it. Set the stage for everything that followed.
2022-03
Self-consistency (Wang et al.)
Run chain-of-thought K times, take the most-frequent answer. Robust improvement over single-shot CoT. Introduced the 'sample many, aggregate' inference-time pattern that o1 would later industrialize.
2023-05
Tree-of-Thoughts (Yao et al.)
Don't just sample many linear chains — search a tree of possible reasoning paths, backtrack from dead ends. Conceptual ancestor of o1's hidden reasoning trees.
2024-09
OpenAI o1 (preview + then GA)
First publicly available production reasoning model. Spends significantly more compute at inference time generating long internal chains of thought before producing a final answer. AIME, GPQA, Codeforces scores jump substantially over GPT-4o. Reasoning chain is hidden from the user (OpenAI cites safety + competitive reasons).
2024-12
OpenAI o3 (preview)
Follow-up to o1 with substantially better scores on hard benchmarks. ARC-AGI-1 87% (vs ~25% for GPT-4o). FrontierMath benchmark 25% (vs ~2% for previous models). Demonstrated that inference-time-compute scaling is a power-law axis like training compute.
2025-01
DeepSeek-R1
Open-weights Chinese reasoning model that matched o1 performance on multiple public benchmarks. Critically: released the technical report describing the training method (R1-Zero pure-RL, then distillation), opening the recipe to the broader research community. Spawned a wave of open-weight reasoning models.
2025-02
Gemini 2.0 Flash Thinking + Gemini 2.5 Thinking
Google's reasoning-mode variants. Like o1, generates internal chains of thought; unlike o1, reasoning traces are visible to the user. Strong on math + science benchmarks. Pairs with Google's substantial multimodal + long-context advantage.
2025-05
Claude Opus 4 + Sonnet 4 (Extended Thinking)
Anthropic's reasoning-mode variants. User can choose 'extended thinking' on a per-query basis. Reasoning traces visible, similar to Gemini. Strong on coding + agentic benchmarks. The 'reasoning mode is a toggle, not a separate model' productization choice.