What is chain-of-thought prompting?

The short answer

Chain-of-thought (CoT) prompting is a technique introduced by Google Research in 2022 that elicits multi-step reasoning from large language models by prompting them to produce intermediate reasoning steps before a final answer, instead of jumping straight to the output. On the GSM8K math benchmark, CoT prompting raised PaLM 540B's accuracy from 17.9% to 56.9% — a result so large it triggered the entire reasoning-model research direction that produced OpenAI o1 and DeepSeek-R1.

The longer answer

Chain-of-thought prompting was formalized by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou of Google Research in the paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903), first posted January 28, 2022 and presented at NeurIPS 2022. The technique is structurally simple: in a few-shot prompt, each exemplar is rewritten so that the answer is preceded by a natural-language explanation of how the answer was derived. The model, conditioned on these traces, generates its own trace before its own answer on the test query.

The empirical claim that made the paper canonical is a sharp emergence threshold. Below roughly 100 billion parameters, CoT prompting was flat or harmful on arithmetic and symbolic reasoning. At 540 billion parameters (PaLM), CoT lifted GSM8K from 17.9% to 56.9% accuracy, MultiArith from 78.7% to 94.7%, and SVAMP from 69.4% to 79.0%. This was the first widely-cited evidence for "emergent abilities" — capabilities that appear discontinuously with scale (arXiv:2206.07682, Wei et al., 2022).

Three closely-related variants followed within months. Kojima et al. (arXiv:2205.11916, May 2022) showed that simply appending "Let's think step by step" to a zero-shot prompt produced large gains on GPT-3 and PaLM, eliminating the need for hand-written exemplars. This became known as zero-shot CoT. Wang et al. (arXiv:2203.11171, March 2022) introduced self-consistency, in which the model samples multiple CoT traces with non-zero temperature and the final answer is chosen by majority vote — pushing PaLM 540B GSM8K to 74.4%. Yao et al. (arXiv:2305.10601, May 2023) generalized the linear chain into Tree-of-Thoughts, which searches over branching reasoning steps and beat CoT on Game of 24 by a factor of 4.

The technique generalized far beyond arithmetic. CoT-style prompting is now standard practice for code generation (HumanEval, MBPP), commonsense reasoning (CommonsenseQA, StrategyQA), and symbolic manipulation. Anthropic's Claude, OpenAI's GPT-4o, and Google's Gemini all accept CoT-style prompts natively, and the system prompts of major frontier models include CoT scaffolding by default.

The deeper consequence was architectural. By 2024, frontier labs moved CoT from a prompting trick into model training. OpenAI's o1, released September 12, 2024, was trained with reinforcement learning specifically on long internal chains-of-thought, and reports its "thinking time" as a first-class compute axis. DeepSeek-R1 (arXiv:2501.12948, January 2025) replicated this with open weights, showing that pure reinforcement learning on verifiable rewards is sufficient to induce CoT behavior without supervised fine-tuning. Anthropic shipped "extended thinking" mode in Claude 3.7 Sonnet on February 24, 2025, exposing the chain to the developer.

A caveat the original paper did not emphasize but later work has: the generated chain is not always a faithful explanation of the model's actual computation. Turpin et al. (arXiv:2305.04388, May 2023) showed that biasing features in few-shot examples can change a model's answer while the CoT trace gives a different, plausible-sounding rationale. Lanham et al. (arXiv:2307.13702, July 2023) found that for smaller models, the chain often does drive the answer; for larger models, the answer is sometimes determined before the chain is generated. CoT is a reasoning amplifier and a window into model behavior — but the window is not always clean glass.

Key facts

CoT was introduced by Wei et al. at Google Research in arXiv:2201.11903, posted January 28, 2022, and accepted to NeurIPS 2022.
On GSM8K, CoT lifted PaLM 540B from 17.9% to 56.9% solve rate (Wei et al., 2022, Table 1).
CoT shows emergence: gains are flat or negative below ~100B parameters and large above (arXiv:2206.07682, Wei et al., NeurIPS 2022).
Zero-shot CoT — appending "Let's think step by step" — was shown by Kojima et al. in arXiv:2205.11916, NeurIPS 2022.
Self-consistency (majority vote over sampled chains) reached 74.4% on GSM8K with PaLM 540B (arXiv:2203.11171, Wang et al., ICLR 2023).
Tree-of-Thoughts generalizes CoT into a search tree and beat CoT 4x on Game of 24 (arXiv:2305.10601, Yao et al., NeurIPS 2023).
OpenAI o1, released September 12, 2024, was the first frontier model trained with RL on long internal CoT as a first-class compute axis (OpenAI o1 system card).
DeepSeek-R1 replicated o1-class CoT behavior via pure RL on verifiable rewards, with open weights under MIT license (arXiv:2501.12948, January 2025).
Claude 3.7 Sonnet shipped developer-visible "extended thinking" CoT on February 24, 2025 (Anthropic release notes).
CoT traces are not always faithful explanations of the model's actual reasoning (arXiv:2305.04388, Turpin et al., NeurIPS 2023).

Sources

Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903. arxiv.org/abs/2201.11903
Kojima, T. et al. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916. arxiv.org/abs/2205.11916
Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171. arxiv.org/abs/2203.11171
Yao, S. et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601. arxiv.org/abs/2305.10601
Wei, J. et al. Emergent Abilities of Large Language Models. arXiv:2206.07682. arxiv.org/abs/2206.07682
Turpin, M. et al. Language Models Don't Always Say What They Think. arXiv:2305.04388. arxiv.org/abs/2305.04388
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. arxiv.org/abs/2501.12948
OpenAI. Learning to Reason with LLMs (o1). openai.com/index/learning-to-reason-with-llms
Anthropic. Claude 3.7 Sonnet and Claude Code. anthropic.com/news/claude-3-7-sonnet

What is chain-of-thought prompting?

The short answer

The longer answer

Key facts

Related questions

Sources