AtomEons / Research / Decoded / Chain-of-Thought

2022 · arXiv:2201.11903 · Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou

“Think step by step.” That was the unlock.

In one sentence: Asking large language models to explain their reasoning out loud before giving an answer makes them dramatically better at math, logic, and multi-step problems — the model already could reason, but you had to ask the right way.

01 · Why this matters to your life

Every “reasoning model” you have heard of in 2026 — OpenAI o1 and o3, DeepSeek-R1, Claude Extended Thinking, Gemini Thinking — descends from this paper. They are not new fundamental architectures. They are scaled, automated, refined versions of one prompting trick: get the model to think out loud first.

The practical takeaway you can use today: when you ask any AI a complicated question, add “think through this step by step” or “explain your reasoning before answering.” The output quality usually improves substantially. This works in 2026 for the same reason it worked in 2022. It is the cheapest performance upgrade in AI.

02 · What scientists actually did

They took grade-school math word problems and tested whether large language models could solve them. The results were embarrassing — the AI got most of them wrong despite being able to write Shakespearean essays. The model knew math abstractly but kept stumbling on the multi-step arithmetic.

Then they tried something different. Instead of showing the model a question and the answer, they showed it a question and a worked-out solution that walked through the reasoning. The accuracy improved dramatically — on one math benchmark, from 17.9% to 58.1%. Same model. Same question. Different prompting.

The insight is that the AI could not hold all the reasoning steps in its head and produce the right final answer in one shot. But if it could write the intermediate steps down, it could think across the steps the same way humans do — using its own previously-written words as scratch paper. The reasoning happens in the writing, not before it.

03 · What scientists know but rarely say

Chain-of-thought reasoning works much better in big models than small ones. The 2022 paper showed it was essentially useless in models under ~10 billion parameters. This led to the concept of emergent capabilities — things that suddenly start working when the model crosses a size threshold. The follow-up debate is whether this emergence is real or an artifact of measurement; the field is still working it out.

The other unstated truth: chain-of-thought is brittle. It works best on problems the model has seen similar versions of during training. It fails on truly novel problems even when the steps look right. This is why the o1/o3-style models are trained specifically to produce good reasoning chains rather than just prompted to — automated chain-of-thought made consistent.

Most consequential implication: this paper made “reasoning” legible. Before chain-of-thought, an AI gave you an answer and you had to trust it. After chain-of-thought, an AI gave you an answer plus its reasoning, and you could check the work. The interpretability tax was massively reduced. Modern medical, legal, and scientific AI applications all rely on this.

04 · What the paper does NOT claim

The paper does not claim the AI is actually reasoning the way humans do. It claims that something behaviorally similar to step-by-step reasoning emerges from the prompting pattern. Whether the model is genuinely thinking or pattern-matching against memorized reasoning chains is unresolved.

The paper also does not claim chain-of-thought is the only way to elicit reasoning. The technique has been augmented by self-consistency (Wang et al. 2022, asking the model multiple times and taking the majority answer), tree-of-thought (Yao et al. 2023, exploring multiple reasoning branches), and process-reward models (the technique behind OpenAI's o1). Each adds capability on top of the original chain-of-thought foundation.

05 · Read the original

· arxiv.org/abs/2201.11903 — the 43-page original, figures alone tell the story.
· Kojima et al. 2022 (Zero-shot CoT) — discovered that just adding “Let's think step by step” produces most of the gain even without examples. arXiv:2205.11916.
· Wang et al. 2022 (Self-Consistency) — the upgrade that asks the model many times and votes. arXiv:2203.11171.
· OpenAI o1 system card (2024) — the production version of automated chain-of-thought trained directly into a model.

How AI gets safety guardrails →← decoded index