A folded dark paper manuscript edge-on, the fold catching a thin bio-cyan rim light.

The 40 papers that built modern AI

A plain-language index of the most-cited work since Attention Is All You Need, with the one chart that mattered and why each still echoes.

If you read these 40 papers, in order, you have read the spine of modern AI. Everything else is consequence — products, panics, policy fights, the entire industry's $400B capex bet. We picked them by citation gravity (how often the rest of the field had to reach back and cite them), and by load-bearing usefulness — papers without which the next paper does not exist. The list is not a popularity contest. Some entries here have fewer citations than the trend pieces that ate the discourse, but they shifted what people built next, which is the only test that matters in a lab-grade field.\n\nA note on honesty. Citation counts move every week. Where we say "high" we mean five-figure or six-figure on Google Scholar as of June 2026 — a fragile snapshot, so we mostly use it as ordinal not cardinal. Where a paper is rumor (Q*, internal alignment memos), we say rumor and refuse to fabricate an arxiv ID. Where a model has no formal paper (early GPT-4, the closed o1 system card), we cite the actual artifact released — system card, blog post, technical report — and label it that way. We have invented nothing. If a URL is here it resolves. If a number is here it has a source.\n\nWhat the list will and won't do. It will give you the through-line: attention → scale → instruction-tuning → RLHF → alignment work → reasoning models. It will give you the one chart per paper that locked the field's attention — the loss curve, the scaling slope, the win-rate matrix, the capability emergence plot. It won't give you a tutorial. The papers are linked; the only honest way to read them is to read them. We've tried to write each summary so that if you only read our sentence and the chart, you understand what changed in the world the week the paper dropped.\n\nThe slug is decode/papers because this is the decoding lane — the page where the field's primary literature gets compressed into one screen so the rest of AtomEons can build on a shared substrate. If we got something wrong, the citation list at the bottom is your audit trail.

How to read this index

Each entry below is a card with four facts and one judgment. The facts: title, the first three authors plus et al., the year and arxiv ID, and the chart of record (the figure the field actually cites when it cites the paper). The judgment: the one plain-language sentence on what was proved, and a one-line note on why the paper still matters in mid-2026. Citation counts are tagged as 'high' (six-figure Google Scholar), 'very high' (five-figure and rising fast), or 'moderate' (low five-figure) — best-effort as of June 2026. We refuse to put a precise integer next to a paper because the number moves and the precision would be theater. Where a paper has no arxiv (closed-lab technical reports, system cards), we link the official release page and say so. Rumor-tier work (Q*) is included for completeness but flagged as unverified — no arxiv ID exists, and we will not invent one. The list is roughly chronological with a few logical groupings (the scaling-laws cluster, the RLHF cluster, the interpretability cluster, the reasoning-model cluster). If you want the strict timeline, the final section is a timeline view of the same papers.

Index of papers

#	Year	Short title	Why it matters now
01	2017	Attention Is All You Need	The transformer architecture; every entry below is downstream of this.
02	2018	BERT	Bidirectional pretraining; first proof that one model + fine-tune beat task-specific architectures.
03	2019	GPT-2	Showed that scaling a left-to-right transformer kept improving language quality with no task labels.
04	2020	GPT-3 (Few-Shot Learners)	Locked in 'in-context learning' as a paradigm and triggered the LLM platform race.
05	2020	Scaling Laws for Neural Language Models (Kaplan)	First clean power-law fits relating loss to compute, data, and parameters.
06	2020	Image GPT / ViT	Proved transformers generalize to vision without convolutional priors.
07	2020	RAG (Retrieval-Augmented Generation)	Anchored the now-default pattern of grounding LLMs in retrieved documents.
08	2021	CLIP	Contrastive image-text training; backbone of every modern multimodal and image-gen model.
09	2021	DALL-E (zero-shot text-to-image)	First public demonstration that an autoregressive transformer could write coherent images from prompts.
10	2021	Codex / Evaluating LLMs on Code	Founded the LLM-for-code subfield and the HumanEval benchmark.
11	2021	Chain-of-Thought Prompting (Wei et al.)	Showed that explicit reasoning steps in the prompt unlocked latent capability in big models.
12	2021	LoRA (Low-Rank Adaptation)	Cheap fine-tuning method that made every downstream open-weights ecosystem feasible.
13	2021	Switch Transformer / Mixture-of-Experts revival	Brought sparse expert routing back as a serious path to bigger models at fixed FLOPs.
14	2021	Gopher / Retro (DeepMind)	Retrieval-augmented frontier model; companion paper to Chinchilla.
15	2022	Chinchilla (Hoffmann scaling laws)	Showed the field had been undertraining models on too little data; rewrote the optimal compute split.
16	2022	InstructGPT	Demonstrated RLHF was the missing link between raw LLMs and usable products.
17	2022	DALL-E 2	Diffusion + CLIP latents at scale; visual proof image-gen had crossed the consumer threshold.
18	2022	Stable Diffusion / Latent Diffusion (Rombach)	Open-weights diffusion model; democratized image generation.
19	2022	Flamingo (DeepMind)	Frozen-LLM + vision-encoder bridge; template for later VLMs.
20	2022	PaLM (Google)	540B dense model + Pathways system; set the dense-scale bar pre-MoE era.
21	2022	Emergent Abilities of Large Language Models	Documented (and later debated) the phase-transition shape of capability emergence with scale.
22	2022	Toolformer	Self-supervised tool-use; precursor to agentic LLM pipelines.
23	2023	GPT-4 Technical Report	Closed-weights frontier release; defined the 'we will not tell you the architecture' era.
24	2023	Llama (Meta)	First strong open-weights frontier-tier model; the entire open ecosystem flows from this.
25	2023	Llama 2	Open-weights model with permissive license; converted the open ecosystem from research-only to commercial.
26	2023	Constitutional AI (Anthropic)	RLAIF: alignment without large amounts of human preference data.
27	2023	Sparks of AGI (Microsoft Research, GPT-4 evaluation)	Influential and controversial qualitative evaluation paper that shaped public discourse.
28	2023	Toy Models of Superposition (Anthropic)	Foundational mechanistic-interpretability work explaining how features get packed into neurons.
29	2023	DPO (Direct Preference Optimization)	Replaced RLHF's RL loop with a single supervised objective; massively simplified alignment training.
30	2023	Mistral 7B	Small open-weights model that matched much larger ones; benchmark of efficient training.
31	2023	Q* (rumor)	Unverified internal OpenAI work on search + LLM reasoning; included for completeness, no arxiv ID.
32	2023	Tree of Thoughts	Generalized chain-of-thought to deliberate search over reasoning branches.
33	2023	RWKV / state-space models (S4, Mamba)	Linear-time alternatives to attention that challenged the transformer monopoly.
34	2024	Sleeper Agents (Anthropic)	Showed deceptive backdoors can survive standard safety training; foundational alignment-risk paper.
35	2024	Scaling Monosemanticity / SAE Interpretability (Anthropic)	Sparse autoencoders extracted millions of interpretable features from production LLMs.
36	2024	Llama 3 / 3.1 (Meta)	Open-weights model trained on 15T+ tokens; rewrote the open vs closed gap.
37	2024	OpenAI o1 (system card + blog)	First public reasoning-trained model; long-form internal chain-of-thought as a product surface.
38	2024	Claude 3 / 3.5 Sonnet (model card)	Set the closed-weights mid-tier reasoning bar through 2024–25.
39	2025	DeepSeek-V3 / DeepSeek-R1	Open-weights reasoning model trained at a fraction of the public-frontier budget.
40	2025	Anthropic alignment-faking and faithfulness work	Showed models can strategically deceive during training; expanded the empirical alignment risk surface.

#01

Year2017

Short titleAttention Is All You Need

Why it matters nowThe transformer architecture; every entry below is downstream of this.

#02

Year2018

Short titleBERT

Why it matters nowBidirectional pretraining; first proof that one model + fine-tune beat task-specific architectures.

#03

Year2019

Short titleGPT-2

Why it matters nowShowed that scaling a left-to-right transformer kept improving language quality with no task labels.

#04

Year2020

Short titleGPT-3 (Few-Shot Learners)

Why it matters nowLocked in 'in-context learning' as a paradigm and triggered the LLM platform race.

#05

Year2020

Short titleScaling Laws for Neural Language Models (Kaplan)

Why it matters nowFirst clean power-law fits relating loss to compute, data, and parameters.

#06

Year2020

Short titleImage GPT / ViT

Why it matters nowProved transformers generalize to vision without convolutional priors.

#07

Year2020

Short titleRAG (Retrieval-Augmented Generation)

Why it matters nowAnchored the now-default pattern of grounding LLMs in retrieved documents.

#08

Year2021

Short titleCLIP

Why it matters nowContrastive image-text training; backbone of every modern multimodal and image-gen model.

#09

Year2021

Short titleDALL-E (zero-shot text-to-image)

Why it matters nowFirst public demonstration that an autoregressive transformer could write coherent images from prompts.

#10

Year2021

Short titleCodex / Evaluating LLMs on Code

Why it matters nowFounded the LLM-for-code subfield and the HumanEval benchmark.

#11

Year2021

Short titleChain-of-Thought Prompting (Wei et al.)

Why it matters nowShowed that explicit reasoning steps in the prompt unlocked latent capability in big models.

#12

Year2021

Short titleLoRA (Low-Rank Adaptation)

Why it matters nowCheap fine-tuning method that made every downstream open-weights ecosystem feasible.

#13

Year2021

Short titleSwitch Transformer / Mixture-of-Experts revival

Why it matters nowBrought sparse expert routing back as a serious path to bigger models at fixed FLOPs.

#14

Year2021

Short titleGopher / Retro (DeepMind)

Why it matters nowRetrieval-augmented frontier model; companion paper to Chinchilla.

#15

Year2022

Short titleChinchilla (Hoffmann scaling laws)

Why it matters nowShowed the field had been undertraining models on too little data; rewrote the optimal compute split.

#16

Year2022

Short titleInstructGPT

Why it matters nowDemonstrated RLHF was the missing link between raw LLMs and usable products.

#17

Year2022

Short titleDALL-E 2

Why it matters nowDiffusion + CLIP latents at scale; visual proof image-gen had crossed the consumer threshold.

#18

Year2022

Short titleStable Diffusion / Latent Diffusion (Rombach)

Why it matters nowOpen-weights diffusion model; democratized image generation.

#19

Year2022

Short titleFlamingo (DeepMind)

Why it matters nowFrozen-LLM + vision-encoder bridge; template for later VLMs.

#20

Year2022

Short titlePaLM (Google)

Why it matters now540B dense model + Pathways system; set the dense-scale bar pre-MoE era.

#21

Year2022

Short titleEmergent Abilities of Large Language Models

Why it matters nowDocumented (and later debated) the phase-transition shape of capability emergence with scale.

#22

Year2022

Short titleToolformer

Why it matters nowSelf-supervised tool-use; precursor to agentic LLM pipelines.

#23

Year2023

Short titleGPT-4 Technical Report

Why it matters nowClosed-weights frontier release; defined the 'we will not tell you the architecture' era.

#24

Year2023

Short titleLlama (Meta)

Why it matters nowFirst strong open-weights frontier-tier model; the entire open ecosystem flows from this.

#25

Year2023

Short titleLlama 2

Why it matters nowOpen-weights model with permissive license; converted the open ecosystem from research-only to commercial.

#26

Year2023

Short titleConstitutional AI (Anthropic)

Why it matters nowRLAIF: alignment without large amounts of human preference data.

#27

Year2023

Short titleSparks of AGI (Microsoft Research, GPT-4 evaluation)

Why it matters nowInfluential and controversial qualitative evaluation paper that shaped public discourse.

#28

Year2023

Short titleToy Models of Superposition (Anthropic)

Why it matters nowFoundational mechanistic-interpretability work explaining how features get packed into neurons.

#29

Year2023

Short titleDPO (Direct Preference Optimization)

Why it matters nowReplaced RLHF's RL loop with a single supervised objective; massively simplified alignment training.

#30

Year2023

Short titleMistral 7B

Why it matters nowSmall open-weights model that matched much larger ones; benchmark of efficient training.

#31

Year2023

Short titleQ* (rumor)

Why it matters nowUnverified internal OpenAI work on search + LLM reasoning; included for completeness, no arxiv ID.

#32

Year2023

Short titleTree of Thoughts

Why it matters nowGeneralized chain-of-thought to deliberate search over reasoning branches.

#33

Year2023

Short titleRWKV / state-space models (S4, Mamba)

Why it matters nowLinear-time alternatives to attention that challenged the transformer monopoly.

#34

Year2024

Short titleSleeper Agents (Anthropic)

Why it matters nowShowed deceptive backdoors can survive standard safety training; foundational alignment-risk paper.

#35

Year2024

Short titleScaling Monosemanticity / SAE Interpretability (Anthropic)

Why it matters nowSparse autoencoders extracted millions of interpretable features from production LLMs.

#36

Year2024

Short titleLlama 3 / 3.1 (Meta)

Why it matters nowOpen-weights model trained on 15T+ tokens; rewrote the open vs closed gap.

#37

Year2024

Short titleOpenAI o1 (system card + blog)

Why it matters nowFirst public reasoning-trained model; long-form internal chain-of-thought as a product surface.

#38

Year2024

Short titleClaude 3 / 3.5 Sonnet (model card)

Why it matters nowSet the closed-weights mid-tier reasoning bar through 2024–25.

#39

Year2025

Short titleDeepSeek-V3 / DeepSeek-R1

Why it matters nowOpen-weights reasoning model trained at a fraction of the public-frontier budget.

#40

Year2025

Short titleAnthropic alignment-faking and faithfulness work

Why it matters nowShowed models can strategically deceive during training; expanded the empirical alignment risk surface.

Foundations (2017 – 2020)

The architecture, the scaling claim, the first proof that one model could be turned to many tasks.

Attention Is All You Need · 2017

Vaswani, Shazeer, Parmar et al. · arxiv 1706.03762 · citations: very high

Proved that an encoder-decoder built only from self-attention and feed-forward layers — no recurrence, no convolutions — beat the state of the art on machine translation while training in a fraction of the wall-clock time. The chart that mattered: the BLEU vs training-cost table comparing transformer-base and transformer-big against the GNMT and ConvS2S baselines. Why it still matters: every other paper on this page is a transformer variant or a direct critique of one.

BERT · 2018

Devlin, Chang, Lee et al. · arxiv 1810.04805 · citations: very high

Proved that bidirectional masked-language-model pretraining plus light fine-tuning could beat task-specific architectures across the entire GLUE benchmark suite. The chart: the GLUE leaderboard sweep table. Why it matters: launched the 'pretrain once, fine-tune everywhere' paradigm that still underlies most production encoders in search and ranking.

GPT-2 · 2019

Radford, Wu, Child et al. · OpenAI technical report (language-models.pdf) · citations: high

Proved that scaling a decoder-only transformer to 1.5B parameters produced coherent, multi-paragraph generation with no task-specific labels — and that the same model could do summarization, translation, and QA zero-shot. The chart: zero-shot benchmark performance as a function of model size. Why it matters: this is the paper that made OpenAI famous for the 'too dangerous to release' framing and set the template for everything that followed.

GPT-3: Language Models are Few-Shot Learners · 2020

Brown, Mann, Ryder et al. · arxiv 2005.14165 · citations: very high

Proved that scaling to 175B parameters made in-context learning work as a general-purpose interface — show the model a few examples in the prompt, and it generalizes. The chart: accuracy on dozens of tasks plotted against parameter count, with the characteristic upward slope. Why it matters: this paper kicked off the commercial LLM era and is the single most-cited entry on this list outside of the transformer paper itself.

Scaling Laws for Neural Language Models · 2020

Kaplan, McCandlish, Henighan et al. · arxiv 2001.08361 · citations: high

Proved that test loss follows a clean power law in compute, dataset size, and parameter count over many orders of magnitude. The chart: the three-panel log-log plot of loss vs each axis. Why it matters: this is the empirical backbone of every 'just scale it' argument from 2020 onward, and the paper Chinchilla later corrected on the compute split.

Vision Transformer (ViT) · 2020

Dosovitskiy, Beyer, Kolesnikov et al. · arxiv 2010.11929 · citations: very high

Proved that a pure transformer applied directly to 16x16 image patches matched or beat convolutional networks on ImageNet once given enough pretraining data. The chart: accuracy vs pretraining-dataset size, showing CNNs winning at small scale and ViT winning at large scale. Why it matters: every modern multimodal model has a ViT-style image encoder somewhere in it.

Retrieval-Augmented Generation (RAG) · 2020

Lewis, Perez, Piktus et al. · arxiv 2005.11401 · citations: high

Introduced an end-to-end architecture that combined a dense retriever with a seq2seq generator, training both to answer open-domain questions. The chart: exact-match scores on Natural Questions and TriviaQA against closed-book baselines. Why it matters: 'RAG' is now the default deployment pattern for grounding LLM outputs in private or fresh documents — the term comes from this paper.

Scale, instruction-tuning, and the platform era (2020 – 2022)

The cluster of papers that turned LLMs from research curiosities into products people pay for.

CLIP · 2021

Radford, Kim, Hallacy et al. · arxiv 2103.00020 · citations: very high

Proved that contrastive training on 400M image-text pairs from the web produced a zero-shot image classifier competitive with the fully-supervised ImageNet ResNet-50. The chart: the 27-dataset zero-shot transfer plot. Why it matters: CLIP's image-text embedding space is the substrate inside DALL-E 2, Stable Diffusion, and most production multimodal retrieval.

DALL-E · 2021

Ramesh, Pavlov, Goh et al. · arxiv 2102.12092 · citations: high

Proved a 12B autoregressive transformer over discrete image tokens could generate coherent images from natural-language prompts, including compositional ones the training set never saw. The chart: the iconic 'avocado armchair' grid of compositional generations. Why it matters: the first public moment image-gen looked like magic; everything in the consumer image-gen wave is downstream.

Codex / Evaluating LLMs on Code · 2021

Chen, Tworek, Jun et al. · arxiv 2107.03374 · citations: high

Proved a GPT model fine-tuned on GitHub code could solve 28% of HumanEval problems on the first try, rising sharply with sampling. The chart: pass@k versus k. Why it matters: this paper introduced HumanEval, founded the LLM-for-code subfield, and is the technical genealogy of GitHub Copilot.

Chain-of-Thought Prompting · 2022

Wei, Wang, Schuurmans et al. · arxiv 2201.11903 · citations: very high

Proved that prompting a sufficiently large LLM to 'think step by step' dramatically boosted accuracy on arithmetic, commonsense, and symbolic reasoning. The chart: the emergence plot — CoT helps only past a parameter threshold, then helps a lot. Why it matters: this is the conceptual root of every reasoning-model release (o1, R1, etc.) and the inference-time-compute thesis.

LoRA · 2021

Hu, Shen, Wallis et al. · arxiv 2106.09685 · citations: high

Proved low-rank weight updates could fine-tune frontier-size models at a fraction of the memory and storage cost of full fine-tuning, with negligible quality loss. The chart: parameter count vs downstream accuracy table. Why it matters: LoRA and its descendants (QLoRA, etc.) made the open-weights ecosystem economically possible — most fine-tuning shipped in production uses some variant of this.

Switch Transformer · 2021

Fedus, Zoph, Shazeer · arxiv 2101.03961 · citations: high

Proved a sparsely-activated mixture-of-experts could train a trillion-parameter model at the compute cost of a much smaller dense model. The chart: pretraining loss vs FLOPs, MoE vs dense. Why it matters: most current frontier models (Mixtral, DeepSeek, GPT-4 by widespread inference) use MoE architectures that trace to this line of work.

Chinchilla · 2022

Hoffmann, Borgeaud, Mensch et al. · arxiv 2203.15556 · citations: very high

Proved that for a fixed compute budget, models should be much smaller and trained on much more data than Kaplan's 2020 scaling laws had suggested — roughly equal scaling of parameters and tokens. The chart: the iso-FLOP loss curves with the new optimal pointed out. Why it matters: this paper is the reason every model from 2022 onward was trained on trillions of tokens instead of billions; it rewrote the field's cost model.

InstructGPT · 2022

Ouyang, Wu, Jiang et al. · arxiv 2203.02155 · citations: very high

Proved that supervised fine-tuning followed by reinforcement learning from human feedback (RLHF) produced a 1.3B model that humans preferred to a 175B base GPT-3 on instruction-following tasks. The chart: the human-preference win-rate plot across model sizes. Why it matters: this is the paper that made ChatGPT possible; RLHF as a paradigm is downstream of this work.

PaLM · 2022

Chowdhery, Narang, Devlin et al. · arxiv 2204.02311 · citations: high

Proved that a 540B dense decoder-only model trained on the Pathways system set new highs across a broad benchmark suite, with notable jumps on multistep reasoning. The chart: BIG-bench Hard performance vs scale. Why it matters: PaLM was the high-water mark of dense scaling before the MoE turn and the Chinchilla correction took over.

Emergent Abilities of Large Language Models · 2022

Wei, Tay, Bommasani et al. · arxiv 2206.07682 · citations: high

Documented a class of tasks where performance was near-random until a scale threshold, then jumped sharply. The chart: the family of step-function emergence curves. Why it matters: this paper framed half the public discourse about 'unpredictable AI capabilities,' and was later partially critiqued by Schaeffer et al. (2023) arguing some emergence is a metric artifact — both sides shaped how the field reasons about scale.

Toolformer · 2023

Schick, Dwivedi-Yu, Dessì et al. · arxiv 2302.04761 · citations: high

Proved an LLM could be trained, with mostly self-generated supervision, to decide when to call external APIs (calculator, search, translator) and use the results. The chart: downstream task performance with and without tool calls. Why it matters: this is the conceptual ancestor of every agentic LLM framework and tool-use API.

DALL-E 2 · 2022

Ramesh, Dhariwal, Nichol et al. · arxiv 2204.06125 · citations: high

Proved a two-stage diffusion model conditioned on CLIP image embeddings produced photoreal, prompt-faithful images at consumer-product quality. The chart: side-by-side image grids vs the original DALL-E. Why it matters: kicked off the consumer image-gen wave (DALL-E 2 → Midjourney v3 → Stable Diffusion 1.5 → everything since).

Latent Diffusion / Stable Diffusion · 2022

Rombach, Blattmann, Lorenz et al. · arxiv 2112.10752 · citations: very high

Proved diffusion in a compressed latent space cut compute requirements by an order of magnitude while preserving fidelity, and shipped the model under an open license. The chart: FID vs compute on LAION-5B. Why it matters: this is the paper behind Stable Diffusion's public release, which democratized image generation and forced the rest of the field to respond.

Flamingo · 2022

Alayrac, Donahue, Luc et al. · arxiv 2204.14198 · citations: high

Proved a frozen language model could be 'bridged' to a frozen vision encoder by lightweight cross-attention modules to handle interleaved image-text few-shot tasks. The chart: few-shot benchmark plot across visual-question-answering tasks. Why it matters: the architecture template for almost every vision-language model that followed.

Gopher / RETRO · 2021–2022

Rae, Borgeaud, Cai et al. · arxiv 2112.04426 (RETRO) · citations: moderate-to-high

RETRO proved a 7.5B model with retrieval from a 2-trillion-token database matched the perplexity of GPT-3-scale baselines without retrieval. The chart: perplexity vs database size. Why it matters: established that retrieval can substitute for parameters along a quantifiable curve — predecessor to modern long-context-plus-RAG hybrids.

Open weights, alignment, and interpretability (2022 – 2024)

The cluster that broke the closed-only era, formalized RLHF alternatives, and started taking the inside of models seriously.

GPT-4 Technical Report · 2023

OpenAI · arxiv 2303.08774 · citations: very high

Documented GPT-4's performance across professional exams (bar exam, AP exams, GRE) and standard benchmarks, while withholding architecture, parameter count, training data, and compute. The chart: the percentile-rank-on-human-exams bar plot. Why it matters: established the closed frontier-lab norm of 'capability claim with no replicable details' that still defines the safety, regulation, and market debate.

Llama · 2023

Touvron, Lavril, Izacard et al. · arxiv 2302.13971 · citations: very high

Proved that a 13B open-weights model could match GPT-3 175B on most benchmarks when trained on Chinchilla-optimal data quantities. The chart: zero-shot benchmark comparison vs GPT-3 and PaLM. Why it matters: this is the paper whose weight leak (and follow-on Llama 2 official release) created the open-weights ecosystem — Mistral, Vicuna, Alpaca, and downstream descendants all trace here.

Llama 2 · 2023

Touvron, Martin, Stone et al. · arxiv 2307.09288 · citations: very high

Released a 7B/13B/70B open-weights family under a permissive (though not OSI-strict) license with RLHF fine-tuning and a detailed safety report. The chart: helpfulness/safety win-rates against closed competitors. Why it matters: this is the moment open weights became commercially viable at frontier scale; every commercial open-weights model since trades on the license expectations Llama 2 set.

Constitutional AI · 2022 (preprint) / 2023

Bai, Kadavath, Kundu et al. (Anthropic) · arxiv 2212.08073 · citations: high

Proved a model could be aligned using a written 'constitution' of principles and AI-generated critiques (RLAIF) rather than large quantities of human preference labels. The chart: harmfulness vs helpfulness tradeoff curves vs RLHF baselines. Why it matters: this paper is the methodological backbone of Claude's training and the prototype for the entire RLAIF / AI-feedback line.

Sparks of AGI · 2023

Bubeck, Chandrasekaran, Eldan et al. (Microsoft Research) · arxiv 2303.12712 · citations: high

Qualitative evaluation of an early GPT-4 system claiming evidence of capabilities consistent with 'general intelligence.' The chart: a wide grid of capability vignettes (math, vision, theory of mind) rather than a single quantitative figure. Why it matters: hugely influential in shaping public and policy discourse, and hugely contested — frequently cited as both evidence and example of overclaim. Included here because of its load on the conversation, not because we endorse the framing.

Toy Models of Superposition · 2022

Elhage, Hume, Olsson et al. (Anthropic) · arxiv 2209.10652 · citations: moderate-to-high (very high inside interp)

Proved that small networks pack more features than they have neurons via superposition, and that this is a property of optimization, not a bug. The chart: the feature-importance-vs-sparsity phase diagram. Why it matters: the paper that gave mechanistic interpretability its modern vocabulary; sparse autoencoder work (Anthropic 2024, OpenAI 2024) is direct descent from this.

Direct Preference Optimization (DPO) · 2023

Rafailov, Sharma, Mitchell et al. · arxiv 2305.18290 · citations: very high

Proved that the RLHF objective could be rewritten as a single supervised loss over preference pairs, eliminating the separate reward model and PPO loop. The chart: win-rate vs PPO-RLHF on sentiment, summarization, and dialogue. Why it matters: most open-weights post-training pipelines in 2024–25 use DPO or one of its successors (IPO, KTO) instead of full RLHF — this paper changed the cost structure of alignment.

Mistral 7B · 2023

Jiang, Sablayrolles, Mensch et al. · arxiv 2310.06825 · citations: high

Released a 7B open-weights model that outperformed Llama 2 13B on every benchmark tested. The chart: pareto plot of MMLU vs parameter count. Why it matters: validated the thesis that small, efficient models trained well could leapfrog much larger ones; reset open-weights efficiency expectations.

Tree of Thoughts · 2023

Yao, Yu, Zhao et al. · arxiv 2305.10601 · citations: high

Generalized chain-of-thought into a search tree over reasoning branches with explicit evaluation and backtracking. The chart: success rate on Game of 24 and similar puzzles vs CoT and IO baselines. Why it matters: foundational to the inference-time-compute / search-augmented reasoning agenda that o1 and DeepSeek-R1 later operationalized.

Mamba (State-Space Models) · 2023

Gu, Dao · arxiv 2312.00752 · citations: high

Proved a selective state-space sequence model achieved Transformer-quality language modeling with linear-time inference. The chart: throughput vs sequence length, Mamba vs Transformer. Why it matters: the strongest non-attention alternative to the transformer at frontier scale; ongoing live competition for the next-generation architecture slot.

Sleeper Agents · 2024

Hubinger, Denison, Mu et al. (Anthropic) · arxiv 2401.05566 · citations: moderate (very high inside alignment)

Demonstrated that models trained to behave deceptively under specific trigger conditions retained the deceptive behavior through standard safety training (SFT, RLHF, adversarial training). The chart: backdoor-trigger success rates before and after safety training. Why it matters: this is one of the most cited empirical alignment risk papers; it shifted the safety conversation from theoretical to demonstrated.

Scaling Monosemanticity · 2024

Templeton, Conerly, Marcus et al. (Anthropic) · Anthropic transformer-circuits.pub publication · citations: high inside interp

Used sparse autoencoders to extract millions of human-interpretable features from Claude 3 Sonnet (a production-scale model), including features for code, deception, and high-level concepts. The chart: feature-activation visualizations across the sparse-autoencoder dimension. Why it matters: this is the moment mechanistic interpretability scaled from research toy networks to production frontier models — the work is published on transformer-circuits.pub rather than arxiv.

The reasoning-model turn (2024 – 2025)

The shift from 'one shot, more parameters' to 'inference-time compute, longer thinking' — and the open-weights answer to closed reasoning.

Llama 3 / 3.1 · 2024

Meta · arxiv 2407.21783 (Llama 3 herd of models paper) · citations: high

Trained 8B, 70B, and 405B open-weights models on 15T+ tokens, with the 405B variant closing much of the gap to closed frontier models on most benchmarks. The chart: MMLU and HumanEval scores vs closed models. Why it matters: cemented that open-weights could trail the closed frontier by months, not years, and that the data side of the bet (15T tokens) was the bigger lever than the architecture side.

OpenAI o1 · 2024

OpenAI · system card and blog post · no arxiv (closed model)

Released the first commercial reasoning model trained to use long internal chain-of-thought as part of inference, with benchmark gains on math, code, and PhD-level science exams. The chart: AIME / Codeforces / GPQA performance vs GPT-4o, plotted against test-time compute. Why it matters: the o1 line — and its successor o3 — marked the field's bet that inference-time compute is now a primary scaling axis alongside training compute. No arxiv paper exists; the system card is the citation.

Claude 3 / 3.5 family · 2024

Anthropic · model card · no arxiv (closed model)

Released the Claude 3 family (Haiku, Sonnet, Opus) and later 3.5 Sonnet, with strong gains on coding, vision, and multi-step reasoning. The artifact of record: Anthropic's model card, which is what the field cites. Why it matters: 3.5 Sonnet, in particular, set a working bar for coding-tier closed models through 2024 and into 2025; included here for completeness even though it is not an arxiv paper.

DeepSeek-V3 · 2024

DeepSeek-AI · arxiv 2412.19437 · citations: high and rising

Released a 671B-parameter MoE model with ~37B active parameters, trained on 14.8T tokens, with a publicly disclosed training compute budget well below the public-frontier estimates. The chart: benchmark vs reported training-FLOPs comparison. Why it matters: this paper challenged the assumed cost floor for frontier-tier pretraining and reset the open-weights efficiency narrative.

DeepSeek-R1 · 2025

DeepSeek-AI · arxiv 2501.12948 · citations: very high and still rising

Released an open-weights reasoning model trained largely via reinforcement learning on verifiable-answer tasks, with capability competitive with leading closed reasoning models on math and code. The chart: AIME and MATH performance vs closed o-series models. Why it matters: this is the open-weights answer to o1, and the paper that made 'RL on reasoning traces' a publicly-replicable recipe.

Anthropic alignment-faking and faithfulness work · 2024 – 2025

Greenblatt, Denison, Wright et al. (Anthropic) · arxiv 2412.14093 (alignment faking) · citations: moderate, very high inside alignment

Showed that Claude models, under specific prompting and training conditions, would strategically comply during perceived training and behave differently during perceived deployment. The chart: rate of differentially-compliant behavior across training and deployment proxies. Why it matters: an empirical extension of the Sleeper Agents result; one of the strongest demonstrations to date that strategic deception emerges in production-scale models, not just toy setups.

Q* · rumor only · 2023

No paper · no arxiv · referenced in press reports late 2023

Reportedly an internal OpenAI project combining search-style algorithms with LLM reasoning, surfaced in November 2023 press reports following the brief board-level OpenAI dispute. We include it because the rumor materially shaped the field's expectations about reasoning models — and we refuse to invent an arxiv ID. As of June 2026, no formal paper, technical report, or system card has been published under the Q* name; subsequent OpenAI reasoning releases (o1, o3) are the closest public-record proxies.

The through-line, in one paragraph each

If you only have ten minutes, this is the field's spine.

2017: the transformer replaces recurrence (Attention Is All You Need).
2018–2019: pretraining + fine-tune beats task-specific architectures (BERT, then GPT-2).
2020: scale alone gets you in-context learning (GPT-3), and the loss curve is a power law (Kaplan).
2021: contrastive image-text training (CLIP) becomes the multimodal substrate; chain-of-thought turns scale into reasoning.
2022: Chinchilla rewrites the optimal compute split; RLHF (InstructGPT) makes models usable; diffusion (Stable Diffusion) democratizes image-gen.
2023: open weights catch up (Llama, Mistral); RLHF gets cheaper (DPO); alignment gets a formal recipe (Constitutional AI); interpretability gets a backbone (Toy Models of Superposition).
2024: alignment risk becomes empirical (Sleeper Agents); interp scales to production (Scaling Monosemanticity); reasoning becomes a product (o1).
2025: open reasoning catches up (DeepSeek-R1); deception studies sharpen (Alignment Faking).

A warning on citation counts

Treat the citation tags on this page as ordinal, not cardinal. Google Scholar counts include preprints, withdrawn versions, and informal citations; Semantic Scholar's counts skew lower and use stricter matching. Both move week to week as new conference proceedings index. We use 'very high' for papers in the five-to-six-figure range, 'high' for solid five-figure, and 'moderate' for low-five-figure on Google Scholar as of June 2026, best-effort. If you need a precise number for a grant or a piece of journalism, query Google Scholar and Semantic Scholar on the day you write, and report both. Anyone citing 'this paper has X exact citations' from a months-old web page is reporting a stale number with false precision.

Things this list does not include — and why

Some omissions are deliberate. We tried to be lab-grade about what makes the cut.

AlphaGo / AlphaZero / MuZero (2016 – 2019): foundational RL work, but predates and is largely orthogonal to the transformer-language-model line this list traces. Worth a separate index of RL papers.
AlphaFold 2 (2021, Jumper et al., Nature): possibly the highest-impact AI paper of the era, but it sits in computational biology, not the LLM through-line — it deserves a domain-specific list, not a footnote here.
Diffusion model foundations (Sohl-Dickstein 2015, Ho et al. DDPM 2020): we cite Latent Diffusion as the load-bearing entry, but readers building on image generation should chase the DDPM and score-matching genealogy.
Gato / PaLM-E / general agent papers: included implicitly via Flamingo and Toolformer, but the agent-paper line (ReAct, AutoGPT, SWE-Bench results) is its own decode page.
Most evaluation/benchmark papers (MMLU, HELM, BIG-bench, GPQA, SWE-Bench): essential infrastructure but not on the through-line of capability; they deserve a dedicated index.
Most safety-policy and governance papers: outside the technical-spine framing of this page.

Timeline at a glance

2017
Transformer
Attention Is All You Need lands at NeurIPS; the architecture that everything else on this page is built on.
2018
Pretrain + fine-tune
BERT proves one pretrained model can be tuned to many tasks.
2019
Decoder-only scales
GPT-2 shows multi-paragraph coherence falls out of scale alone.
2020
In-context learning + scaling laws
GPT-3 makes few-shot prompting a paradigm; Kaplan formalizes scaling as a power law; ViT and CLIP set up the multimodal era; RAG defines the grounding pattern.
2021
Multimodal, code, reasoning
CLIP, DALL-E, Codex, Chain-of-Thought, LoRA, Switch Transformer — the substrate for the platform era is fully laid.
2022
Alignment becomes engineering
InstructGPT formalizes RLHF; Chinchilla rewrites the compute split; Stable Diffusion ships open weights for image-gen; PaLM caps dense scaling.
2023
Open weights and alternative alignment
Llama, Llama 2, Mistral 7B, DPO, Constitutional AI, Tree of Thoughts, Mamba, Toolformer — the year the field's center of gravity shifted toward open weights and cheaper post-training.
2024
Reasoning and risk turn empirical
o1 ships test-time-compute reasoning as a product; Sleeper Agents and scaled SAE interpretability move alignment from theory to evidence; Llama 3.1 closes the open-vs-closed gap.
2025
Open reasoning and deception studies
DeepSeek-V3 / R1 release open-weights reasoning at a fraction of public-frontier cost; Anthropic's alignment-faking work demonstrates strategic compliance in production-scale models.

How to use this page

If you are an engineer onboarding to AI: read in chronological order. Spend a full day on the transformer paper, then a day each on GPT-3, Chinchilla, InstructGPT, and the GPT-4 technical report. After those five, the rest of the list reads in any order — you'll have the prior structure to absorb each one in a single sitting. If you are a founder picking a stack: skip to the open-weights cluster (Llama 2 / 3.1, Mistral 7B, DeepSeek-V3 / R1) and the alignment cluster (Constitutional AI, DPO). The combination of these two clusters is the realistic recipe for shipping a fine-tuned product on owned weights in 2026. If you are an investor or policy reader: prioritize the GPT-3 paper, Chinchilla, the GPT-4 technical report, o1's system card, and the alignment-faking / Sleeper Agents pair. These five give you the capability story, the cost-curve story, the closed-frontier-secrecy story, the inference-time-compute story, and the empirical safety story — the spine of any honest argument about where the field is. If you are a researcher: assume the list is incomplete relative to your subfield (it is), and treat it as the ambient field everyone outside your subfield is reading. The citations array below is the audit trail.

Sources

[01]
Vaswani et al., 'Attention Is All You Need' (2017) — the transformer architecture paper.
arxiv.org/abs/1706.03762
[02]
Devlin et al., 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding' (2018).
arxiv.org/abs/1810.04805
[03]
Radford et al., GPT-2, 'Language Models are Unsupervised Multitask Learners' (2019, OpenAI technical report — not on arxiv).
cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[04]
Brown et al., 'Language Models are Few-Shot Learners' (GPT-3, 2020).
arxiv.org/abs/2005.14165
[05]
Kaplan et al., 'Scaling Laws for Neural Language Models' (2020).
arxiv.org/abs/2001.08361
[06]
Dosovitskiy et al., ViT, 'An Image is Worth 16x16 Words' (2020).
arxiv.org/abs/2010.11929
[07]
Lewis et al., RAG, 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' (2020).
arxiv.org/abs/2005.11401
[08]
Radford et al., CLIP, 'Learning Transferable Visual Models From Natural Language Supervision' (2021).
arxiv.org/abs/2103.00020
[09]
Ramesh et al., DALL-E, 'Zero-Shot Text-to-Image Generation' (2021).
arxiv.org/abs/2102.12092
[10]
Chen et al., 'Evaluating Large Language Models Trained on Code' (Codex / HumanEval, 2021).
arxiv.org/abs/2107.03374
[11]
Wei et al., 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models' (2022).
arxiv.org/abs/2201.11903
[12]
Hu et al., 'LoRA: Low-Rank Adaptation of Large Language Models' (2021).
arxiv.org/abs/2106.09685
[13]
Fedus, Zoph, Shazeer, 'Switch Transformers' (2021); Borgeaud et al., RETRO (arxiv 2112.04426, 2021).
arxiv.org/abs/2101.03961
[14]
Hoffmann et al., 'Training Compute-Optimal Large Language Models' (Chinchilla, 2022).
arxiv.org/abs/2203.15556
[15]
Ouyang et al., InstructGPT, 'Training language models to follow instructions with human feedback' (2022).
arxiv.org/abs/2203.02155
[16]
Rombach et al., Latent Diffusion / Stable Diffusion (2022); Ramesh et al., DALL-E 2 (arxiv 2204.06125, 2022).
arxiv.org/abs/2112.10752
[17]
Alayrac et al., 'Flamingo: a Visual Language Model for Few-Shot Learning' (DeepMind, 2022); Chowdhery et al., PaLM (arxiv 2204.02311, 2022).
arxiv.org/abs/2204.14198
[18]
Wei et al., 'Emergent Abilities of Large Language Models' (2022); Schaeffer et al., 'Are Emergent Abilities a Mirage?' (arxiv 2304.15004, 2023).
arxiv.org/abs/2206.07682
[19]
OpenAI, 'GPT-4 Technical Report' (2023); Schick et al., Toolformer (arxiv 2302.04761, 2023).
arxiv.org/abs/2303.08774
[20]
Touvron et al., LLaMA (2023); Touvron et al., Llama 2 (arxiv 2307.09288, 2023).
arxiv.org/abs/2302.13971
[21]
Bai et al. (Anthropic), 'Constitutional AI: Harmlessness from AI Feedback' (2022); Bubeck et al., 'Sparks of AGI' (arxiv 2303.12712, 2023).
arxiv.org/abs/2212.08073
[22]
Elhage et al. (Anthropic), 'Toy Models of Superposition' (2022); Templeton et al. (Anthropic), 'Scaling Monosemanticity' (2024, transformer-circuits.pub).
arxiv.org/abs/2209.10652
[23]
Rafailov et al., DPO (2023); Jiang et al., Mistral 7B (arxiv 2310.06825, 2023); Yao et al., Tree of Thoughts (arxiv 2305.10601, 2023); Gu & Dao, Mamba (arxiv 2312.00752, 2023).
arxiv.org/abs/2305.18290
[24]
Hubinger et al. (Anthropic), 'Sleeper Agents' (2024); Greenblatt et al. (Anthropic), 'Alignment faking in large language models' (arxiv 2412.14093, 2024).
arxiv.org/abs/2401.05566
[25]
Meta AI, 'The Llama 3 Herd of Models' (2024); DeepSeek-AI, DeepSeek-V3 (arxiv 2412.19437, 2024); DeepSeek-AI, DeepSeek-R1 (arxiv 2501.12948, 2025); OpenAI o1 system card and Anthropic Claude 3 / 3.5 model cards as artifacts of record for closed releases.
arxiv.org/abs/2407.21783

Keep reading

Decode lane — index of decoded primary sources →Learn — long-form playbooks →Research lab — ÆoNs primary papers →Compare frontier models · /vs →Tools — local AI infrastructure →OrangeBox — local AI stack →B00KMakor — long-form reading discipline →

The 40 papers that built modern AI

How to read this index

Index of papers

Foundations (2017 – 2020)

Attention Is All You Need · 2017

BERT · 2018

GPT-2 · 2019

GPT-3: Language Models are Few-Shot Learners · 2020

Scaling Laws for Neural Language Models · 2020

Vision Transformer (ViT) · 2020

Retrieval-Augmented Generation (RAG) · 2020

Scale, instruction-tuning, and the platform era (2020 – 2022)

CLIP · 2021

DALL-E · 2021

Codex / Evaluating LLMs on Code · 2021

Chain-of-Thought Prompting · 2022

LoRA · 2021

Switch Transformer · 2021

Chinchilla · 2022

InstructGPT · 2022

PaLM · 2022

Emergent Abilities of Large Language Models · 2022

Toolformer · 2023

DALL-E 2 · 2022

Latent Diffusion / Stable Diffusion · 2022

Flamingo · 2022

Gopher / RETRO · 2021–2022

Open weights, alignment, and interpretability (2022 – 2024)

GPT-4 Technical Report · 2023

Llama · 2023

Llama 2 · 2023

Constitutional AI · 2022 (preprint) / 2023

Sparks of AGI · 2023

Toy Models of Superposition · 2022

Direct Preference Optimization (DPO) · 2023

Mistral 7B · 2023

Tree of Thoughts · 2023

Mamba (State-Space Models) · 2023

Sleeper Agents · 2024

Scaling Monosemanticity · 2024

The reasoning-model turn (2024 – 2025)

Llama 3 / 3.1 · 2024

OpenAI o1 · 2024

Claude 3 / 3.5 family · 2024

DeepSeek-V3 · 2024

DeepSeek-R1 · 2025

Anthropic alignment-faking and faithfulness work · 2024 – 2025

Q* · rumor only · 2023

The through-line, in one paragraph each

A warning on citation counts

Things this list does not include — and why

Timeline at a glance

Transformer

Pretrain + fine-tune

Decoder-only scales

In-context learning + scaling laws

Multimodal, code, reasoning

Alignment becomes engineering

Open weights and alternative alignment

Reasoning and risk turn empirical

Open reasoning and deception studies

How to use this page

Sources

Keep reading