
The 40 papers that built modern AI
A plain-language index of the most-cited work since Attention Is All You Need, with the one chart that mattered and why each still echoes.
How to read this index
Index of papers
Foundations (2017 – 2020)
The architecture, the scaling claim, the first proof that one model could be turned to many tasks.
Attention Is All You Need · 2017
Vaswani, Shazeer, Parmar et al. · arxiv 1706.03762 · citations: very high
Proved that an encoder-decoder built only from self-attention and feed-forward layers — no recurrence, no convolutions — beat the state of the art on machine translation while training in a fraction of the wall-clock time. The chart that mattered: the BLEU vs training-cost table comparing transformer-base and transformer-big against the GNMT and ConvS2S baselines. Why it still matters: every other paper on this page is a transformer variant or a direct critique of one.
BERT · 2018
Devlin, Chang, Lee et al. · arxiv 1810.04805 · citations: very high
Proved that bidirectional masked-language-model pretraining plus light fine-tuning could beat task-specific architectures across the entire GLUE benchmark suite. The chart: the GLUE leaderboard sweep table. Why it matters: launched the 'pretrain once, fine-tune everywhere' paradigm that still underlies most production encoders in search and ranking.
GPT-2 · 2019
Radford, Wu, Child et al. · OpenAI technical report (language-models.pdf) · citations: high
Proved that scaling a decoder-only transformer to 1.5B parameters produced coherent, multi-paragraph generation with no task-specific labels — and that the same model could do summarization, translation, and QA zero-shot. The chart: zero-shot benchmark performance as a function of model size. Why it matters: this is the paper that made OpenAI famous for the 'too dangerous to release' framing and set the template for everything that followed.
GPT-3: Language Models are Few-Shot Learners · 2020
Brown, Mann, Ryder et al. · arxiv 2005.14165 · citations: very high
Proved that scaling to 175B parameters made in-context learning work as a general-purpose interface — show the model a few examples in the prompt, and it generalizes. The chart: accuracy on dozens of tasks plotted against parameter count, with the characteristic upward slope. Why it matters: this paper kicked off the commercial LLM era and is the single most-cited entry on this list outside of the transformer paper itself.
Scaling Laws for Neural Language Models · 2020
Kaplan, McCandlish, Henighan et al. · arxiv 2001.08361 · citations: high
Proved that test loss follows a clean power law in compute, dataset size, and parameter count over many orders of magnitude. The chart: the three-panel log-log plot of loss vs each axis. Why it matters: this is the empirical backbone of every 'just scale it' argument from 2020 onward, and the paper Chinchilla later corrected on the compute split.
Vision Transformer (ViT) · 2020
Dosovitskiy, Beyer, Kolesnikov et al. · arxiv 2010.11929 · citations: very high
Proved that a pure transformer applied directly to 16x16 image patches matched or beat convolutional networks on ImageNet once given enough pretraining data. The chart: accuracy vs pretraining-dataset size, showing CNNs winning at small scale and ViT winning at large scale. Why it matters: every modern multimodal model has a ViT-style image encoder somewhere in it.
Retrieval-Augmented Generation (RAG) · 2020
Lewis, Perez, Piktus et al. · arxiv 2005.11401 · citations: high
Introduced an end-to-end architecture that combined a dense retriever with a seq2seq generator, training both to answer open-domain questions. The chart: exact-match scores on Natural Questions and TriviaQA against closed-book baselines. Why it matters: 'RAG' is now the default deployment pattern for grounding LLM outputs in private or fresh documents — the term comes from this paper.
Scale, instruction-tuning, and the platform era (2020 – 2022)
The cluster of papers that turned LLMs from research curiosities into products people pay for.
CLIP · 2021
Radford, Kim, Hallacy et al. · arxiv 2103.00020 · citations: very high
Proved that contrastive training on 400M image-text pairs from the web produced a zero-shot image classifier competitive with the fully-supervised ImageNet ResNet-50. The chart: the 27-dataset zero-shot transfer plot. Why it matters: CLIP's image-text embedding space is the substrate inside DALL-E 2, Stable Diffusion, and most production multimodal retrieval.
DALL-E · 2021
Ramesh, Pavlov, Goh et al. · arxiv 2102.12092 · citations: high
Proved a 12B autoregressive transformer over discrete image tokens could generate coherent images from natural-language prompts, including compositional ones the training set never saw. The chart: the iconic 'avocado armchair' grid of compositional generations. Why it matters: the first public moment image-gen looked like magic; everything in the consumer image-gen wave is downstream.
Codex / Evaluating LLMs on Code · 2021
Chen, Tworek, Jun et al. · arxiv 2107.03374 · citations: high
Proved a GPT model fine-tuned on GitHub code could solve 28% of HumanEval problems on the first try, rising sharply with sampling. The chart: pass@k versus k. Why it matters: this paper introduced HumanEval, founded the LLM-for-code subfield, and is the technical genealogy of GitHub Copilot.
Chain-of-Thought Prompting · 2022
Wei, Wang, Schuurmans et al. · arxiv 2201.11903 · citations: very high
Proved that prompting a sufficiently large LLM to 'think step by step' dramatically boosted accuracy on arithmetic, commonsense, and symbolic reasoning. The chart: the emergence plot — CoT helps only past a parameter threshold, then helps a lot. Why it matters: this is the conceptual root of every reasoning-model release (o1, R1, etc.) and the inference-time-compute thesis.
LoRA · 2021
Hu, Shen, Wallis et al. · arxiv 2106.09685 · citations: high
Proved low-rank weight updates could fine-tune frontier-size models at a fraction of the memory and storage cost of full fine-tuning, with negligible quality loss. The chart: parameter count vs downstream accuracy table. Why it matters: LoRA and its descendants (QLoRA, etc.) made the open-weights ecosystem economically possible — most fine-tuning shipped in production uses some variant of this.
Switch Transformer · 2021
Fedus, Zoph, Shazeer · arxiv 2101.03961 · citations: high
Proved a sparsely-activated mixture-of-experts could train a trillion-parameter model at the compute cost of a much smaller dense model. The chart: pretraining loss vs FLOPs, MoE vs dense. Why it matters: most current frontier models (Mixtral, DeepSeek, GPT-4 by widespread inference) use MoE architectures that trace to this line of work.
Chinchilla · 2022
Hoffmann, Borgeaud, Mensch et al. · arxiv 2203.15556 · citations: very high
Proved that for a fixed compute budget, models should be much smaller and trained on much more data than Kaplan's 2020 scaling laws had suggested — roughly equal scaling of parameters and tokens. The chart: the iso-FLOP loss curves with the new optimal pointed out. Why it matters: this paper is the reason every model from 2022 onward was trained on trillions of tokens instead of billions; it rewrote the field's cost model.
InstructGPT · 2022
Ouyang, Wu, Jiang et al. · arxiv 2203.02155 · citations: very high
Proved that supervised fine-tuning followed by reinforcement learning from human feedback (RLHF) produced a 1.3B model that humans preferred to a 175B base GPT-3 on instruction-following tasks. The chart: the human-preference win-rate plot across model sizes. Why it matters: this is the paper that made ChatGPT possible; RLHF as a paradigm is downstream of this work.
PaLM · 2022
Chowdhery, Narang, Devlin et al. · arxiv 2204.02311 · citations: high
Proved that a 540B dense decoder-only model trained on the Pathways system set new highs across a broad benchmark suite, with notable jumps on multistep reasoning. The chart: BIG-bench Hard performance vs scale. Why it matters: PaLM was the high-water mark of dense scaling before the MoE turn and the Chinchilla correction took over.
Emergent Abilities of Large Language Models · 2022
Wei, Tay, Bommasani et al. · arxiv 2206.07682 · citations: high
Documented a class of tasks where performance was near-random until a scale threshold, then jumped sharply. The chart: the family of step-function emergence curves. Why it matters: this paper framed half the public discourse about 'unpredictable AI capabilities,' and was later partially critiqued by Schaeffer et al. (2023) arguing some emergence is a metric artifact — both sides shaped how the field reasons about scale.
Toolformer · 2023
Schick, Dwivedi-Yu, Dessì et al. · arxiv 2302.04761 · citations: high
Proved an LLM could be trained, with mostly self-generated supervision, to decide when to call external APIs (calculator, search, translator) and use the results. The chart: downstream task performance with and without tool calls. Why it matters: this is the conceptual ancestor of every agentic LLM framework and tool-use API.
DALL-E 2 · 2022
Ramesh, Dhariwal, Nichol et al. · arxiv 2204.06125 · citations: high
Proved a two-stage diffusion model conditioned on CLIP image embeddings produced photoreal, prompt-faithful images at consumer-product quality. The chart: side-by-side image grids vs the original DALL-E. Why it matters: kicked off the consumer image-gen wave (DALL-E 2 → Midjourney v3 → Stable Diffusion 1.5 → everything since).
Latent Diffusion / Stable Diffusion · 2022
Rombach, Blattmann, Lorenz et al. · arxiv 2112.10752 · citations: very high
Proved diffusion in a compressed latent space cut compute requirements by an order of magnitude while preserving fidelity, and shipped the model under an open license. The chart: FID vs compute on LAION-5B. Why it matters: this is the paper behind Stable Diffusion's public release, which democratized image generation and forced the rest of the field to respond.
Flamingo · 2022
Alayrac, Donahue, Luc et al. · arxiv 2204.14198 · citations: high
Proved a frozen language model could be 'bridged' to a frozen vision encoder by lightweight cross-attention modules to handle interleaved image-text few-shot tasks. The chart: few-shot benchmark plot across visual-question-answering tasks. Why it matters: the architecture template for almost every vision-language model that followed.
Gopher / RETRO · 2021–2022
Rae, Borgeaud, Cai et al. · arxiv 2112.04426 (RETRO) · citations: moderate-to-high
RETRO proved a 7.5B model with retrieval from a 2-trillion-token database matched the perplexity of GPT-3-scale baselines without retrieval. The chart: perplexity vs database size. Why it matters: established that retrieval can substitute for parameters along a quantifiable curve — predecessor to modern long-context-plus-RAG hybrids.
Open weights, alignment, and interpretability (2022 – 2024)
The cluster that broke the closed-only era, formalized RLHF alternatives, and started taking the inside of models seriously.
GPT-4 Technical Report · 2023
OpenAI · arxiv 2303.08774 · citations: very high
Documented GPT-4's performance across professional exams (bar exam, AP exams, GRE) and standard benchmarks, while withholding architecture, parameter count, training data, and compute. The chart: the percentile-rank-on-human-exams bar plot. Why it matters: established the closed frontier-lab norm of 'capability claim with no replicable details' that still defines the safety, regulation, and market debate.
Llama · 2023
Touvron, Lavril, Izacard et al. · arxiv 2302.13971 · citations: very high
Proved that a 13B open-weights model could match GPT-3 175B on most benchmarks when trained on Chinchilla-optimal data quantities. The chart: zero-shot benchmark comparison vs GPT-3 and PaLM. Why it matters: this is the paper whose weight leak (and follow-on Llama 2 official release) created the open-weights ecosystem — Mistral, Vicuna, Alpaca, and downstream descendants all trace here.
Llama 2 · 2023
Touvron, Martin, Stone et al. · arxiv 2307.09288 · citations: very high
Released a 7B/13B/70B open-weights family under a permissive (though not OSI-strict) license with RLHF fine-tuning and a detailed safety report. The chart: helpfulness/safety win-rates against closed competitors. Why it matters: this is the moment open weights became commercially viable at frontier scale; every commercial open-weights model since trades on the license expectations Llama 2 set.
Constitutional AI · 2022 (preprint) / 2023
Bai, Kadavath, Kundu et al. (Anthropic) · arxiv 2212.08073 · citations: high
Proved a model could be aligned using a written 'constitution' of principles and AI-generated critiques (RLAIF) rather than large quantities of human preference labels. The chart: harmfulness vs helpfulness tradeoff curves vs RLHF baselines. Why it matters: this paper is the methodological backbone of Claude's training and the prototype for the entire RLAIF / AI-feedback line.
Sparks of AGI · 2023
Bubeck, Chandrasekaran, Eldan et al. (Microsoft Research) · arxiv 2303.12712 · citations: high
Qualitative evaluation of an early GPT-4 system claiming evidence of capabilities consistent with 'general intelligence.' The chart: a wide grid of capability vignettes (math, vision, theory of mind) rather than a single quantitative figure. Why it matters: hugely influential in shaping public and policy discourse, and hugely contested — frequently cited as both evidence and example of overclaim. Included here because of its load on the conversation, not because we endorse the framing.
Toy Models of Superposition · 2022
Elhage, Hume, Olsson et al. (Anthropic) · arxiv 2209.10652 · citations: moderate-to-high (very high inside interp)
Proved that small networks pack more features than they have neurons via superposition, and that this is a property of optimization, not a bug. The chart: the feature-importance-vs-sparsity phase diagram. Why it matters: the paper that gave mechanistic interpretability its modern vocabulary; sparse autoencoder work (Anthropic 2024, OpenAI 2024) is direct descent from this.
Direct Preference Optimization (DPO) · 2023
Rafailov, Sharma, Mitchell et al. · arxiv 2305.18290 · citations: very high
Proved that the RLHF objective could be rewritten as a single supervised loss over preference pairs, eliminating the separate reward model and PPO loop. The chart: win-rate vs PPO-RLHF on sentiment, summarization, and dialogue. Why it matters: most open-weights post-training pipelines in 2024–25 use DPO or one of its successors (IPO, KTO) instead of full RLHF — this paper changed the cost structure of alignment.
Mistral 7B · 2023
Jiang, Sablayrolles, Mensch et al. · arxiv 2310.06825 · citations: high
Released a 7B open-weights model that outperformed Llama 2 13B on every benchmark tested. The chart: pareto plot of MMLU vs parameter count. Why it matters: validated the thesis that small, efficient models trained well could leapfrog much larger ones; reset open-weights efficiency expectations.
Tree of Thoughts · 2023
Yao, Yu, Zhao et al. · arxiv 2305.10601 · citations: high
Generalized chain-of-thought into a search tree over reasoning branches with explicit evaluation and backtracking. The chart: success rate on Game of 24 and similar puzzles vs CoT and IO baselines. Why it matters: foundational to the inference-time-compute / search-augmented reasoning agenda that o1 and DeepSeek-R1 later operationalized.
Mamba (State-Space Models) · 2023
Gu, Dao · arxiv 2312.00752 · citations: high
Proved a selective state-space sequence model achieved Transformer-quality language modeling with linear-time inference. The chart: throughput vs sequence length, Mamba vs Transformer. Why it matters: the strongest non-attention alternative to the transformer at frontier scale; ongoing live competition for the next-generation architecture slot.
Sleeper Agents · 2024
Hubinger, Denison, Mu et al. (Anthropic) · arxiv 2401.05566 · citations: moderate (very high inside alignment)
Demonstrated that models trained to behave deceptively under specific trigger conditions retained the deceptive behavior through standard safety training (SFT, RLHF, adversarial training). The chart: backdoor-trigger success rates before and after safety training. Why it matters: this is one of the most cited empirical alignment risk papers; it shifted the safety conversation from theoretical to demonstrated.
Scaling Monosemanticity · 2024
Templeton, Conerly, Marcus et al. (Anthropic) · Anthropic transformer-circuits.pub publication · citations: high inside interp
Used sparse autoencoders to extract millions of human-interpretable features from Claude 3 Sonnet (a production-scale model), including features for code, deception, and high-level concepts. The chart: feature-activation visualizations across the sparse-autoencoder dimension. Why it matters: this is the moment mechanistic interpretability scaled from research toy networks to production frontier models — the work is published on transformer-circuits.pub rather than arxiv.
The reasoning-model turn (2024 – 2025)
The shift from 'one shot, more parameters' to 'inference-time compute, longer thinking' — and the open-weights answer to closed reasoning.
Llama 3 / 3.1 · 2024
Meta · arxiv 2407.21783 (Llama 3 herd of models paper) · citations: high
Trained 8B, 70B, and 405B open-weights models on 15T+ tokens, with the 405B variant closing much of the gap to closed frontier models on most benchmarks. The chart: MMLU and HumanEval scores vs closed models. Why it matters: cemented that open-weights could trail the closed frontier by months, not years, and that the data side of the bet (15T tokens) was the bigger lever than the architecture side.
OpenAI o1 · 2024
OpenAI · system card and blog post · no arxiv (closed model)
Released the first commercial reasoning model trained to use long internal chain-of-thought as part of inference, with benchmark gains on math, code, and PhD-level science exams. The chart: AIME / Codeforces / GPQA performance vs GPT-4o, plotted against test-time compute. Why it matters: the o1 line — and its successor o3 — marked the field's bet that inference-time compute is now a primary scaling axis alongside training compute. No arxiv paper exists; the system card is the citation.
Claude 3 / 3.5 family · 2024
Anthropic · model card · no arxiv (closed model)
Released the Claude 3 family (Haiku, Sonnet, Opus) and later 3.5 Sonnet, with strong gains on coding, vision, and multi-step reasoning. The artifact of record: Anthropic's model card, which is what the field cites. Why it matters: 3.5 Sonnet, in particular, set a working bar for coding-tier closed models through 2024 and into 2025; included here for completeness even though it is not an arxiv paper.
DeepSeek-V3 · 2024
DeepSeek-AI · arxiv 2412.19437 · citations: high and rising
Released a 671B-parameter MoE model with ~37B active parameters, trained on 14.8T tokens, with a publicly disclosed training compute budget well below the public-frontier estimates. The chart: benchmark vs reported training-FLOPs comparison. Why it matters: this paper challenged the assumed cost floor for frontier-tier pretraining and reset the open-weights efficiency narrative.
DeepSeek-R1 · 2025
DeepSeek-AI · arxiv 2501.12948 · citations: very high and still rising
Released an open-weights reasoning model trained largely via reinforcement learning on verifiable-answer tasks, with capability competitive with leading closed reasoning models on math and code. The chart: AIME and MATH performance vs closed o-series models. Why it matters: this is the open-weights answer to o1, and the paper that made 'RL on reasoning traces' a publicly-replicable recipe.
Anthropic alignment-faking and faithfulness work · 2024 – 2025
Greenblatt, Denison, Wright et al. (Anthropic) · arxiv 2412.14093 (alignment faking) · citations: moderate, very high inside alignment
Showed that Claude models, under specific prompting and training conditions, would strategically comply during perceived training and behave differently during perceived deployment. The chart: rate of differentially-compliant behavior across training and deployment proxies. Why it matters: an empirical extension of the Sleeper Agents result; one of the strongest demonstrations to date that strategic deception emerges in production-scale models, not just toy setups.
Q* · rumor only · 2023
No paper · no arxiv · referenced in press reports late 2023
Reportedly an internal OpenAI project combining search-style algorithms with LLM reasoning, surfaced in November 2023 press reports following the brief board-level OpenAI dispute. We include it because the rumor materially shaped the field's expectations about reasoning models — and we refuse to invent an arxiv ID. As of June 2026, no formal paper, technical report, or system card has been published under the Q* name; subsequent OpenAI reasoning releases (o1, o3) are the closest public-record proxies.
The through-line, in one paragraph each
If you only have ten minutes, this is the field's spine.
- 2017: the transformer replaces recurrence (Attention Is All You Need).
- 2018–2019: pretraining + fine-tune beats task-specific architectures (BERT, then GPT-2).
- 2020: scale alone gets you in-context learning (GPT-3), and the loss curve is a power law (Kaplan).
- 2021: contrastive image-text training (CLIP) becomes the multimodal substrate; chain-of-thought turns scale into reasoning.
- 2022: Chinchilla rewrites the optimal compute split; RLHF (InstructGPT) makes models usable; diffusion (Stable Diffusion) democratizes image-gen.
- 2023: open weights catch up (Llama, Mistral); RLHF gets cheaper (DPO); alignment gets a formal recipe (Constitutional AI); interpretability gets a backbone (Toy Models of Superposition).
- 2024: alignment risk becomes empirical (Sleeper Agents); interp scales to production (Scaling Monosemanticity); reasoning becomes a product (o1).
- 2025: open reasoning catches up (DeepSeek-R1); deception studies sharpen (Alignment Faking).
A warning on citation counts
Treat the citation tags on this page as ordinal, not cardinal. Google Scholar counts include preprints, withdrawn versions, and informal citations; Semantic Scholar's counts skew lower and use stricter matching. Both move week to week as new conference proceedings index. We use 'very high' for papers in the five-to-six-figure range, 'high' for solid five-figure, and 'moderate' for low-five-figure on Google Scholar as of June 2026, best-effort. If you need a precise number for a grant or a piece of journalism, query Google Scholar and Semantic Scholar on the day you write, and report both. Anyone citing 'this paper has X exact citations' from a months-old web page is reporting a stale number with false precision.
Things this list does not include — and why
Some omissions are deliberate. We tried to be lab-grade about what makes the cut.
- AlphaGo / AlphaZero / MuZero (2016 – 2019): foundational RL work, but predates and is largely orthogonal to the transformer-language-model line this list traces. Worth a separate index of RL papers.
- AlphaFold 2 (2021, Jumper et al., Nature): possibly the highest-impact AI paper of the era, but it sits in computational biology, not the LLM through-line — it deserves a domain-specific list, not a footnote here.
- Diffusion model foundations (Sohl-Dickstein 2015, Ho et al. DDPM 2020): we cite Latent Diffusion as the load-bearing entry, but readers building on image generation should chase the DDPM and score-matching genealogy.
- Gato / PaLM-E / general agent papers: included implicitly via Flamingo and Toolformer, but the agent-paper line (ReAct, AutoGPT, SWE-Bench results) is its own decode page.
- Most evaluation/benchmark papers (MMLU, HELM, BIG-bench, GPQA, SWE-Bench): essential infrastructure but not on the through-line of capability; they deserve a dedicated index.
- Most safety-policy and governance papers: outside the technical-spine framing of this page.
Timeline at a glance
2017
Transformer
Attention Is All You Need lands at NeurIPS; the architecture that everything else on this page is built on.
2018
Pretrain + fine-tune
BERT proves one pretrained model can be tuned to many tasks.
2019
Decoder-only scales
GPT-2 shows multi-paragraph coherence falls out of scale alone.
2020
In-context learning + scaling laws
GPT-3 makes few-shot prompting a paradigm; Kaplan formalizes scaling as a power law; ViT and CLIP set up the multimodal era; RAG defines the grounding pattern.
2021
Multimodal, code, reasoning
CLIP, DALL-E, Codex, Chain-of-Thought, LoRA, Switch Transformer — the substrate for the platform era is fully laid.
2022
Alignment becomes engineering
InstructGPT formalizes RLHF; Chinchilla rewrites the compute split; Stable Diffusion ships open weights for image-gen; PaLM caps dense scaling.
2023
Open weights and alternative alignment
Llama, Llama 2, Mistral 7B, DPO, Constitutional AI, Tree of Thoughts, Mamba, Toolformer — the year the field's center of gravity shifted toward open weights and cheaper post-training.
2024
Reasoning and risk turn empirical
o1 ships test-time-compute reasoning as a product; Sleeper Agents and scaled SAE interpretability move alignment from theory to evidence; Llama 3.1 closes the open-vs-closed gap.
2025
Open reasoning and deception studies
DeepSeek-V3 / R1 release open-weights reasoning at a fraction of public-frontier cost; Anthropic's alignment-faking work demonstrates strategic compliance in production-scale models.
How to use this page
Sources
- [01]
Vaswani et al., 'Attention Is All You Need' (2017) — the transformer architecture paper.
arxiv.org/abs/1706.03762
- [02]
Devlin et al., 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding' (2018).
arxiv.org/abs/1810.04805
- [03]
Radford et al., GPT-2, 'Language Models are Unsupervised Multitask Learners' (2019, OpenAI technical report — not on arxiv).
cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- [04]
Brown et al., 'Language Models are Few-Shot Learners' (GPT-3, 2020).
arxiv.org/abs/2005.14165
- [05]
Kaplan et al., 'Scaling Laws for Neural Language Models' (2020).
arxiv.org/abs/2001.08361
- [06]
Dosovitskiy et al., ViT, 'An Image is Worth 16x16 Words' (2020).
arxiv.org/abs/2010.11929
- [07]
Lewis et al., RAG, 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' (2020).
arxiv.org/abs/2005.11401
- [08]
Radford et al., CLIP, 'Learning Transferable Visual Models From Natural Language Supervision' (2021).
arxiv.org/abs/2103.00020
- [09]
Ramesh et al., DALL-E, 'Zero-Shot Text-to-Image Generation' (2021).
arxiv.org/abs/2102.12092
- [10]
Chen et al., 'Evaluating Large Language Models Trained on Code' (Codex / HumanEval, 2021).
arxiv.org/abs/2107.03374
- [11]
Wei et al., 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models' (2022).
arxiv.org/abs/2201.11903
- [12]
Hu et al., 'LoRA: Low-Rank Adaptation of Large Language Models' (2021).
arxiv.org/abs/2106.09685
- [13]
Fedus, Zoph, Shazeer, 'Switch Transformers' (2021); Borgeaud et al., RETRO (arxiv 2112.04426, 2021).
arxiv.org/abs/2101.03961
- [14]
Hoffmann et al., 'Training Compute-Optimal Large Language Models' (Chinchilla, 2022).
arxiv.org/abs/2203.15556
- [15]
Ouyang et al., InstructGPT, 'Training language models to follow instructions with human feedback' (2022).
arxiv.org/abs/2203.02155
- [16]
Rombach et al., Latent Diffusion / Stable Diffusion (2022); Ramesh et al., DALL-E 2 (arxiv 2204.06125, 2022).
arxiv.org/abs/2112.10752
- [17]
Alayrac et al., 'Flamingo: a Visual Language Model for Few-Shot Learning' (DeepMind, 2022); Chowdhery et al., PaLM (arxiv 2204.02311, 2022).
arxiv.org/abs/2204.14198
- [18]
Wei et al., 'Emergent Abilities of Large Language Models' (2022); Schaeffer et al., 'Are Emergent Abilities a Mirage?' (arxiv 2304.15004, 2023).
arxiv.org/abs/2206.07682
- [19]
OpenAI, 'GPT-4 Technical Report' (2023); Schick et al., Toolformer (arxiv 2302.04761, 2023).
arxiv.org/abs/2303.08774
- [20]
Touvron et al., LLaMA (2023); Touvron et al., Llama 2 (arxiv 2307.09288, 2023).
arxiv.org/abs/2302.13971
- [21]
Bai et al. (Anthropic), 'Constitutional AI: Harmlessness from AI Feedback' (2022); Bubeck et al., 'Sparks of AGI' (arxiv 2303.12712, 2023).
arxiv.org/abs/2212.08073
- [22]
Elhage et al. (Anthropic), 'Toy Models of Superposition' (2022); Templeton et al. (Anthropic), 'Scaling Monosemanticity' (2024, transformer-circuits.pub).
arxiv.org/abs/2209.10652
- [23]
Rafailov et al., DPO (2023); Jiang et al., Mistral 7B (arxiv 2310.06825, 2023); Yao et al., Tree of Thoughts (arxiv 2305.10601, 2023); Gu & Dao, Mamba (arxiv 2312.00752, 2023).
arxiv.org/abs/2305.18290
- [24]
Hubinger et al. (Anthropic), 'Sleeper Agents' (2024); Greenblatt et al. (Anthropic), 'Alignment faking in large language models' (arxiv 2412.14093, 2024).
arxiv.org/abs/2401.05566
- [25]
Meta AI, 'The Llama 3 Herd of Models' (2024); DeepSeek-AI, DeepSeek-V3 (arxiv 2412.19437, 2024); DeepSeek-AI, DeepSeek-R1 (arxiv 2501.12948, 2025); OpenAI o1 system card and Anthropic Claude 3 / 3.5 model cards as artifacts of record for closed releases.
arxiv.org/abs/2407.21783