01 · 2020-01
Kaplan et al. · Scaling Laws for Neural Language Models
OpenAI paper that established empirical power-law relationships between model size (parameters), dataset size (tokens), and compute used (FLOPs). Showed that loss decreases as a power law in each variable when the others are not bottlenecks. Underlying claim: scale is the dominant lever for language modeling improvement. Established the field's core question: how to allocate a fixed compute budget across N (params), D (tokens), and C (FLOPs).
Contribution: Showed scaling is power-law. Established N × D × FLOPs as the fundamental variables. Underestimated optimal D substantially (this got corrected by Chinchilla).
02 · 2022-03
Hoffmann et al. · Training Compute-Optimal Large Language Models (Chinchilla)
DeepMind paper that re-ran Kaplan's analysis with a wider sweep and identified that prior models (GPT-3, Gopher) had been substantially undertrained on data relative to parameter count. Trained a 70B parameter model (Chinchilla) on 1.4T tokens and matched the much-larger 280B Gopher's performance. Established the now-canonical rule: ~20 tokens of training data per parameter is approximately compute-optimal under fixed FLOPs budget.
Contribution: Reset the field. Every frontier lab post-2022 trained much larger D/N ratios. Established the '20 tokens per parameter' rule of thumb. Largely retired the 'big model + few tokens' approach that dominated 2020-2021.
03 · 2022+
Inference-aware revisions to Chinchilla optimality
Researchers (Touvron et al. with Llama, then many others) noted that Chinchilla's compute optimality assumes training is the only cost. In production, inference compute often dominates total system cost. This means it's often economically optimal to train SMALLER models on MORE tokens (past the Chinchilla 20-tokens-per-param point), trading worse training-compute efficiency for better inference economics. Llama 2 trained 7B and 13B models on 2T tokens (~150-300 tokens/param). Llama 3 8B trained on 15T tokens (~1900 tokens/param). This is sometimes called 'overtraining' — but it's the right call when inference is most of the cost.
Contribution: Established that compute-optimal under training is not the same as economically-optimal under deployment. Reshaped the smaller-but-overtrained model strategy that defines 2023-2026 frontier-open-weight releases.
04 · 2024-2025
Inference-time scaling laws (o1, o3, Deepseek-R1)
OpenAI's o1 (Sep 2024) demonstrated that scaling compute used at inference time (long chain-of-thought reasoning) improves performance on reasoning benchmarks (AIME, GPQA) in a power-law manner — separate from train-time scaling. DeepSeek-R1 (Jan 2025) showed this pattern works on a public open-weights model. The field now has TWO scaling axes: pretraining compute (the Kaplan-Chinchilla axis) and inference compute (the o1 axis).
Contribution: Opened a second scaling dimension. Frontier labs in 2025-2026 invest heavily in both. Has major implications for inference-cost-per-task economics.