Vast top-down architectural shot of an empty server room — where training actually happens.

Scaling laws · the atlas

How big models get, and why it matters.

Six years of scaling-law research, in plain English. The Kaplan → Chinchilla → inference-aware → o1-style-test-time arc. The page that finally lets you ask: how big should this model be, on how many tokens, for how much compute, and why does the answer change depending on whether you're paying for training once or inference forever.

Four milestones

The arc, in four papers.

01 · 2020-01

Kaplan et al. · Scaling Laws for Neural Language Models

OpenAI paper that established empirical power-law relationships between model size (parameters), dataset size (tokens), and compute used (FLOPs). Showed that loss decreases as a power law in each variable when the others are not bottlenecks. Underlying claim: scale is the dominant lever for language modeling improvement. Established the field's core question: how to allocate a fixed compute budget across N (params), D (tokens), and C (FLOPs).

Contribution: Showed scaling is power-law. Established N × D × FLOPs as the fundamental variables. Underestimated optimal D substantially (this got corrected by Chinchilla).

02 · 2022-03

Hoffmann et al. · Training Compute-Optimal Large Language Models (Chinchilla)

DeepMind paper that re-ran Kaplan's analysis with a wider sweep and identified that prior models (GPT-3, Gopher) had been substantially undertrained on data relative to parameter count. Trained a 70B parameter model (Chinchilla) on 1.4T tokens and matched the much-larger 280B Gopher's performance. Established the now-canonical rule: ~20 tokens of training data per parameter is approximately compute-optimal under fixed FLOPs budget.

Contribution: Reset the field. Every frontier lab post-2022 trained much larger D/N ratios. Established the '20 tokens per parameter' rule of thumb. Largely retired the 'big model + few tokens' approach that dominated 2020-2021.

03 · 2022+

Inference-aware revisions to Chinchilla optimality

Researchers (Touvron et al. with Llama, then many others) noted that Chinchilla's compute optimality assumes training is the only cost. In production, inference compute often dominates total system cost. This means it's often economically optimal to train SMALLER models on MORE tokens (past the Chinchilla 20-tokens-per-param point), trading worse training-compute efficiency for better inference economics. Llama 2 trained 7B and 13B models on 2T tokens (~150-300 tokens/param). Llama 3 8B trained on 15T tokens (~1900 tokens/param). This is sometimes called 'overtraining' — but it's the right call when inference is most of the cost.

Contribution: Established that compute-optimal under training is not the same as economically-optimal under deployment. Reshaped the smaller-but-overtrained model strategy that defines 2023-2026 frontier-open-weight releases.

04 · 2024-2025

Inference-time scaling laws (o1, o3, Deepseek-R1)

OpenAI's o1 (Sep 2024) demonstrated that scaling compute used at inference time (long chain-of-thought reasoning) improves performance on reasoning benchmarks (AIME, GPQA) in a power-law manner — separate from train-time scaling. DeepSeek-R1 (Jan 2025) showed this pattern works on a public open-weights model. The field now has TWO scaling axes: pretraining compute (the Kaplan-Chinchilla axis) and inference compute (the o1 axis).

Contribution: Opened a second scaling dimension. Frontier labs in 2025-2026 invest heavily in both. Has major implications for inference-cost-per-task economics.

What this means for 2026

Five implications.

Why GPT-5 / Claude 5 / Gemini 3 won't be 10× bigger

The 2020 trajectory implied frontier models in 2024-2026 would be 10-100× larger than GPT-3 (175B). They're not — Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro are widely estimated at similar or smaller parameter counts than GPT-4 (which itself was estimated mid-2024 leaks at ~1.8T total / ~280B active in MoE). The compute went into MORE TOKENS, not more parameters. Chinchilla + inference economics drove the change.

Why open-weight models keep being smaller but performing better

Llama 3.1 70B beats GPT-3 175B on most benchmarks while being 60% smaller. This isn't magic — it's because Llama 3 was trained on 15T tokens (~50× more data than GPT-3's 300B). The marginal value of more parameters fell faster than the marginal value of more data. Open-weight scene has aggressively exploited this.

Why training a 'frontier' model is now $100M-$1B+

Even though models aren't 10× larger by parameter count, they ARE 10× more compute by FLOPs because of the longer training runs on more tokens. GPT-4 training cost estimated at ~$100M by Sam Altman public statements. Llama 3.1 405B training reportedly used 30M GPU-hours on H100s. Gemini Ultra training estimated at $190M+ by SemiAnalysis. The barrier to frontier is substantially compute, not labor — and that has structural implications for who can train frontier models.

What 'compute-optimal' means in 2026

There's no longer one answer. Train-compute-optimal (Chinchilla) wants ~20 tokens per parameter. Inference-economically-optimal (Llama 3 pattern) wants 100-2000+ tokens per parameter. Test-time-compute-optimal (o1 pattern) wants long inference traces on a smaller-or-equal-sized model. The right answer depends on whether you're paying for training once or inference forever.

Why scaling laws don't predict capability jumps

Scaling laws describe how loss decreases smoothly with compute. They do NOT predict at what compute level a model will suddenly gain a capability (like in-context learning, multi-step reasoning, programming proficiency). The relationship between loss and capability is non-linear and discontinuous in places. 'Emergent capabilities' (Wei et al. 2022) is the term for this, though that paper itself has been challenged on whether emergence is real or measurement-artifact.

Sources

[01]
Scaling Laws for Neural Language Models
Kaplan, McCandlish, Henighan, et al. (OpenAI) · 2020
https://arxiv.org/abs/2001.08361 ↗
[02]
Training Compute-Optimal Large Language Models
Hoffmann, Borgeaud, Mensch, et al. (DeepMind) · 2022
https://arxiv.org/abs/2203.15556 ↗
[03]
Emergent Abilities of Large Language Models
Wei, Tay, Bommasani, et al. · 2022
https://arxiv.org/abs/2206.07682 ↗
[04]
Are Emergent Abilities of Large Language Models a Mirage?
Schaeffer, Miranda, Koyejo (Stanford) · 2023
https://arxiv.org/abs/2304.15004 ↗
[05]
Llama 3 technical report (includes overtrained-pattern discussion)
Meta AI · 2024
https://arxiv.org/abs/2407.21783 ↗
[06]
DeepSeek-R1 (inference-time scaling proven on open weights)
DeepSeek-AI · 2025
https://arxiv.org/abs/2501.12948 ↗
[07]
Chinchilla's Wild Implications (Henighan blog)
Tom Henighan · 2022
https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications ↗

How training actually works →Mixture of experts →← atlas index