::deep-dive
Training Dynamics and Scaling
Scaling laws, optimizers, mixed precision, gradient accumulation, and the practical art of training large models
Knowing the transformer architecture is necessary but not sufficient — to do frontier work you must know how to train. Training large models is its own discipline, with its own literature, its own folk wisdom, and its own pathologies. The doctorate-grade curriculum here covers: the scaling laws (Kaplan et al. and the corrected Hoffmann et al. Chinchilla paper), which dictate compute-optimal allocation of parameters versus tokens; the optimizer choices that matter (AdamW dominates, but Lion, Sophia, and second-order methods like Shampoo deserve study); learning rate schedules (linear warmup, cosine decay, the recently-popular Warmup-Stable-Decay schedule); the practical engineering of mixed precision training (bfloat16 vs float16, loss scaling, the GradScaler dance, and why bfloat16 won); gradient accumulation and gradient checkpointing as techniques for fitting larger effective batches on limited memory; data parallelism, model parallelism (tensor and pipeline), ZeRO and FSDP for parameter sharding; the empirical phenomena of training (grokking, loss spikes, the transition from memorization to generalization, double descent); and the practical hyperparameter tuning playbook (learning rate, weight decay, batch size, warmup steps, the muP parameterization). Frontier labs treat training as a craft — knowing exactly which hyperparameter to change when a run starts diverging is worth millions of dollars in compute. The Chinchilla paper alone reshaped the field by showing that GPT-3 was undertrained; the original Kaplan scaling laws made an off-by-data-multiplier error that Hoffmann corrected. The Adam paper, the muP paper (Yang et al.), and the various memory-efficient training papers are the load-bearing references. By the end of this path you should be able to look at a proposed training run and predict its convergence behavior, allocate compute optimally between model size and data, and debug a loss spike.
::reading path · in order
::01 · paper
~6h
Training Compute-Optimal Large Language Models — Hoffmann et al. (Chinchilla paper, DeepMind 2022)
The corrected scaling law. Required reading. The 20-tokens-per-parameter rule shaped Llama 2 and beyond.
::02 · paper
~5h
Scaling Laws for Neural Language Models — Kaplan et al. (OpenAI, 2020)
The original scaling laws paper. Read alongside Chinchilla to understand what the original analysis got wrong.
::03 · paper
~2h
Adam: A Method for Stochastic Optimization — Kingma and Ba (2014)
The optimizer behind almost every modern training run. Internalize the bias correction and the update rule.
::04 · paper
~2h
Decoupled Weight Decay Regularization — Loshchilov and Hutter (AdamW paper, 2017)
AdamW is the actual optimizer used in practice. The decoupling matters.
::05 · paper
~3h
Mixed Precision Training — Micikevicius et al. (NVIDIA, 2017)
The original mixed precision paper. Read alongside the bfloat16 specification (Wang and Kanwar, Google) to understand why bf16 dominates today.
::06 · paper
~5h
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Rajbhandari, Rasley, Ruwase, He (Microsoft DeepSpeed, 2019)
The parameter, gradient, and optimizer-state sharding scheme that made trillion-parameter training tractable.
::07 · paper
~4h
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Shoeybi et al. (NVIDIA, 2019)
Tensor parallelism. The other half of large-model training infrastructure.
::08 · paper
~6h
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer — Yang et al. (muP paper, 2022)
Parameterization that lets hyperparameters tuned on small models transfer to large models. Used by frontier labs.
::09 · paper
~10h
Llama 2 and Llama 3 technical reports — Meta
Read for the complete training recipe of a frontier-grade open model. Especially the data mix, the LR schedule, and the hyperparameter choices.
::10 · blog
~8h
Deep Learning Tuning Playbook — Tuning playbook (Google researchers, github)
Practical hyperparameter tuning guidance from people who actually train models. Treat it as an operational manual.
::11 · code
~8h
PyTorch FSDP tutorial and documentation
FSDP is the modern PyTorch implementation of ZeRO-style sharding. Required for any actual large-model training.
::exercises · build · derive · reproduce
- 01Reproduce a small Chinchilla-style scaling experiment: train models at three sizes on increasing token counts and fit the scaling law.
- 02Implement mixed precision training manually using torch.autocast and GradScaler. Compare memory and throughput to fp32.
- 03Implement gradient accumulation and verify that effective batch size N is equivalent to a true batch of N.
- 04Train the same model with SGD+momentum, Adam, AdamW, and Lion. Plot loss curves and explain the differences.
- 05Diagnose a loss spike: intentionally use too-high a learning rate, capture the divergence, then fix it with warmup and gradient clipping.
- 06Set up FSDP training on multiple GPUs (or a single GPU simulation). Verify equivalence to single-device training.
::milestones · observable
- ▲You can compute the Chinchilla-optimal parameter count for a given training budget.
- ▲You can debug a divergent training run and identify the cause (LR, init, normalization, data).
- ▲You have actually trained a model with mixed precision and FSDP.
- ▲You can explain why bfloat16 beat float16 for large model training.
- ▲You can read a frontier training report and identify every non-default choice.