::deep-dive

Training Dynamics and Scaling

Scaling laws, optimizers, mixed precision, gradient accumulation, and the practical art of training large models

Knowing the transformer architecture is necessary but not sufficient — to do frontier work you must know how to train. Training large models is its own discipline, with its own literature, its own folk wisdom, and its own pathologies. The doctorate-grade curriculum here covers: the scaling laws (Kaplan et al. and the corrected Hoffmann et al. Chinchilla paper), which dictate compute-optimal allocation of parameters versus tokens; the optimizer choices that matter (AdamW dominates, but Lion, Sophia, and second-order methods like Shampoo deserve study); learning rate schedules (linear warmup, cosine decay, the recently-popular Warmup-Stable-Decay schedule); the practical engineering of mixed precision training (bfloat16 vs float16, loss scaling, the GradScaler dance, and why bfloat16 won); gradient accumulation and gradient checkpointing as techniques for fitting larger effective batches on limited memory; data parallelism, model parallelism (tensor and pipeline), ZeRO and FSDP for parameter sharding; the empirical phenomena of training (grokking, loss spikes, the transition from memorization to generalization, double descent); and the practical hyperparameter tuning playbook (learning rate, weight decay, batch size, warmup steps, the muP parameterization). Frontier labs treat training as a craft — knowing exactly which hyperparameter to change when a run starts diverging is worth millions of dollars in compute. The Chinchilla paper alone reshaped the field by showing that GPT-3 was undertrained; the original Kaplan scaling laws made an off-by-data-multiplier error that Hoffmann corrected. The Adam paper, the muP paper (Yang et al.), and the various memory-efficient training papers are the load-bearing references. By the end of this path you should be able to look at a proposed training run and predict its convergence behavior, allocate compute optimally between model size and data, and debug a loss spike.

::reading path · in order

::01 · paper
~6h
Training Compute-Optimal Large Language Models — Hoffmann et al. (Chinchilla paper, DeepMind 2022)
The corrected scaling law. Required reading. The 20-tokens-per-parameter rule shaped Llama 2 and beyond.
::02 · paper
~5h
Scaling Laws for Neural Language Models — Kaplan et al. (OpenAI, 2020)
The original scaling laws paper. Read alongside Chinchilla to understand what the original analysis got wrong.
::03 · paper
~2h
Adam: A Method for Stochastic Optimization — Kingma and Ba (2014)
The optimizer behind almost every modern training run. Internalize the bias correction and the update rule.
::04 · paper
~2h
Decoupled Weight Decay Regularization — Loshchilov and Hutter (AdamW paper, 2017)
AdamW is the actual optimizer used in practice. The decoupling matters.
::05 · paper
~3h
Mixed Precision Training — Micikevicius et al. (NVIDIA, 2017)
The original mixed precision paper. Read alongside the bfloat16 specification (Wang and Kanwar, Google) to understand why bf16 dominates today.
::06 · paper
~5h
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Rajbhandari, Rasley, Ruwase, He (Microsoft DeepSpeed, 2019)
The parameter, gradient, and optimizer-state sharding scheme that made trillion-parameter training tractable.
::07 · paper
~4h
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Shoeybi et al. (NVIDIA, 2019)
Tensor parallelism. The other half of large-model training infrastructure.
::08 · paper
~6h
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer — Yang et al. (muP paper, 2022)
Parameterization that lets hyperparameters tuned on small models transfer to large models. Used by frontier labs.
::09 · paper
~10h
Llama 2 and Llama 3 technical reports — Meta
Read for the complete training recipe of a frontier-grade open model. Especially the data mix, the LR schedule, and the hyperparameter choices.
::10 · blog
~8h
Deep Learning Tuning Playbook — Tuning playbook (Google researchers, github)
Practical hyperparameter tuning guidance from people who actually train models. Treat it as an operational manual.
::11 · code
~8h
PyTorch FSDP tutorial and documentation
FSDP is the modern PyTorch implementation of ZeRO-style sharding. Required for any actual large-model training.

::exercises · build · derive · reproduce

01Reproduce a small Chinchilla-style scaling experiment: train models at three sizes on increasing token counts and fit the scaling law.
02Implement mixed precision training manually using torch.autocast and GradScaler. Compare memory and throughput to fp32.
03Implement gradient accumulation and verify that effective batch size N is equivalent to a true batch of N.
04Train the same model with SGD+momentum, Adam, AdamW, and Lion. Plot loss curves and explain the differences.
05Diagnose a loss spike: intentionally use too-high a learning rate, capture the divergence, then fix it with warmup and gradient clipping.
06Set up FSDP training on multiple GPUs (or a single GPU simulation). Verify equivalence to single-device training.

::milestones · observable

▲You can compute the Chinchilla-optimal parameter count for a given training budget.
▲You can debug a divergent training run and identify the cause (LR, init, normalization, data).
▲You have actually trained a model with mixed precision and FSDP.
▲You can explain why bfloat16 beat float16 for large model training.
▲You can read a frontier training report and identify every non-default choice.