Vast top-down architectural shot of an empty server room — where training actually happens.

The training pipeline atlas

How a frontier language model actually gets made, stage by stage

Most public writing about large language models focuses on the chat interface — the prompt box, the assistant voice, the safety refusals. That is the last five percent of the work. The first ninety-five percent is a multi-stage industrial pipeline that turns raw web text into a system that can hold a conversation, refuse a jailbreak, and pass a coding interview. This page is a working atlas of that pipeline. We walk through each stage in the order it happens at a frontier lab: pre-training, mid-training or continued pre-training, supervised fine-tuning, preference learning (RLHF via PPO, or DPO), constitutional or rule-based methods, safety-specific rounds, adversarial red-teaming, capability evaluation, model card, release. For each stage we give the rough compute proportion, the data shape, what changes about the weights, and what the model gains. The voice here is lab-grade and anti-hype. Some numbers — compute proportions, dataset sizes, exact loss curves — are not published by every lab, and where the public literature is silent we say so. The frontier moves quickly and provider docs are the source of truth for current numbers. Where we cite a fact, we cite the primary paper. Where the field is in active debate (DPO vs PPO, scaling-law extrapolation past Chinchilla, the actual marginal value of constitutional AI versus heavy RLHF), we name the debate rather than picking a side for you. If you came in expecting a marketing diagram with five gradient arrows labelled "AI magic," this is not that. It is an attempt to make the actual industrial process legible to someone who wants to reason about model quality, training cost, safety properties, and where the open-source ecosystem can and cannot keep up.

The pipeline at a glance

Compute share figures are best-effort estimates synthesized from the few labs that have published a breakdown (notably the Llama 3 and InstructGPT papers). Many labs do not disclose the split. The headline is durable: pre-training dominates FLOPs, post-training dominates behavior.

Stage	What it does	Rough compute share	What changes
Pre-training	Next-token prediction on a curated text corpus	Typically >95% of total training FLOPs	Most of the weights settle into their final shape
Mid-training / continued pre-training	More pre-training on cleaner or domain-shifted data, often longer context	Single-digit percent	Knowledge refresh, longer context window, code/math density
Supervised fine-tuning (SFT)	Train on hand-curated (prompt, ideal-response) pairs	Small (<1% of total)	The base model learns to follow an instruction format
Preference learning (RLHF or DPO)	Optimize against a learned preference model or against direct preference pairs	Small in FLOPs, large in human-label cost	Tone, helpfulness, refusal behavior, hallucination rate
Constitutional / rule-based pass	Self-critique against written principles (Anthropic-style CAI) or rule-following data	Small	Refusal behavior with fewer human-written safety labels
Safety RLHF / harmlessness rounds	Targeted preference training on adversarial and policy-relevant prompts	Small	Refusals on disallowed content, jailbreak resistance
Red-team and adversarial evaluation	Humans and tools try to break the model; failures fed back as training data	Iterative, weeks to months	Closes specific known failure modes
Capability evaluation	Public and private benchmarks; internal task batteries	Compute light, time heavy	Nothing — measurement only
Model card + release	Documented limits, evals, safety policy, deployment surface	Zero training compute	Public contract for what the model is

StagePre-training

What it doesNext-token prediction on a curated text corpus

Rough compute shareTypically >95% of total training FLOPs

What changesMost of the weights settle into their final shape

StageMid-training / continued pre-training

What it doesMore pre-training on cleaner or domain-shifted data, often longer context

Rough compute shareSingle-digit percent

What changesKnowledge refresh, longer context window, code/math density

StageSupervised fine-tuning (SFT)

What it doesTrain on hand-curated (prompt, ideal-response) pairs

Rough compute shareSmall (<1% of total)

What changesThe base model learns to follow an instruction format

StagePreference learning (RLHF or DPO)

What it doesOptimize against a learned preference model or against direct preference pairs

Rough compute shareSmall in FLOPs, large in human-label cost

What changesTone, helpfulness, refusal behavior, hallucination rate

StageConstitutional / rule-based pass

What it doesSelf-critique against written principles (Anthropic-style CAI) or rule-following data

Rough compute shareSmall

What changesRefusal behavior with fewer human-written safety labels

StageSafety RLHF / harmlessness rounds

What it doesTargeted preference training on adversarial and policy-relevant prompts

Rough compute shareSmall

What changesRefusals on disallowed content, jailbreak resistance

StageRed-team and adversarial evaluation

What it doesHumans and tools try to break the model; failures fed back as training data

Rough compute shareIterative, weeks to months

What changesCloses specific known failure modes

StageCapability evaluation

What it doesPublic and private benchmarks; internal task batteries

Rough compute shareCompute light, time heavy

What changesNothing — measurement only

StageModel card + release

What it doesDocumented limits, evals, safety policy, deployment surface

Rough compute shareZero training compute

What changesPublic contract for what the model is

Pre-training: where the model actually learns

Pre-training is the long stage where a randomly-initialized transformer learns to predict the next token across a very large corpus of text (and increasingly code, math, images, audio). The objective is mechanical — minimize cross-entropy on next-token prediction — but the side effect is the model internalizes grammar, world facts, code structure, arithmetic, multilingual mappings, and a great deal of common-sense reasoning. Three things determine what comes out of this stage. First, the data. Web crawl is the base, but every frontier lab now invests heavily in filtering, deduplication, quality scoring, and topic balancing. The Llama 3 paper from Meta describes a multi-stage data pipeline with classifier-based quality filtering and explicit upsampling of code and math (Grattafiori et al., arXiv:2407.21783). Bad data caps the ceiling — you cannot RLHF your way out of a noisy pre-training mix. Second, the tokenizer. Models see byte-pair-encoded chunks, not characters. The choice of tokenizer affects compression efficiency on non-English text, code, and numbers. GPT-style tokenizers (tiktoken's cl100k, o200k families) and Llama's tokenizer have different splits, which is why token counts for the same text differ across providers. Third, the scaling law. The Chinchilla paper (Hoffmann et al., 2022, arXiv:2203.15556) showed that for a fixed compute budget, the optimal allocation is to train a smaller model on more tokens than the GPT-3 generation was using — roughly 20 tokens per parameter, give or take. Modern open frontier models (Llama 3, Qwen, DeepSeek) routinely train well past Chinchilla-optimal because inference cost favors smaller models trained on more data, even if the marginal training-loss return diminishes. Past Chinchilla-optimal, gains per FLOP shrink but do not disappear, and labs trade extra training compute for cheaper inference.

What changes when, mechanically

A useful mental model: pre-training builds the brain, SFT teaches it the interview format, preference learning teaches it the manners, and red-team rounds patch specific known holes. The cost order is roughly reversed: brain is most expensive in FLOPs, manners are most expensive in human labels.

Pre-training

Loss objective: next-token cross-entropy

Almost all of the parametric knowledge — facts, language structure, code idioms, reasoning circuits — gets installed here. Removing a specific fact later is hard precisely because pre-training distributed it across many weights.

Continued pre-training

Loss objective: same as pre-training

Same objective, narrower or fresher data. Used to extend context length (e.g. via RoPE scaling or position interpolation), refresh knowledge cutoff, or boost a domain like code or math without starting from scratch.

SFT

Loss objective: cross-entropy on demonstrations

The model learns the format of a helpful assistant. Behavior shifts noticeably — it stops completing the prompt and starts answering it — but the underlying knowledge is unchanged.

RLHF / DPO

Loss: PPO objective with KL anchor, or DPO log-ratio loss

Style, refusal patterns, calibration, and 'feel' get tuned. Capability on hard reasoning benchmarks often barely moves; subjective helpfulness ratings move a lot. This is also where the 'alignment tax' on raw benchmark scores can appear.

Constitutional pass

Mix of supervised + RL from AI feedback

The model critiques and revises its own outputs against a written set of principles. Reduces the amount of human red-team labeling needed for harmlessness. Originally described by Bai et al. (Anthropic, 2022).

Red-team rounds

Outputs feed back into SFT and preference data

Specific failure modes (jailbreaks, CBRN uplift attempts, prompt-injection variants) are surfaced by humans and tools, then used to generate targeted training data. Iterative — never 'done.'

Mid-training and continued pre-training

Between the big pre-training run and the post-training stack, many labs now insert a 'mid-training' or 'continued pre-training' phase. This is the same next-token-prediction objective as pre-training, but on a smaller, cleaner, more deliberately balanced corpus. Common uses include extending the context window (often via positional-encoding tricks like RoPE scaling or YaRN), refreshing the knowledge cutoff by mixing in more recent text, and shifting the model's distribution toward code, math, or another priority domain. The Llama 3 technical report from Meta (Grattafiori et al., arXiv:2407.21783) describes a long context extension stage as part of the published training recipe, where context is progressively extended in stages rather than at the start. Other labs have published variants of this idea, and DeepSeek and Qwen have written about multi-stage data curricula that look like mid-training even when they don't use the term. The compute share of this stage is small relative to the main pre-training run but not negligible — context extension over a trillion-plus tokens at long sequence length is not free. The honest assessment is that mid-training is one of the levers labs use to differentiate, and the public literature is thinner here than it is for either pre-training or RLHF.

Supervised fine-tuning (SFT)

After pre-training, the model can complete text but does not yet know that 'How tall is Everest?' should produce an answer rather than a continuation of a quiz. Supervised fine-tuning solves the format problem. The lab assembles a corpus of (prompt, ideal-response) pairs — sometimes tens of thousands, sometimes millions, depending on the lab — written by trained annotators or generated and curated with model assistance. The model is then fine-tuned on this corpus with the same cross-entropy objective as pre-training. The foundational public description is the InstructGPT paper from OpenAI (Ouyang et al., 2022, arXiv:2203.02155), which showed that a relatively small SFT pass on top of GPT-3 produced large gains in human preference ratings, before any reinforcement-learning step. SFT alone moves the model a long way. It also, importantly, can cost capability on raw benchmarks if the demonstrations are written in a stilted or overly-cautious style — labs have to balance helpfulness and naturalness. SFT data is one of the highest-leverage and most expensive components of the modern stack. Hand-written assistant responses by skilled annotators are slow to produce; synthetic data and 'distillation from a stronger teacher' have become standard ways to scale this up, though that introduces its own quality risks.

Preference learning: RLHF (PPO) vs DPO

Once the model can answer in the right format, the next step is teaching it which answers humans actually prefer. Two methods dominate. The classical approach is reinforcement learning from human feedback (RLHF) using Proximal Policy Optimization (PPO). Annotators rank pairs of model outputs; a separate reward model is trained to predict those rankings; then the language model is updated via PPO to maximize the reward model's score while a KL-divergence term keeps it close to the SFT model so it doesn't degenerate. This is the recipe described in InstructGPT (Ouyang et al., 2022, arXiv:2203.02155) and the Anthropic helpful-and-harmless paper (Bai et al., 2022, arXiv:2204.05862). It works, it is well-studied, and it is operationally heavy — training a reward model, tuning PPO, and managing reward hacking are all real engineering problems. The newer alternative is Direct Preference Optimization (DPO), introduced by Rafailov et al. (2023, arXiv:2305.18290). DPO algebraically rearranges the RLHF objective so the language model can be trained directly on preference pairs with a closed-form loss — no separate reward model, no on-policy RL loop. DPO has become the default for many open-source post-training stacks because it is simpler and more stable. Whether it matches PPO at the frontier is an open empirical question. Some labs continue to use PPO or PPO variants (GRPO, RLOO, RLAIF-style methods) because they offer finer control. As of June 2026 the field has not converged on a single answer; check provider technical reports for what each model actually used. Either way, this stage is small in FLOPs but high in human-label cost. Preference data is where the model's voice, refusal style, and helpfulness calibration are tuned. It is also where a lot of the 'alignment tax' debate happens — heavily RLHF'd models sometimes score lower on raw multiple-choice benchmarks than their SFT-only siblings even when humans prefer them in conversation.

The constitutional AI step

Anthropic's Constitutional AI paper (Bai et al., 2022, arXiv:2212.08073) introduced an approach where the model is given a written 'constitution' — a set of principles like 'choose the response that is least harmful' or 'choose the response a thoughtful person would prefer' — and asked to critique and revise its own outputs against those principles. The revisions become training data, both for supervised fine-tuning and for a preference model used in a subsequent RL-from-AI-feedback (RLAIF) loop. The practical claim is not that human feedback is replaced. It is that the volume of human red-team labeling required for harmlessness can be reduced, because the model can generate much of the preference data itself once it has the principles in hand. This makes safety training more scalable and the principles themselves auditable and editable in a way that an opaque labeled dataset is not. Not every lab uses a constitutional step. Some use rule-based finetuning, some use heavy human red-teaming, some use a mix. The specific recipe is one of the things labs treat as differentiated IP.

Safety RLHF and red-team rounds

Safety-specific post-training is usually a separate stage from general helpfulness RLHF, even when the underlying algorithm is the same. The targeted-harm prompts (CBRN uplift, child-safety, weapons, prompt-injection chains, persuasion-of-self-harm) need careful curation, and mixing them into general preference data risks either over-refusal on benign prompts or under-refusal on disallowed ones. Red-team rounds are the iterative cycle: humans and tools attempt to elicit policy-violating outputs, the failures get logged, new training data gets generated to close that specific hole, the model is retrained, and the cycle repeats. The Anthropic helpful-and-harmless paper and OpenAI's GPT-4 system card both describe a months-long red-team process prior to release. The Llama 3 paper documents a similar process at scale for an open-weights release. These rounds rarely produce a 'done' state — every release ships with known failure modes that the next round will try to address.

Capability evaluation

Evaluation happens throughout post-training, not just at the end. Internal eval batteries are what actually decide whether a checkpoint ships. Public benchmarks anchor the press release but rarely reflect the full picture — a model can top a leaderboard and still feel worse to use, or vice versa. Treat public scores as one signal among many.

Public reasoning and knowledge benchmarks — MMLU, MMLU-Pro, GPQA Diamond, ARC, BIG-Bench Hard. Saturating fast at the frontier, useful as floor checks.
Coding benchmarks — HumanEval, MBPP, LiveCodeBench, SWE-bench Verified. SWE-bench in particular has become a load-bearing 'is this model actually useful for engineering' signal.
Math benchmarks — GSM8K (now mostly saturated), MATH, AIME-style competition sets. Frontier models now train explicitly to be good at these.
Agentic and tool-use evals — TAU-bench, WebArena, OSWorld. Newer, noisier, more representative of real assistant work.
Long-context evals — needle-in-a-haystack variants, RULER, and harder reasoning-over-long-document tests like ZeroSCROLLS.
Safety evals — model-specific refusal batteries, persuasive-attack sets, plus a growing set of public ones (HarmBench, AdvBench, JailbreakBench).
Internal evals — every lab runs proprietary task batteries that map to its specific deployment surface. These are what actually drive ship/no-ship decisions, and are rarely public.

Model cards and release

When a checkpoint is ready, the lab publishes a model card and a release. The model card documents intended use, training data at a high level (most labs do not list specific datasets), evaluation results, known limitations, and safety policies. It is, in a useful sense, the public contract for what the model is. The model-card concept was formalized by Mitchell et al. (2018, arXiv:1810.03993) and is now standard across major labs. The release itself usually includes the deployable model (API or weights), a tokenizer, a usage policy, pricing, and often a separate system card or technical report with more detail on training and red-teaming. OpenAI's GPT-4 technical report, Anthropic's Claude system cards, and Meta's Llama 3 technical report are good examples of what a thorough release looks like, though each lab strikes a different balance between transparency and what it treats as competitive IP. Frontier API pricing changes frequently — check provider docs for current numbers rather than relying on figures cached anywhere, including here.

Honest gaps and open questions

The public literature is uneven. We know a great deal about the pre-training objective and the SFT recipe, less about exact data mixes, less still about the proprietary preference-data pipelines that distinguish frontier products. As of June 2026, several questions are genuinely open: (1) DPO versus PPO at the absolute frontier — open-weights stacks lean DPO, but the largest closed models may still use PPO variants and have not all said so; (2) the marginal value of constitutional AI versus heavy human red-teaming — public results suggest both work, neither dominates; (3) whether scaling laws bend or hold past current frontier compute — the Chinchilla curves were fit on a specific regime, and labs are now well outside it; (4) how much of capability progress comes from data quality versus algorithmic improvement versus raw scale. When you read claims about training, weight evidence and primary sources over confident summaries — including this one.

Where this fits in the AtomEons atlas

This page is the training half of the model-lifecycle atlas. The deployment half — serving, context windows, prompt caching, tool use, evaluations in production — lives in related pages under /learn. If you are reasoning about a specific model's capabilities or safety properties, the most reliable move is to read its primary technical report and system card, then check current provider documentation for anything operational (pricing, rate limits, supported features). Secondary summaries, even good ones, lag the primary sources by months.

Sources

[01]
Hoffmann et al. (2022), 'Training Compute-Optimal Large Language Models' (Chinchilla), establishes the compute-optimal token-to-parameter ratio for LLM pre-training at roughly 20 tokens per parameter.
arxiv.org/abs/2203.15556
[02]
Ouyang et al. (2022), 'Training language models to follow instructions with human feedback' (InstructGPT), describes the canonical SFT-plus-RLHF (PPO) recipe at OpenAI.
arxiv.org/abs/2203.02155
[03]
Bai et al. (2022), 'Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback' (Anthropic), documents the human-feedback pipeline for both helpfulness and harmlessness.
arxiv.org/abs/2204.05862
[04]
Bai et al. (2022), 'Constitutional AI: Harmlessness from AI Feedback' (Anthropic), introduces the self-critique-against-principles approach and RLAIF.
arxiv.org/abs/2212.08073
[05]
Rafailov et al. (2023), 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model' (DPO), gives the closed-form preference-learning loss that has become standard in open-source post-training.
arxiv.org/abs/2305.18290
[06]
Grattafiori et al. (2024), 'The Llama 3 Herd of Models' (Meta), publishes a detailed multi-stage training recipe including data curation, context extension, SFT, and preference optimization.
arxiv.org/abs/2407.21783
[07]
Mitchell et al. (2018), 'Model Cards for Model Reporting', formalizes the model-card concept that frontier labs now use for release documentation.
arxiv.org/abs/1810.03993
[08]
OpenAI's GPT-4 technical report describes the red-team and safety-evaluation process used prior to release, and the general structure of capability evaluations.
openai.com/research/gpt-4 (GPT-4 Technical Report, 2023)
[09]
Meta's Llama 3 release documents the open-weights release pattern: model card, technical report, weights, tokenizer, usage policy.
ai.meta.com/blog/meta-llama-3/ and the Llama 3 paper
[10]
Public evaluation frameworks (OpenAI Evals, EleutherAI lm-evaluation-harness) host many of the benchmarks referenced (MMLU, GSM8K, HumanEval, ARC) in reproducible form.
github.com/openai/evals and github.com/EleutherAI/lm-evaluation-harness
[11]
SWE-bench and SWE-bench Verified are real-world software-engineering benchmarks that have become load-bearing for assessing coding-agent capability.
swebench.com — SWE-bench official site (Jimenez et al., arXiv:2310.06770)
[12]
MMLU (Massive Multitask Language Understanding) is the multi-domain knowledge benchmark widely used as a floor check for frontier models.
arxiv.org/abs/2009.03300 (Hendrycks et al., MMLU)
[13]
GSM8K is the grade-school-math benchmark frequently cited in training reports; now mostly saturated at the frontier.
arxiv.org/abs/2110.14168 (Cobbe et al., GSM8K)
[14]
GPQA Diamond is a graduate-level science Q&A benchmark used as a harder knowledge eval than MMLU.
arxiv.org/abs/2311.12022 (Rein et al., GPQA)

Keep reading

Atlas: model deployment and serving →Learn: scaling laws explained →Learn: RLHF vs DPO playbook →Research papers and primary sources →Tools and benchmarks we track →OrangeBox local-first AI build system →Compare: frontier model families →