The pipeline at a glance
Compute share figures are best-effort estimates synthesized from the few labs that have published a breakdown (notably the Llama 3 and InstructGPT papers). Many labs do not disclose the split. The headline is durable: pre-training dominates FLOPs, post-training dominates behavior.
| Stage | What it does | Rough compute share | What changes |
|---|
| Pre-training | Next-token prediction on a curated text corpus | Typically >95% of total training FLOPs | Most of the weights settle into their final shape |
| Mid-training / continued pre-training | More pre-training on cleaner or domain-shifted data, often longer context | Single-digit percent | Knowledge refresh, longer context window, code/math density |
| Supervised fine-tuning (SFT) | Train on hand-curated (prompt, ideal-response) pairs | Small (<1% of total) | The base model learns to follow an instruction format |
| Preference learning (RLHF or DPO) | Optimize against a learned preference model or against direct preference pairs | Small in FLOPs, large in human-label cost | Tone, helpfulness, refusal behavior, hallucination rate |
| Constitutional / rule-based pass | Self-critique against written principles (Anthropic-style CAI) or rule-following data | Small | Refusal behavior with fewer human-written safety labels |
| Safety RLHF / harmlessness rounds | Targeted preference training on adversarial and policy-relevant prompts | Small | Refusals on disallowed content, jailbreak resistance |
| Red-team and adversarial evaluation | Humans and tools try to break the model; failures fed back as training data | Iterative, weeks to months | Closes specific known failure modes |
| Capability evaluation | Public and private benchmarks; internal task batteries | Compute light, time heavy | Nothing — measurement only |
| Model card + release | Documented limits, evals, safety policy, deployment surface | Zero training compute | Public contract for what the model is |
StagePre-training
What it doesNext-token prediction on a curated text corpus
Rough compute shareTypically >95% of total training FLOPs
What changesMost of the weights settle into their final shape
StageMid-training / continued pre-training
What it doesMore pre-training on cleaner or domain-shifted data, often longer context
Rough compute shareSingle-digit percent
What changesKnowledge refresh, longer context window, code/math density
StageSupervised fine-tuning (SFT)
What it doesTrain on hand-curated (prompt, ideal-response) pairs
Rough compute shareSmall (<1% of total)
What changesThe base model learns to follow an instruction format
StagePreference learning (RLHF or DPO)
What it doesOptimize against a learned preference model or against direct preference pairs
Rough compute shareSmall in FLOPs, large in human-label cost
What changesTone, helpfulness, refusal behavior, hallucination rate
StageConstitutional / rule-based pass
What it doesSelf-critique against written principles (Anthropic-style CAI) or rule-following data
Rough compute shareSmall
What changesRefusal behavior with fewer human-written safety labels
StageSafety RLHF / harmlessness rounds
What it doesTargeted preference training on adversarial and policy-relevant prompts
Rough compute shareSmall
What changesRefusals on disallowed content, jailbreak resistance
StageRed-team and adversarial evaluation
What it doesHumans and tools try to break the model; failures fed back as training data
Rough compute shareIterative, weeks to months
What changesCloses specific known failure modes
StageCapability evaluation
What it doesPublic and private benchmarks; internal task batteries
Rough compute shareCompute light, time heavy
What changesNothing — measurement only
StageModel card + release
What it doesDocumented limits, evals, safety policy, deployment surface
Rough compute shareZero training compute
What changesPublic contract for what the model is
Pre-training: where the model actually learns
Pre-training is the long stage where a randomly-initialized transformer learns to predict the next token across a very large corpus of text (and increasingly code, math, images, audio). The objective is mechanical — minimize cross-entropy on next-token prediction — but the side effect is the model internalizes grammar, world facts, code structure, arithmetic, multilingual mappings, and a great deal of common-sense reasoning.
Three things determine what comes out of this stage. First, the data. Web crawl is the base, but every frontier lab now invests heavily in filtering, deduplication, quality scoring, and topic balancing. The Llama 3 paper from Meta describes a multi-stage data pipeline with classifier-based quality filtering and explicit upsampling of code and math (Grattafiori et al., arXiv:2407.21783). Bad data caps the ceiling — you cannot RLHF your way out of a noisy pre-training mix.
Second, the tokenizer. Models see byte-pair-encoded chunks, not characters. The choice of tokenizer affects compression efficiency on non-English text, code, and numbers. GPT-style tokenizers (tiktoken's cl100k, o200k families) and Llama's tokenizer have different splits, which is why token counts for the same text differ across providers.
Third, the scaling law. The Chinchilla paper (Hoffmann et al., 2022, arXiv:2203.15556) showed that for a fixed compute budget, the optimal allocation is to train a smaller model on more tokens than the GPT-3 generation was using — roughly 20 tokens per parameter, give or take. Modern open frontier models (Llama 3, Qwen, DeepSeek) routinely train well past Chinchilla-optimal because inference cost favors smaller models trained on more data, even if the marginal training-loss return diminishes. Past Chinchilla-optimal, gains per FLOP shrink but do not disappear, and labs trade extra training compute for cheaper inference.
What changes when, mechanically
A useful mental model: pre-training builds the brain, SFT teaches it the interview format, preference learning teaches it the manners, and red-team rounds patch specific known holes. The cost order is roughly reversed: brain is most expensive in FLOPs, manners are most expensive in human labels.
Pre-training
Loss objective: next-token cross-entropy
Almost all of the parametric knowledge — facts, language structure, code idioms, reasoning circuits — gets installed here. Removing a specific fact later is hard precisely because pre-training distributed it across many weights.
Continued pre-training
Loss objective: same as pre-training
Same objective, narrower or fresher data. Used to extend context length (e.g. via RoPE scaling or position interpolation), refresh knowledge cutoff, or boost a domain like code or math without starting from scratch.
SFT
Loss objective: cross-entropy on demonstrations
The model learns the format of a helpful assistant. Behavior shifts noticeably — it stops completing the prompt and starts answering it — but the underlying knowledge is unchanged.
RLHF / DPO
Loss: PPO objective with KL anchor, or DPO log-ratio loss
Style, refusal patterns, calibration, and 'feel' get tuned. Capability on hard reasoning benchmarks often barely moves; subjective helpfulness ratings move a lot. This is also where the 'alignment tax' on raw benchmark scores can appear.
Constitutional pass
Mix of supervised + RL from AI feedback
The model critiques and revises its own outputs against a written set of principles. Reduces the amount of human red-team labeling needed for harmlessness. Originally described by Bai et al. (Anthropic, 2022).
Red-team rounds
Outputs feed back into SFT and preference data
Specific failure modes (jailbreaks, CBRN uplift attempts, prompt-injection variants) are surfaced by humans and tools, then used to generate targeted training data. Iterative — never 'done.'
Mid-training and continued pre-training
Between the big pre-training run and the post-training stack, many labs now insert a 'mid-training' or 'continued pre-training' phase. This is the same next-token-prediction objective as pre-training, but on a smaller, cleaner, more deliberately balanced corpus. Common uses include extending the context window (often via positional-encoding tricks like RoPE scaling or YaRN), refreshing the knowledge cutoff by mixing in more recent text, and shifting the model's distribution toward code, math, or another priority domain.
The Llama 3 technical report from Meta (Grattafiori et al., arXiv:2407.21783) describes a long context extension stage as part of the published training recipe, where context is progressively extended in stages rather than at the start. Other labs have published variants of this idea, and DeepSeek and Qwen have written about multi-stage data curricula that look like mid-training even when they don't use the term.
The compute share of this stage is small relative to the main pre-training run but not negligible — context extension over a trillion-plus tokens at long sequence length is not free. The honest assessment is that mid-training is one of the levers labs use to differentiate, and the public literature is thinner here than it is for either pre-training or RLHF.
Supervised fine-tuning (SFT)
After pre-training, the model can complete text but does not yet know that 'How tall is Everest?' should produce an answer rather than a continuation of a quiz. Supervised fine-tuning solves the format problem. The lab assembles a corpus of (prompt, ideal-response) pairs — sometimes tens of thousands, sometimes millions, depending on the lab — written by trained annotators or generated and curated with model assistance. The model is then fine-tuned on this corpus with the same cross-entropy objective as pre-training.
The foundational public description is the InstructGPT paper from OpenAI (Ouyang et al., 2022, arXiv:2203.02155), which showed that a relatively small SFT pass on top of GPT-3 produced large gains in human preference ratings, before any reinforcement-learning step. SFT alone moves the model a long way. It also, importantly, can cost capability on raw benchmarks if the demonstrations are written in a stilted or overly-cautious style — labs have to balance helpfulness and naturalness.
SFT data is one of the highest-leverage and most expensive components of the modern stack. Hand-written assistant responses by skilled annotators are slow to produce; synthetic data and 'distillation from a stronger teacher' have become standard ways to scale this up, though that introduces its own quality risks.
Preference learning: RLHF (PPO) vs DPO
Once the model can answer in the right format, the next step is teaching it which answers humans actually prefer. Two methods dominate.
The classical approach is reinforcement learning from human feedback (RLHF) using Proximal Policy Optimization (PPO). Annotators rank pairs of model outputs; a separate reward model is trained to predict those rankings; then the language model is updated via PPO to maximize the reward model's score while a KL-divergence term keeps it close to the SFT model so it doesn't degenerate. This is the recipe described in InstructGPT (Ouyang et al., 2022, arXiv:2203.02155) and the Anthropic helpful-and-harmless paper (Bai et al., 2022, arXiv:2204.05862). It works, it is well-studied, and it is operationally heavy — training a reward model, tuning PPO, and managing reward hacking are all real engineering problems.
The newer alternative is Direct Preference Optimization (DPO), introduced by Rafailov et al. (2023, arXiv:2305.18290). DPO algebraically rearranges the RLHF objective so the language model can be trained directly on preference pairs with a closed-form loss — no separate reward model, no on-policy RL loop. DPO has become the default for many open-source post-training stacks because it is simpler and more stable. Whether it matches PPO at the frontier is an open empirical question. Some labs continue to use PPO or PPO variants (GRPO, RLOO, RLAIF-style methods) because they offer finer control. As of June 2026 the field has not converged on a single answer; check provider technical reports for what each model actually used.
Either way, this stage is small in FLOPs but high in human-label cost. Preference data is where the model's voice, refusal style, and helpfulness calibration are tuned. It is also where a lot of the 'alignment tax' debate happens — heavily RLHF'd models sometimes score lower on raw multiple-choice benchmarks than their SFT-only siblings even when humans prefer them in conversation.
The constitutional AI step
Anthropic's Constitutional AI paper (Bai et al., 2022, arXiv:2212.08073) introduced an approach where the model is given a written 'constitution' — a set of principles like 'choose the response that is least harmful' or 'choose the response a thoughtful person would prefer' — and asked to critique and revise its own outputs against those principles. The revisions become training data, both for supervised fine-tuning and for a preference model used in a subsequent RL-from-AI-feedback (RLAIF) loop.
The practical claim is not that human feedback is replaced. It is that the volume of human red-team labeling required for harmlessness can be reduced, because the model can generate much of the preference data itself once it has the principles in hand. This makes safety training more scalable and the principles themselves auditable and editable in a way that an opaque labeled dataset is not.
Not every lab uses a constitutional step. Some use rule-based finetuning, some use heavy human red-teaming, some use a mix. The specific recipe is one of the things labs treat as differentiated IP.
Safety RLHF and red-team rounds
Safety-specific post-training is usually a separate stage from general helpfulness RLHF, even when the underlying algorithm is the same. The targeted-harm prompts (CBRN uplift, child-safety, weapons, prompt-injection chains, persuasion-of-self-harm) need careful curation, and mixing them into general preference data risks either over-refusal on benign prompts or under-refusal on disallowed ones.
Red-team rounds are the iterative cycle: humans and tools attempt to elicit policy-violating outputs, the failures get logged, new training data gets generated to close that specific hole, the model is retrained, and the cycle repeats. The Anthropic helpful-and-harmless paper and OpenAI's GPT-4 system card both describe a months-long red-team process prior to release. The Llama 3 paper documents a similar process at scale for an open-weights release. These rounds rarely produce a 'done' state — every release ships with known failure modes that the next round will try to address.
Capability evaluation
Evaluation happens throughout post-training, not just at the end. Internal eval batteries are what actually decide whether a checkpoint ships. Public benchmarks anchor the press release but rarely reflect the full picture — a model can top a leaderboard and still feel worse to use, or vice versa. Treat public scores as one signal among many.
- Public reasoning and knowledge benchmarks — MMLU, MMLU-Pro, GPQA Diamond, ARC, BIG-Bench Hard. Saturating fast at the frontier, useful as floor checks.
- Coding benchmarks — HumanEval, MBPP, LiveCodeBench, SWE-bench Verified. SWE-bench in particular has become a load-bearing 'is this model actually useful for engineering' signal.
- Math benchmarks — GSM8K (now mostly saturated), MATH, AIME-style competition sets. Frontier models now train explicitly to be good at these.
- Agentic and tool-use evals — TAU-bench, WebArena, OSWorld. Newer, noisier, more representative of real assistant work.
- Long-context evals — needle-in-a-haystack variants, RULER, and harder reasoning-over-long-document tests like ZeroSCROLLS.
- Safety evals — model-specific refusal batteries, persuasive-attack sets, plus a growing set of public ones (HarmBench, AdvBench, JailbreakBench).
- Internal evals — every lab runs proprietary task batteries that map to its specific deployment surface. These are what actually drive ship/no-ship decisions, and are rarely public.
Model cards and release
When a checkpoint is ready, the lab publishes a model card and a release. The model card documents intended use, training data at a high level (most labs do not list specific datasets), evaluation results, known limitations, and safety policies. It is, in a useful sense, the public contract for what the model is. The model-card concept was formalized by Mitchell et al. (2018, arXiv:1810.03993) and is now standard across major labs.
The release itself usually includes the deployable model (API or weights), a tokenizer, a usage policy, pricing, and often a separate system card or technical report with more detail on training and red-teaming. OpenAI's GPT-4 technical report, Anthropic's Claude system cards, and Meta's Llama 3 technical report are good examples of what a thorough release looks like, though each lab strikes a different balance between transparency and what it treats as competitive IP. Frontier API pricing changes frequently — check provider docs for current numbers rather than relying on figures cached anywhere, including here.
Honest gaps and open questions
The public literature is uneven. We know a great deal about the pre-training objective and the SFT recipe, less about exact data mixes, less still about the proprietary preference-data pipelines that distinguish frontier products. As of June 2026, several questions are genuinely open: (1) DPO versus PPO at the absolute frontier — open-weights stacks lean DPO, but the largest closed models may still use PPO variants and have not all said so; (2) the marginal value of constitutional AI versus heavy human red-teaming — public results suggest both work, neither dominates; (3) whether scaling laws bend or hold past current frontier compute — the Chinchilla curves were fit on a specific regime, and labs are now well outside it; (4) how much of capability progress comes from data quality versus algorithmic improvement versus raw scale. When you read claims about training, weight evidence and primary sources over confident summaries — including this one.
Where this fits in the AtomEons atlas
This page is the training half of the model-lifecycle atlas. The deployment half — serving, context windows, prompt caching, tool use, evaluations in production — lives in related pages under /learn. If you are reasoning about a specific model's capabilities or safety properties, the most reliable move is to read its primary technical report and system card, then check current provider documentation for anything operational (pricing, rate limits, supported features). Secondary summaries, even good ones, lag the primary sources by months.