built throughORANGEBOX·see what it ships·$1 →
Four matte-black tuning forks of decreasing size on dark slate — RLHF, DPO, KTO, ORPO.

AtomEons / Learn / atlas / rlhf-family

The RLHF family tree

Atlas · post-training methods from PPO (2017) to GRPO (2025) — what each one actually changes

Post-training is where a base language model stops being a clever autocomplete and becomes a usable assistant. The family of methods that does this work — collectively called RLHF, even when the "RL" part has been quietly removed — has expanded from one algorithm in 2017 to roughly a dozen production-relevant variants by mid-2026. Most explainers treat them as interchangeable. They are not. This atlas page is an honest field guide. For each method we state the loss in plain language, the specific failure of PPO it is trying to fix, who actually shipped a model with it (not who wrote a paper), and what it costs relative to a PPO baseline. We separate three categories: production-validated (a frontier or near-frontier lab shipped a real model with it), academic-validated (the paper has strong benchmarks but no flagship shipping evidence), and theoretical (the math is interesting but the empirical track record is thin or contested). The honest summary up front: by 2026 the field has bifurcated. For reasoning models with verifiable rewards (math, code, formal logic), the action moved back to on-policy RL — GRPO and its descendants. For general chat assistants, the action moved toward simpler offline methods — DPO and its variants — because they are cheaper, more stable, and roughly as good on subjective preference benchmarks. PPO survives, but as one option among several, not the default. We cite original papers only. Where a fact is uncertain or we are reasoning from secondary sources, we say so in the prose. Compute numbers are relative ratios from published comparisons; absolute dollar costs depend on hardware and scale and are not invented here. Check provider docs for current pricing on managed RLHF services.

The shape of the family tree

Every method on this page is trying to solve the same problem: take a pretrained language model and shift its distribution toward outputs humans (or a verifier) prefer, without destroying the general capability it learned from web-scale pretraining. The methods differ on four axes. First, whether they use an explicit reward model trained on preference pairs (PPO, GRPO) or fold the reward into the loss directly (DPO, IPO, SimPO). Second, whether they need a frozen reference policy to constrain drift (DPO, IPO, KTO) or eliminate it (SimPO, ORPO, GRPO without KL). Third, whether training data is fixed up front (offline: DPO, IPO, KTO, SimPO, ORPO) or generated by the current policy at each step (online: PPO, GRPO, iterative DPO, online DPO). Fourth, whether they need paired (chosen/rejected) preference data or can work with thumbs-up/thumbs-down style binary labels (KTO is the outlier here). Every method below sits at a specific point on these four axes, and that placement explains most of what is different about it.

Chronology

  1. Jun 2017

    Deep RL from human preferences (Christiano et al.)

    OpenAI/DeepMind paper introduces the reward-model + RL pipeline that becomes RLHF. Atari and MuJoCo, not language. arXiv:1706.03741.

  2. Mar 2022

    InstructGPT (Ouyang et al.)

    OpenAI applies the Christiano pipeline at scale to GPT-3 using PPO. First flagship production model shipped with RLHF. arXiv:2203.02155.

  3. Dec 2022

    Constitutional AI (Bai et al., Anthropic)

    RLAIF variant — AI-generated preferences against a written constitution replace human labelers in the harm phase. arXiv:2212.08073. Powers the Claude line.

  4. May 2023

    DPO (Rafailov et al.)

    Stanford paper collapses the reward-model + PPO pipeline into a single classification loss. The shift that broke RLHF open to the open-source community. arXiv:2305.18290.

  5. Oct 2023

    IPO (Azar et al., DeepMind)

    Identifies a DPO failure mode (overfitting on near-deterministic preferences) and proposes a regularized objective. arXiv:2310.12036.

  6. Feb 2024

    KTO and DeepSeekMath/GRPO

    KTO (Ethayarajh et al., arXiv:2402.01306) replaces pairs with binary thumbs-up/down. DeepSeekMath introduces GRPO (Shao et al., arXiv:2402.03300) — PPO minus the critic.

  7. Mar 2024

    ORPO (Hong et al.)

    Merges SFT and preference optimization into a single stage, drops the reference model. arXiv:2403.07691.

  8. May 2024

    SimPO (Meng, Xia, Chen — Princeton)

    Reference-free, length-normalized DPO variant. Topped Chatbot Arena under-10B at release. arXiv:2405.14734.

  9. Jul 2024

    Llama 3 ships with DPO

    Meta explicitly chose DPO over PPO for the Llama 3 herd, citing scaling and stability. arXiv:2407.21783. Marked the moment DPO became frontier-acceptable, not just academic.

  10. Jan 2025

    DeepSeek-R1

    GRPO scaled to a frontier reasoning model. Reproduced o1-level math/code performance with an open-weights model. arXiv:2501.12948. GRPO becomes the dominant method for reasoning training.

Methods · what each one actually does

All loss descriptions are plain-language summaries. Read the cited paper before implementing — these summaries are orientation, not specification.

MethodPPO (Christiano 2017 → InstructGPT 2022)
Loss in plain languageTrain a reward model on preference pairs. Then run on-policy RL: sample from the policy, score with the reward model, update the policy to maximize reward minus a KL penalty to a frozen reference.
Problem with PPO it addressesIt is the baseline — the problem PPO has is itself: four models in memory at once (policy, reference, reward model, critic), unstable, hyperparameter-sensitive, and expensive to scale.
Status as of June 2026Production-validated. Powered InstructGPT, original ChatGPT, GPT-4-era models. Still used at OpenAI and others for general post-training. Compute reference baseline.
MethodDPO (Rafailov 2023)
Loss in plain languageSkip the reward model and the RL loop entirely. Train the policy directly on a classification loss over (chosen, rejected) preference pairs, where the implicit reward is the log-ratio between the policy and a frozen reference.
Problem with PPO it addressesEliminates the reward model, the critic, the rollouts, and the RL hyperparameter tuning. Single supervised-style training pass.
Status as of June 2026Production-validated. Llama 3 (Meta, 2024), Zephyr-7B-beta (Hugging Face H4, 2023), and many open-weights flagship models. The default choice for general-purpose post-training in 2025-2026.
MethodIPO (Azar 2023)
Loss in plain languageSame setup as DPO but replaces the Bradley-Terry log-sigmoid with a squared loss. Penalizes the policy for pushing the log-ratio too far past the preference margin, which DPO cannot do.
Problem with PPO it addressesDPO overfits when preferences are near-deterministic (chosen always beats rejected). IPO regularizes against this failure mode.
Status as of June 2026Academic-validated. Cited and reimplemented widely. We are not aware of a frontier flagship model that publicly reports IPO as its primary loss as of June 2026 — check the latest model cards.
MethodKTO (Ethayarajh 2024)
Loss in plain languageDrop preference pairs entirely. Use binary thumbs-up / thumbs-down labels per response, and a loss inspired by Kahneman-Tversky prospect theory: asymmetric penalty for desirable vs undesirable outputs.
Problem with PPO it addressesPair labels are expensive to collect. Most real product feedback is binary (thumbs up/down, abandoned vs completed). KTO uses that data directly.
Status as of June 2026Academic-validated, with significant adoption. Hugging Face TRL ships KTO; multiple open-weights fine-tunes use it. No public frontier flagship has named KTO as primary, but the production signal is non-trivial.
MethodSimPO (Meng/Xia/Chen 2024)
Loss in plain languageReference-free DPO. The implicit reward is the average log-probability of the chosen sequence under the policy, with no frozen reference and a target margin term.
Problem with PPO it addressesDPO requires keeping a frozen reference model in memory and computing two forward passes per step. SimPO halves the memory and compute cost.
Status as of June 2026Academic-validated, near-production. Topped Chatbot Arena under-10B at release. Widely used in open-weights fine-tunes. No flagship lab has publicly confirmed SimPO as primary.
MethodORPO (Hong 2024)
Loss in plain languageCombine SFT and preference optimization into one stage. Loss is standard cross-entropy on chosen responses plus an odds-ratio penalty against rejected responses. No reference model, no separate SFT phase.
Problem with PPO it addressesTwo-stage pipelines (SFT then DPO/PPO) are operationally complex. ORPO does both in one pass on the same data.
Status as of June 2026Academic-validated. Production adoption visible in some open-weights releases; not confirmed as primary at any frontier lab as of June 2026.
MethodGRPO (Shao 2024 / DeepSeek)
Loss in plain languageOnline RL like PPO but without the critic (value network). For each prompt, sample a group of completions, normalize their rewards against the group mean and std (the 'group baseline'), and use that as the advantage.
Problem with PPO it addressesPPO needs a learned critic that is itself expensive to train and often unstable. GRPO replaces it with a Monte Carlo group baseline — fewer parameters, simpler, scales better with reasoning rollouts.
Status as of June 2026Production-validated. DeepSeekMath, DeepSeek-V2/V3, and DeepSeek-R1 (Jan 2025) all use GRPO. Dominant method for reasoning post-training as of mid-2026. Many R1 reproductions use GRPO.
MethodIterative DPO
Loss in plain languageTrain DPO. Use the new policy to generate fresh responses. Re-label with a reward model or LLM judge. Train DPO again. Repeat.
Problem with PPO it addressesVanilla DPO is offline — once the preference data is fixed, the policy can drift away from the data distribution and learning stalls. Iterative DPO closes that loop.
Status as of June 2026Production-validated. Self-Rewarding Language Models (Yuan et al., Meta, arXiv:2401.10020) and several Llama-3 post-training pipelines use iterative DPO loops.
MethodOnline DPO / OAIF
Loss in plain languageSame loss as DPO, but the preference pairs are generated and labeled in-step from the current policy plus an online judge (often another LLM). Closer to PPO's online structure but with DPO's loss.
Problem with PPO it addressesIterative DPO is batched; online DPO is per-step. Removes more of the distribution shift between collection and training.
Status as of June 2026Academic-validated, with strong empirical results. Adoption is visible in 2025-2026 open-weights work; flagship use is harder to confirm publicly.
MethodRPO (Reward-aware Preference Optimization)
Loss in plain languageGeneralization of DPO/IPO that uses the magnitude of the reward gap between chosen and rejected — not just the order. Approximates the gap with an implicit-reward function over the policy.
Problem with PPO it addressesDPO treats all preference pairs equally even when one response is barely worse and another is wildly worse. RPO uses that signal.
Status as of June 2026Production-validated. NVIDIA used Online RPO in Nemotron-4-340B-Instruct and the Llama-Nemotron line (arXiv:2406.11704, arXiv:2505.00949). A 2025 paper (arXiv:2502.00203) unifies the framework.

Compute relative to PPO

These are relative compute and memory ratios from the original papers and common implementations. Absolute dollar costs depend on cluster, scale, and rollout length — not invented here. Numbers are approximate and the offline methods especially can vary 2x depending on data scale.

MethodPPO
Models in memory during trainingPolicy + reference + reward model + critic (4)
Relative training compute (PPO = 1.0)1.0 (reference)
NotesIncludes online rollouts, which dominate cost. The critic itself is roughly policy-sized.
MethodDPO
Models in memory during trainingPolicy + reference (2)
Relative training compute (PPO = 1.0)~0.3 to 0.5
NotesNo rollouts, no reward model, no critic. Dataset size dominates.
MethodIPO
Models in memory during trainingPolicy + reference (2)
Relative training compute (PPO = 1.0)~0.3 to 0.5
NotesSame compute profile as DPO.
MethodKTO
Models in memory during trainingPolicy + reference (2)
Relative training compute (PPO = 1.0)~0.3 to 0.5
NotesSame compute as DPO. Data collection is cheaper since labels are binary.
MethodSimPO
Models in memory during trainingPolicy only (1)
Relative training compute (PPO = 1.0)~0.2 to 0.3
NotesNo reference model. Roughly half DPO's memory and forward-pass cost.
MethodORPO
Models in memory during trainingPolicy only (1)
Relative training compute (PPO = 1.0)~0.2 to 0.3
NotesNo reference, no separate SFT stage — combined SFT+preference in one pass.
MethodGRPO
Models in memory during trainingPolicy + reference + reward model (3)
Relative training compute (PPO = 1.0)~0.5 to 0.8
NotesNo critic vs PPO. Group sampling adds rollout cost but the critic savings dominate. Reasoning rollouts (long chains-of-thought) push this higher in practice.
MethodIterative DPO
Models in memory during trainingPolicy + reference + judge (2-3)
Relative training compute (PPO = 1.0)~0.5 to 1.0+
NotesPer iteration is DPO-cheap, but multiple iterations stack. With an LLM judge in the loop, total cost approaches PPO.

Which shipping models use what

Drawn from public technical reports and model cards. Where a model uses multiple methods across stages, we note the dominant or final stage. Verified against original technical reports where available — if you are betting on this, read the model card directly.

InstructGPT / early ChatGPT (OpenAI)

arXiv:2203.02155 · 2022

PPO with a learned reward model on human preference pairs. The original blueprint that every other method on this page is trying to improve or replace.

Claude (Anthropic)

arXiv:2212.08073 · 2022 (original CAI)

Constitutional AI — a hybrid where the harmlessness phase uses AI-generated preferences against a written constitution (RLAIF), and the helpfulness phase remains human-RLHF. Anthropic has not published current Claude post-training details.

Llama 3 / 3.1 / 3.3 (Meta)

arXiv:2407.21783 · 2024

Explicitly chose DPO over PPO. Pipeline is SFT → rejection sampling → DPO, iterated across rounds. Meta's stated reason was scaling and stability.

DeepSeek-R1 / V3 (DeepSeek)

arXiv:2501.12948 · 2025

GRPO, scaled. R1-Zero used pure RL with GRPO on verifiable-reward tasks (math, code) and produced emergent chain-of-thought reasoning. R1 added cold-start SFT data and multi-stage training.

Zephyr-7B-beta (Hugging Face H4)

arXiv:2310.16944 · 2023

Early proof that DPO could match PPO at small scale on real benchmarks. SFT on UltraChat then DPO on UltraFeedback. The model that helped DPO cross from paper to practice in the open-weights world.

Nemotron-4-340B / Llama-Nemotron (NVIDIA)

arXiv:2406.11704, arXiv:2505.00949

Online Reward-aware Preference Optimization (RPO) plus REINFORCE (RLOO) across multiple RL stages. One of the few flagship models to publicly report a non-DPO, non-GRPO method.

Honest gaps and caveats

Three things to keep in mind. First: most labs do not publish full post-training recipes anymore. What we cite are technical reports, which are usually correct on method names but often skip hyperparameters, data mix, and exact stage ordering. Treat them as orientation, not specification. Second: 'production-validated' here means 'a real lab shipped a real model.' It does not mean the method is best for your use case. A 7B fine-tune of an open model is not the same problem as training a frontier flagship. Third: the field is moving fast enough that this page will be out of date within months. New methods (DPO variants with margin, length-normalized losses, process reward models for reasoning) appear monthly. The four-axis framework above (reward model? reference model? online? pair vs binary?) is more stable than any specific method's prominence.

How to choose

A minimum-effective-dose decision tree. This is not a substitute for reading the papers, but it cuts the search space for most projects.

  • Verifiable rewards (math, code, formal logic) and budget for online RL: GRPO. The DeepSeek-R1 reproduction stack is well-documented and open-source friendly.
  • Subjective preferences (chat, instruction following, style) and a fixed preference dataset: DPO. Llama 3 chose it for a reason — stable, scalable, well-supported in TRL and TRLX.
  • Subjective preferences but tight on memory or compute: SimPO or ORPO. Reference-free, roughly half the cost of DPO.
  • You have thumbs-up/thumbs-down product telemetry but not pair-labeled preferences: KTO. Use the data you actually have, not the data the paper used.
  • DPO is overfitting (chosen win-rate at training time goes to ~1.0 fast): IPO. The squared-loss regularization addresses this directly.
  • You have a strong reward model and want to keep improving past one round of DPO: iterative DPO. Generate, judge, train, repeat — 2-4 rounds typically.
  • You are NVIDIA or have a similarly mature stack and the reward signal is rich: RPO. Use the gap, not just the order.
  • You are not sure: start with DPO. It is the conservative default for general-purpose post-training in 2026 and has the most community tooling. Move to GRPO or SimPO only when you have a specific reason.

What the family tree teaches

Three patterns are visible across the decade from Christiano 2017 to DeepSeek-R1 2025. First: every successful method removes a component. PPO had four models in memory; DPO removed the reward model and the critic; SimPO and ORPO removed the reference model; GRPO removed the critic. Subtraction has been the throughline. Second: the field re-discovered that online RL beats offline preference learning when the reward is verifiable. DPO won the chat era because chat preferences are subjective and offline data is cheap. GRPO won the reasoning era because math and code have ground-truth rewards and on-policy rollouts compound. Both can be true at once — the right method depends on what the reward looks like. Third: most of these papers are six to thirty-six months old. The methods are settling into a small number of stable families (DPO-like, PPO-like, GRPO-like) and the variant explosion has slowed. By 2027 we expect the field to have consolidated further, with one or two dominant methods per use case rather than ten.

Sources

  1. [01]

    Christiano et al. 2017, 'Deep reinforcement learning from human preferences' — original RLHF reward-model + RL framework, evaluated on Atari and MuJoCo.

    arxiv.org/abs/1706.03741

  2. [02]

    Ouyang et al. 2022, 'Training language models to follow instructions with human feedback' — InstructGPT, first large-scale RLHF-with-PPO production pipeline.

    arxiv.org/abs/2203.02155

  3. [03]

    Bai et al. 2022, 'Constitutional AI: Harmlessness from AI Feedback' (Anthropic) — RLAIF with a written constitution; foundational for the Claude line.

    arxiv.org/abs/2212.08073

  4. [04]

    Rafailov et al. 2023, 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model' — collapses reward model + PPO into a single classification loss.

    arxiv.org/abs/2305.18290

  5. [05]

    Azar et al. 2023, 'A General Theoretical Paradigm to Understand Learning from Human Preferences' — introduces IPO and the Psi-PO framework regularizing DPO.

    arxiv.org/abs/2310.12036

  6. [06]

    Tunstall et al. 2023, 'Zephyr: Direct Distillation of LM Alignment' (Hugging Face H4) — early production-grade open-weights model trained with DPO.

    arxiv.org/abs/2310.16944

  7. [07]

    Yuan et al. 2024, 'Self-Rewarding Language Models' (Meta) — iterative DPO with the model as its own judge, Llama 2 70B base.

    arxiv.org/abs/2401.10020

  8. [08]

    Ethayarajh et al. 2024, 'KTO: Model Alignment as Prospect Theoretic Optimization' — binary thumbs-up/down loss instead of preference pairs.

    arxiv.org/abs/2402.01306

  9. [09]

    Shao et al. 2024, 'DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models' — introduces GRPO (Group Relative Policy Optimization).

    arxiv.org/abs/2402.03300

  10. [10]

    Hong, Lee, Thorne 2024, 'ORPO: Monolithic Preference Optimization without Reference Model' — combines SFT and preference optimization without a reference policy.

    arxiv.org/abs/2403.07691

  11. [11]

    Meng, Xia, Chen 2024, 'SimPO: Simple Preference Optimization with a Reference-Free Reward' — reference-free DPO variant; topped Chatbot Arena under-10B at release.

    arxiv.org/abs/2405.14734

  12. [12]

    NVIDIA 2024, 'Nemotron-4 340B Technical Report' — reports use of Reward-aware Preference Optimization (RPO) in alignment stages.

    arxiv.org/abs/2406.11704

  13. [13]

    Meta 2024, 'The Llama 3 Herd of Models' — Meta explicitly chose DPO over more complex RL methods for the Llama 3 post-training pipeline.

    arxiv.org/abs/2407.21783

  14. [14]

    DeepSeek 2025, 'DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning' — frontier reasoning model trained with GRPO.

    arxiv.org/abs/2501.12948

  15. [15]

    2025, 'Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment' — generalizes RPO and unifies many preference-optimization variants.

    arxiv.org/abs/2502.00203

  16. [16]

    NVIDIA 2025, 'Llama-Nemotron: Efficient Reasoning Models' — uses REINFORCE (RLOO) and Online RPO in post-training.

    arxiv.org/abs/2505.00949

  17. [17]

    Zephyr-7B-beta model card confirms DPO training on Mistral-7B-v0.1 base, used in production.

    huggingface.co/HuggingFaceH4/zephyr-7b-beta

  18. [18]

    Official SimPO implementation and reproduction details from Princeton NLP.

    github.com/princeton-nlp/SimPO

  19. [19]

    Official DeepSeek-R1 repository confirms GRPO as the RL algorithm used in training.

    github.com/deepseek-ai/DeepSeek-R1

LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM