
The RLHF family tree
Atlas · post-training methods from PPO (2017) to GRPO (2025) — what each one actually changes
The shape of the family tree
Chronology
Jun 2017
Deep RL from human preferences (Christiano et al.)
OpenAI/DeepMind paper introduces the reward-model + RL pipeline that becomes RLHF. Atari and MuJoCo, not language. arXiv:1706.03741.
Mar 2022
InstructGPT (Ouyang et al.)
OpenAI applies the Christiano pipeline at scale to GPT-3 using PPO. First flagship production model shipped with RLHF. arXiv:2203.02155.
Dec 2022
Constitutional AI (Bai et al., Anthropic)
RLAIF variant — AI-generated preferences against a written constitution replace human labelers in the harm phase. arXiv:2212.08073. Powers the Claude line.
May 2023
DPO (Rafailov et al.)
Stanford paper collapses the reward-model + PPO pipeline into a single classification loss. The shift that broke RLHF open to the open-source community. arXiv:2305.18290.
Oct 2023
IPO (Azar et al., DeepMind)
Identifies a DPO failure mode (overfitting on near-deterministic preferences) and proposes a regularized objective. arXiv:2310.12036.
Feb 2024
KTO and DeepSeekMath/GRPO
KTO (Ethayarajh et al., arXiv:2402.01306) replaces pairs with binary thumbs-up/down. DeepSeekMath introduces GRPO (Shao et al., arXiv:2402.03300) — PPO minus the critic.
Mar 2024
ORPO (Hong et al.)
Merges SFT and preference optimization into a single stage, drops the reference model. arXiv:2403.07691.
May 2024
SimPO (Meng, Xia, Chen — Princeton)
Reference-free, length-normalized DPO variant. Topped Chatbot Arena under-10B at release. arXiv:2405.14734.
Jul 2024
Llama 3 ships with DPO
Meta explicitly chose DPO over PPO for the Llama 3 herd, citing scaling and stability. arXiv:2407.21783. Marked the moment DPO became frontier-acceptable, not just academic.
Jan 2025
DeepSeek-R1
GRPO scaled to a frontier reasoning model. Reproduced o1-level math/code performance with an open-weights model. arXiv:2501.12948. GRPO becomes the dominant method for reasoning training.
Methods · what each one actually does
All loss descriptions are plain-language summaries. Read the cited paper before implementing — these summaries are orientation, not specification.
Compute relative to PPO
These are relative compute and memory ratios from the original papers and common implementations. Absolute dollar costs depend on cluster, scale, and rollout length — not invented here. Numbers are approximate and the offline methods especially can vary 2x depending on data scale.
Which shipping models use what
Drawn from public technical reports and model cards. Where a model uses multiple methods across stages, we note the dominant or final stage. Verified against original technical reports where available — if you are betting on this, read the model card directly.
InstructGPT / early ChatGPT (OpenAI)
arXiv:2203.02155 · 2022
PPO with a learned reward model on human preference pairs. The original blueprint that every other method on this page is trying to improve or replace.
Claude (Anthropic)
arXiv:2212.08073 · 2022 (original CAI)
Constitutional AI — a hybrid where the harmlessness phase uses AI-generated preferences against a written constitution (RLAIF), and the helpfulness phase remains human-RLHF. Anthropic has not published current Claude post-training details.
Llama 3 / 3.1 / 3.3 (Meta)
arXiv:2407.21783 · 2024
Explicitly chose DPO over PPO. Pipeline is SFT → rejection sampling → DPO, iterated across rounds. Meta's stated reason was scaling and stability.
DeepSeek-R1 / V3 (DeepSeek)
arXiv:2501.12948 · 2025
GRPO, scaled. R1-Zero used pure RL with GRPO on verifiable-reward tasks (math, code) and produced emergent chain-of-thought reasoning. R1 added cold-start SFT data and multi-stage training.
Zephyr-7B-beta (Hugging Face H4)
arXiv:2310.16944 · 2023
Early proof that DPO could match PPO at small scale on real benchmarks. SFT on UltraChat then DPO on UltraFeedback. The model that helped DPO cross from paper to practice in the open-weights world.
Nemotron-4-340B / Llama-Nemotron (NVIDIA)
arXiv:2406.11704, arXiv:2505.00949
Online Reward-aware Preference Optimization (RPO) plus REINFORCE (RLOO) across multiple RL stages. One of the few flagship models to publicly report a non-DPO, non-GRPO method.
Honest gaps and caveats
Three things to keep in mind. First: most labs do not publish full post-training recipes anymore. What we cite are technical reports, which are usually correct on method names but often skip hyperparameters, data mix, and exact stage ordering. Treat them as orientation, not specification. Second: 'production-validated' here means 'a real lab shipped a real model.' It does not mean the method is best for your use case. A 7B fine-tune of an open model is not the same problem as training a frontier flagship. Third: the field is moving fast enough that this page will be out of date within months. New methods (DPO variants with margin, length-normalized losses, process reward models for reasoning) appear monthly. The four-axis framework above (reward model? reference model? online? pair vs binary?) is more stable than any specific method's prominence.
How to choose
A minimum-effective-dose decision tree. This is not a substitute for reading the papers, but it cuts the search space for most projects.
- Verifiable rewards (math, code, formal logic) and budget for online RL: GRPO. The DeepSeek-R1 reproduction stack is well-documented and open-source friendly.
- Subjective preferences (chat, instruction following, style) and a fixed preference dataset: DPO. Llama 3 chose it for a reason — stable, scalable, well-supported in TRL and TRLX.
- Subjective preferences but tight on memory or compute: SimPO or ORPO. Reference-free, roughly half the cost of DPO.
- You have thumbs-up/thumbs-down product telemetry but not pair-labeled preferences: KTO. Use the data you actually have, not the data the paper used.
- DPO is overfitting (chosen win-rate at training time goes to ~1.0 fast): IPO. The squared-loss regularization addresses this directly.
- You have a strong reward model and want to keep improving past one round of DPO: iterative DPO. Generate, judge, train, repeat — 2-4 rounds typically.
- You are NVIDIA or have a similarly mature stack and the reward signal is rich: RPO. Use the gap, not just the order.
- You are not sure: start with DPO. It is the conservative default for general-purpose post-training in 2026 and has the most community tooling. Move to GRPO or SimPO only when you have a specific reason.
What the family tree teaches
Sources
- [01]
Christiano et al. 2017, 'Deep reinforcement learning from human preferences' — original RLHF reward-model + RL framework, evaluated on Atari and MuJoCo.
arxiv.org/abs/1706.03741
- [02]
Ouyang et al. 2022, 'Training language models to follow instructions with human feedback' — InstructGPT, first large-scale RLHF-with-PPO production pipeline.
arxiv.org/abs/2203.02155
- [03]
Bai et al. 2022, 'Constitutional AI: Harmlessness from AI Feedback' (Anthropic) — RLAIF with a written constitution; foundational for the Claude line.
arxiv.org/abs/2212.08073
- [04]
Rafailov et al. 2023, 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model' — collapses reward model + PPO into a single classification loss.
arxiv.org/abs/2305.18290
- [05]
Azar et al. 2023, 'A General Theoretical Paradigm to Understand Learning from Human Preferences' — introduces IPO and the Psi-PO framework regularizing DPO.
arxiv.org/abs/2310.12036
- [06]
Tunstall et al. 2023, 'Zephyr: Direct Distillation of LM Alignment' (Hugging Face H4) — early production-grade open-weights model trained with DPO.
arxiv.org/abs/2310.16944
- [07]
Yuan et al. 2024, 'Self-Rewarding Language Models' (Meta) — iterative DPO with the model as its own judge, Llama 2 70B base.
arxiv.org/abs/2401.10020
- [08]
Ethayarajh et al. 2024, 'KTO: Model Alignment as Prospect Theoretic Optimization' — binary thumbs-up/down loss instead of preference pairs.
arxiv.org/abs/2402.01306
- [09]
Shao et al. 2024, 'DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models' — introduces GRPO (Group Relative Policy Optimization).
arxiv.org/abs/2402.03300
- [10]
Hong, Lee, Thorne 2024, 'ORPO: Monolithic Preference Optimization without Reference Model' — combines SFT and preference optimization without a reference policy.
arxiv.org/abs/2403.07691
- [11]
Meng, Xia, Chen 2024, 'SimPO: Simple Preference Optimization with a Reference-Free Reward' — reference-free DPO variant; topped Chatbot Arena under-10B at release.
arxiv.org/abs/2405.14734
- [12]
NVIDIA 2024, 'Nemotron-4 340B Technical Report' — reports use of Reward-aware Preference Optimization (RPO) in alignment stages.
arxiv.org/abs/2406.11704
- [13]
Meta 2024, 'The Llama 3 Herd of Models' — Meta explicitly chose DPO over more complex RL methods for the Llama 3 post-training pipeline.
arxiv.org/abs/2407.21783
- [14]
DeepSeek 2025, 'DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning' — frontier reasoning model trained with GRPO.
arxiv.org/abs/2501.12948
- [15]
2025, 'Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment' — generalizes RPO and unifies many preference-optimization variants.
arxiv.org/abs/2502.00203
- [16]
NVIDIA 2025, 'Llama-Nemotron: Efficient Reasoning Models' — uses REINFORCE (RLOO) and Online RPO in post-training.
arxiv.org/abs/2505.00949
- [17]
Zephyr-7B-beta model card confirms DPO training on Mistral-7B-v0.1 base, used in production.
huggingface.co/HuggingFaceH4/zephyr-7b-beta
- [18]
Official SimPO implementation and reproduction details from Princeton NLP.
github.com/princeton-nlp/SimPO
- [19]
Official DeepSeek-R1 repository confirms GRPO as the RL algorithm used in training.
github.com/deepseek-ai/DeepSeek-R1