What is RLHF?
The short answer
RLHF (Reinforcement Learning from Human Feedback) is a machine learning technique that fine-tunes large language models using human preference data instead of hand-written labels. It works by training a reward model on human-ranked outputs, then using reinforcement learning — typically Proximal Policy Optimization (PPO) — to optimize the language model against that reward. OpenAI used RLHF to train InstructGPT (2022) and ChatGPT, and it is now the standard alignment recipe at Anthropic, Google DeepMind, and Meta.
The longer answer
RLHF emerged from a 2017 paper by Christiano, Leike, Brown, Martic, Legg, and Amodei at OpenAI and DeepMind titled Deep reinforcement learning from human preferences (arXiv:1706.03741), which showed an agent could learn complex behaviors from roughly 900 bits of binary human feedback. The technique was then adapted for language by Stiennon et al. in Learning to summarize from human feedback (arXiv:2009.01325, 2020), and operationalized at scale by Ouyang et al. in the InstructGPT paper Training language models to follow instructions with human feedback (arXiv:2203.02155, March 2022) — the direct predecessor of ChatGPT.
The pipeline has three stages. Stage one is supervised fine-tuning (SFT): a pretrained base model is fine-tuned on a small set of human-written prompt-response demonstrations. Stage two is reward model training: human labelers rank multiple model outputs for the same prompt, and a separate model is trained to predict which output a human would prefer, using the Bradley-Terry pairwise preference loss. Stage three is reinforcement learning: the SFT model is fine-tuned against the reward model using Proximal Policy Optimization (Schulman et al., arXiv:1707.06347, 2017), with a KL-divergence penalty against the SFT reference model to prevent reward hacking and catastrophic forgetting.
Anthropic introduced an important variant called Constitutional AI (CAI) in Constitutional AI: Harmlessness from AI Feedback (Bai et al., arXiv:2212.08073, December 2022), which replaces most human feedback with AI-generated critique based on a written constitution — a method called RLAIF (RL from AI Feedback). Google DeepMind's Sparrow (arXiv:2209.14375, 2022) and Meta's Llama 2 Chat (arXiv:2307.09288, 2023) both use RLHF variants; Llama 2 specifically used over 1 million human preference comparisons.
The technique has known failure modes. Reward hacking — where the policy exploits flaws in the reward model rather than satisfying the underlying preference — is documented in Scaling Laws for Reward Model Overoptimization (Gao, Schulman, Hilton, arXiv:2210.10760, 2022). Sycophancy, where RLHF-trained models tell users what they want to hear rather than the truth, was demonstrated in Towards Understanding Sycophancy in Language Models (Sharma et al., arXiv:2310.13548, 2023). Mode collapse and reduced output diversity after RLHF is also widely reported.
Newer methods are partially displacing classical PPO-based RLHF. Direct Preference Optimization (DPO) by Rafailov et al. (arXiv:2305.18290, 2023) eliminates the explicit reward model and the RL loop entirely, reformulating the problem as a single classification loss over preference pairs — and is now the default fine-tuning method in much of the open-source ecosystem. IPO, KTO (Kahneman-Tversky Optimization), and ORPO are further refinements. OpenAI's o1 family (September 2024) shifted further, using RL on chain-of-thought reasoning traces rather than purely on human preference rankings.
NIST's AI Risk Management Framework (NIST AI 100-1, January 2023) and the subsequent Generative AI Profile (NIST AI 600-1, July 2024) reference preference-based fine-tuning as a primary alignment lever for governing model behavior. The technique is now a regulatory touchpoint, not just a research method.
Key facts
- RLHF for language was operationalized in InstructGPT, March 2022, using 40 human labelers and roughly 13,000 prompts (arXiv:2203.02155).
- The original deep RL from human preferences paper used approximately 900 binary preference labels to train Atari-level agents (arXiv:1706.03741).
- PPO, the RL algorithm used in classical RLHF, was published by Schulman et al. in 2017 (arXiv:1707.06347).
- Llama 2 Chat used over 1,000,000 human preference comparisons across its training pipeline (arXiv:2307.09288, Table 6).
- The Bradley-Terry preference model (Bradley & Terry, Biometrika 1952) is the standard pairwise loss for the reward model.
- Reward model overoptimization follows a documented scaling law: KL divergence from the SFT model predicts proxy-vs-gold reward divergence (arXiv:2210.10760).
- Constitutional AI (RLAIF) replaces most human labelers with AI critique grounded in a 16-principle written constitution (arXiv:2212.08073).
- Direct Preference Optimization (DPO) achieves comparable performance to PPO-RLHF with no reward model and no RL loop (arXiv:2305.18290).
- NIST AI 600-1 (Generative AI Profile, July 2024) treats preference fine-tuning as a governance lever for harm reduction.
- Anthropic's Claude, OpenAI's GPT-4, Google's Gemini, and Meta's Llama 2/3 all use RLHF or a direct descendant (DPO, RLAIF) in post-training.
Related questions
Sources
- Christiano et al., "Deep reinforcement learning from human preferences," arXiv:1706.03741
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT), arXiv:2203.02155
- Stiennon et al., "Learning to summarize from human feedback," arXiv:2009.01325
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073
- Schulman et al., "Proximal Policy Optimization Algorithms," arXiv:1707.06347
- Rafailov et al., "Direct Preference Optimization," arXiv:2305.18290
- Gao, Schulman, Hilton, "Scaling Laws for Reward Model Overoptimization," arXiv:2210.10760
- NIST AI 600-1, "Generative AI Profile," NIST.AI.600-1.pdf