AtomEons / Research / Decoded / Direct Preference Optimization (DPO): Skipping the Reward Model in Alignment

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn (Stanford University, 2023) · arXiv:2305.18290

Direct Preference Optimization (DPO): Skipping the Reward Model in Alignment

You can fine-tune a language model on human preferences (this answer is better than that one) using ordinary supervised training — no separate reward model, no reinforcement learning loop — by treating the language model itself as its own reward function.

2. What the scientists actually did

The setup is the same starting point as RLHF. You have: - A pretrained base language model (the reference). - A dataset of pairwise preferences: a prompt, two candidate answers, and a human label saying which answer was preferred. In classical RLHF (Christiano et al. 2017; Ouyang et al. 2022, InstructGPT) the procedure is: 1. Train a separate reward model on the preference data to predict which answer humans would pick. 2. Use that reward model as a scoring function and run PPO — a reinforcement learning algorithm — to tune the language model to produce high-scoring answers, while a KL penalty keeps the tuned model from drifting too far from the reference. 3. Pray your PPO run does not collapse, reward-hack, or explode. Tune endlessly. The DPO authors derived an exact mathematical equivalence. Under the Bradley-Terry preference model (the same one RLHF assumes), the optimal RLHF policy has a closed-form relationship to the reward function. That relationship can be inverted. Plug the inversion back into the standard preference-modeling loss, and the reward model disappears entirely — the language model's own log-probabilities take its place. The resulting loss is one line of code. For each preference pair, you compute the log-probability the current model assigns to the preferred answer minus the log-probability it assigns to the rejected answer, anchored against the same difference under a frozen reference copy of the model, scaled by a temperature constant called beta. You push that difference up. That's it. They tested DPO against PPO-RLHF on three tasks: - Sentiment-controlled generation (IMDb). - Summarization (Reddit TL;DR). - Single-turn dialogue (Anthropic HH). On every task, DPO matched or exceeded PPO-RLHF in win rate (judged by GPT-4 as proxy human evaluator), with simpler training, fewer hyperparameters, and far less compute. Crucially, DPO was more stable — RLHF runs frequently require multiple restarts and hyperparameter sweeps; DPO runs more or less worked the first time.

3. What scientists know but rarely say out loud

- DPO is not magic — it is the same objective as RLHF, written differently. Anywhere the underlying preference model is misspecified, both methods fail in the same direction. DPO doesn't fix bad data; it just makes bad data cheaper to train on. - DPO subtly over-fits to the preference dataset. Because the reference model is frozen and the policy can keep increasing log-prob gaps on the training pairs without ever hitting a reward ceiling, DPO tends to push down probability on all answers (preferred and rejected) while widening the gap between them. Several follow-ups (IPO, KTO, SimPO, ORPO) exist mostly to patch this issue. - The beta parameter — the KL-penalty proxy — does most of the heavy lifting. Pick it wrong and the model either fails to update or destroys its general capability. Many published DPO runs use beta values inherited from the original paper without re-tuning. - DPO needs a high-quality reference model. If your starting model can't produce competent answers, no amount of preference tuning will fix it. Alignment is mostly a polishing step; the capability has to be there first. - "Aligned" in the DPO sense means "preferred by the humans (or AI judges) who labeled the dataset." That is not the same as safe, honest, or wise. It means stylistically agreeable to a specific labeling pool. Substitute a different labeling pool and you get a different "aligned" model. - DPO is one of the cleanest examples of a result that propagated almost overnight because it was simple, free of trade secrets, and ran on commodity hardware. Within months of the arXiv post, it became the default in open-source alignment.

4. What the paper does NOT claim

- It does not claim DPO produces a safer model than RLHF. It claims comparable alignment quality at lower cost. - It does not claim DPO is a better reward model — it has no reward model at all. Some downstream uses still need an explicit reward model (e.g., best-of-N sampling at inference), and DPO doesn't give you one. - It does not claim DPO scales to all preference data shapes. The derivation assumes pairwise preferences under Bradley-Terry. List-wise preferences, scalar ratings, or thumbs-up-only signals require different methods (KTO, etc.). - It does not claim DPO solves reward hacking, sycophancy, or specification gaming. Those are properties of the preference data, not the optimizer. - It does not claim DPO is the final word. The authors explicitly invite follow-up work, and the follow-ups arrived fast.

5. Read the original

- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). *Direct Preference Optimization: Your Language Model is Secretly a Reward Model.* arXiv:2305.18290. https://arxiv.org/abs/2305.18290 - Ouyang, L., et al. (2022). *Training language models to follow instructions with human feedback* (InstructGPT — the canonical RLHF paper DPO replaces). arXiv:2203.02155. https://arxiv.org/abs/2203.02155 - Christiano, P., et al. (2017). *Deep reinforcement learning from human preferences* (the original RLHF framework). arXiv:1706.03741. https://arxiv.org/abs/1706.03741 - Azar, M. G., et al. (2023). *A General Theoretical Paradigm to Understand Learning from Human Preferences* (IPO — the most cited DPO follow-up addressing over-fitting). arXiv:2310.12036. https://arxiv.org/abs/2310.12036 - Tunstall, L., et al. (2023). *Zephyr: Direct Distillation of LM Alignment* (the first widely reproduced open-source model trained with DPO; canonical worked example). arXiv:2310.16944. https://arxiv.org/abs/2310.16944

← research / decoded index