::deep-dive

RLHF and Alignment

From InstructGPT to Constitutional AI — how raw language models become helpful, harmless, and honest assistants

A pretrained language model is an extraordinary completion engine but not a useful assistant. The transformation from base model to instruction-following assistant is one of the most consequential advances of the 2022-2024 era, and the technique is reinforcement learning from human feedback (RLHF), or more recently, related techniques like direct preference optimization (DPO), Constitutional AI (CAI), and RLAIF (RL from AI feedback). The canonical paper is Ouyang et al.'s InstructGPT (2022), which introduced the three-stage pipeline: supervised fine-tuning on demonstrations, training a reward model on human preference comparisons, and PPO fine-tuning of the language model against the reward model. Anthropic's Bai et al. HH-RLHF paper (Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022) is the parallel canonical work and the basis for the public HH dataset. Constitutional AI (Bai et al., 2022) introduced AI feedback as a scalable alternative to pure human labeling, and Anthropic's subsequent work extended this into RLAIF. Direct Preference Optimization (Rafailov et al., 2023) showed that the RL step can be replaced with a clean supervised loss, dramatically simplifying training. A doctorate-grade understanding of this area requires not just reading these papers but understanding the underlying RL theory (PPO, the KL-constrained reward, the reward hacking literature), the empirical pathologies (reward model overoptimization, sycophancy, mode collapse, sandbagging), and the broader alignment context (what RLHF can and cannot do, why scalable oversight is hard, the relationship between RLHF and the alignment problem proper). This page connects to the AI safety page and the interpretability page — they are aspects of the same project.

::reading path · in order

::01 · paper
~8h
Training language models to follow instructions with human feedback — Ouyang et al. (InstructGPT paper, OpenAI 2022)
The foundational RLHF-for-assistants paper. Read every section including the appendices on PPO and reward model training.
::02 · paper
~8h
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback — Bai et al. (Anthropic 2022)
Anthropic's parallel canonical paper. Read for the HH-RLHF dataset and the helpful-vs-harmless tradeoff analysis.
::03 · paper
~6h
Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic 2022)
Introduces AI feedback as a scalability mechanism. The basis for RLAIF and a major step toward scalable oversight.
::04 · paper
~6h
Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov, Sharma, Mitchell, Manning, Ermon, Finn (Stanford 2023)
DPO eliminates the explicit reward model and PPO. The current default for many open-source post-training pipelines.
::05 · paper
~4h
Proximal Policy Optimization Algorithms — Schulman, Wolski, Dhariwal, Radford, Klimov (OpenAI 2017)
The RL algorithm under classical RLHF. Read alongside Sutton and Barto chapters on policy gradients.
::06 · paper
~5h
Learning to Summarize from Human Feedback — Stiennon et al. (OpenAI 2020)
The pre-InstructGPT RLHF paper. Cleaner experimental setting that demonstrates the core mechanism.
::07 · paper
~4h
Scaling Laws for Reward Model Overoptimization — Gao, Schulman, Hilton (OpenAI 2022)
Quantifies reward hacking. Essential for understanding why RLHF has limits.
::08 · paper
~5h
Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models — Denison et al. (Anthropic 2024)
Empirical evidence of reward hacking generalizing to specification gaming and reward tampering.
::09 · textbook
~10h
Reinforcement Learning: An Introduction — Sutton and Barto (chapters 13 on policy gradients)
Background for understanding PPO. Read if your RL fundamentals are weak.
::10 · course
~25h
Spinning Up in Deep RL — OpenAI (spinningup.openai.com)
Practical implementations of policy gradient methods. Work through the PPO implementation to internalize the algorithm.
::11 · code
~15h
TRL (Transformer Reinforcement Learning) — HuggingFace (github.com/huggingface/trl)
Modern reference implementation of SFT, RM training, PPO, and DPO. Read the source code, then use it.

::exercises · build · derive · reproduce

01Implement a tiny preference dataset (you produce 50 labeled pairs) and train a reward model from a base LM.
02Implement DPO from scratch on top of HuggingFace transformers. Verify against the TRL implementation.
03Reproduce a sycophancy result: prompt a model with leading premises and measure agreement bias before and after a preference fine-tune.
04Read InstructGPT and Bai et al. HH-RLHF in the same week. Produce a one-page diff of their methodologies.
05Run a small PPO RLHF training loop on a toy task (e.g., positive-sentiment continuations). Plot KL-to-reference and reward.
06Design a Constitutional AI red-team-and-revise loop for a small open model. Document the safety improvement (or lack thereof).

::milestones · observable

▲You can derive the DPO loss from the KL-constrained RLHF objective on paper.
▲You have actually trained a reward model and an RLHF or DPO fine-tune.
▲You can identify reward hacking signatures in a model's outputs.
▲You can explain Constitutional AI to a skeptical PhD in plain language.
▲You understand why RLHF is not alignment, only a partial solution.