01 · Why this matters to your life
When you ask ChatGPT a question and it answers clearly without rambling, refuses inappropriate requests, and doesn't produce harmful content, that's RLHF in action. The base GPT model would have rambled. The base GPT model would have produced anything you asked for. RLHF is what makes the model usable instead of just powerful.
The reason this paper matters: it was the first credible recipe for turning a raw language model into a polite assistant. Every consumer AI product since 2022 — ChatGPT, Claude, Gemini, Grok — uses some variant of this technique. It is the bridge between research and product.
02 · What scientists actually did
Three stages. First (Supervised Fine-Tuning, SFT): they collected ~13,000 examples of high-quality answers humans wrote in response to prompts. They fine-tuned the base GPT model on these to make it follow instructions instead of just completing text.
Second (Reward Model training): they had the model generate multiple responses to thousands of prompts. They had human contractors rank which response was better. They trained a separate neural network — the reward model — to predict which ranking humans would give. The reward model became a stand-in for human preferences at machine speed.
Third (Reinforcement Learning): they used the reward model as a scoring function and applied reinforcement learning (specifically, PPO — Proximal Policy Optimization) to further train the language model to produce outputs that scored well. The model learned to write what humans wanted to read, as approximated by the reward model.
The resulting model (InstructGPT) was rated by humans as substantially preferable to the base GPT-3 on basically every metric — helpfulness, honesty, harmfulness — despite being a smaller model. The training, not the size, did the work.
03 · What scientists know but rarely say
RLHF works because human raters are good at telling which of two responses is better, even when they couldn't write a perfect response from scratch. This is a known phenomenon in machine learning — judgment is often easier than generation. The whole technique depends on this asymmetry.
The labor model behind RLHF is rarely discussed in the technical literature. The Time magazine investigation of February 2023 revealed that OpenAI contracted Kenyan workers through Sama at ~$1.32-$2.00/hour to label graphic content used in safety training. The technical paper doesn't mention this. The economics of RLHF — who labels, what they're paid, what they see — became a meaningful conversation only after the paper was published.
The other unstated reality: RLHF is brittle. The reward model is an approximation of human preferences. If the language model finds outputs the reward model loves but humans actually hate, you get “reward hacking” — the model getting good at fooling the grader rather than being good. This is a real problem in production systems and the reason Anthropic developed Constitutional AI as an alternative (/research/decoded/constitutional-ai).
04 · What the paper does NOT claim
The paper does not claim that RLHF solves alignment. It claims that RLHF substantially improves human-rated quality on instruction-following tasks at fixed model size. The follow-up debate is whether RLHF actually makes models “safer” or just makes them better at appearing safe to evaluators — a distinction that has motivated significant safety research since.
The paper also does not claim its preference data is universally correct. It explicitly notes that “human values” — what is helpful, what is harmful — vary across cultures, contexts, and individual judgments. The contractor labels OpenAI used reflect specific choices about who got to grade the AI. The downstream effects of those choices on what every consumer AI considers “polite” or “harmful” are still being unpacked.
05 · Read the original
- · arxiv.org/abs/2203.02155 — the original InstructGPT paper. ~68 pages.
- · Christiano et al. 2017 (Deep Reinforcement Learning from Human Preferences) — the foundational paper RLHF descends from. arxiv:1706.03741.
- · Stiennon et al. 2020 (Summarizing with Human Feedback) — the predecessor that first applied RLHF to language. arxiv:2009.01325.
- · Time magazine investigation (Jan 18, 2023) — “OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic.”
- · Rafailov et al. 2023 (DPO) — the simpler RLHF alternative that skips the reinforcement-learning step.