What is Constitutional AI?
The short answer
Constitutional AI (CAI) is a training method developed by Anthropic in 2022 that aligns large language models using a written set of principles — a 'constitution' — instead of relying on human labelers to rate every harmful response. The model critiques and revises its own outputs against the constitution, then is fine-tuned on those revisions, a technique called Reinforcement Learning from AI Feedback (RLAIF). It is the alignment approach behind Anthropic's Claude family of models.
The longer answer
Constitutional AI was introduced by Bai et al. in the December 2022 paper "Constitutional AI: Harmlessness from AI Feedback" (arXiv:2212.08073). The method was Anthropic's response to a scaling problem in Reinforcement Learning from Human Feedback (RLHF): collecting human red-team labels for every harmful prompt is slow, expensive, and exposes contractors to disturbing content. CAI replaces most of that human labor with the model itself, supervised by a small written document.
The training pipeline has two stages. In the supervised stage (SL-CAI), the model generates a response to a red-team prompt, is asked to critique that response against a constitutional principle (for example, "choose the response that is least harmful, unethical, or deceptive"), and then rewrites the response. The model is fine-tuned on these revised responses. In the reinforcement-learning stage (RL-CAI), the model generates pairs of responses and a separate AI model picks the better one according to the constitution, producing a preference dataset that trains a reward model. The policy is then optimized against that reward model with PPO, the same algorithm OpenAI used in InstructGPT (arXiv:2203.02155).
Anthropic published the constitution Claude is trained against in May 2023 in a post titled "Claude's Constitution." It draws from the UN Universal Declaration of Human Rights (1948), Apple's Terms of Service, DeepMind's Sparrow rules (Glaese et al., arXiv:2209.14375), and Anthropic's own research on non-Western perspectives. Principles include "please choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content" and instructions to avoid being preachy, obnoxious, or condescending.
CAI is a form of RLAIF — Reinforcement Learning from AI Feedback — and Google DeepMind later showed in "RLAIF vs. RLHF" (Lee et al., arXiv:2309.00267) that AI-generated preferences can match human-generated preferences on summarization and helpful-dialogue tasks. This made CAI not just an alignment philosophy but a practical scaling technique adopted across the industry.
In October 2023 Anthropic ran a follow-up project called Collective Constitutional AI in partnership with the Collective Intelligence Project, polling roughly 1,000 Americans through Polis to draft a public constitution. The resulting model was compared against the standard Claude constitution and behaved similarly on most evaluations, with minor differences in political bias scores on the BBQ benchmark.
Constitutional AI is distinct from but related to several other alignment methods. RLHF (Christiano et al., arXiv:1706.03741) uses human preferences directly. Direct Preference Optimization or DPO (Rafailov et al., arXiv:2305.18290) skips the reward model entirely. Deliberative Alignment (OpenAI, 2024) trains o1-class reasoning models to reason explicitly about safety specifications at inference time, which is conceptually a cousin of CAI's critique-and-revise loop but applied to chain-of-thought rather than offline fine-tuning.
The method has limits. The constitution itself is written by Anthropic employees and reflects their judgment about what is harmful, which is a centralization of values that critics including Stanford HAI have noted. CAI also depends on the base model already being capable enough to critique itself coherently — it is not a fix for weak models. And the NIST AI Risk Management Framework (NIST AI 100-1, January 2023) treats alignment techniques like CAI as one input to governance, not a substitute for it.
Key facts
- Constitutional AI was introduced in the paper "Constitutional AI: Harmlessness from AI Feedback" by Yuntao Bai et al. on December 15, 2022 (arXiv:2212.08073).
- CAI has two training stages: a supervised learning phase (SL-CAI) using self-critique and revision, and a reinforcement learning phase (RL-CAI) using AI-generated preference labels (arXiv:2212.08073, sections 3 and 4).
- The technique is a specific implementation of Reinforcement Learning from AI Feedback (RLAIF), which Google DeepMind showed performs comparably to RLHF on standard benchmarks (arXiv:2309.00267).
- The Claude constitution draws on the UN Universal Declaration of Human Rights (1948) and DeepMind's Sparrow rules (arXiv:2209.14375), per Anthropic's May 9, 2023 post "Claude's Constitution."
- Collective Constitutional AI, run with the Collective Intelligence Project in October 2023, used Polis to crowdsource a constitution from roughly 1,000 U.S. participants (Anthropic blog, October 17, 2023).
- The reinforcement learning stage typically uses Proximal Policy Optimization (PPO), the algorithm from OpenAI's InstructGPT paper (arXiv:2203.02155).
- The NIST AI Risk Management Framework 1.0 (NIST AI 100-1, January 26, 2023) catalogs alignment methods like CAI under the GOVERN and MAP functions.
- Direct Preference Optimization (DPO) is an alternative that removes the explicit reward model used in RLHF and CAI (arXiv:2305.18290, May 29, 2023).
Related questions
Sources
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback," arXiv:2212.08073 — arxiv.org/abs/2212.08073
- Anthropic, "Claude's Constitution," May 9, 2023 — anthropic.com/news/claudes-constitution
- Anthropic, "Collective Constitutional AI: Aligning a Language Model with Public Input," October 17, 2023 — anthropic.com/news/collective-constitutional-ai
- Lee et al., "RLAIF vs. RLHF," arXiv:2309.00267 — arxiv.org/abs/2309.00267
- Glaese et al., "Improving alignment of dialogue agents via targeted human judgements" (Sparrow), arXiv:2209.14375 — arxiv.org/abs/2209.14375
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT), arXiv:2203.02155 — arxiv.org/abs/2203.02155
- Rafailov et al., "Direct Preference Optimization," arXiv:2305.18290 — arxiv.org/abs/2305.18290
- NIST, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, January 26, 2023 — nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf