AtomEons / Research / Decoded / Constitutional AI

2022 · arXiv:2212.08073 · Bai, Kadavath, Kundu, Askell, Kernion, Jones, et al. · Anthropic

AI grading its own homework.

In one sentence: Anthropic taught Claude to behave well by having a second AI grade Claude's responses against a written constitution — replacing thousands of human labelers with a feedback loop that AI scales with itself.

01 · Why this matters to your life

When Claude refuses to help you make a weapon, or declines to spread misinformation, or pushes back politely on a request that would harm someone — that's constitutional AI in action. The behavior is not a hardcoded rule list. It is a trained tendency, learned by Claude grading thousands of its own draft responses against a written set of principles.

The reason this matters beyond Anthropic: it was the first plausible recipe for scaling AI safety past the human bottleneck. Before this paper, every meaningful improvement in AI behavior required armies of humans labeling AI outputs — slow, expensive, hard to keep consistent. After this paper, the loop could run at machine speed. Every frontier lab now uses some version of the technique.

02 · What scientists actually did

They wrote a short list of principles — the “constitution” — describing how an AI assistant should behave. The original constitution was a few dozen lines drawing from sources like the UN Declaration of Human Rights, Apple's Terms of Service style guides, and Anthropic's own values. Things like “choose the response that is most helpful, harmless, and honest.”

Then they ran a two-stage training process. In stage one, they had the AI itself look at its own draft response, critique it against the constitution, and rewrite it to better match the principles. The model learned to self-improve its outputs. In stage two, they had the AI generate pairs of responses and label which was better according to the constitution — producing training data for a separate preference model. That preference model then steered future training.

The net effect: instead of needing thousands of humans to rank thousands of AI outputs, you need one constitution and a model that knows how to apply it. The AI labels its own data at machine speed. The recipe is called RLAIF — Reinforcement Learning from AI Feedback — and it sits next to its predecessor RLHF (Reinforcement Learning from Human Feedback).

03 · What scientists know but rarely say

The constitution is short. The full text Anthropic published is roughly two pages. The reason every behavior of every Claude conversation traces back to those two pages is that the model generalizes wildly from them — it extrapolates from the written principles to situations the principles never explicitly mention. Whether the extrapolation is correct, and how it handles edge cases the constitution didn't anticipate, is the open problem.

The unstated tradeoff: RLAIF is cheaper than RLHF but no one fully understands what it amplifies. If the labeling model has subtle biases, those biases get baked into the trained model at scale. The field calls this “reward hacking” — the trained model learning to look good to the grader rather than actually be good. Anthropic, OpenAI, DeepMind all have safety researchers working on this problem in 2026.

The most consequential implication: this paper is the technique that lets frontier labs scale safety as fast as they scale capability. If safety stayed human-bottlenecked, we'd ship capable models with thin guardrails. The fact that the guardrails can scale with the capability is what makes Claude shippable to consumers and enterprises.

04 · What the paper does NOT claim

The paper does not claim Constitutional AI solves alignment. It claims it's a scalable safety training recipe — a useful tool, not a final answer. Anthropic's own subsequent research (including the Sparse Autoencoders work in /research/decoded/sparse-autoencoders) is explicitly about closing the remaining gap.

The paper also does not claim the AI “understands ethics.” What it has is statistical compliance with the written principles — usually robust, occasionally jailbreakable. Every modern frontier lab acknowledges this. The honest framing is that Constitutional AI is engineering safety, not philosophical safety. The model behaves better; whether it understands why is unresolved.

05 · Read the original

· arxiv.org/abs/2212.08073 — the original. ~60 pages but the early sections give you most of it.
· Anthropic's Claude's Constitution post (anthropic.com/news/claudes-constitution) — the actual text the model is trained against.
· Ouyang et al. 2022 (InstructGPT) — the OpenAI paper introducing RLHF, the predecessor to RLAIF. arXiv:2203.02155.
· Lee et al. 2023 (RLAIF detailed comparison) — Google's study comparing RLHF vs RLAIF head-to-head. arXiv:2309.00267.

Looking inside the AI brain →← decoded index