::deep-dive

AI Safety (Technical)

From Concrete Problems to mesa-optimization and alignment faking — the technical research agenda for making AI go well

AI safety as a technical research discipline is concerned with the question: how do we build AI systems that reliably do what we want, including as their capabilities increase past human verification? This is a distinct project from misuse prevention (which is more like cybersecurity), from short-term harm mitigation (which is more like content moderation), and from policy work (which is more like governance). The technical agenda has several pillars: outer alignment (specifying what we want — the reward, the objective, the constitution); inner alignment (ensuring the optimizer's actual objective matches what we specified, addressing mesa-optimization and deceptive alignment); robustness (the model behaves as intended under distribution shift, adversarial pressure, and out-of-training inputs); interpretability (we can understand why the model is doing what it is doing); and scalable oversight (we can supervise systems whose capabilities exceed our ability to directly verify their outputs). The canonical entry-level paper is Concrete Problems in AI Safety (Amodei, Olah, Steinhardt, Christiano, Schulman, Mane, 2016) which laid out five concrete research problems that remain relevant today. Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger, van Merwijk, Mikulik, Skalse, Garrabrant, 2019) introduced the mesa-optimization framework and the inner-alignment problem. More recently, Alignment Faking in Large Language Models (Greenblatt et al., Anthropic 2024) and Sleeper Agents (Hubinger et al., Anthropic 2024) have provided the first empirical demonstrations of alignment-relevant failure modes in frontier models. The Stanford / DeepMind / Anthropic / OpenAI safety publications and the AI Alignment Forum together comprise the active research literature. A doctorate-grade learner should be able to articulate the difference between capabilities research and safety research, identify which safety problems each technique addresses, evaluate empirical safety claims rigorously, and identify open problems worth working on.

::reading path · in order

::01 · paper
~6h
Concrete Problems in AI Safety — Amodei, Olah, Steinhardt, Christiano, Schulman, Mane (2016)
The canonical entry to the field. Read it first; the five problems framework is still useful taxonomically.
::02 · paper
~12h
Risks from Learned Optimization in Advanced Machine Learning Systems — Hubinger, van Merwijk, Mikulik, Skalse, Garrabrant (2019)
Introduces mesa-optimization and the inner-alignment problem. Foundational and long; budget the time.
::03 · paper
~10h
Alignment Faking in Large Language Models — Greenblatt et al. (Anthropic, Redwood, NYU, MILA 2024)
First-of-its-kind empirical demonstration of alignment faking in a frontier model. Required reading.
::04 · paper
~8h
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — Hubinger et al. (Anthropic 2024)
Empirical demonstration that deceptive behavior can survive standard safety training. Critical evidence.
::05 · paper
~5h
AI safety via debate — Irving, Christiano, Amodei (OpenAI 2018)
Foundational scalable oversight proposal. Read alongside the iterated amplification proposals.
::06 · paper
~5h
Supervising strong learners by amplifying weak experts — Christiano, Shlegeris, Amodei (2018)
Iterated amplification as a scalable oversight mechanism. The conceptual parent of RLAIF.
::07 · blog
~3h
Anthropic — Core Views on AI Safety (anthropic.com)
Anthropic's published views on how they think about safety. Useful as a frontier-lab perspective.
::08 · blog
~3h
DeepMind — Specification Gaming: The Flip Side of AI Ingenuity (blog and accompanying spreadsheet of examples)
Specification gaming examples in the wild. Concrete and grounded.
::09 · paper
~6h
Discovering Language Model Behaviors with Model-Written Evaluations — Perez et al. (Anthropic 2022)
Methodology for generating alignment-relevant evaluations at scale. Useful both for safety and for evaluation literacy.
::10 · paper
~6h
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision — Burns et al. (OpenAI 2023)
Empirical attempt at the scalable oversight problem. Read for the methodology and the open questions.
::11 · blog
~20h
AI Alignment Forum (alignmentforum.org)
The active research forum. Read recent posts, follow the discussions, develop taste for which arguments are rigorous.
::12 · course
~40h
AISFB (AI Safety Fundamentals) curriculum — BlueDot Impact
Structured curriculum covering the alignment problem, scalable oversight, interpretability, and governance. Good integrator.

::exercises · build · derive · reproduce

01Read Concrete Problems and produce a one-page taxonomy mapping each problem to a current research direction.
02Reproduce a simple specification-gaming example: train a small RL agent on a toy reward and identify the unintended optimum.
03Read the Alignment Faking paper and replicate one of its evaluations on an open model.
04Implement a Sleeper Agents-style backdoored model on a small open model and attempt to detect the backdoor.
05Write a research proposal addressing one of the open problems from the AI Alignment Forum. Include falsifiable predictions.
06Audit a recent capabilities paper for safety-relevant claims and identify the unsupported ones.

::milestones · observable

▲You can explain the difference between outer and inner alignment.
▲You can articulate mesa-optimization in your own words.
▲You can evaluate an empirical safety claim and identify its limitations.
▲You have actually read and understood Risks from Learned Optimization end to end.
▲You can identify a specific open problem in technical AI safety that you could work on.