What is mechanistic interpretability?

Plain-English answer with citations. AtomEons Research, last reviewed June 2026.

The short answer

Mechanistic interpretability is a subfield of AI safety research that tries to reverse-engineer the internal computations of neural networks — identifying the specific weights, neurons, attention heads, and circuits that implement a model's behavior — so that what a model does can be explained at the level of algorithms, not just inputs and outputs. The term was popularized by Chris Olah and collaborators at Anthropic and OpenAI through the 'Circuits' research program, and the dominant modern toolkit centers on sparse autoencoders (SAEs), activation patching, and circuit tracing.

The longer answer

Mechanistic interpretability ("mech interp" in the field) treats a trained neural network the way a reverse engineer treats a compiled binary: as an artifact whose internal logic was not designed but can in principle be recovered. Instead of asking "what does this model output?" it asks "what algorithm is the model running, expressed in terms of its own parameters?"

The research lineage runs through Chris Olah's Distill "Zoom In" essay (2020), the OpenAI/Anthropic "Circuits" thread, and Anthropic's "A Mathematical Framework for Transformer Circuits" (Elhage et al., 2021), which formalized how attention heads compose to implement small algorithms inside transformer models. Neel Nanda's "Progress Measures for Grokking via Mechanistic Interpretability" (Nanda et al., 2023, arXiv:2301.05217) demonstrated that a small transformer trained on modular addition implements a discrete Fourier-transform-plus-trig-identities algorithm — recovered weight-by-weight, not inferred from behavior.

Three technical primitives dominate current practice. First, superposition: Anthropic's "Toy Models of Superposition" (Elhage et al., 2022) showed networks pack more features than they have neurons by representing features as non-orthogonal directions in activation space. Second, sparse autoencoders (SAEs): "Towards Monosemanticity" (Bricken et al., 2023) and the follow-up "Scaling Monosemanticity" (Templeton et al., 2024) used SAEs trained on the residual stream of Claude 3 Sonnet to extract roughly 34 million interpretable features, including the now-famous "Golden Gate Bridge" feature whose clamping made the model obsessed with the bridge. Third, activation patching / path patching (Meng et al., "Locating and Editing Factual Associations in GPT," arXiv:2202.05262), which causally localizes where in the network a behavior is computed by swapping activations between forward passes.

The field's most-cited circuit-level result is the Indirect Object Identification (IOI) circuit in GPT-2 Small (Wang et al., "Interpretability in the Wild," arXiv:2211.00593), which identified 26 attention heads across the network that together implement the algorithm for completing sentences like "When John and Mary went to the store, John gave a drink to ___" with "Mary." More recent work — Anthropic's "Circuit Tracing" (Lindsey et al., 2025) and "On the Biology of a Large Language Model" — extended this from toy models to production-scale Claude, tracing multi-step reasoning, planning, and refusal behaviors.

Mech interp matters because behavioral safety evaluations cannot rule out deceptive or sandbagged behavior; only knowing what the model is actually computing can. The UK AI Safety Institute (AISI), the US AI Safety Institute (NIST), and the EU AI Office have all cited interpretability as a research priority. NIST AI 600-1 (Generative AI Profile, July 2024) names interpretability as a measurement-and-mitigation pillar for foundation models.

Open problems remain large. SAEs find features but do not yet give complete circuits at scale. Superposition makes feature decomposition non-unique. Faithfulness — whether the recovered "circuit" actually causes the behavior versus correlates with it — requires careful causal interventions. And the compute cost of training SAEs on frontier models is significant; Anthropic reported training SAEs with up to 34M features on Claude 3 Sonnet residual streams.

Key facts

The "Circuits" research program at Distill (2020-2021) established the modern framing; Chris Olah's "Zoom In: An Introduction to Circuits" (Distill, March 2020, doi:10.23915/distill.00024.001) is the canonical starting reference.
"A Mathematical Framework for Transformer Circuits" (Elhage et al., Anthropic, December 2021) formalized attention heads as composable read/write operations on a residual stream.
"Toy Models of Superposition" (Elhage et al., 2022, arXiv:2209.10652) demonstrated that networks represent more features than they have dimensions by using non-orthogonal directions.
"Towards Monosemanticity" (Bricken et al., Anthropic, October 2023) showed sparse autoencoders extract interpretable monosemantic features from a 1-layer transformer.
"Scaling Monosemanticity" (Templeton et al., Anthropic, May 2024) scaled SAEs to Claude 3 Sonnet, extracting ~34M features including the "Golden Gate Bridge" feature.
"Interpretability in the Wild" (Wang et al., 2022, arXiv:2211.00593) reverse-engineered the 26-attention-head Indirect Object Identification circuit in GPT-2 Small.
"Locating and Editing Factual Associations in GPT" (Meng et al., 2022, arXiv:2202.05262) introduced causal tracing / ROME for localizing factual recall in MLP layers.
"Progress Measures for Grokking via Mechanistic Interpretability" (Nanda et al., 2023, arXiv:2301.05217) recovered the modular-addition algorithm a transformer learns during grokking.
NIST AI 600-1 (Generative AI Profile, July 2024) lists interpretability among recommended mitigations for generative-AI risk.
The TransformerLens library (Neel Nanda, 2022) is the de facto open-source toolkit for activation patching and circuit analysis in transformer models.

Sources

Olah et al., "Zoom In: An Introduction to Circuits," Distill, 2020. distill.pub/2020/circuits/zoom-in
Elhage et al., "A Mathematical Framework for Transformer Circuits," Anthropic, 2021. transformer-circuits.pub/2021/framework
Elhage et al., "Toy Models of Superposition," arXiv:2209.10652, 2022. arxiv.org/abs/2209.10652
Bricken et al., "Towards Monosemanticity," Anthropic, 2023. transformer-circuits.pub/2023/monosemantic-features
Templeton et al., "Scaling Monosemanticity," Anthropic, 2024. transformer-circuits.pub/2024/scaling-monosemanticity
Wang et al., "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small," arXiv:2211.00593, 2022. arxiv.org/abs/2211.00593
Meng et al., "Locating and Editing Factual Associations in GPT," arXiv:2202.05262, 2022. arxiv.org/abs/2202.05262
NIST AI 600-1, "Artificial Intelligence Risk Management Framework: Generative AI Profile," July 2024. nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

What is mechanistic interpretability?

The short answer

The longer answer

Key facts

Related questions

Sources