::deep-dive

Mechanistic Interpretability

Reverse-engineering neural networks — circuits, features, and the path to understanding what models actually do

Mechanistic interpretability is the project of reverse-engineering neural networks into the algorithms they implement. Where behavioral interpretability asks 'what does the model output?', mechanistic interpretability asks 'what is happening inside?' — at the level of individual neurons, attention heads, residual stream features, and computational circuits. The modern field crystallized around Anthropic's Transformer Circuits Thread and the work of Chris Olah, Catherine Olsson, Neel Nanda, and colleagues. The foundational papers — A Mathematical Framework for Transformer Circuits (Elhage et al., 2021), In-context Learning and Induction Heads (Olsson et al., 2022), Toy Models of Superposition (Elhage et al., 2022), and the subsequent Scaling Monosemanticity and Sparse Autoencoder work (Templeton et al., 2024) — together define the modern interpretability research program. The core concepts a doctorate-grade learner needs are: the residual stream as the central object of computation; the QK and OV circuits decomposition of attention heads; induction heads as a mechanistic explanation for in-context learning; superposition (the idea that models represent more features than they have neurons by storing them in nearly-orthogonal directions); sparse autoencoders (SAEs) as the leading tool for extracting interpretable features from polysemantic neurons; and activation patching, causal scrubbing, and other causal intervention techniques. The applied side includes circuit-level explanations of specific behaviors (the indirect object identification circuit, the modular addition circuit), feature visualization, and the use of interpretability to detect deception, sycophancy, and other alignment-relevant failure modes. Neel Nanda's blog and his TransformerLens library are the entry-level practical resources; the Transformer Circuits Thread itself is the canonical reading list. This is a young field — read papers chronologically to understand how the ideas built on each other.

::reading path · in order

::01 · paper
~12h
A Mathematical Framework for Transformer Circuits — Elhage et al. (Anthropic, Transformer Circuits Thread, 2021)
The foundational paper. Read slowly. The QK/OV decomposition is the load-bearing abstraction for everything that follows.
::02 · paper
~8h
In-context Learning and Induction Heads — Olsson et al. (Anthropic 2022)
Identifies a specific circuit (induction heads) that mechanistically explains in-context learning. A landmark result.
::03 · paper
~10h
Toy Models of Superposition — Elhage et al. (Anthropic 2022)
Explains how networks represent more features than neurons. Foundational for understanding SAEs.
::04 · paper
~10h
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet — Templeton et al. (Anthropic 2024)
Sparse autoencoders applied to a frontier model. The current state of the art in feature extraction.
::05 · paper
~8h
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning — Bricken et al. (Anthropic 2023)
The precursor to Scaling Monosemanticity. Read first for the SAE methodology.
::06 · blog
~6h
Neel Nanda — A Comprehensive Mechanistic Interpretability Explainer & Glossary
Neel's glossary. Bookmark and return to it whenever a term is unfamiliar.
::07 · blog
~4h
Neel Nanda — 200 Concrete Open Problems in Mechanistic Interpretability
Research-problem buffet. Pick three and try them; this is how to enter the field.
::08 · code
~20h
TransformerLens — Neel Nanda (github.com/TransformerLensOrg/TransformerLens)
The de-facto research library for transformer interpretability. Work through the tutorials.
::09 · paper
~8h
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small — Wang et al. (2022)
An end-to-end circuit analysis of a real behavior in a real (small) model. The methodological template.
::10 · paper
~6h
Progress measures for grokking via mechanistic interpretability — Nanda, Chan, Lieberum, Smith, Steinhardt (2023)
Modular addition circuit. A complete mechanistic explanation of a phenomenon (grokking) from end to end.
::11 · course
~60h
ARENA (Alignment Research Engineer Accelerator) curriculum — github.com/callummcdougall/ARENA_3.0
The most thorough modern mechanistic interpretability curriculum. Includes implementations of every major paper.
::12 · blog
~30h
Transformer Circuits Thread — Anthropic (transformer-circuits.pub)
Read every post chronologically. This is the canonical living literature of mechanistic interpretability.

::exercises · build · derive · reproduce

01Implement attention head visualization for GPT-2 small using TransformerLens. Identify candidate induction heads.
02Reproduce the indirect object identification circuit analysis on GPT-2 small. Verify the activation patching results.
03Train a small sparse autoencoder on the residual stream of a pretrained model. Inspect the features.
04Implement modular addition and reproduce Nanda et al.'s grokking circuit analysis.
05Pick one of Neel Nanda's 200 open problems and produce a short writeup of your attempt.
06Read Toy Models of Superposition and implement the toy ReLU model. Visualize the superposition phenomenon.

::milestones · observable

▲You can explain superposition and why it makes interpretability hard.
▲You have used TransformerLens to actually inspect a real model.
▲You can identify induction heads in a transformer you did not train.
▲You have trained a sparse autoencoder and inspected its features.
▲You can read a new Transformer Circuits Thread post and immediately follow the methodology.