
Mechanistic interpretability atlas
A working map of how researchers are trying to read what's actually happening inside large neural networks — and how far they still have to go.
Why this field exists at all
The Distill circuits thread (vision, 2020)
The Anthropic transformer circuits thread (language, 2021 onwards)
The conceptual primitives
Six ideas you have to hold in your head before any technique makes sense. Each entry is the idea in one breath, where to read more, and what's still unsettled.
Features
Distill · Zoom In (2020)
A 'feature' is a direction in a layer's activation space that corresponds to something a human can name — 'curve at 30 degrees,' 'Arabic script,' 'sycophantic praise.' Whether features really are the right unit, or whether models compute in a basis no human will ever name cleanly, is open.
Polysemanticity
Olah et al. · multiple Distill articles
Most neurons in trained networks don't cleanly correspond to a single feature. A single neuron often fires for many unrelated things. Polysemanticity is the empirical wall that all naive 'just look at the neuron' interpretability runs into.
Superposition
Elhage et al. 2022 · arXiv 2209.10652
The leading hypothesis for why polysemanticity happens: networks pack more features than they have dimensions by storing features as overlapping non-orthogonal directions, accepting interference because real-world features are sparse.
Induction heads
Olsson et al. 2022 · arXiv 2209.11895
Attention heads that implement the algorithm 'I saw [A][B] earlier in this context, I just saw [A] again, predict [B].' They appear sharply during training and are implicated as a major mechanism for in-context learning.
In-context learning
Olsson et al. 2022; Brown et al. GPT-3 2020
The phenomenon that a large model's loss on a token drops as more relevant context appears earlier in the sequence — without any weight update. Strongly associated with induction-head formation but not reduced to it.
The residual stream
Elhage et al. 2021 · mathematical framework
Every transformer layer reads from and writes to a shared bus called the residual stream. Treating the residual stream as the model's working memory — and tracking who writes what to it — is the unit of analysis behind most modern circuit work.
Toy models of superposition · the proof that helped
Sparse autoencoders · the current workhorse
Intervention techniques · how researchers actually test claims
Mechanistic interpretability is not just looking at activations — it's intervening on them and watching what breaks. Five core techniques you'll see referenced across the literature, the canonical paper each one is associated with, and what each lets you conclude.
Induction heads and the in-context learning story
The IOI circuit · what a fully-described circuit looks like
A condensed timeline
2016
Linear probes
Alain & Bengio formalize the technique of training small linear classifiers on frozen intermediate activations to measure what information is linearly readable at each depth.
2017
Feature visualization on Distill
Olah, Mordvintsev, and Schubert publish 'Feature Visualization' on Distill, establishing the methodology of synthesizing inputs that maximally activate chosen units.
2020 · March
'Zoom In: An Introduction to Circuits'
The Distill Circuits thread launches. Three claims — features, circuits, universality — are proposed as the working hypotheses of mechanistic interpretability.
2021 · December
Mathematical framework for transformer circuits
Elhage et al. port the Circuits program to transformers and give the algebraic decomposition (QK/OV, residual stream) that everything downstream builds on.
2022 · February
ROME · causal tracing
Meng et al. introduce causal tracing to localize factual associations in GPT, and show that targeted rank-one weight edits can change a model's stored fact.
2022 · September
Induction heads · Toy models of superposition
Olsson et al. publish the induction-heads / in-context-learning paper; Elhage et al. publish 'Toy Models of Superposition.' Both are foundational.
2022 · October
Interpretability in the Wild · IOI circuit
Wang et al. apply path patching to GPT-2 small and produce the cleanest fully-described circuit in a real language model.
2023 · October
Towards Monosemanticity
Bricken, Templeton et al. show sparse autoencoders can recover interpretable features from a one-layer transformer, validating the SAE approach to superposition.
2024 · May
Scaling Monosemanticity
Templeton et al. scale SAEs to Claude 3 Sonnet's residual stream and extract roughly 34 million features, including a number flagged as safety-relevant.
2025 · March
Attribution graphs · 'On the Biology of a Large Language Model'
Anthropic publishes circuit tracing via attribution graphs on Claude 3.5 Haiku, with ten case studies of multi-step reasoning, planning, and hallucination inhibition. Tooling open-sourced.
What you can honestly claim · and what you can't · as of June 2026
Honest accounting, written to be linkable when someone asks 'do we understand how LLMs work yet?' You can claim: that small transformers (one or two layers, GPT-2 small) admit full circuit-level descriptions of specific behaviors; that superposition is a real, demonstrated phenomenon; that sparse autoencoders can extract large numbers of human-nameable features from frontier-scale residual streams; that induction-head formation co-occurs sharply with the emergence of in-context learning during training; that attribution graphs can produce partial, intervention-validated traces of multi-step behaviors in models the size of Claude 3.5 Haiku. You cannot honestly claim: that we have a full readable wiring diagram of any frontier model in production; that SAE feature dictionaries are the model's own internal vocabulary rather than a useful approximation; that any single interpretability technique generalizes cleanly across model families; that interpretability tooling is yet at the point of reliably catching novel failure modes before deployment. The research is real. The frontier is still further away than press coverage tends to suggest. Both statements need to be said in the same breath.
If you're starting from scratch · a short reading order
- Distill · 'Zoom In: An Introduction to Circuits' (Olah et al. 2020) — the why and the worldview, with interactive visualizations.
- Transformer Circuits Thread · 'A Mathematical Framework for Transformer Circuits' (Elhage et al. 2021) — the algebra you need to read everything that follows.
- arXiv 2209.10652 · 'Toy Models of Superposition' (Elhage et al. 2022) — the conceptual key to why interpretability is hard and how to attack it.
- arXiv 2209.11895 · 'In-context Learning and Induction Heads' (Olsson et al. 2022) — the most replicated specific mechanism in language models.
- arXiv 2211.00593 · 'Interpretability in the Wild' (Wang et al. 2022) — what a fully-described circuit looks like end to end.
- Transformer Circuits Thread · 'Towards Monosemanticity' (Bricken et al. 2023) and 'Scaling Monosemanticity' (Templeton et al. 2024) — the sparse-autoencoder line, in order.
- Transformer Circuits Thread · 'On the Biology of a Large Language Model' (Lindsey et al. 2025) — attribution graphs in action on Claude 3.5 Haiku, with companion open-source tooling.
- arXiv 1610.01644 · 'Understanding intermediate layers using linear classifier probes' (Alain & Bengio 2016) — the technique that started the broader interpretability program before circuits.
Sources
- [01]
Olah et al. (2020) propose features, circuits, and universality as the three working hypotheses of mechanistic interpretability and launch the Distill Circuits thread.
distill.pub/2020/circuits/zoom-in/
- [02]
Olah, Mordvintsev, and Schubert (2017) introduce the modern feature-visualization methodology of synthesizing inputs that maximally activate chosen units.
distill.pub/2017/feature-visualization/
- [03]
The Anthropic Transformer Circuits Thread is the primary publication venue for the Anthropic interpretability team's work on language-model circuits.
transformer-circuits.pub/
- [04]
Elhage et al. (2021) provide the algebraic decomposition of attention-only transformers (QK/OV circuits, residual stream) used by subsequent circuit work.
transformer-circuits.pub/2021/framework/index.html
- [05]
Elhage et al. (2022) 'Toy Models of Superposition' demonstrate, on small ReLU networks with controllable feature sparsity, that models pack more features than they have dimensions by using non-orthogonal directions.
arxiv.org/abs/2209.10652
- [06]
Olsson et al. (2022) 'In-context Learning and Induction Heads' show induction heads form sharply during training in coincidence with the emergence of in-context learning.
arxiv.org/abs/2209.11895
- [07]
Bricken, Templeton et al. (October 2023) demonstrate that sparse autoencoders can extract interpretable features from the activations of a one-layer transformer.
transformer-circuits.pub/2023/monosemantic-features
- [08]
Templeton et al. (May 2024) scale sparse autoencoders to the middle-layer residual stream of Claude 3 Sonnet and extract on the order of tens of millions of features, including a number flagged as safety-relevant.
transformer-circuits.pub/2024/scaling-monosemanticity/
- [09]
Meng et al. (2022) ROME paper introduces causal tracing to localize factual associations in GPT and demonstrates targeted rank-one weight edits.
arxiv.org/abs/2202.05262
- [10]
Wang et al. (2022) identify a 26-attention-head circuit in GPT-2 small that performs indirect object identification, using path patching to map the circuit's structure.
arxiv.org/abs/2211.00593
- [11]
Alain & Bengio (2016) formalize linear classifier probes as a method for measuring how linearly readable various properties are from intermediate layer activations.
arxiv.org/abs/1610.01644
- [12]
Lindsey et al. (2025) 'On the Biology of a Large Language Model' applies attribution graphs to Claude 3.5 Haiku across ten case studies, including multi-step reasoning, planning, and hallucination inhibition.
transformer-circuits.pub/2025/attribution-graphs/biology.html
- [13]
Anthropic open-sourced the circuit-tracing tooling for generating attribution graphs on popular open-weights models, developed in collaboration with Decode Research.
anthropic.com/research/open-source-circuit-tracing
- [14]
Olah et al. (2018) 'The Building Blocks of Interpretability' on Distill combines feature visualization, attribution, and dimensionality reduction into a unified interpretation interface.
distill.pub/2018/building-blocks/
- [15]
Anthropic's research page documents the publication and framing of Toy Models of Superposition as part of the Transformer Circuits program.
anthropic.com/research/toy-models-of-superposition