Macro of an open mechanical watch movement showing gears and jewels — interpretability is looking inside.

Mechanistic interpretability atlas

A working map of how researchers are trying to read what's actually happening inside large neural networks — and how far they still have to go.

Mechanistic interpretability is the project of explaining a neural network the way you would explain a circuit board or a metabolic pathway: not by what it tends to output, but by which internal components do which computational work and how those components fit together. The field has a real research program with real artifacts. It also has a deeply honest open problem, which is that almost everything that has been carefully reverse-engineered so far was reverse-engineered on toy models, single-layer transformers, or a few attention heads inside small GPT-2 variants. Frontier-scale models remain, on the whole, opaque. This atlas walks the methods and results in the order a curious reader would actually want them. We start with the basic premise — that small subnetworks called circuits can be identified and described — and trace the line from the Distill.pub Circuits thread (vision models, 2020) through the Anthropic Transformer Circuits thread (language models, 2021-2026). We cover the conceptual primitives — features, polysemanticity, superposition, induction heads, in-context learning — and then the experimental techniques researchers use to actually intervene: linear probes, activation patching, causal tracing, path patching, sparse autoencoders, and the more recent attribution graphs work on Claude 3.5 Haiku. The voice here is lab-grade and anti-hype. Where a result is on a one-layer transformer, we say so. Where a technique scales partially, we say so. Where the field is honest that it does not yet understand a phenomenon, we say so. Mechanistic interpretability is a slow, hard, beautiful science. It is also, as of mid-2026, very far from being able to give you a full readable wiring diagram of any model you actually use day-to-day. Both of those things can be true at once, and this page tries to hold both.

Why this field exists at all

A trained neural network is a vector of numbers. Run that vector against an input and you get an output. Almost nothing about the numbers tells you, by inspection, what computation is happening between input and output. For most of the deep-learning era this was treated as acceptable: the model worked, the benchmarks moved, and the question of how was deferred. Mechanistic interpretability declines that deferral. It treats a model as a system that can in principle be understood the way a compiled program can be understood — by reading the components, naming their roles, and verifying with interventions that the role you named is actually the one being played. The bet is that if you can do this at small scale (a one-layer transformer, an image classifier's curve detector, a single attention head in GPT-2), you eventually build up enough vocabulary and tooling to do it at frontier scale. That bet has not paid off yet at frontier scale. It has paid off enough at small scale that the field has stopped being speculative and started producing replicable, intervention-validated results. The sections below are a tour of where the line currently is.

The Distill circuits thread (vision, 2020)

The modern mechanistic interpretability program effectively begins with Olah et al.'s 'Zoom In: An Introduction to Circuits' on Distill (March 2020). The thread proposed three claims that the field has been arguing about, refining, and partially confirming ever since. First: features are the fundamental unit. A trained vision model contains directions in activation space that correspond to recognizable things — curve detectors, dog-head detectors, high-low frequency boundary detectors. Second: circuits are formed when features connect by weights. The computation that turns 'curve' into 'dog face' into 'dog' can be traced as a subgraph. Third: universality — analogous features and circuits form across different models trained on similar data. The Distill thread published these claims with interactive visualizations and a culture of treating individual neurons as worth investigating, not dismissing as noise. The original Distill articles (feature visualization, building blocks of interpretability, an overview of early vision, curve detectors, naturally occurring equivariance) remain the canonical introduction to the methodology and are still freely readable at distill.pub.

The Anthropic transformer circuits thread (language, 2021 onwards)

When the same research group ported the Circuits program to transformers, the first artifact was 'A Mathematical Framework for Transformer Circuits' (Elhage et al., December 2021). The framework decomposed attention-only transformers into algebraic objects you can reason about — QK and OV circuits, residual stream, attention-head composition — and demonstrated that some small transformers can be hand-decoded in this language. The Transformer Circuits Thread at transformer-circuits.pub has continued publishing since, with the cadence of a working research journal rather than a final report. As of mid-2026 the thread includes the foundational mathematical framework, the induction heads work, Toy Models of Superposition, the dictionary-learning and sparse-autoencoder line ('Towards Monosemanticity' and 'Scaling Monosemanticity'), and the 2025 attribution-graphs / 'On the Biology of a Large Language Model' work on Claude 3.5 Haiku. The thread is the central primary source for this entire atlas.

The conceptual primitives

Six ideas you have to hold in your head before any technique makes sense. Each entry is the idea in one breath, where to read more, and what's still unsettled.

Features

Distill · Zoom In (2020)

A 'feature' is a direction in a layer's activation space that corresponds to something a human can name — 'curve at 30 degrees,' 'Arabic script,' 'sycophantic praise.' Whether features really are the right unit, or whether models compute in a basis no human will ever name cleanly, is open.

Polysemanticity

Olah et al. · multiple Distill articles

Most neurons in trained networks don't cleanly correspond to a single feature. A single neuron often fires for many unrelated things. Polysemanticity is the empirical wall that all naive 'just look at the neuron' interpretability runs into.

Superposition

Elhage et al. 2022 · arXiv 2209.10652

The leading hypothesis for why polysemanticity happens: networks pack more features than they have dimensions by storing features as overlapping non-orthogonal directions, accepting interference because real-world features are sparse.

Induction heads

Olsson et al. 2022 · arXiv 2209.11895

Attention heads that implement the algorithm 'I saw [A][B] earlier in this context, I just saw [A] again, predict [B].' They appear sharply during training and are implicated as a major mechanism for in-context learning.

In-context learning

Olsson et al. 2022; Brown et al. GPT-3 2020

The phenomenon that a large model's loss on a token drops as more relevant context appears earlier in the sequence — without any weight update. Strongly associated with induction-head formation but not reduced to it.

The residual stream

Elhage et al. 2021 · mathematical framework

Every transformer layer reads from and writes to a shared bus called the residual stream. Treating the residual stream as the model's working memory — and tracking who writes what to it — is the unit of analysis behind most modern circuit work.

Toy models of superposition · the proof that helped

'Toy Models of Superposition' (Elhage et al., September 2022, arXiv 2209.10652) is the paper that turned superposition from a hypothesis into a thing you can demonstrate by training a five-neuron network and staring at the weights. The setup is deliberately small: a tiny ReLU autoencoder trained to reconstruct synthetic inputs whose features have controllable sparsity. When features are dense, the model uses each neuron for one feature, the way intuition wants. When features are sparse — which natural data is — the model packs additional features into the same neurons as overlapping directions, accepting some interference because most of those features are off most of the time anyway. The paper demonstrates phase transitions as sparsity increases, a striking geometric connection between superposition patterns and uniform polytopes, and a suggestive link to adversarial examples. This is a toy result on a tiny model. Its importance is conceptual: it gave the field a concrete picture of why interpretability of larger models is hard (features are not aligned with neurons) and a concrete mechanism (sparse coding) that suggested a recovery technique. That technique is sparse autoencoders, covered below.

Sparse autoencoders · the current workhorse

If features are stored in superposition, the obvious move is to learn an overcomplete dictionary that decomposes activations into a much larger set of mostly-zero feature activations. This is what sparse autoencoders (SAEs) do. Train an SAE on the activations at a chosen layer, force it to reconstruct those activations using few features at a time, and the dictionary it learns is, empirically, often interpretable. Two Anthropic papers anchor this line. 'Towards Monosemanticity' (Bricken, Templeton et al., October 2023) trained SAEs on a one-layer transformer and recovered thousands of features that human raters could often name and that responded to causal interventions in expected ways. 'Scaling Monosemanticity' (Templeton et al., May 2024) scaled the same recipe to the middle residual stream of Claude 3 Sonnet and extracted on the order of tens of millions of features — many of them multilingual, multimodal, and abstract, including features Anthropic flagged as safety-relevant (deception, sycophancy, bias, dangerous content). This is the closest thing the field has to a workable factoring of a frontier-scale model. It is also not a solved problem. SAE features are not guaranteed to be the right factoring, dead features and feature-splitting remain active research issues, and 'we extracted X million features from this model' does not mean 'we understand this model.' It means we have a vocabulary of X million directions, each of which might or might not be the unit the model itself is using.

Intervention techniques · how researchers actually test claims

Mechanistic interpretability is not just looking at activations — it's intervening on them and watching what breaks. Five core techniques you'll see referenced across the literature, the canonical paper each one is associated with, and what each lets you conclude.

Technique	What it does	Canonical paper	What you can claim from it
Linear probes	Train a linear classifier on a frozen layer's activations to predict some property of the input. Tells you whether that property is linearly readable from that layer.	Alain & Bengio 2016 · arXiv 1610.01644	The information is present in a linearly accessible form at that layer. Not that the model uses it.
Activation patching	Replace an activation at a chosen site with one from a different run (clean or corrupted). Measure how much the output changes.	Meng et al. 2022 · ROME · arXiv 2202.05262	That site causally contributes to the output for this task on this input distribution.
Causal tracing	A specific activation-patching protocol that corrupts the input and selectively restores activations to localize where factual recall happens.	Meng et al. 2022 · arXiv 2202.05262	Which layer-token positions a specific factual association is stored at — to the granularity the technique resolves.
Path patching	Patch not at a node but along a specific path between two nodes in the computational graph. Isolates the contribution of one edge of the circuit.	Wang et al. 2022 · IOI · arXiv 2211.00593	That specific information-flow path is doing this specific job. Used to map full circuits, not just single sites.
Attribution graphs	Build a graph of which intermediate features causally influence which output features, using SAE features as the nodes.	Lindsey et al. 2025 · attribution graphs · Claude 3.5 Haiku	A partial, hypothesis-generating trace of the chain of computations. Verified by follow-up perturbation, not by the graph alone.

TechniqueLinear probes

What it doesTrain a linear classifier on a frozen layer's activations to predict some property of the input. Tells you whether that property is linearly readable from that layer.

Canonical paperAlain & Bengio 2016 · arXiv 1610.01644

What you can claim from itThe information is present in a linearly accessible form at that layer. Not that the model uses it.

TechniqueActivation patching

What it doesReplace an activation at a chosen site with one from a different run (clean or corrupted). Measure how much the output changes.

Canonical paperMeng et al. 2022 · ROME · arXiv 2202.05262

What you can claim from itThat site causally contributes to the output for this task on this input distribution.

TechniqueCausal tracing

What it doesA specific activation-patching protocol that corrupts the input and selectively restores activations to localize where factual recall happens.

Canonical paperMeng et al. 2022 · arXiv 2202.05262

What you can claim from itWhich layer-token positions a specific factual association is stored at — to the granularity the technique resolves.

TechniquePath patching

What it doesPatch not at a node but along a specific path between two nodes in the computational graph. Isolates the contribution of one edge of the circuit.

Canonical paperWang et al. 2022 · IOI · arXiv 2211.00593

What you can claim from itThat specific information-flow path is doing this specific job. Used to map full circuits, not just single sites.

TechniqueAttribution graphs

What it doesBuild a graph of which intermediate features causally influence which output features, using SAE features as the nodes.

Canonical paperLindsey et al. 2025 · attribution graphs · Claude 3.5 Haiku

What you can claim from itA partial, hypothesis-generating trace of the chain of computations. Verified by follow-up perturbation, not by the graph alone.

Induction heads and the in-context learning story

'In-context Learning and Induction Heads' (Olsson et al., September 2022, arXiv 2209.11895) is the most-cited language-model interpretability result, and the reason is that it makes a specific, falsifiable, mechanistic claim and then provides evidence for it. The claim: induction heads — attention heads that implement pattern-completion of the form '[A][B] ... [A] → [B]' — are the mechanism behind the bulk of in-context learning in transformer language models. The evidence is the co-incidence of phase transitions: as transformers train, there is a sharp, narrow window in which induction-head circuitry forms, and that window coincides with a sharp drop in loss on later tokens of a sequence (the operational signature of in-context learning) and a small bump in overall training loss. The paper is careful about how strong the claim is. It calls the evidence preliminary and indirect, distinguishes correlation from mechanism, and is explicit about the limitations of the experimental setup. That care is part of why the result has held up. As of mid-2026 induction heads are an accepted, replicated phenomenon; whether they account for most in-context learning in frontier models, or one piece of a larger story, is still an open question.

The IOI circuit · what a fully-described circuit looks like

'Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small' (Wang et al., October 2022, arXiv 2211.00593) is the cleanest worked example of a full circuit in a real language model. The task: predict the indirect object in sentences like 'When Mary and John went to the store, John gave a bottle of milk to ___' where the right answer is Mary, not the repeated John. Using path patching the authors identified a circuit of 26 attention heads — about 1.1 percent of GPT-2 small's head-position pairs — that does most of the work, decomposing cleanly into three named groups: Duplicate Token Heads (notice that 'John' appears twice), S-Inhibition Heads (suppress the repeated subject), and Name Mover Heads (copy the remaining name into the output via attention). This is what 'understanding a model' looks like at the high end of the current state of the art: a small task on a small model, fully traced, intervention-validated. The reason it is still cited as a benchmark is that the cost in researcher effort to produce this kind of mapping is large, and the technique does not yet scale to frontier-model behaviors of comparable specificity.

A condensed timeline

2016
Linear probes
Alain & Bengio formalize the technique of training small linear classifiers on frozen intermediate activations to measure what information is linearly readable at each depth.
2017
Feature visualization on Distill
Olah, Mordvintsev, and Schubert publish 'Feature Visualization' on Distill, establishing the methodology of synthesizing inputs that maximally activate chosen units.
2020 · March
'Zoom In: An Introduction to Circuits'
The Distill Circuits thread launches. Three claims — features, circuits, universality — are proposed as the working hypotheses of mechanistic interpretability.
2021 · December
Mathematical framework for transformer circuits
Elhage et al. port the Circuits program to transformers and give the algebraic decomposition (QK/OV, residual stream) that everything downstream builds on.
2022 · February
ROME · causal tracing
Meng et al. introduce causal tracing to localize factual associations in GPT, and show that targeted rank-one weight edits can change a model's stored fact.
2022 · September
Induction heads · Toy models of superposition
Olsson et al. publish the induction-heads / in-context-learning paper; Elhage et al. publish 'Toy Models of Superposition.' Both are foundational.
2022 · October
Interpretability in the Wild · IOI circuit
Wang et al. apply path patching to GPT-2 small and produce the cleanest fully-described circuit in a real language model.
2023 · October
Towards Monosemanticity
Bricken, Templeton et al. show sparse autoencoders can recover interpretable features from a one-layer transformer, validating the SAE approach to superposition.
2024 · May
Scaling Monosemanticity
Templeton et al. scale SAEs to Claude 3 Sonnet's residual stream and extract roughly 34 million features, including a number flagged as safety-relevant.
2025 · March
Attribution graphs · 'On the Biology of a Large Language Model'
Anthropic publishes circuit tracing via attribution graphs on Claude 3.5 Haiku, with ten case studies of multi-step reasoning, planning, and hallucination inhibition. Tooling open-sourced.

What you can honestly claim · and what you can't · as of June 2026

Honest accounting, written to be linkable when someone asks 'do we understand how LLMs work yet?' You can claim: that small transformers (one or two layers, GPT-2 small) admit full circuit-level descriptions of specific behaviors; that superposition is a real, demonstrated phenomenon; that sparse autoencoders can extract large numbers of human-nameable features from frontier-scale residual streams; that induction-head formation co-occurs sharply with the emergence of in-context learning during training; that attribution graphs can produce partial, intervention-validated traces of multi-step behaviors in models the size of Claude 3.5 Haiku. You cannot honestly claim: that we have a full readable wiring diagram of any frontier model in production; that SAE feature dictionaries are the model's own internal vocabulary rather than a useful approximation; that any single interpretability technique generalizes cleanly across model families; that interpretability tooling is yet at the point of reliably catching novel failure modes before deployment. The research is real. The frontier is still further away than press coverage tends to suggest. Both statements need to be said in the same breath.

If you're starting from scratch · a short reading order

Distill · 'Zoom In: An Introduction to Circuits' (Olah et al. 2020) — the why and the worldview, with interactive visualizations.
Transformer Circuits Thread · 'A Mathematical Framework for Transformer Circuits' (Elhage et al. 2021) — the algebra you need to read everything that follows.
arXiv 2209.10652 · 'Toy Models of Superposition' (Elhage et al. 2022) — the conceptual key to why interpretability is hard and how to attack it.
arXiv 2209.11895 · 'In-context Learning and Induction Heads' (Olsson et al. 2022) — the most replicated specific mechanism in language models.
arXiv 2211.00593 · 'Interpretability in the Wild' (Wang et al. 2022) — what a fully-described circuit looks like end to end.
Transformer Circuits Thread · 'Towards Monosemanticity' (Bricken et al. 2023) and 'Scaling Monosemanticity' (Templeton et al. 2024) — the sparse-autoencoder line, in order.
Transformer Circuits Thread · 'On the Biology of a Large Language Model' (Lindsey et al. 2025) — attribution graphs in action on Claude 3.5 Haiku, with companion open-source tooling.
arXiv 1610.01644 · 'Understanding intermediate layers using linear classifier probes' (Alain & Bengio 2016) — the technique that started the broader interpretability program before circuits.

Sources

[01]
Olah et al. (2020) propose features, circuits, and universality as the three working hypotheses of mechanistic interpretability and launch the Distill Circuits thread.
distill.pub/2020/circuits/zoom-in/
[02]
Olah, Mordvintsev, and Schubert (2017) introduce the modern feature-visualization methodology of synthesizing inputs that maximally activate chosen units.
distill.pub/2017/feature-visualization/
[03]
The Anthropic Transformer Circuits Thread is the primary publication venue for the Anthropic interpretability team's work on language-model circuits.
transformer-circuits.pub/
[04]
Elhage et al. (2021) provide the algebraic decomposition of attention-only transformers (QK/OV circuits, residual stream) used by subsequent circuit work.
transformer-circuits.pub/2021/framework/index.html
[05]
Elhage et al. (2022) 'Toy Models of Superposition' demonstrate, on small ReLU networks with controllable feature sparsity, that models pack more features than they have dimensions by using non-orthogonal directions.
arxiv.org/abs/2209.10652
[06]
Olsson et al. (2022) 'In-context Learning and Induction Heads' show induction heads form sharply during training in coincidence with the emergence of in-context learning.
arxiv.org/abs/2209.11895
[07]
Bricken, Templeton et al. (October 2023) demonstrate that sparse autoencoders can extract interpretable features from the activations of a one-layer transformer.
transformer-circuits.pub/2023/monosemantic-features
[08]
Templeton et al. (May 2024) scale sparse autoencoders to the middle-layer residual stream of Claude 3 Sonnet and extract on the order of tens of millions of features, including a number flagged as safety-relevant.
transformer-circuits.pub/2024/scaling-monosemanticity/
[09]
Meng et al. (2022) ROME paper introduces causal tracing to localize factual associations in GPT and demonstrates targeted rank-one weight edits.
arxiv.org/abs/2202.05262
[10]
Wang et al. (2022) identify a 26-attention-head circuit in GPT-2 small that performs indirect object identification, using path patching to map the circuit's structure.
arxiv.org/abs/2211.00593
[11]
Alain & Bengio (2016) formalize linear classifier probes as a method for measuring how linearly readable various properties are from intermediate layer activations.
arxiv.org/abs/1610.01644
[12]
Lindsey et al. (2025) 'On the Biology of a Large Language Model' applies attribution graphs to Claude 3.5 Haiku across ten case studies, including multi-step reasoning, planning, and hallucination inhibition.
transformer-circuits.pub/2025/attribution-graphs/biology.html
[13]
Anthropic open-sourced the circuit-tracing tooling for generating attribution graphs on popular open-weights models, developed in collaboration with Decode Research.
anthropic.com/research/open-source-circuit-tracing
[14]
Olah et al. (2018) 'The Building Blocks of Interpretability' on Distill combines feature visualization, attribution, and dimensionality reduction into a unified interpretation interface.
distill.pub/2018/building-blocks/
[15]
Anthropic's research page documents the publication and framing of Toy Models of Superposition as part of the Transformer Circuits program.
anthropic.com/research/toy-models-of-superposition

Keep reading

Learn · how transformers work →Atlas · model evaluations →Atlas · alignment research →Research · ÆoNs papers →Learn · sparse autoencoders explainer →vs · open vs closed models →Tools · interpretability stack →

Mechanistic interpretability atlas

Why this field exists at all

The Distill circuits thread (vision, 2020)

The Anthropic transformer circuits thread (language, 2021 onwards)

The conceptual primitives

Features

Polysemanticity

Superposition

Induction heads

In-context learning

The residual stream

Toy models of superposition · the proof that helped

Sparse autoencoders · the current workhorse

Intervention techniques · how researchers actually test claims

Induction heads and the in-context learning story

The IOI circuit · what a fully-described circuit looks like

A condensed timeline

Linear probes

Feature visualization on Distill

'Zoom In: An Introduction to Circuits'

Mathematical framework for transformer circuits

ROME · causal tracing

Induction heads · Toy models of superposition

Interpretability in the Wild · IOI circuit

Towards Monosemanticity

Scaling Monosemanticity

Attribution graphs · 'On the Biology of a Large Language Model'

What you can honestly claim · and what you can't · as of June 2026

If you're starting from scratch · a short reading order

Sources

Keep reading