AtomEons / Research / Decoded / Sparse Autoencoders

2023 · arXiv:2309.08600 · Cunningham, Ewart, Riggs, Huben, Sharkey · EleutherAI + Anthropic

An X-ray for the AI brain.

In one sentence: A mathematical technique for taking a fully-trained language model and revealing the specific human-interpretable concepts it has learned — like turning a black box into something with labels on the inside.

01 · Why this matters to your life

For years AIs were called “black boxes.” You gave one an input. You got an output. What happened in between was hundreds of billions of numbers nobody could read. If the AI made a mistake, no one could trace why. If it was biased, no one could find the bias. If it learned dangerous knowledge, no one could verify it.

This paper began the work of changing that. By 2024 Anthropic had identified roughly 34 million distinct concepts inside Claude — labeled, browsable, editable. They could find the “Golden Gate Bridge” concept and turn it up. They could find the “sycophancy” circuit and turn it down. The implications for AI safety, debugging, regulation, and trust are enormous.

02 · What scientists actually did

The technical problem they solved is called “superposition.” Inside a neural network, individual neurons don't cleanly represent individual concepts. Instead, each neuron represents many concepts at once, and each concept is spread across many neurons. This is efficient for the AI — it lets it fit more knowledge into limited space — but it makes the network impossible to read by hand.

The trick: they trained a second, smaller neural network whose only job is to take the messy, overlapping signals from the original network and re-encode them into a much larger but sparser representation — where each “feature” in the new representation activates only on one specific concept at a time. The math is called a sparse autoencoder. The result is a translation layer between “messy AI internal” and “clean human-readable.”

They then went through the discovered features one by one and labeled them. Some features fire on the word “cat.” Some fire on the concept of sarcasm. Some fire on Python code. Some fire on harmful intent. The features are weirdly granular — for instance, Anthropic later found one specific feature that fires on the concept of “being in a bind / having to compromise on values.”

03 · What scientists know but rarely say

The honest framing: this is still incomplete. We can identify millions of features inside a model. We cannot yet identify all of them. The features we have found are interpretable, but we don't know what fraction of the model's reasoning they capture. The black box has become a partially-labeled-grey box.

The other unstated reality: feature labeling is hard scaling work. Each feature has to be identified, characterized, and (ideally) named by an AI or a human. Anthropic publishes interactive feature explorers (transformer-circuits.pub) where you can browse some of them. The scale required for full interpretability is still beyond what any team has fully delivered, but progress through 2024-2026 has been faster than skeptics predicted.

Most consequential implication: if interpretability matures, AI auditing becomes possible the way financial auditing is possible. You could in principle verify that a model does not contain dangerous capabilities, certify that it lacks specific biases, or trace why it gave a wrong answer. AI regulation discussions through 2025-2026 increasingly assume this capability exists or will exist soon. The Anthropic interpretability team is essentially trying to make the regulators' assumptions true.

04 · What the paper does NOT claim

The 2023 paper does not claim to fully interpret any model. It claims that sparse autoencoders find “highly interpretable” features in language models — a substantial step beyond previous work. The number of identified features in the original paper was a few thousand. The 34M number quoted above is the scaled-up version Anthropic published in their 2024 follow-ups (Scaling Monosemanticity, May 2024).

The paper does not claim that finding features means understanding the model. It does not claim that all features are clean — some are still polysemantic (multiple concepts mixed). It does not claim to know the “goals” or “intentions” of the model. Interpretability is a step toward all of these — not yet a delivery.

05 · Read the original

· arxiv.org/abs/2309.08600 — the original 2023 paper.
· transformer-circuits.pub (Anthropic Interpretability Team) — the entire interpretability research thread, with interactive feature explorers. Best browsing on the internet for “what's inside an AI.”
· Templeton et al. 2024 (“Scaling Monosemanticity”) — Anthropic's follow-up scaling SAEs to production-scale Claude. transformer-circuits.pub/2024/scaling-monosemanticity.
· “Golden Gate Claude” (May 2024) — Anthropic's public demonstration of feature steering, where they cranked up the Golden Gate Bridge feature inside Claude and the model started talking about itself as the bridge. Strange and instructive.

← decoded index · more papers