SYNTHESIS NOTE

Can dictionary learning scale to production language models?

Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.

Synthesis note · 2026-06-03 · sourced from Evaluations

Eight months after sparse autoencoders recovered monosemantic features from a one-layer transformer, the open question was whether the method scales — if it cannot reach state-of-the-art models, it cannot contribute to safety. This work answers it: dictionary learning extracts high-quality features from Claude 3 Sonnet, a medium-sized production model. The approach rests on two hypotheses worth stating because they are load-bearing: the linear representation hypothesis (concepts are directions in activation space) and the superposition hypothesis (networks use almost-orthogonal directions to pack more features than dimensions). Sparse autoencoders are the dictionary-learning approximation that exploits this.

The recovered features are notable on three dimensions. They are abstract — features for famous people, countries and cities, type signatures in code. They are multilingual and multimodal — the same feature responds to a concept across languages and in both text and images. And they span abstract and concrete instantiations of one idea (code with security vulnerabilities and abstract discussion of security vulnerabilities fire the same feature). Most consequentially, the features are not merely correlational: they both respond to and behaviorally cause the relevant behaviors — clamping a feature steers the model.

The significance for the vault is that interpretability is tractable at production scale, not just in toy models — a precondition for any feature-level safety or steering work. It sits against Do standard analysis methods hide nonlinear features in neural networks?: SAEs recover an impressive feature diversity, but that caution remains live — what dictionary learning surfaces may still over-represent the linearly accessible.

Inquiring lines that read this note 2

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What limits mechanistic interpretability's ability to characterize models?

How do mechanistic features compare to natural language for interpretability?

Do language models learn genuine linguistic structure or just surface patterns?

Can we balance interpretability with the efficiency gains of compressed inter-model communication?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 97 in 2-hop network ·medium cluster Open in graph ↗

Can dictionary learning scale to production lang… Do standard analysis methods hide nonlinear featur… Can we decode what LLM activations really represen… Can a model be truthful without actually being hon…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do standard analysis methods hide nonlinear features in neural networks? Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
caveat on what feature-extraction methods can and cannot see
Can we decode what LLM activations really represent in language? Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
both unify interpretation and control through the same representational handle
Can a model be truthful without actually being honest? Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
causal features are the substrate that could make such mechanistic distinctions actionable

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

dictionary learning scales to production models recovering abstract multimodal features that both detect and causally cause behavior

Can dictionary learning scale to production language models?

Inquiring lines that read this note 2

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4