Can dictionary learning scale to production language models?
Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.
Eight months after sparse autoencoders recovered monosemantic features from a one-layer transformer, the open question was whether the method scales — if it cannot reach state-of-the-art models, it cannot contribute to safety. This work answers it: dictionary learning extracts high-quality features from Claude 3 Sonnet, a medium-sized production model. The approach rests on two hypotheses worth stating because they are load-bearing: the linear representation hypothesis (concepts are directions in activation space) and the superposition hypothesis (networks use almost-orthogonal directions to pack more features than dimensions). Sparse autoencoders are the dictionary-learning approximation that exploits this.
The recovered features are notable on three dimensions. They are abstract — features for famous people, countries and cities, type signatures in code. They are multilingual and multimodal — the same feature responds to a concept across languages and in both text and images. And they span abstract and concrete instantiations of one idea (code with security vulnerabilities and abstract discussion of security vulnerabilities fire the same feature). Most consequentially, the features are not merely correlational: they both respond to and behaviorally cause the relevant behaviors — clamping a feature steers the model.
The significance for the vault is that interpretability is tractable at production scale, not just in toy models — a precondition for any feature-level safety or steering work. It sits against Do standard analysis methods hide nonlinear features in neural networks?: SAEs recover an impressive feature diversity, but that caution remains live — what dictionary learning surfaces may still over-represent the linearly accessible.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do standard analysis methods hide nonlinear features in neural networks?
Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
caveat on what feature-extraction methods can and cannot see
-
Can we decode what LLM activations really represent in language?
Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
both unify interpretation and control through the same representational handle
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
causal features are the substrate that could make such mechanistic distinctions actionable
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
- Weight-sparse transformers have interpretable circuits
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- A Primer on the Inner Workings of Transformer-based Language Models
Original note title
dictionary learning scales to production models recovering abstract multimodal features that both detect and causally cause behavior