SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Synthesis note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Existing mechanistic interpretability approaches (SAEs, activation patching, circuit discovery) attempt to understand dense models post-hoc. Weight-sparse training offers a fundamentally different paradigm: make the model interpretable by construction.

The approach: constrain most weights to be zeros (small L0 norm). Each neuron can only read from or write to a few residual channels, which discourages distributing representations across channels and using excess neurons. The result: disentangled circuits where neuron activations correspond to simple concepts ("tokens following a single quote," "depth of list nesting") with straightforward, intuitive connections.

Three key findings:

  1. Disentangled task circuits. Isolating minimal circuits for each task shows they are compact. Different tasks use different circuits with minimal overlap. This validates the hypothesis that superposition is what makes dense models hard to interpret — remove the superposition pressure and interpretation becomes tractable.

  2. Necessary and sufficient. Mean-ablating every neuron except the circuit preserves task performance. Deleting only the circuit nodes severely harms it. This is an unusually rigorous validation for interpretability claims.

  3. Capability-interpretability tradeoff with scaling. Making weights sparser decreases capability. Scaling model size improves the frontier — larger sparse models are more capable at the same interpretability level. But scaling beyond tens of millions of nonzero parameters while preserving interpretability remains unsolved.

The critical limitation: weight-sparse models are extremely inefficient to train and deploy, and unlikely to reach frontier capabilities. This is interpretability-by-construction for research models, not a path to understanding GPT-4.

However, preliminary results suggest the method can be adapted to explain existing dense models — training sparse approximations that reveal interpretable structure present in the dense original. If this scales, it bridges the gap between the paradigm's elegance and practical utility.

Inquiring lines that use this note as a source 55

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

weight sparsity produces interpretable disentangled circuits — a new paradigm trading capability for interpretability