SYNTHESIS NOTE

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Synthesis note · 2026-04-18 · sourced from MechInterp

Anthropic's circuit tracing work uses attribution graphs built from sparse autoencoders to reveal computational graphs in Claude models. The key finding is a consistent four-tier hierarchy of feature types across model layers:

Input features (early layers) — activate on specific tokens or token categories. A "digital" feature fires on "digital", "digitize", etc. These are the raw perceptual layer.
Abstract features (middle/later layers) — represent properties of context rather than surface tokens. Example: a feature for "the danger of mixing common cleaning chemicals." These are genuine conceptual representations disconnected from specific words.
Functional features (middle/later layers) — perform operations rather than represent concepts. An "add 9" feature causes the model to output a number nine greater than another in context. These are computational primitives, not representational ones.
Output features (late layers) — promote specific outputs or output categories. A "say a capital" feature promotes tokens corresponding to U.S. state capital names.

Polysemantic features (activating for unrelated concepts like "rhythm", Michael Jordan, and other things) are concentrated in earlier layers, consistent with superposition being a compression strategy that gets resolved as processing deepens.

Critically, feature abstractions are richer in larger models (Haiku vs 18L). This suggests that scaling doesn't just add more features — it adds more abstract features, consistent with the idea that capability gains come from developing higher-level internal concepts rather than just memorizing more patterns.

Features also vary in how many layers they "live" across — some contribute to one or two layers while others have strong outputs all the way through. This gradient from local to global features challenges simple circuit-based accounts where each feature has a fixed location.

The distinction between abstract features (representing what) and functional features (computing how) is particularly important for interpretability: standard probing approaches that look for representations of concepts would entirely miss the functional features that implement the actual computation. This connects to the broader finding that Do standard analysis methods hide nonlinear features in neural networks?.

Inquiring lines that read this note 11

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do language models understand semantics or rely on pattern matching?

How does syntactic encoding relate to semantic feature representation?

What limits mechanistic interpretability's ability to characterize models?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Why do hierarchical architectures better implement the deep research definition?

What determines success in training models on multiple tasks?

How do neural networks decompose complex tasks into modular subnetworks?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What happens to representational structure during model pretraining phases?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 101 in 2-hop network ·medium cluster Open in graph ↗

How do language models organize features across … Can high-level concepts replace circuit-level anal… Can sparse weight training make neural networks in… Can identical outputs hide broken internal represe…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can high-level concepts replace circuit-level analysis in AI? Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
RepE operates at the population level and would detect abstract features but may miss functional features that implement operations rather than represent concepts
Can sparse weight training make neural networks interpretable by design? Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.
weight sparsity forces feature disentanglement by construction; circuit tracing achieves interpretability post-hoc through SAE decomposition
Can identical outputs hide broken internal representations? Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
the four-tier hierarchy provides a framework for asking where FER occurs: fracture at the abstract tier would be most damaging to generalization

How do language models organize features across processing layers?

Inquiring lines that read this note 11

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4