SYNTHESIS NOTE
Model Architecture and Internals

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Synthesis note · 2026-04-18 · sourced from MechInterp
What actually happens inside the minds of language models? How should researchers navigate LLM reasoning research?

Anthropic's circuit tracing work uses attribution graphs built from sparse autoencoders to reveal computational graphs in Claude models. The key finding is a consistent four-tier hierarchy of feature types across model layers:

  1. Input features (early layers) — activate on specific tokens or token categories. A "digital" feature fires on "digital", "digitize", etc. These are the raw perceptual layer.

  2. Abstract features (middle/later layers) — represent properties of context rather than surface tokens. Example: a feature for "the danger of mixing common cleaning chemicals." These are genuine conceptual representations disconnected from specific words.

  3. Functional features (middle/later layers) — perform operations rather than represent concepts. An "add 9" feature causes the model to output a number nine greater than another in context. These are computational primitives, not representational ones.

  4. Output features (late layers) — promote specific outputs or output categories. A "say a capital" feature promotes tokens corresponding to U.S. state capital names.

Polysemantic features (activating for unrelated concepts like "rhythm", Michael Jordan, and other things) are concentrated in earlier layers, consistent with superposition being a compression strategy that gets resolved as processing deepens.

Critically, feature abstractions are richer in larger models (Haiku vs 18L). This suggests that scaling doesn't just add more features — it adds more abstract features, consistent with the idea that capability gains come from developing higher-level internal concepts rather than just memorizing more patterns.

Features also vary in how many layers they "live" across — some contribute to one or two layers while others have strong outputs all the way through. This gradient from local to global features challenges simple circuit-based accounts where each feature has a fixed location.

The distinction between abstract features (representing what) and functional features (computing how) is particularly important for interpretability: standard probing approaches that look for representations of concepts would entirely miss the functional features that implement the actual computation. This connects to the broader finding that Do standard analysis methods hide nonlinear features in neural networks?.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 96 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

circuit tracing reveals a four-tier feature hierarchy in language models — input features to abstract concepts to functional operations to output features