INQUIRING LINE

Why are polysemantic features concentrated in early neural network layers?

This explores why the 'messy' features that fire for many unrelated concepts at once (polysemantic ones) tend to pile up near a network's input layers rather than its deeper ones — and the corpus doesn't tackle polysemanticity head-on, but several notes circle the same territory of how features get cleaner as you go deeper.


This explores why multi-meaning, entangled features concentrate early in a network. No note here studies polysemanticity or superposition by name, but a few converge on a clean explanation: early layers sit closest to raw tokens, where many surface forms must be crammed into limited dimensions, and the disentangling happens later. The sharpest evidence is circuit tracing in Claude models, which finds a four-tier progression — token-level inputs → abstract concepts → functional operations → outputs How do language models organize features across processing layers?. The bottom tier is exactly where you'd expect crowding: a single early unit has to participate in representing every word that shares a spelling, a subword, or a context, so it ends up firing for a grab-bag of meanings. Abstraction — and the room to give concepts their own clean directions — only arrives deeper in.

Why is the early layer forced to be lossy and entangled? One note argues that the geometry of language models isn't hand-built but falls directly out of word co-occurrence statistics Where does hierarchical structure in language models come from?. Words that appear in overlapping contexts start life tangled together; nothing has yet pulled them apart. Early representations inherit that raw statistical mush, and it's the job of later layers to carve the nested, separable structure out of it. So polysemanticity early isn't a bug so much as the unprocessed input distribution showing through before the network has done its work.

The 'depth does the disentangling' idea gets independent support from architecture experiments: deep-and-thin small models beat wide ones because stacking layers lets the network compose abstract concepts step by step rather than packing everything into a single wide bottleneck Does depth matter more than width for tiny language models?. If composition is what depth buys you, then the early, pre-composition layers are necessarily the ones doing broad, overloaded, many-meanings-per-unit encoding. Relatedly, compositional generalization tends to track how *linearly decodable* a concept's constituents are from the hidden activations Can neural networks learn compositional skills without symbolic mechanisms? — and clean linear decodability is a deep-layer property, the opposite of the overlapping mixtures you find at the input.

Two more notes hint at the flip side — what 'cleaned up' looks like. Networks naturally sort compositional work into isolated, modular subnetworks, and pretraining makes that modularity more reliable Do neural networks naturally learn modular compositional structure?; and hidden states actively *sparsify* — fewer units firing, more selectively — when a task gets hard or unfamiliar Do language models sparsify their activations under difficult tasks?. Both describe representations becoming dedicated and selective, which is precisely the regime polysemantic early features are not in. The unexpected payoff here: polysemanticity may be less an intrinsic property of 'early layers' and more a symptom of proximity to raw, uncompressed input statistics — depth, modularity, and sparsification are all names for the same process of pulling those tangled meanings apart.


Sources 6 notes

How do language models organize features across processing layers?

Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.

Where does hierarchical structure in language models come from?

LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question remains open: Why do polysemantic (multi-meaning) features concentrate in early neural network layers, and is this constraint still accurate in 2024–2026 models?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A four-tier feature hierarchy places token-level inputs at the base, where raw word co-occurrence statistics force many meanings into single units (~2025, circuit tracing). Early layer polysemanticity is NOT a bug but the uncompressed input distribution; depth and later layers perform disentanglement via composition, modularity, and sparsification (~2024–2026). Depth beats width for abstraction because stacking allows step-by-step concept composition rather than wide bottlenecks (~2024). Linear decodability of compositional constituents is a deep-layer property; early layers show overlapping, entangled mixtures (~2023–2025). Sparsification under OOD shift and modular decomposition both describe the flip side — how clean, selective representations emerge (~2024–2026).

Anchor papers (verify; mind their dates):
- arXiv:2301.10884 (2023): Break It Down — structural compositionality
- arXiv:2402.14905 (2024): MobileLLM — depth-vs-width scaling
- arXiv:2605.23821 (2026): Hierarchical Concept Geometry — word co-occurrence geometry
- arXiv:2603.03415 (2026): Farther the Shift, Sparser the Representation — OOD sparsification

Your task:
(1) RE-TEST each constraint. For every finding above, judge whether post-2026 models, training methods (e.g., mixture-of-experts, continual pretraining), or tooling (probing harnesses, SAE-like sparsification) have RELAXED early-layer polysemanticity or sharpened the disentanglement timeline. Separate the durable question (why input statistics shape early representations) from perishable limitations (e.g., whether current dense models show the same pattern in sparse or modular variants). Where a constraint still holds, cite what evidence keeps it standing.
(2) Surface contradicting or superseding work from the last ~6 months—especially any finding that polysemanticity persists *deeper* than the library claims, or that early layers can achieve selective, sparse representations via different initialization/loss.
(3) Propose 2 research questions that assume the regime may have shifted: (a) In models with explicit disentanglement (SAEs, modular routing), does polysemanticity shift to the earliest *module selector* rather than the earliest *dense layer*? (b) Do models trained on code or math (where word co-occurrence is lower-entropy than language) show a different depth–polysemanticity curve?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines