How do lower network layers compress facts versus higher reasoning layers?
This explores what the corpus reveals about how computation is distributed across a transformer's depth — whether early layers do something fact-like and compressive while later layers do something reasoning-like — and where the literal premise of the question holds up.
This explores whether transformers split labor by depth — lower layers compressing facts, upper layers reasoning — and the corpus actually complicates the tidy version of that picture. The sharpest evidence comes from logit-lens work showing models trained with hidden chain-of-thought compute the correct answer in layers 1–3, then actively suppress that representation in the final layers to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. So early layers aren't just storing facts to be reasoned over later — they can carry finished reasoning that upper layers overwrite. The interesting twist is that depth isn't a clean pipeline from 'retrieve' to 'reason'; the answer can already be present early and get buried, recoverable only from lower-ranked token predictions.
If you zoom out from layers to the broader question of how models compress, the corpus suggests compression is the default mode everywhere, not a lower-layer specialty. LLMs aggressively maximize statistical compression — capturing broad category structure but discarding the fine-grained, context-sensitive distinctions humans preserve Do LLMs compress concepts more aggressively than humans do?. That's a useful reframe: the 'fact compression' the question imagines isn't a neutral storage step, it's a lossy bet on what to keep. And the model's own internals seem to know which parts matter — reasoning chains encode token-level functional importance, with symbolic-computation tokens preferentially preserved and grammar or meta-discourse pruned first Which tokens in reasoning chains actually matter most?. A related finding shows only about 20% of tokens are high-entropy 'forking points' that actually drive learning Do high-entropy tokens drive reasoning model improvements?. Compression and reasoning, in other words, are entangled all the way down.
The entanglement runs the other direction too: the machinery that does reasoning turns out to be good at compression. A reasoning model's raw thinking trace, used directly as shortened context, beats most purpose-built compression methods — the same mechanism that produces reasoning also produces usable input compression Can a reasoning model's thinking trace compress context effectively?. That undercuts the premise of a strict division of labor. Rather than 'low = compress facts, high = reason,' the picture is more recursive: reasoning is itself a compression operation over evidence.
For a different cut at where abstraction lives, Meta's Large Concept Model abandons token-level processing entirely and reasons over whole-sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens?. That's a structural argument that the 'reasoning layer' might be better placed above tokens altogether — a hint that the fact-vs-reasoning split people intuit by depth might be better engineered as a split by representational grain. If you want to chase the deeper thread, the surprise here is that 'where facts live' and 'where reasoning happens' may not be separable coordinates in a transformer at all — the same layers, and the same compression instinct, do both.
Sources 6 notes
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
A reasoning model's raw thinking trace, used directly as shortened context, outperforms most dedicated compression methods without requiring specialized modules or compression-specific training. The mechanism that enables reasoning also produces usable input compression.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.