What are the stages of inference inside language models?
This explores what actually happens inside a language model as it produces an answer — the internal phases of computation across layers, not the visible chain-of-thought it prints.
This explores what actually happens inside a language model as it produces an answer — the internal phases of computation across the layers, rather than the visible 'thinking' it writes out. The corpus suggests inference is genuinely staged, but the stages live inside the network and don't match the tidy story the model tells on screen.
A striking finding is that early layers do a lot of the real work. In models trained with hidden reasoning tokens, the correct answer is computed in the first few layers, then *actively suppressed* in later layers so the output conforms to a required format — the reasoning is still recoverable from lower-ranked token predictions, it's just been overwritten (Do transformers hide reasoning before producing filler tokens?). So one way to read 'stages of inference' is literally depth: compute, then reformat. Relatedly, models hold *multiple* candidate tasks in superposition during the forward pass and only collapse to a single one after the first token is generated — the early stage is plural and ambiguous, the decoding stage forces commitment (Can LLMs handle multiple tasks at once during inference?).
Mechanistic interpretability adds a different cut: not stages in time but tiers of understanding that coexist. Models show conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits) — and crucially the higher tiers don't replace the lower heuristics, they sit on top of a patchwork (Do language models understand in fundamentally different ways?). That's why identical benchmark scores can hide radically different internal machinery: external performance and internal structure are decoupled, so 'what stage is it in' isn't readable off the output (What actually happens inside the minds of language models?, What really happens inside a language model?).
The twist worth carrying away: the printed reasoning trace is often *not* a stage of inference at all. Traces behave as stylistic mimicry — invalid logical steps perform almost as well as valid ones, so the visible 'reasoning' isn't what produced the answer (Do reasoning traces show how models actually think?). The actual computation can happen in latent space without any verbalized steps, scaling test-time compute through hidden-state iteration (Can models reason without generating visible thinking tokens?), and architectures that *loop* their latent computation during pretraining produce intermediate states that align far more faithfully with the final answer than chain-of-thought does (Can reasoning be learned during pretraining rather than after?).
So there's no single canonical 'inference pipeline,' but a consistent picture: a forward pass moves from plural, semantically-driven early computation (Do large language models reason symbolically or semantically?) through layered refinement and tiered circuits, to a committed, formatted output — and the visible reasoning sits *beside* that process as performance, not inside it as cause. If you want the surprising part: the model frequently knows the answer before it 'thinks,' then dresses it up afterward.
Sources 9 notes
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Large language models represent multiple complete, computationally distinct tasks simultaneously during inference—a macroscopic phenomenon separate from feature-level superposition. However, autoregressive decoding forces convergence to a single task after the first token, preventing practical multi-task generation.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.
Research into mechanistic interpretability, cognitive models, and training dynamics shows that identical benchmark performance conceals radically different internal structures. Improving one capability (helpfulness, accuracy) reliably degrades others (faithfulness, calibration, diversity).
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Ouro's 1.4B–2.6B models match 12B baselines by performing reasoning during pretraining via iterative latent loops, not by storing more knowledge. Their intermediate latent states align strongly with final outputs, making them more faithful than divergent chain-of-thought traces.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.