SYNTHESIS NOTE

Topics›Cognitive Models Latent›this note

Where does LLM reasoning actually happen during generation?

Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.

Synthesis note · 2026-04-20 · sourced from Cognitive Models Latent

The field studies "LLM reasoning" without agreeing on what the primary object of study is. Three views coexist but make incompatible predictions:

H2 (surface CoT): Multi-step reasoning is primarily mediated by explicit surface chain-of-thought. The chain IS the reasoning. This requires surface traces to provide the most stable causal leverage — but ordinary CoT is often useful without being reliably faithful, and its role varies sharply across tasks.

H0 (generic serial compute): Most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. More tokens = more FLOPs, regardless of what those tokens say. This requires matched serial compute to explain most gains — but extra budget alone cannot explain why specific internal states, features, or trajectories can predict or alter reasoning behavior.

H1 (latent-state trajectories): Multi-step reasoning is primarily mediated by latent-state trajectories, with surface CoT serving only as a partial interface. Task-relevant commitment arises in hidden-state dynamics that are only partly verbalized, or not verbalized at all.

The difficulty is that recent methods typically move several factors at once: CoT prompting changes both visible traces and compute allocation; latent reasoning methods change both hidden-state dynamics and compute budget; test-time scaling changes compute and usually changes the output path. Without designs that explicitly disentangle these three factors, experimental results cannot distinguish which hypothesis they support.

The paper argues H1 should be the default working hypothesis — not as a task-independent verdict, but because the strongest evidence currently available points toward latent-state dynamics as having the most stable causal leverage. The recommendation: treat latent-state dynamics as the default object of study and design evaluations that explicitly separate surface traces, latent states, and serial compute.

This framework organizes several existing findings. Because Do language models actually use their reasoning steps?, the H2 assumption is empirically weakened — if surface traces aren't causally faithful, they cannot be the primary reasoning medium. Because Does chain-of-thought reasoning reflect genuine thinking or performance?, H2 fails specifically on easy tasks (where the answer is determined before CoT begins) while H1 and H0 remain viable. Because Can we trigger reasoning without explicit chain-of-thought prompts?, direct latent intervention provides causal evidence for H1 that neither H2 nor H0 can explain.

Additional evidence converges from multiple angles. Because Why does reasoning training help math but hurt medical tasks?, the layer separation provides architectural grounding for H1: reasoning is a latent higher-layer process, not a surface token-generation phenomenon. Because Why do language models fail to act on their own reasoning?, even when the surface trace (rationale) is correct, the latent computation (action selection) diverges — a behavioral signature of the surface-latent disconnect that H1 predicts. And because Can we measure how deeply a model actually reasons?, there now exists an H1-native measurement tool: DTR tracks latent computational depth per token rather than surface trace properties, and it outperforms surface-level metrics as an accuracy predictor.

The sharpest implication: the field's default assumption (H2) may be distorting research priorities. If the reasoning object is latent, then benchmarks that evaluate chains, faithfulness metrics that read traces, and interpretability methods that parse CoT are all measuring a secondary phenomenon.

Inquiring lines that read this note 37

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do training priors constrain what context information can override?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

What limits mechanistic interpretability's ability to characterize models?

Does information stored in neural networks necessarily influence generation decisions?

How does reasoning graph topology affect breakthrough insights and generalization?

What graph structures would enable transformational creative reasoning in LLMs?

Why do reasoning models fail at systematic problem-solving and search?

How does open-ended evolver reasoning identify patterns across heterogeneous user trajectories?

Do language models learn genuine linguistic structure or just surface patterns?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

What hidden computations happen inside transformer layers during reasoning?

What structural advantages do diffusion language models offer over autoregressive methods?

Do bidirectional and any-order generation expose different parts of the joint distribution?

Why can LLMs generate ideas better than they evaluate them?

Can critique-only calls in LLMs exploit a measurable gap between generation and evaluation?

How does latent reasoning compare to verbalized chain-of-thought?

Why does training format shape reasoning strategy more than domain content?

Does training data format shape which reasoning strategies LLMs develop?

Why does verification consistently lag behind AI generation?

What properties determine whether reward signals teach genuine reasoning?

How do generative PRMs ensure their reasoning actually influences judgment instead of decorating outputs?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How does single-pass generation differ from multi-stage synthesis architecturally?

Do language models develop causal world models or rely on statistical patterns?

What are the consequences of models training on synthetic data?

Can seedless generation maintain explainability while scaling control?

How do soft continuous representations explore multiple reasoning paths simultaneously?

How do soft thinking and token-level mixtures explore multiple paths simultaneously?

How do prompt structure and constraints affect model instruction reliability?

Why do semantically related prompts converge into attractor states in middle layers?

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 170 in 2-hop network ·dense cluster Open in graph ↗

Where does LLM reasoning actually happen during … Do language models actually use their reasoning st… Does chain-of-thought reasoning reflect genuine th… Can models reason without generating visible think… Does chain-of-thought reasoning reveal genuine inf… Can we trigger reasoning without explicit chain-of… Why does reasoning training help math but hurt med… Can we measure how deeply a model actually reasons… Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
empirical evidence weakening H2
Does chain-of-thought reasoning reflect genuine thinking or performance? When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
difficulty-dependent H2 failure
Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
H1 implementations
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
theoretical argument against H2
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
causal evidence for H1
Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
provides layer-level mechanistic grounding for H1: reasoning localizes to higher layers as a latent process, not as surface token generation
Can we measure how deeply a model actually reasons? What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
an H1-native measurement: DTR measures latent computational depth rather than surface trace properties
Why do language models fail to act on their own reasoning? LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
behavioral evidence for the latent-surface disconnect: models produce correct surface reasoning but act on latent computations that don't follow it

Where does LLM reasoning actually happen during generation?

Inquiring lines that read this note 37

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4