Where does LLM reasoning actually happen during generation?
Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
The field studies "LLM reasoning" without agreeing on what the primary object of study is. Three views coexist but make incompatible predictions:
H2 (surface CoT): Multi-step reasoning is primarily mediated by explicit surface chain-of-thought. The chain IS the reasoning. This requires surface traces to provide the most stable causal leverage — but ordinary CoT is often useful without being reliably faithful, and its role varies sharply across tasks.
H0 (generic serial compute): Most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. More tokens = more FLOPs, regardless of what those tokens say. This requires matched serial compute to explain most gains — but extra budget alone cannot explain why specific internal states, features, or trajectories can predict or alter reasoning behavior.
H1 (latent-state trajectories): Multi-step reasoning is primarily mediated by latent-state trajectories, with surface CoT serving only as a partial interface. Task-relevant commitment arises in hidden-state dynamics that are only partly verbalized, or not verbalized at all.
The difficulty is that recent methods typically move several factors at once: CoT prompting changes both visible traces and compute allocation; latent reasoning methods change both hidden-state dynamics and compute budget; test-time scaling changes compute and usually changes the output path. Without designs that explicitly disentangle these three factors, experimental results cannot distinguish which hypothesis they support.
The paper argues H1 should be the default working hypothesis — not as a task-independent verdict, but because the strongest evidence currently available points toward latent-state dynamics as having the most stable causal leverage. The recommendation: treat latent-state dynamics as the default object of study and design evaluations that explicitly separate surface traces, latent states, and serial compute.
This framework organizes several existing findings. Because Do language models actually use their reasoning steps?, the H2 assumption is empirically weakened — if surface traces aren't causally faithful, they cannot be the primary reasoning medium. Because Does chain-of-thought reasoning reflect genuine thinking or performance?, H2 fails specifically on easy tasks (where the answer is determined before CoT begins) while H1 and H0 remain viable. Because Can we trigger reasoning without explicit chain-of-thought prompts?, direct latent intervention provides causal evidence for H1 that neither H2 nor H0 can explain.
Additional evidence converges from multiple angles. Because Why does reasoning training help math but hurt medical tasks?, the layer separation provides architectural grounding for H1: reasoning is a latent higher-layer process, not a surface token-generation phenomenon. Because Why do language models fail to act on their own reasoning?, even when the surface trace (rationale) is correct, the latent computation (action selection) diverges — a behavioral signature of the surface-latent disconnect that H1 predicts. And because Can we measure how deeply a model actually reasons?, there now exists an H1-native measurement tool: DTR tracks latent computational depth per token rather than surface trace properties, and it outperforms surface-level metrics as an accuracy predictor.
The sharpest implication: the field's default assumption (H2) may be distorting research priorities. If the reasoning object is latent, then benchmarks that evaluate chains, faithfulness metrics that read traces, and interpretability methods that parse CoT are all measuring a secondary phenomenon.
Inquiring lines that use this note as a source 34
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do Generation-Then-Comprehension and AI Delegation produce opposite learning outcomes?
- What distinguishes LLM fabrication from genuine theoretical reasoning?
- Why do LLM explanations cite similarity and diversity more as options increase?
- Can evidence density alone shift an LLM from generation to reasoning?
- Does information stored in neural networks necessarily influence generation decisions?
- What graph structures would enable transformational creative reasoning in LLMs?
- How does open-ended evolver reasoning identify patterns across heterogeneous user trajectories?
- Should LLM reasoning be studied as latent state trajectories rather than surface text?
- Is relevant knowledge encoded in LMs but not causally active in generation?
- Why do LLMs generate logical forms without preserving semantic content?
- What hidden computations happen inside transformer layers during reasoning?
- What internal mechanisms explain LLM reasoning and representation limits?
- Why can LLMs interpret formal logic better than they generate it?
- Do LLMs lack architectural scaffolding for compositional reasoning?
- Do bidirectional and any-order generation expose different parts of the joint distribution?
- Can critique-only calls in LLMs exploit a measurable gap between generation and evaluation?
- How early in token generation does the reasoning mode activate?
- Does training data format shape which reasoning strategies LLMs develop?
- Does LLM reasoning always match the outputs it generates?
- Can extended thinking modes introduce genuine rhetorical exploration to LLMs?
- Can knowledge encoded in model representations fail to influence generation?
- What distinguishes LLM Programs from chain-of-thought and agentic frameworks?
- How do knowing and doing diverge in LLM decision-making?
- How does test-time verification decouple the act of checking from reasoning generation?
- How do generative PRMs ensure their reasoning actually influences judgment instead of decorating outputs?
- Does reasoning happen in hidden space or in generated tokens?
- Where does the generation-verification gap appear in test-time compute?
- How does latent reasoning recursion compare to chain-of-thought reasoning?
- How does tool integration leverage comprehension without demanding perfect generation?
- How do LLM explanations diverge from actual internal reasoning?
- How does single-pass generation differ from multi-stage synthesis architecturally?
- Why does LLM performance improve when forecasting tasks include organized reasoning?
- What prevents LLM representations from causally influencing generation outputs?
- Can seedless generation maintain explainability while scaling control?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
empirical evidence weakening H2
-
Does chain-of-thought reasoning reflect genuine thinking or performance?
When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
difficulty-dependent H2 failure
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
H1 implementations
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
theoretical argument against H2
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
causal evidence for H1
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
provides layer-level mechanistic grounding for H1: reasoning localizes to higher layers as a latent process, not as surface token generation
-
Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
an H1-native measurement: DTR measures latent computational depth rather than surface trace properties
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
behavioral evidence for the latent-surface disconnect: models produce correct surface reasoning but act on latent computations that don't follow it
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LLM Reasoning Is Latent, Not the Chain of Thought
- Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
- Eliciting Reasoning in Language Models with Cognitive Tools
- Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
- Hierarchical Reasoning Model
Original note title
LLM reasoning should be studied as latent-state trajectory formation not as faithful surface chain-of-thought — three competing hypotheses can be empirically separated