INQUIRING LINE

How does latent reasoning recursion compare to chain-of-thought reasoning?

This explores the difference between reasoning that loops in a model's hidden state (recursive latent reasoning) and reasoning written out as explicit text steps (chain-of-thought) — what each actually does and where the real computation lives.


This explores the contrast between reasoning that recurses silently in a model's internal state and reasoning spelled out as visible step-by-step text. The corpus suggests these aren't just two styles of the same thing — they may be operating at different layers entirely, with chain-of-thought serving as a partial readout of a process that mostly happens elsewhere.

The sharpest framing is that LLM reasoning is best understood as a trajectory through hidden states, not as the text it produces Where does LLM reasoning actually happen during generation?. On this view, the visible chain-of-thought is an interface — a surface narration — while the actual inference runs underneath it. That reframes the comparison: latent reasoning isn't an alternative to CoT so much as the thing CoT is gesturing at. A striking demonstration is that steering a single internal feature can trigger reasoning behavior that matches or beats explicit CoT prompting, and it does so without writing any steps at all Can we trigger reasoning without explicit chain-of-thought prompts?. The reasoning was a latent capability the whole time; the prose was optional.

Where recursion adds something genuinely new is in handling uncertainty. Recursive latent reasoners that update their internal state deterministically can only carry one line of thought forward. Making those latent transitions stochastic lets a model hold a distribution over possible solutions instead of committing early — useful when a problem is ambiguous or has several valid strategies Can stochastic latent reasoning help models explore multiple solutions?. That same machinery lets reasoning scale in width by sampling parallel internal trajectories rather than only deeper, sidestepping the serial latency cost of longer and longer chains Can reasoning systems scale wider instead of only deeper?. Chain-of-thought, being a single linear text stream, is structurally stuck scaling in depth.

Meanwhile, the corpus is unusually skeptical about what CoT really is. Several notes converge on the verdict that chain-of-thought reproduces the *form* of reasoning through learned patterns rather than performing genuine logical inference — performance tracks format more than content, structurally invalid prompts work nearly as well as valid ones, and accuracy degrades predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work? What makes chain-of-thought reasoning actually work?. The same brittleness shows up when semantic content is stripped out: models reason through associations, not symbol manipulation Do large language models reason symbolically or semantically?. So CoT's apparent transparency may be partly theater.

The practical upshot is that much of a chain-of-thought is doing no computational work. Concise chains match verbose ones at under 8% of the tokens Can minimal reasoning chains match full explanations?, dynamic pruning can cut three-quarters of steps with no accuracy loss Can reasoning steps be dynamically pruned without losing accuracy?, optimal length follows an inverted-U that shrinks as models get more capable Why does chain of thought accuracy eventually decline with length?, and for simple questions step-by-step prompting actively hurts Why do some questions perform better without step-by-step reasoning?. Read together, these point the same direction as the latent-reasoning work: the verbose chain is mostly documentation wrapped around a much smaller hidden core. The interesting question the corpus leaves you with isn't 'which is better' — it's whether explicit reasoning text is a window into the computation or a story told after the fact.


Sources 12 notes

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about how LLM reasoning actually works—latent vs. explicit. The question remains open: *Is reasoning in LLMs best understood as hidden state trajectories steered by text prompts, or as genuine logical inference made visible through step-by-step chains?*

What a curated library found—and when (findings span 2023–2026; treat as dated claims, not current truth):
• Latent reasoning (internal state updates) can match or exceed chain-of-thought performance without writing any steps; steering a single SAE-identified feature triggers reasoning behavior (~2026).
• Chain-of-thought reproduces *form* of reasoning through learned patterns, not genuine inference; performance tracks format more than content, and invalid prompts work nearly as well as valid ones (~2025–2026).
• Concise intermediate chains match verbose CoT at under 8% of tokens; optimal length follows an inverted-U, shrinking as models grow more capable (~2024–2025).
• Recursive latent reasoners with stochastic state transitions can hold distributions over solutions and scale in width via parallel trajectories, sidestepping serial latency of depth-only scaling (~2025–2026).
• CoT performance degrades predictably under distribution shift and when semantic content is stripped; models reason through associations, not symbol manipulation (~2023–2025).

Anchor papers (verify; mind their dates):
• arXiv:2604.15726 (2026-04) *LLM Reasoning Is Latent, Not the Chain of Thought*
• arXiv:2506.02878 (2025-06) *CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate*
• arXiv:2502.05171 (2025-02) *Scaling up Test-Time Compute with Latent Reasoning*
• arXiv:2508.01191 (2025-08) *Is Chain-of-Thought Reasoning of LLMs a Mirage?*

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, determine whether post-2026 scaling, training paradigms (e.g., process reward models, synthetic reasoning data), inference optimization (sparse token routing, conditional compute), or new evaluation harnesses (math benchmarks, formal logic, adversarial distribution shifts) have relaxed or overturned it. Isolate what's durable (the latent-vs-explicit question itself) from what may be perishable (specific performance gaps, token-efficiency claims). Cite what resolved each constraint.
(2) **Surface strongest contradicting or superseding work from ~last 6 months:** Has any paper shown CoT is *necessary* for certain reasoning classes, or that latent reasoning plateaus where text reasoning scales?
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., "Do current RL-trained reasoners still show the inverted-U in CoT length?" "Can mechanistic interpretability of latent recursion scale to 100B+ parameters?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines