INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How does latent reasoning compare…›this inquiring line

What if AI's 'show your work' is just a learned habit — not where its real thinking happens?

Can latent reasoning achieve the same substitution without tokens?

This explores whether reasoning can move into the model's hidden states — skipping the visible word-by-word 'thinking out loud' — and still do the same work that chain-of-thought tokens do.

This explores whether latent reasoning (computation happening in hidden states or embedding space) can substitute for the visible chain-of-thought tokens models normally generate — and whether anything is lost when the words disappear. The corpus suggests the substitution is largely viable, and several lines of evidence point to verbalization being more of a training habit than a computational necessity.

The most direct support comes from work showing models can scale test-time compute by iterating on hidden states rather than emitting tokens Can models reason without generating visible thinking tokens?. Depth-recurrent architectures, Coconut, and Heima all reason in latent space and reach answers without spelling out intermediate steps — framing verbalization as an artifact, not a requirement. A striking complement: transformers trained to hide their chain-of-thought actually compute the correct answer in their earliest layers, then deliberately overwrite it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. The reasoning was never in the tokens; it was already done internally and the visible output was theater. That reframes the whole question — if the real computation is latent even during 'normal' CoT, then dropping the tokens isn't removing the reasoning, just removing the performance of it.

This fits a broader pattern in the corpus: the tokens carry far less of the load than they appear to. Chain of Draft matches full CoT accuracy using just 7.6% of the tokens, with the other 92% serving style and documentation rather than computation Can minimal reasoning chains match full explanations?. Models trained on deliberately corrupted, nonsensical traces perform comparably to those trained on correct ones — suggesting traces work as computational scaffolding, not meaningful steps Do reasoning traces need to be semantically correct?. And when researchers prune reasoning chains by importance, only a small set of symbolic-computation tokens matter; most are grammar and meta-discourse Which tokens in reasoning chains actually matter most?, echoing the finding that just ~20% of high-entropy 'forking' tokens drive the actual learning signal Do high-entropy tokens drive reasoning model improvements?. If most tokens are disposable, the case for dropping them entirely strengthens.

There are also non-token alternatives that go further than just hiding the words. Large Concept Models reason over whole sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens?, and diffusion LLMs refine reasoning in-place alongside the answer rather than generating it left-to-right, cutting compute in half Can reasoning and answers be generated separately in language models?. These aren't just compression — they're different substrates for the same work.

The catch the corpus raises: what gets substituted may not have been genuine reasoning to begin with. If chain-of-thought is constrained imitation of reasoning's *form* rather than real abstract inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?, and if models lean on semantic associations rather than symbolic logic Do large language models reason symbolically or semantically?, then moving into latent space inherits those same ceilings — it makes reasoning cheaper and faster without making it more genuine. So the honest answer is: latent reasoning can likely match what tokens do, precisely because the tokens were doing less than they looked like they were.

Sources 10 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Show all 10 sources

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs4.27 match · arxiv ↗
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens3.45 match · arxiv ↗
Hierarchical Reasoning Model2.61 match · arxiv ↗
Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity2.54 match · arxiv ↗
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity2.53 match · arxiv ↗
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens1.77 match · arxiv ↗
Break the Chain: Large Language Models Can be Shortcut Reasoners1.74 match · arxiv ↗
LLM Reasoning Is Latent, Not the Chain of Thought1.74 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about latent reasoning and token substitution in LLMs. The question: Can hidden-state computation fully replace verbalized chain-of-thought without loss of reasoning capability?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and include:
• Latent reasoning via depth-recurrence scales test-time compute without token verbalization (~2025).
• Models trained to hide CoT compute correct answers in early layers, then overwrite them with format-compliant output (~2024).
• Chain of Draft achieves 92.4% of full-CoT accuracy using only 7.6% of tokens; 92% serve style, not computation (~2025).
• Corrupted reasoning traces perform comparably to correct ones; traces function as scaffolding, not meaningful steps (~2025).
• Only ~20% of tokens (high-entropy 'forking' points) drive learning signal; most are disposable (~2025).
• Large Concept Models reason over sentence embeddings in language-agnostic space; diffusion LLMs refine in-place, halving compute (~2025).
• CoT may be constrained *imitation* of reasoning form, not genuine abstract inference (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2412.04537 (Dec 2024) — Understanding Hidden Computations in Chain-of-Thought Reasoning.
• arXiv:2502.05171 (Feb 2025) — Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.
• arXiv:2506.02878 (Jun 2025) — CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate.
• arXiv:2508.10736 (Aug 2025) — Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each latent-reasoning claim above, determine whether advances in model scale, training (synthetic data, RL on latent loss), orchestration (multi-pass latent iteration, adaptive compute budgets), or evals (mechanistic probes of hidden state) have extended or contradicted it. Separate durable (e.g., "reasoning can happen without tokens") from perishable (e.g., "current APIs lack latent-reasoning support"). Cite what resolved each one.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent study show latent reasoning *fails* at tasks full CoT handles, or that token transparency remains irreplaceable?
(3) Propose 2 research questions that assume the substitution regime has progressed: e.g., "What is the minimal latent dimensionality needed to preserve reasoning fidelity?" or "Can latent-reasoning models transfer learned heuristics to out-of-distribution symbolic tasks?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What if AI's 'show your work' is just a learned habit — not where its real thinking happens?

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8