SYNTHESIS NOTE

Topics›Reasoning Architectures›this note

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

The mainstream approach to test-time scaling requires the model to verbalize intermediate reasoning steps — producing tokens that represent thoughts before producing an answer. Two architectures challenge this assumption from different angles and converge on the same implication: verbalization is a historical artifact of training constraints, not a necessity for reasoning.

Latent depth-recurrent reasoning: A recurrent block is added to a transformer and iterated at inference time for an arbitrary number of steps. The model "thinks" by updating its hidden state repeatedly before producing any output token. Advantages: (1) no specialized training data required — the model trains with a variable compute budget on standard data; (2) less memory than CoT models, which need long context windows; (3) per-token adaptive compute, where difficult tokens get more recurrent iterations; (4) as model parameter count decreases, FLOPs per parameter increase — enabling high compute utilization on smaller models. The architecture naturally supports early stopping via KL-divergence convergence detection.

Heima (Hidden LLaMA): Each intermediate CoT step is compressed into a compact higher-level hidden representation using a single "thinking token." An adaptive decoder reconstructs variable-length textual sequences from the thinking tokens, enabling interpretability without verbosity. The model encodes each CoT step but doesn't need to generate all the intermediate tokens at inference time.

The synthesis point: both architectures suggest that the constraint requiring "expensive internal reasoning must always be projected down to a single verbalized next token appears wasteful" (Latent Depth paper). Continuous latent space can explore multiple reasoning directions simultaneously, without the linear sequential structure that token generation imposes.

This challenges Does more thinking time actually improve LLM reasoning? from an unexpected direction — the myth assumes verbalized tokens are the unit of thinking; latent reasoning questions whether tokens should be the unit at all.

The connection to human cognition is philosophically interesting: "a substantial amount of thought happens through complex, recurrent firing patterns in the brain, before the first word of an answer is uttered." Latent reasoning may capture facets of human reasoning (spatial thinking, physical intuition) that resist verbalization, which current verbalized CoT approaches cannot access by design.

Coconut (Chain of Continuous Thought): A fourth approach feeds the last hidden state back as the next input embedding directly in continuous space, bypassing the language model head and embedding layer entirely. Continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform breadth-first search (BFS) naturally — rather than committing to a single deterministic path like CoT. Coconut outperforms CoT on logical reasoning tasks requiring substantial backtracking. The neuroscience grounding is direct: neuroimaging studies consistently show that the language network remains largely inactive during reasoning tasks, and language appears optimized for communication rather than reasoning. This suggests verbalized CoT forces reasoning through a communication channel it was never designed for. The CoT unfaithfulness literature reinforces this: even when models generate explicit reasoning chains, they may use a different latent reasoning process internally.

Hierarchical Reasoning Model (HRM): A third distinct latent reasoning architecture adds brain-inspired multi-timescale processing. HRM couples a slow high-level module (abstract planning) with a fast low-level module (detailed computation) in hierarchical recurrence. The fast module reaches equilibrium, then the slow module advances — "hierarchical convergence" avoids premature convergence of standard recurrence. With only 27M parameters and 1000 samples (no pretraining, no CoT), HRM achieves near-perfect accuracy on Sudoku-Extreme and 30×30 maze pathfinding — tasks where CoT methods completely fail (0% accuracy). Uses O(1) memory gradient approximation at equilibrium, avoiding BPTT entirely. See Can recurrent hierarchies achieve reasoning that transformers cannot?.

Theoretical consolidation: These converging architectures now have a formal theoretical framework. Since Where does LLM reasoning actually happen during generation?, the depth-recurrent, Heima, Coconut, HRM, and energy-based approaches all constitute evidence for H1 (latent-state trajectories as the primary reasoning medium). The framework also clarifies why these approaches work: if reasoning is fundamentally a latent-state process, then architectures that operate directly in latent space are working with the native medium rather than forcing it through the bottleneck of discrete verbalization. Furthermore, since Can we trigger reasoning without explicit chain-of-thought prompts?, the latent reasoning capability exists even in standard transformer architectures — specialized latent architectures may be optimizing the medium rather than creating a new capability.

Practical constraint on retrofitting: A critical caveat for deployment: Can continuous reasoning avoid forgetting in instruction-tuned models? shows that fine-tuning already-capable instruction-tuned models for continuous reasoning via Coconut/CCoT methods causes catastrophic forgetting. This limits the Coconut approach to training-from-scratch scenarios and motivates frozen-backbone alternatives for enhancing existing models.

Inquiring lines that read this note 119

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can next-token prediction alone produce genuine language understanding?

How does latent reasoning compare to verbalized chain-of-thought?

How do neural networks separate factual knowledge from reasoning abilities?

How do verbose and concise reasoning occupy different regions in activation space?

How do soft continuous representations explore multiple reasoning paths simultaneously?

Does tokenized intelligence retain genuine value through exchange-based systems?

Can AI output be tokenized without decoupling from the thought processes behind it?

When do additional thinking tokens stop improving reasoning performance?

Why do reasoning models fail at systematic problem-solving and search?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Is embodied interaction necessary for language meaning and genuine agency?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Why does self-revision increase model confidence while degrading accuracy?

Do self-revision tokens measurably degrade reasoning accuracy in scaled models?

What capability tradeoffs emerge when scaling model reasoning abilities?

How do training data properties shape reasoning capability development?

Why does training format shape reasoning strategy more than domain content?

How much does input format shape what reasoning strategy a model develops?

Can prompting inject entirely new knowledge into language models?

How can prompting help models gather information before attempting reasoning?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Why do language models struggle with implicit discourse relations?

Why do explicit linguistic markers override semantic computation in models?

Can inference-time compute substitute for scaling up model parameters?

How much does test-time compute improve reasoning without more tokens?

Do language models understand semantics or rely on pattern matching?

Why does cross-text analogical reasoning fail when semantics decouple from symbols?

Do language models develop causal world models or rely on statistical patterns?

Does domain specialization cause models to lose capabilities elsewhere?

Can capability boundary collapse be addressed by operating at representational rather than token level?

Do base models contain latent reasoning that training can unlock?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

What distinguishes memorized tokens from causally necessary reasoning steps?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How should models express uncertainty rather than forced confident answers?

Why does self-distillation suppress epistemic verbalization in student models?

Is model self-awareness based on genuine introspection or pattern matching?

Do models verbalize their implicit knowledge when that knowledge influences their output?

Do language models learn genuine linguistic structure or just surface patterns?

Why do semantic similarity and task relevance diverge in vector embeddings?

How does token-level interaction like ColBERT overcome commutativity constraints?

What properties determine whether reward signals teach genuine reasoning?

How do internal model mechanisms escape token-level reinforcement signals?

How do prompt structure and constraints affect model instruction reliability?

Related concepts in this collection 12

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

27 direct connections · 182 in 2-hop network ·medium cluster Open in graph ↗

Can models reason without generating visible thi… How should we balance parallel versus sequential c… Does more thinking time actually improve LLM reaso… Can minimal reasoning chains match full explanatio… Can we allocate inference compute based on prompt … Can recurrent hierarchies achieve reasoning that t… Can parallel architectures solve inherently sequen… Can we explore multiple reasoning paths without co… Can energy minimization unlock reasoning without d…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
latent recurrence is neither: it scales depth per token rather than breadth or chain length
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
latent reasoning suggests the token-is-thinking assumption embedded in all TTS benchmarks may be wrong
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
CoD uses fewer tokens; latent reasoning uses zero tokens for intermediate steps; same direction of travel
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
latent recurrence with early stopping implements adaptive compute at the token level, not the prompt level
Can recurrent hierarchies achieve reasoning that transformers cannot? Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
third latent reasoning architecture: hierarchical multi-timescale recurrence
Can parallel architectures solve inherently sequential problems? Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
complexity-theoretic foundation: latent recurrence is necessary for inherently serial problems
Can we explore multiple reasoning paths without committing to one token? Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
training-free approach to continuous-space reasoning via probability-weighted token mixture
Can energy minimization unlock reasoning without domain-specific training? Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
fifth latent reasoning approach: energy minimization as iterative gradient descent at inference time, distinct from depth-recurrent, Heima, Coconut, and HRM; 35% higher scaling rate than Transformer++, modality-agnostic without domain-specific training
Where does LLM reasoning actually happen during generation? Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
provides the theoretical framework (H1/H2/H0) that organizes all these architectures as evidence for H1
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
mechanistic evidence: latent reasoning is not just architecturally achievable but causally controllable via a single feature
Can continuous reasoning avoid forgetting in instruction-tuned models? Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
validates a practical concern: Coconut-style fine-tuning causes catastrophic forgetting on capable models; SoftCoT provides the retrofit-safe alternative
Can stochastic latent reasoning let models explore multiple solutions? When recursive reasoning models collapse to single deterministic paths, can introducing stochasticity into latent transitions instead let them maintain uncertainty and consider alternative strategies? This matters because real problems often have multiple valid answers.
extends: GRAM makes the deterministic latent recurrence stochastic to represent multiple solutions

Can models reason without generating visible thinking tokens?

Inquiring lines that read this note 119

Related concepts in this collection 12

Related papers in this collection 8

Search by related questions 4