SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

The mainstream approach to test-time scaling requires the model to verbalize intermediate reasoning steps — producing tokens that represent thoughts before producing an answer. Two architectures challenge this assumption from different angles and converge on the same implication: verbalization is a historical artifact of training constraints, not a necessity for reasoning.

Latent depth-recurrent reasoning: A recurrent block is added to a transformer and iterated at inference time for an arbitrary number of steps. The model "thinks" by updating its hidden state repeatedly before producing any output token. Advantages: (1) no specialized training data required — the model trains with a variable compute budget on standard data; (2) less memory than CoT models, which need long context windows; (3) per-token adaptive compute, where difficult tokens get more recurrent iterations; (4) as model parameter count decreases, FLOPs per parameter increase — enabling high compute utilization on smaller models. The architecture naturally supports early stopping via KL-divergence convergence detection.

Heima (Hidden LLaMA): Each intermediate CoT step is compressed into a compact higher-level hidden representation using a single "thinking token." An adaptive decoder reconstructs variable-length textual sequences from the thinking tokens, enabling interpretability without verbosity. The model encodes each CoT step but doesn't need to generate all the intermediate tokens at inference time.

The synthesis point: both architectures suggest that the constraint requiring "expensive internal reasoning must always be projected down to a single verbalized next token appears wasteful" (Latent Depth paper). Continuous latent space can explore multiple reasoning directions simultaneously, without the linear sequential structure that token generation imposes.

This challenges Does more thinking time actually improve LLM reasoning? from an unexpected direction — the myth assumes verbalized tokens are the unit of thinking; latent reasoning questions whether tokens should be the unit at all.

The connection to human cognition is philosophically interesting: "a substantial amount of thought happens through complex, recurrent firing patterns in the brain, before the first word of an answer is uttered." Latent reasoning may capture facets of human reasoning (spatial thinking, physical intuition) that resist verbalization, which current verbalized CoT approaches cannot access by design.

Coconut (Chain of Continuous Thought): A fourth approach feeds the last hidden state back as the next input embedding directly in continuous space, bypassing the language model head and embedding layer entirely. Continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform breadth-first search (BFS) naturally — rather than committing to a single deterministic path like CoT. Coconut outperforms CoT on logical reasoning tasks requiring substantial backtracking. The neuroscience grounding is direct: neuroimaging studies consistently show that the language network remains largely inactive during reasoning tasks, and language appears optimized for communication rather than reasoning. This suggests verbalized CoT forces reasoning through a communication channel it was never designed for. The CoT unfaithfulness literature reinforces this: even when models generate explicit reasoning chains, they may use a different latent reasoning process internally.

Hierarchical Reasoning Model (HRM): A third distinct latent reasoning architecture adds brain-inspired multi-timescale processing. HRM couples a slow high-level module (abstract planning) with a fast low-level module (detailed computation) in hierarchical recurrence. The fast module reaches equilibrium, then the slow module advances — "hierarchical convergence" avoids premature convergence of standard recurrence. With only 27M parameters and 1000 samples (no pretraining, no CoT), HRM achieves near-perfect accuracy on Sudoku-Extreme and 30×30 maze pathfinding — tasks where CoT methods completely fail (0% accuracy). Uses O(1) memory gradient approximation at equilibrium, avoiding BPTT entirely. See Can recurrent hierarchies achieve reasoning that transformers cannot?.

Theoretical consolidation: These converging architectures now have a formal theoretical framework. Since Where does LLM reasoning actually happen during generation?, the depth-recurrent, Heima, Coconut, HRM, and energy-based approaches all constitute evidence for H1 (latent-state trajectories as the primary reasoning medium). The framework also clarifies why these approaches work: if reasoning is fundamentally a latent-state process, then architectures that operate directly in latent space are working with the native medium rather than forcing it through the bottleneck of discrete verbalization. Furthermore, since Can we trigger reasoning without explicit chain-of-thought prompts?, the latent reasoning capability exists even in standard transformer architectures — specialized latent architectures may be optimizing the medium rather than creating a new capability.

Practical constraint on retrofitting: A critical caveat for deployment: Can continuous reasoning avoid forgetting in instruction-tuned models? shows that fine-tuning already-capable instruction-tuned models for continuous reasoning via Coconut/CCoT methods causes catastrophic forgetting. This limits the Coconut approach to training-from-scratch scenarios and motivates frozen-backbone alternatives for enhancing existing models.

Inquiring lines that use this note as a source 108

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 12

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
26 direct connections · 180 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

latent reasoning in continuous space scales test-time compute without verbalized tokens or specialized training data