Can latent reasoning scale test-time compute without verbalized tokens or special training?
This explores whether models can do their 'thinking' inside hidden internal states — scaling up reasoning at inference time — without writing out chain-of-thought tokens and without a special training regime to teach them how.
This explores whether models can scale reasoning in hidden internal states rather than written-out tokens, and whether that requires special training. The corpus answers the compute half cleanly and the training half with a caveat. On compute: yes, reasoning can scale without verbalized steps. Depth-recurrent models, Heima, and Coconut all push test-time effort through repeated iteration of the hidden state rather than generated text Can models reason without generating visible thinking tokens?. Looped architectures make the same point structurally — re-applying the same layers in recurrent depth beats simply making the network bigger, because recursion enables the state-tracking and compositional steps that raw parameter count doesn't Can models learn by looping instead of growing larger?. The shared insight is that verbalization looks like a training artifact, not a requirement for reasoning to happen.
The 'without special training' part is where the corpus pushes back. A recurring finding is that the reasoning ability is already latent in base models — five independent methods (RL steering, critique fine-tuning, decoding tricks, SAE feature steering, RLVR) all *elicit* capability that's already sitting in the activations rather than installing it Do base models already contain hidden reasoning ability?. So 'no special training' is half-true: you may not need to teach the skill, but you usually need *something* — even minimal — to unlock it. And the gap between a reasoning model and a non-reasoning one persists no matter how much inference compute you throw at the weaker one, because training instills a protocol that makes the extra compute productive Can non-reasoning models catch up with more compute?. Latent compute scales, but only once something has organized the model to use it.
Worth noticing: several of these approaches move reasoning *off* the token level entirely rather than just hiding it. Large Concept Models reason over whole-sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens?, and Latent-Thought Language Models add a separate scaling axis — latent 'thought vectors' that scale independently of parameter count via a fast/slow dual-rate learning scheme Can latent thought vectors scale language models beyond parameters?. These suggest the interesting frontier isn't 'same reasoning, fewer tokens' but 'a different substrate for reasoning altogether.'
There's a deeper reason latent reasoning is attractive, hiding in the critiques of the verbal kind. Several notes find that visible chain-of-thought may be more performance than logic: corrupted or irrelevant reasoning traces train models about as well as correct ones, implying the trace acts as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?, and CoT degrades predictably the moment you leave the training distribution, producing fluent-but-invalid reasoning Does chain-of-thought reasoning actually generalize beyond training data?. If the words aren't doing the reasoning anyway, doing the work in latent space loses less than you'd think. Quiet-STaR sits in between — it learns to generate rationales at every token position from arbitrary text, with quality judged by whether the rationale improves prediction rather than by labels Can models learn reasoning from predicting any text?.
The honest bottom line: latent reasoning genuinely scales test-time compute without verbalized tokens, and the capability is largely already present in base models — but 'no special training' overstates it. The pattern across the corpus is elicitation, not creation. The compute is latent; you still need a key to turn it on.
Sources 9 notes
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.