INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

Can latent reasoning scale test-time compute without verbal tokens?

This explores whether models can stretch their 'thinking budget' at inference time by iterating on hidden internal states — reasoning in latent space — instead of generating visible word-by-word chains of thought.

This explores whether models can scale up their reasoning effort at test time without spelling out steps in words — reasoning inside hidden states rather than emitting tokens. The short answer the corpus gives is yes: several independent architectures show that the extra compute that makes models 'think harder' doesn't have to be spent on visible text. Depth-recurrent models, Heima, and Coconut all scale reasoning by iterating on hidden states, suggesting that verbalization is a training artifact rather than a requirement for reasoning Can models reason without generating visible thinking tokens?. Meta's Large Concept Model pushes the same idea up a level, reasoning over sentence-level embeddings in a language-agnostic space before decoding to any target language Can reasoning happen at the sentence level instead of tokens?, and Latent-Thought Language Models add a whole new scaling axis — you can grow the size of the latent 'thought' independently of model parameters Can latent thought vectors scale language models beyond parameters?.

What makes this more than a curiosity is the evidence that the verbal chain-of-thought was never doing the work we assumed. When models are trained to hide their reasoning, logit-lens analysis catches them computing the correct answer in the first few layers and then actively overwriting it with format-compliant filler tokens — the reasoning is real, the words are theater Do transformers hide reasoning before producing filler tokens?. In the same spirit, models trained on deliberately corrupted, irrelevant reasoning traces perform about as well as those trained on correct ones, which implies the trace functions as computational scaffolding rather than meaningful step-by-step logic Do reasoning traces need to be semantically correct?. If the semantic content of the visible chain matters so little, moving the computation into latent space stops looking lossy.

The hard part has always been that latent reasoning gives up the things text gives you for free: you can't easily sample it, score it, or train it with the usual reinforcement-learning machinery. The corpus's most pointed answer here is NF-CoT, which wraps continuous thoughts in an autoregressive normalizing flow inside the model's causal stream — recovering exact likelihoods, probabilistic sampling, and KV-cache compatibility, so you can run policy-gradient refinement and trajectory scoring on non-verbal reasoning just as you would on text Can continuous thoughts have tractable likelihoods for sampling and scoring?. That closes much of the practical gap that kept latent reasoning a research toy.

It's worth seeing this against the broader test-time-scaling picture, because latent reasoning is one axis among several. Search budget scales answer quality with the same diminishing-returns curve as reasoning tokens, giving agents a knob to trade reasoning against retrieval Does search budget scale like reasoning tokens for answer quality? — and notably, the productive use of any of these budgets depends on training, not raw compute: non-reasoning models don't catch up to reasoning models no matter how much inference you throw at them, because training instills a protocol that makes the extra compute pay off Can non-reasoning models catch up with more compute?. There's also a parsimony angle: only about 20% of tokens are the high-entropy 'forking points' that actually drive reasoning gains Do high-entropy tokens drive reasoning model improvements?, which hints at why most verbalized tokens are compressible into latent steps without loss.

The quiet, surprising takeaway is that the bottleneck may not be capability at all. Five separate mechanisms show base models already contain latent reasoning that minimal post-training merely elicits rather than creates Do base models already contain hidden reasoning ability?. So 'reasoning without words' isn't asking models to do something new — it's about reaching reasoning that's already sitting in the activations, without forcing it through the narrow, lossy channel of generated text.

Sources 10 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can continuous thoughts have tractable likelihoods for sampling and scoring?

NF-CoT models continuous thoughts as an autoregressive normalizing flow inside the LLM's causal stream, recovering exact likelihood, probabilistic sampling, and KV-cache compatibility. This enables policy-gradient refinement and trajectory scoring on non-verbal reasoning, matching the tractability of textual CoT.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can latent reasoning scale test-time compute without verbal tokens?

Sources 10 notes

Next inquiring lines