INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

How do continuous concept tokens explore multiple reasoning paths without explicit sampling?

This explores how 'Soft Thinking' lets a model keep many reasoning routes alive at once by reasoning in continuous concept space — instead of picking one discrete word per step, which forces a single path.

This explores how continuous concept tokens carry multiple reasoning paths in parallel without the model ever rolling a die to pick a token. The core idea, from Soft Thinking, is that normal language models collapse a rich probability distribution down to one chosen word at every step — and that act of committing throws away all the other routes the model was simultaneously considering. Soft Thinking instead feeds the *whole* probability distribution forward as a single probability-weighted 'concept' embedding, so the next step reasons over a superposition of paths at once rather than one sampled branch Can we explore multiple reasoning paths without committing to one token?. The exploration is implicit: there's no tree search and no sampling, just the blended embedding doing the work of many parallel guesses. It's training-free, nudges accuracy up, and even cuts token count via entropy-based early stopping.

The deeper move here is that reasoning doesn't have to be *verbalized* to happen. A cluster of architectures — depth-recurrent models, Heima, Coconut — show that test-time compute can scale by iterating on hidden states rather than emitting visible thinking tokens, suggesting that spelling reasoning out in words is a training habit, not a requirement of thinking itself Can models reason without generating visible thinking tokens?. Soft Thinking sits in this family: the 'concept' it passes forward is a point in continuous space, not a sentence. Meta's Large Concept Model pushes the same logic to a coarser grain, reasoning over whole-sentence embeddings in a language-agnostic space before decoding back to words Can reasoning happen at the sentence level instead of tokens?.

The catch with continuous reasoning is that you lose the things discrete tokens give you for free — the ability to sample a path, score it, and train on it. That's the gap normalizing flows are meant to close: NF-CoT models continuous thoughts as a tractable distribution inside the model's causal stream, recovering exact likelihoods, genuine probabilistic sampling, and trajectory scoring for non-verbal reasoning Can continuous thoughts have tractable likelihoods for sampling and scoring?. So there are two flavors here worth holding side by side: Soft Thinking explores paths *implicitly* by refusing to commit, while NF-CoT restores *explicit* sampling and scoring on top of continuous thoughts. Same continuous-space territory, opposite stances on whether you want to sample at all.

Why does refusing to commit help? Work on discrete reasoning chains shows that the decision is concentrated in a tiny minority of moments: only ~20% of tokens are high-entropy 'forking points' where the model genuinely branches, and those are exactly where reasoning improvements live Do high-entropy tokens drive reasoning model improvements?. Discrete generation forces a hard choice precisely at those forks; continuous concept tokens let the model hedge through them, keeping competing branches in superposition until the evidence resolves. Relatedly, models that commit too early to one branch and keep switching — 'underthinking' — waste compute and lose accuracy, which is why penalizing premature thought-switching helps Do reasoning models switch between ideas too frequently?. Continuous superposition is a different remedy for the same disease: you don't have to switch paths if you never abandoned the others.

The thing you didn't know you wanted to know: this whole approach quietly bets that the *words* in a reasoning chain are scaffolding, not substance. Evidence from elsewhere in the corpus backs that bet — models trained on deliberately corrupted, semantically irrelevant traces reason just as well, implying the trace functions as computation more than meaning Do reasoning traces need to be semantically correct?. And diffusion LLMs go further, refining reasoning and answers simultaneously in place rather than left-to-right, with answer confidence converging while reasoning keeps resolving Can reasoning and answers be generated separately in language models?. If reasoning is really a trajectory through a continuous space, then sampling discrete tokens was always a lossy interface — and exploring 'without sampling' isn't a trick, it's closer to what the model was doing underneath all along.

Sources 8 notes

Can we explore multiple reasoning paths without committing to one token?

Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can continuous thoughts have tractable likelihoods for sampling and scoring?

NF-CoT models continuous thoughts as an autoregressive normalizing flow inside the LLM's causal stream, recovering exact likelihood, probabilistic sampling, and KV-cache compatibility. This enables policy-gradient refinement and trajectory scoring on non-verbal reasoning, matching the tractability of textual CoT.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

How do continuous concept tokens explore multiple reasoning paths without explicit sampling?

Sources 8 notes

Next inquiring lines