INQUIRING LINE

Reasoning, Retrieval, and Evaluation · Model Architecture and Internals · Training, RL, and Test-Time Scalingcross-cluster

How do you supervise reasoning that never becomes tokens?

This explores a real tension: almost every tool we have for grading reasoning—process rewards, reflection-token analysis, trace correctness—operates on visible text, so what happens when the reasoning lives in hidden state and never surfaces as words?

This explores a real tension: almost every tool we have for grading reasoning operates on visible text, so what do you do when the reasoning never becomes words? Several papers show this isn't hypothetical—models genuinely can reason in latent space. Depth-recurrent architectures, Heima, and Coconut all scale test-time compute by iterating on hidden states rather than emitting tokens, which suggests verbalization is a training habit, not a requirement for thinking Can models reason without generating visible thinking tokens?. Once reasoning goes silent, the usual supervision handles fall away.

The corpus reveals just how much our supervision machinery secretly depends on tokens. Process rewards get mined from what a search agent reads but doesn't cite—an inherently text-level signal Can search agent behavior yield reliable process rewards for reasoning?. Specific words like 'Wait' and 'Therefore' turn out to be mutual-information peaks that actually drive accuracy Do reflection tokens carry more information about correct answers?. And you can identify which tokens carry reasoning by their high variance across rollouts Can we identify which tokens actually matter for reasoning?. All three pin supervision to the token stream. Take the stream away and they have nothing to grab.

Here's the surprising part: a cluster of findings suggests the tokens may have been a poor supervision target all along. Models trained on deliberately corrupted, irrelevant traces solve problems just as well—sometimes generalizing better—implying the trace works as computational scaffolding rather than meaningful steps Do reasoning traces need to be semantically correct?. Chain-of-thought is shaped by format and spatial structure far more than logical content, and invalid prompts work as well as valid ones What makes chain-of-thought reasoning actually work?. Probes even show models commit to easy answers internally before they finish writing the reasoning out Does chain-of-thought reasoning reflect genuine thinking or performance?. If the visible words are often theater, supervising them was never the same as supervising the reasoning.

So the corpus points to a different answer than 'force the reasoning back into tokens': supervise the outcome and let the hidden computation organize itself around it. Quiet-STaR judges its internal rationales purely by whether they improve next-token prediction—no labeled correctness, reasoning emerges as a side effect of better language modeling Can models learn reasoning from predicting any text?. Reinforcement Pre-Training does something similar, turning next-token prediction into a reasoning task graded by the corpus itself, which structurally blocks reward hacking because the answer key is the data Can next-token prediction become a reasoning task with RL?. The pattern is the same one the search-agent work uses: tie the reward to something verifiable and external, and you don't need to read the intermediate steps at all Can search agent behavior yield reliable process rewards for reasoning?.

The alternative, if you want to keep some grip on the process, is to make the reasoning modular instead of legible. Cognitive tools enforce operation isolation through sandboxed calls, and structured critical-question prompts force the model to check its warrants—both impose structure on reasoning without depending on it narrating honestly Can modular cognitive tools unlock reasoning without training? Can structured argument prompts make LLM reasoning more rigorous?. The thing you didn't know you wanted to know: the question secretly assumes tokens were ever good supervision. The corpus's quiet verdict is that outcome-anchored, verifiable rewards supervise latent reasoning better precisely because they never needed to see the steps.

Sources 11 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can search agent behavior yield reliable process rewards for reasoning?

LongTraceRL mines entity-level reasoning signals from what search agents read but don't cite—the hardest distractors—and applies rubric rewards only to correct answers, structurally blocking reward fabrication while capturing intermediate reasoning quality.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Can we identify which tokens actually matter for reasoning?

A small subset of tokens in reference answers change their certainty sharply depending on which chain of thought precedes them, while most tokens remain stable. This variance pattern, computable from the model's own samples, identifies reasoning-bearing tokens without supervision.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Does chain-of-thought reasoning reflect genuine thinking or performance?

Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

How do you supervise reasoning that never becomes tokens?

Sources 11 notes

Next inquiring lines