Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
"Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens" presents the strongest evidence yet against the assumption that reasoning traces carry meaningful semantics that contribute to solution quality.
The experimental design is clean. Transformers are trained on A* search traces for shortest-path planning in random mazes. Three conditions: (1) correct traces, (2) no traces, and (3) deliberately corrupted traces that have no relation to the specific problem they are paired with. The corrupted traces are not just noisy — they are systematically irrelevant, paired with wrong problems.
The results: corrupted-trace models maintain performance largely consistent with correct-trace models. In some cases they improve on correct-trace models and generalize more robustly to out-of-distribution tasks. Models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions — the formal A* validator confirms only a loose correlation between trace accuracy and solution accuracy.
This result directly challenges three assumptions simultaneously. First, that intermediate tokens function as reasoning steps (they may function as computational scaffolding — additional forward passes — regardless of semantic content). Second, that correct traces are superior training data (the scaffolding hypothesis predicts that any tokens providing additional computation would work). Third, that the "aha moment" in DeepSeek R1 indicates genuine realization (a single token insertion does not change internal state; it provides one more forward pass).
The "Stop Anthropomorphizing" position paper reinforces this from a different angle. It argues the community's tendency to call intermediate tokens "thoughts" or "reasoning traces" is actively harmful — generating false confidence and directing research toward improving trace quality rather than understanding the computational mechanism. The LLM-Modulo framework (generate-test with external verification) is proposed as the principled alternative: treat the LLM as a generator, use sound external verifiers for guarantees.
The practical implication: optimizing trace "interpretability" or "correctness" may be orthogonal to optimizing solution accuracy. The traces most useful for model performance may be those that provide optimal computational scaffolding, not those that most closely resemble human reasoning. This converges with What do models actually learn from chain-of-thought training?, which shows from the opposite direction that structural perturbations (shuffled steps) cause severe accuracy drops while content perturbations (wrong numbers, removed keywords) cause minimal impact. Together, these findings isolate the active ingredient: logical architecture, not semantic content.
Theoretical backing (RL-STaR): The theoretical analysis of the STaR framework provides formal support: RL-based self-taught reasoning can improve capabilities despite incorrect reasoning steps in the training data, because the iterative policy gradient converges under bounded error conditions. The model doesn't need correct intermediate steps to learn to produce correct final answers — what matters is the policy improvement trajectory, not the fidelity of individual traces. The quality of the pre-trained model sets the floor for effective bootstrapping, but the tolerance for noisy intermediates is built into the convergence guarantee.
Inquiring lines that use this note as a source 267
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How can minimal pairs expose reasoning failures that single-instance accuracy metrics miss?
- Do recency-focused prompts and in-context examples work equally well for order recovery?
- Can corrupted reasoning traces be reliably distinguished from correct ones?
- Why does scaling reasoning tokens fail to improve unfamiliar tasks?
- Does irrelevant content degrade reasoning even when it fits the context window?
- How does SONAR embedding quality affect downstream reasoning accuracy?
- Why do single examples trigger large reasoning improvements in models?
- Can reflection in reasoning models be corrective rather than just confirmatory?
- Can latent reasoning architectures work as retrofits to existing models?
- What makes a background condition relevant to a specific reasoning task?
- Can models learn when to invoke search during reasoning tasks?
- Are correct reasoning traces measurably shorter than incorrect ones?
- Does iterative denoising order affect the reasoning style diffusion models learn?
- Why does explicit theory injection work better than example-based learning for reasoning tasks?
- How does optimizing for accuracy during training degrade downstream reasoning quality?
- How does token-by-token generation constrain a model's ability to plan ahead?
- Are reasoning traces really reasoning or just stylistic imitation of human thought?
- How much accuracy is preserved when removing explanatory layers from reasoning traces?
- Why do linguistic hedging markers correlate with internal confidence signals in reasoning traces?
- Why do reasoning models fail on structurally unfamiliar instances?
- Can marginal hints integrate better into reasoning than comprehensive explanations?
- What linguistic markers distinguish longer incorrect traces from correct ones?
- Can models learn to select exemplars based on reasoning skills rather than complexity?
- Why do correct reasoning traces in language models tend to be shorter?
- What behavioral markers signal when reasoning chains are performative?
- Why do logically invalid chain-of-thought examples work nearly as well?
- Can chain of thought traces be designed to prevent anthropomorphic misinterpretation?
- What makes a reasoning trace causally sufficient versus merely stylistically plausible?
- Why do correct reasoning traces appear shorter than incorrect ones?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- Why does chain-of-thought fail when problems lack matching training schemata?
- Can reasoning traces prove models are actually reasoning versus mimicking?
- How much does training data format shape what reasoning strategy emerges?
- How do planning and backtracking sentences control reasoning traces?
- Can reasoning chains work without logical validity?
- Can concise reasoning traces match verbose explanation accuracy?
- Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
- What computational role do intermediate tokens actually play in transformers?
- Does the DeepSeek R1 single token insertion represent genuine reasoning?
- Can external verifiers replace reasoning trace quality in solution guarantees?
- How does business logic specification replace annotated training datasets?
- Does partial trace guidance work better than curriculum learning for hard problems?
- Can solution traces substitute for process-level reward signals in math reasoning?
- How do semantic failure modes map to attentional and intentional layers?
- Can the three-stage DoT framework detect all cognitive distortion types reliably?
- Can verifier-guided search catch factual errors that reasoning training cannot?
- Can models learn better from critiquing errors than imitating correct responses?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- Can AI-generated explanations of errors teach as effectively as self-resolution?
- Why do open-source models trained on proprietary outputs still fail at reasoning?
- Why does explicit reasoning degrade passage reranking performance?
- Why does reasoning accuracy degrade beyond a critical thinking token threshold?
- What causes snowball errors to accumulate across reasoning steps in language models?
- How do failed branches remain in context and contaminate subsequent reasoning?
- Can removing failed branches from edited traces improve previous mistakes?
- Why does domain accuracy improve while reasoning quality degrades after supervised fine-tuning?
- Does supervised fine-tuning improve accuracy while damaging the quality of reasoning?
- When does explicit reasoning actually degrade performance on a task?
- How much reasoning catalyst data is actually needed for improvement?
- Does irrelevant context degrade reasoning even within model context limits?
- Do self-revision tokens measurably degrade reasoning accuracy in scaled models?
- Why are correct reasoning traces consistently shorter than incorrect ones?
- Can hyperedges replace triple-based externalization in reasoning tasks?
- Can reasoning traces serve purposes beyond producing the final answer itself?
- Why do temporal reasoning patterns matter more than final answers?
- Can small models solve complex tasks using externalized reasoning graphs?
- What makes training data quality more important than quantity for reasoning?
- How does per-token adaptive compute improve efficiency in recurrent reasoning?
- How can entailment benchmarks separate genuine reasoning from memorization effects?
- Why do entities trigger memorized propositions instead of enabling reasoning?
- Why is extracting training data insufficient proof that models memorize?
- Why does the distinction between functional and causal grounding matter for AI alignment?
- Do reasoning models perform genuine logical evaluation or pattern matching?
- How do explicit reasoning traces help models construct valid syntactic trees?
- Can event boundaries be identified from statistical regularities without understanding events?
- Why do models learn reasoning form instead of actual abstract inference?
- Why does mixing reasoning traces from different teachers destabilize learning?
- Does logical trace coherence guarantee valid mathematical reasoning?
- Why does distillation transfer reasoning patterns with few examples?
- Why does iterative refinement amplify rather than correct reasoning errors?
- Does reasoning trace style explain why RL post-training improves model reasoning?
- Can derivational traces be distinguished from stylistic mimicry of reasoning?
- Why do different reasoning chains surface different relevant facts?
- Do shorter reasoning traces actually produce more reliable model outputs?
- Can synthesized explanations be more auditable than winning-chain explanations?
- Why does grokking reveal the shift from memorization to genuine understanding?
- What training signals would teach models when not to reason?
- Can token efficiency come from stopping before reflection?
- Can messy multi-agent transcripts become better training data than clean outputs?
- Why do correct reasoning traces tend to be shorter than incorrect ones?
- How much of a model's reasoning tokens are unnecessary for reaching the final answer?
- Why do language models generate reasoning tokens after internally deciding the answer?
- Why do introverted agents produce longer and more detailed reasoning traces?
- Why does reflection in reasoning models stay confirmatory instead of corrective?
- Can mechanistic interpretability explain explanation-execution disconnection?
- How does factoring perception from reasoning improve sparse-label learning?
- Why does intermediate step quality predict reasoning outcomes better than global features?
- Why do explicit linguistic markers override semantic computation in models?
- Can models distinguish between activated knowledge and genuine reasoning?
- Can latent reasoning mechanisms and recursive tracking mechanisms be combined effectively?
- Do reflection tokens and symbolic tokens serve different roles in reasoning?
- How does constraint complexity relate to optimal reasoning token budgets?
- Why does output alignment fail to catch internally incoherent reasoning?
- Why do reasoning models confidently generate wrong answers instead of abstaining?
- What happens to reasoning accuracy when models use more thinking tokens?
- Why do reasoning models reduce effort despite having token budget remaining?
- Why do recursive belief models require different training than logical derivation?
- What distinguishes inductive inference from negative evidence versus positive patterns?
- Can scaffolding frameworks isolate inductive reasoning from deductive confounds?
- Why do corrupted traces maintain performance as well as correct traces?
- How does post-training on traces improve performance without semantic reasoning?
- Does anonymizing reasoning traces harm the quality of model outputs?
- Which sentences in reasoning traces actually influence the final answer?
- Why do reasoning models wander instead of searching systematically?
- Why do invalid reasoning steps produce nearly the same performance gains?
- Is the reasoning cliff actually a tool-use problem?
- Why do larger reasoning models show cyclicity only in later layers?
- Can deliberate corruption of reasoning traces harm out of distribution generalization?
- Why do reasoning models produce unfaithful or unhelpful reasoning traces?
- Why do we measure reasoning quality by reading visible chains?
- Why do invalid prompts produce reasoning traces as effectively as valid ones?
- Why do verbalized reasoning chains fail on certain problem classes?
- How much does test-time compute improve reasoning without more tokens?
- Can training improve reasoning coherence without improving actual correctness?
- Why do reasoning traces resemble mimicry rather than verified problem-solving?
- Can training on reasoning traces teach actual self-correction or only confident first answers?
- Why does outcome supervision fail for long reasoning chains?
- Why does reflection in reasoning models tend to be confirmatory rather than corrective?
- Why do difficult problems force models to develop reasoning strategies?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- What distinguishes coherent reasoning from inaccurate but plausible predictions?
- Why does reasoning graph topology evolve differently across training phases?
- How does a single training example trigger phase transitions in reasoning output?
- How does trace coherence differ from valid mathematical proof in practice?
- Can chain-of-thought traces be faithful without causal sufficiency and necessity?
- How does trace coherence differ from trace validity in reasoning?
- Why does extending reasoning traces worsen persona consistency?
- Can a single correct example seed exponential improvement in mathematical reasoning?
- What inference strategy works better than forcing self-revision under token constraints?
- Why do SFT models memorize patterns instead of learning generalizable reasoning?
- Why does cross-text analogical reasoning fail when semantics decouple from symbols?
- Can models maintain auditable reasoning while achieving high accuracy?
- Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?
- How early in token generation does the reasoning mode activate?
- Why does reflection in reasoning models confirm rather than correct initial directions?
- Do correct reasoning traces tend to be shorter than incorrect ones?
- What makes some sentences in reasoning traces have disproportionate causal influence?
- How does correctness emergence occur when no expert initially solved the task?
- What sparse mechanistic structures drive reasoning traces in language models?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- Can latent reasoning achieve the same substitution without tokens?
- Can reasoning catalyst data serve as a stable foundation for test-time training?
- Why do reasoning models fail at learning hidden rules from sparse exceptions?
- Can early stopping on reflection tokens save computation without accuracy loss?
- How can correct explanations coexist with failed applications in AI?
- Why do models rarely admit to their actual reasoning in chain-of-thought traces?
- Why do models skip steps that would make reasoning clearer?
- Can reasoning models succeed at logic but fail at execution?
- How does training data format shape which reasoning patterns emerge in models?
- Does directional knowledge failure indicate shallow pattern matching over deep representation?
- How can one training example improve reasoning across thousands of unseen problems?
- Can models be trained to explain instead of imitate answers?
- How does tokenization change what gets counted as valuable knowledge?
- Why does more inference compute amplify wandering rather than solving it?
- How can high benchmark performance mask broken reasoning in AI systems?
- Can inserted errors in reasoning drafts produce predictable downstream effects?
- Why does representation recycling of MI-peak tokens improve reasoning accuracy?
- Are hedging markers in incorrect traces indicators of failed backtracking?
- Do shorter correct reasoning traces contain more thought anchors than longer ones?
- Why do familiar patterns that support correct answers sometimes drive errors?
- Can memorization scores diagnose where reasoning chains become unreliable?
- What distinguishes memorized tokens from causally necessary reasoning steps?
- Why does training data format shape reasoning strategy more than content?
- Do corrupted reasoning traces teach something different than pure success traces?
- Why does failed step fraction predict reasoning quality better than trace length?
- Can instance-adaptive reasoning happen without sequential token dependencies?
- Why do correct reasoning traces stay shorter than incorrect ones?
- Why are incorrect reasoning traces longer than correct ones?
- What mechanisms cause reasoning models to wander rather than focus?
- Can knowledge density per token be measured as a quality metric?
- How do single wrong steps corrupt entire reasoning chains?
- Why do reasoning model failures stem from execution rather than reasoning?
- What happens to model reasoning accuracy as thinking token requirements exceed critical thresholds?
- Can simple structure perturbations reliably expose memorization in reasoning models?
- Can training format itself shape what reasoning strategy a model learns?
- Can verification loops and decomposition fix judgment failures?
- Why do expert reasoners skip steps that novices must state explicitly?
- What failure modes emerge when scheme classification feeds downstream reasoning pipelines?
- Can training models on backward reasoning improve their forward planning ability?
- What role do local backtracking steps play in reasoning traces?
- Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
- Why do reasoning tasks improve more than retrieval from lookup memory?
- Does trace length actually reflect problem difficulty or training proximity?
- Why do wrong numbers cost less accuracy than shuffled reasoning steps?
- Can reasoning models reject ill-posed questions or do they overthink?
- How do prior errors in reasoning context amplify future mistakes?
- Do longer chain-of-thought traces improve interpretability or just performance?
- Why do reasoning traces mislead users into trusting wrong model answers?
- Are chain-of-thought traces anthropomorphizing how AI models really reason?
- Can chain-of-thought traces harm rather than help user understanding?
- Why does evaluating errors teach more than imitating correct responses?
- Does latent reasoning capability exist in base models before any training?
- How does backward reasoning during training improve forward reasoning capability?
- Which tokens actually change across different reasoning paths in rollouts?
- How do reasoning-invariant tokens dilute learning signals in uniform averaging?
- How much of a reasoning trace is actually redundant or unnecessary?
- Why does the order of training examples matter for what models learn?
- How can reasoning quality be verified before integrating new information into a reasoning graph?
- Can models internally identify which tokens matter most for reasoning?
- Can models reason at inference without specialized internal training?
- Does reasoning happen in hidden space or in generated tokens?
- Why does naive randomness fail to improve stochastic latent reasoning models?
- Do linearized traces genuinely expand exploration beyond standard chain-of-thought?
- What distinguishes genuine capability gains from coherent but invalid reasoning traces?
- Why do reasoning traces persuade users without improving their accuracy?
- Why does reasoning transfer across different numbers but factual recall does not?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- How do meta-tokens help models learn when to generate reasoning versus commit predictions?
- Why might rationales that predict common text patterns fail on hard novel reasoning?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- How should tool-call attribution distinguish credit between successful accidents and intentional actions?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- Why does reasoning catalyst data remain stable across multiple self-improvement iterations?
- What quality filters distinguish useful reasoning enrichment from shallow repetition?
- Why does uniform averaging across all tokens dilute the reasoning signal?
- What evidence shows that reasoning chains encode token-level functional structure?
- How do continuous concept tokens compare to latent trajectory sampling?
- Does token-level reasoning during pretraining improve general reasoning without task-specific supervision?
- How much do compressed reasoning traces transfer across different models?
- What makes a thinking trace take information shortcuts?
- Does reasoning style transfer matter more than solution correctness in distillation?
- Does the base model already contain latent reasoning capability?
- Can models possess latent reasoning capability that training signals fail to unlock?
- Why do knowledge and reasoning train in different network layers?
- Can partial solution traces convert unproductive hard samples into learnable training data?
- Why do shorter confident reasoning traces fail on out-of-distribution problems?
- Can we measure how much prior errors bias subsequent token predictions?
- Do reasoning models need to verbalize doubt to correct their own mistakes?
- Can models recover knowledge with completely unrelated retraining tasks?
- How much of MATH-500 improvement comes from data contamination versus real reasoning gains?
- Can reasoning training fix sycophancy if it is not a reasoning failure?
- How can humans evaluate explanations from systems they did not train?
- Can post-hoc analysis of reasoning traces actively mislead users?
- Why do language model reasoning chains look fluent when they deviate from the task?
- What makes reasoning traces effective or ineffective for solving problems?
- Why do corrupted reasoning traces sometimes generalize better than correct ones?
- Does the token prediction framing actually capture what human reasoning does?
- Why do students learn better from explanations than from solving problems from scratch?
- Can base models spontaneously produce reasoning traces without any RL training?
- How does confidence filtering improve selection of reasoning traces?
- Can approximate or noisy reference answers work for RL-based reasoning training?
- Is reasoning failure caused by task complexity or training distribution gaps?
- Why are shorter reasoning traces more reliable than longer correct ones?
- What makes some reasoning traces better supervision than others despite equal accuracy?
- Can we detect redundant reasoning steps during model inference instead of training?
- Why do reasoning traces fail to accurately reflect model decision-making?
- Can this whole-artifact principle apply to other generative tasks?
- Can articulating latent reasoning processes improve transfer across domains?
- Why does reasoning backward enable better forward reasoning performance?
- How does contrapositive augmentation change the tractability of reasoning tasks?
- Can minimal training signals unlock latent reasoning capability in base models?
- What makes some training data teach brittle answers versus robust reasoning?
- How can we turn reasoning model failures into useful training signals?
- Why does latent-level prediction beat token-level prediction for reasoning?
- Can minimal training signals unlock reasoning already latent in pretrained representations?
- Why do language models use remaining tokens to rationalize instead of reconsider?
- What latent reasoning capability do base models already possess before training?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
this provides the strongest evidence for the stylistic mimicry claim: even irrelevant mimicry works
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
extends: not just constrained imitation but imitation of form without semantic content still effective
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
extends necessity failure: traces can be semantically irrelevant and still produce correct solutions
-
Can minimal reasoning chains match full explanations?
Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
both findings converge: most trace content is dispensable
-
Does training on messy search processes improve reasoning?
Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.
complementary finding: corrupted traces show content is dispensable (scaffolding hypothesis); SoS shows the search PROCESS itself is valuable training data (process exposure hypothesis). Different mechanisms, both challenge optimal-trace supremacy.
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
convergent evidence from the opposite direction: corrupted content is tolerated (this note) while corrupted structure causes severe degradation (that note); together they confirm traces function as structural scaffolding not semantic reasoning
-
Does logical validity actually drive chain-of-thought gains?
What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
convergent finding from prompting rather than training: invalid exemplar reasoning at inference time (that note) parallels corrupted training traces (this note), both showing logical validity is dispensable for performance gains
-
What three separate factors drive chain-of-thought performance?
Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.
the three-factor decomposition explains WHY corrupted traces work: output probability (the dominant factor) is shifted by intermediate token generation regardless of content validity; only the noisy-reasoning factor requires semantic correctness
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Faith and Fate: Limits of Transformers on Compositionality
- Reasoning Can Hurt the Inductive Abilities of Large Language Models
- Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Original note title
deliberately corrupted reasoning traces perform comparably to correct traces and sometimes generalize better out of distribution