SYNTHESIS NOTE

Topics›Reasoning Logic Internal Rules›this note

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Synthesis note · 2026-02-22 · sourced from Reasoning Logic Internal Rules

The "Deciphering Factors Influencing CoT" paper achieves something rare: a clean decomposition of what drives Chain-of-Thought performance into three independently measurable factors, using the simple but controlled task of shift cipher decoding across GPT-4, Claude 3, and Llama 3.1.

Factor 1: Output probability. The probability of the correct output in the model's distribution dramatically affects CoT accuracy. Varying only the output's probability of occurrence shifts GPT-4 accuracy from 26% to 70%. CoT works better when the answer is already more probable — it amplifies existing tendencies rather than overcoming them.

Factor 2: Memorization. Performance is higher when the specific cipher variant was more frequently encountered during pre-training. This is not reasoning — it is pattern matching against memorized instances. The frequency of encountering different shift values in training data directly predicts accuracy on those shifts.

Factor 3: Noisy reasoning. After controlling for probability and memorization, genuine reasoning effects remain — but they are noisy. Error rate increases with the number of implicit reasoning steps (shift magnitude). This is real multi-step reasoning, but each step introduces error probability, so accuracy degrades with chain length.

The decomposition resolves the ongoing debate about whether LLMs reason or memorize: they do both, simultaneously, and the contribution of each factor varies by task. This supports Does chain-of-thought reasoning reveal genuine inference or pattern matching? while adding a crucial nuance: the imitation IS partially genuine, but contaminated by probability bias and memorization artifacts.

The probability factor is particularly important for understanding CoT faithfulness. Since Do language models actually use their reasoning steps?, the probability dependence reveals a specific mechanism for causal insufficiency: CoT "reasoning" succeeds partly because the sequence of generated tokens increases the conditional probability of the correct answer, not because the logical content is being processed. This is exactly the mechanism behind Does logical validity actually drive chain-of-thought gains? — invalid exemplars work because they still generate token sequences that shift output probability toward correct answers.

The noisy-reasoning factor connects to Does more thinking time always improve reasoning accuracy?: if each reasoning step adds noise, then past some threshold the accumulated noise exceeds the signal, producing the inverted-U performance curve.

Inquiring lines that read this note 62

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What actually drives chain-of-thought reasoning improvements in language models?

Does AI fluency substitute for verifiable accuracy in human judgment?

What structural features force users to evaluate the epistemic status of outputs?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Why do correct reasoning traces tend to be shorter than incorrect ones?

When do additional thinking tokens stop improving reasoning performance?

How should planning and perception grounding be factored in agent design?

What interference occurs when planning and synthesis happen in the same component?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

How does latent reasoning compare to verbalized chain-of-thought?

How do training priors constrain what context information can override?

What mechanism makes keyword probability the strongest predictor of priming?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How do the three grokking phases connect to memorization capacity limits?

Is model self-awareness based on genuine introspection or pattern matching?

What are the seven components of genuine mental state simulation?

Can prompting inject entirely new knowledge into language models?

Which structural properties of CoT prompts matter most for performance?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

How should memory consolidation strategies shape agent performance over time?

How do insert, forget, and merge operations maintain thought coherence over time?

How can process reward models supervise complex reasoning traces?

How does reasoning graph topology affect breakthrough insights and generalization?

How do evaluation biases undermine LLM quality assessment systems?

Why does probability of text completion not equal knowledge value?

How do we evaluate AI systems when user perception misleads actual performance?

How do satisfaction scores differ from genuine cognitive improvement?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

What properties determine whether reward signals teach genuine reasoning?

What memory architectures best support persistent reasoning across extended interactions?

How do memorization and attention map onto different memory systems?

Why do agents confidently report success despite actually failing tasks?

How should tool-call attribution distinguish credit between successful accidents and intentional actions?

Why do benchmark improvements fail to reflect actual reasoning quality?

How does example difficulty affect learning efficiency in language models?

Why does target probability matter more than task logical complexity?

How can AI agents autonomously learn and transfer skills across tasks?

What makes trajectory quality matter more than one-shot task success?

How does memorization interact with learning and generalization?

What is the theoretical capacity limit before memorization saturates?

How can identical external performance mask different internal representations?

What makes some frictions negligible while others block entire pathways?

How do transformer attention mechanisms implement memory and algorithmic functions?

What computation remains in the attention heads that programs cannot capture?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

What three separate factors drive chain-of-thoug… Does chain-of-thought reasoning reveal genuine inf… Do language models actually use their reasoning st… Does logical validity actually drive chain-of-thou… Does more thinking time always improve reasoning a… Do reasoning traces need to be semantically correc… What do models actually learn from chain-of-though…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
nuances: imitation IS partially genuine, but noisy and probability-contaminated
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
probability dependence is a specific causal insufficiency mechanism
Does logical validity actually drive chain-of-thought gains? What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
probability mechanism explains why invalid exemplars work
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
noisy reasoning factor predicts the inverted-U: accumulated noise exceeds signal
Do reasoning traces need to be semantically correct? Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
the probability factor explains why corrupted traces still work: intermediate tokens shift output probability toward correct answers regardless of their semantic content; the "genuine reasoning" factor is only one of three contributors
What do models actually learn from chain-of-thought training? When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
aligns with the three-factor decomposition: structural coherence provides the scaffolding for the probability and noisy-reasoning factors to operate, while content correctness maps only to the memorization factor

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cot performance reflects three disentangled factors — output probability memorization and noisy reasoning

What three separate factors drive chain-of-thought performance?

Inquiring lines that read this note 62

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4