SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Synthesis note · 2026-02-22 · sourced from Reasoning Logic Internal Rules
What makes chain-of-thought reasoning actually work? How should researchers navigate LLM reasoning research? Do reasoning traces show how models actually think?

The "Deciphering Factors Influencing CoT" paper achieves something rare: a clean decomposition of what drives Chain-of-Thought performance into three independently measurable factors, using the simple but controlled task of shift cipher decoding across GPT-4, Claude 3, and Llama 3.1.

Factor 1: Output probability. The probability of the correct output in the model's distribution dramatically affects CoT accuracy. Varying only the output's probability of occurrence shifts GPT-4 accuracy from 26% to 70%. CoT works better when the answer is already more probable — it amplifies existing tendencies rather than overcoming them.

Factor 2: Memorization. Performance is higher when the specific cipher variant was more frequently encountered during pre-training. This is not reasoning — it is pattern matching against memorized instances. The frequency of encountering different shift values in training data directly predicts accuracy on those shifts.

Factor 3: Noisy reasoning. After controlling for probability and memorization, genuine reasoning effects remain — but they are noisy. Error rate increases with the number of implicit reasoning steps (shift magnitude). This is real multi-step reasoning, but each step introduces error probability, so accuracy degrades with chain length.

The decomposition resolves the ongoing debate about whether LLMs reason or memorize: they do both, simultaneously, and the contribution of each factor varies by task. This supports Does chain-of-thought reasoning reveal genuine inference or pattern matching? while adding a crucial nuance: the imitation IS partially genuine, but contaminated by probability bias and memorization artifacts.

The probability factor is particularly important for understanding CoT faithfulness. Since Do language models actually use their reasoning steps?, the probability dependence reveals a specific mechanism for causal insufficiency: CoT "reasoning" succeeds partly because the sequence of generated tokens increases the conditional probability of the correct answer, not because the logical content is being processed. This is exactly the mechanism behind Does logical validity actually drive chain-of-thought gains? — invalid exemplars work because they still generate token sequences that shift output probability toward correct answers.

The noisy-reasoning factor connects to Does more thinking time always improve reasoning accuracy?: if each reasoning step adds noise, then past some threshold the accumulated noise exceeds the signal, producing the inverted-U performance curve.

Inquiring lines that use this note as a source 58

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 114 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cot performance reflects three disentangled factors — output probability memorization and noisy reasoning