SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Synthesis note · 2026-02-22 · sourced from Reasoning Logic Internal Rules
What makes chain-of-thought reasoning actually work? How should researchers navigate LLM reasoning research? Do reasoning traces show how models actually think?

"Invalid Logic, Equivalent Gains" runs a clean experiment: replace valid reasoning in CoT exemplar prompts with completely illogical reasoning, then measure performance on BIG-Bench Hard tasks. The result: logically invalid CoT prompts perform close behind valid CoT and outperform answer-only prompting. The reasoning content of CoT exemplars is not what drives the performance gain.

This is a sharp test because it isolates the contribution of logical validity from everything else CoT provides: output format, step decomposition, intermediate token generation, attention pattern scaffolding. If invalid reasoning still helps, then the benefit comes from these structural properties, not from the reasoning itself.

The finding directly supports Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If the model were learning to reason from exemplars, invalid exemplars would degrade performance substantially. Instead, the model is learning the FORM of step-by-step output — the structure activates latent capabilities without the exemplar content needing to be logically sound.

This also deepens Do language models actually use their reasoning steps?. If the exemplar reasoning doesn't need to be valid for CoT to work, then the model's own generated reasoning may similarly be decorative rather than causal. The exemplar finding makes the faithfulness concern bidirectional: neither the input reasoning (exemplars) nor the output reasoning (generated CoT) need be logically valid for the performance gain to occur.

The practical implication: CoT prompt engineering should focus on structural properties (step count, decomposition format, answer scaffolding) rather than on the logical correctness of the exemplar reasoning. Since Why do chain-of-thought examples fail across different conditions?, the dimensions that matter are structural (complexity, order, style), not logical.

Inquiring lines that use this note as a source 190

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

logically invalid cot prompts perform nearly as well as valid ones — valid reasoning is not the chief driver of cot gains