Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Synthesis note · 2026-02-22 · sourced from Reasoning Logic Internal Rules

"Invalid Logic, Equivalent Gains" runs a clean experiment: replace valid reasoning in CoT exemplar prompts with completely illogical reasoning, then measure performance on BIG-Bench Hard tasks. The result: logically invalid CoT prompts perform close behind valid CoT and outperform answer-only prompting. The reasoning content of CoT exemplars is not what drives the performance gain.

This is a sharp test because it isolates the contribution of logical validity from everything else CoT provides: output format, step decomposition, intermediate token generation, attention pattern scaffolding. If invalid reasoning still helps, then the benefit comes from these structural properties, not from the reasoning itself.

The finding directly supports Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If the model were learning to reason from exemplars, invalid exemplars would degrade performance substantially. Instead, the model is learning the FORM of step-by-step output — the structure activates latent capabilities without the exemplar content needing to be logically sound.

This also deepens Do language models actually use their reasoning steps?. If the exemplar reasoning doesn't need to be valid for CoT to work, then the model's own generated reasoning may similarly be decorative rather than causal. The exemplar finding makes the faithfulness concern bidirectional: neither the input reasoning (exemplars) nor the output reasoning (generated CoT) need be logically valid for the performance gain to occur.

The practical implication: CoT prompt engineering should focus on structural properties (step count, decomposition format, answer scaffolding) rather than on the logical correctness of the exemplar reasoning. Since Why do chain-of-thought examples fail across different conditions?, the dimensions that matter are structural (complexity, order, style), not logical.

Inquiring lines that read this note 197

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does logical validity actually drive chain-of-thought gains?

Inquiring lines that read this note 197

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4