Does chain-of-thought reasoning amplify bullshit or just make it more visible?
This explores whether the visible step-by-step reasoning in chain-of-thought (CoT) actually makes models think better, or whether it's mostly persuasive-looking text dressed up as logic — i.e., does showing its work add rigor, or just amplify confident-sounding nonsense?
This explores whether chain-of-thought reasoning genuinely improves thinking or just produces more convincing-looking text. The corpus leans hard toward the uncomfortable answer: a lot of CoT is form without substance. The clearest evidence is that logically invalid reasoning chains perform almost as well as valid ones — swap in exemplars whose steps don't actually follow, and accuracy barely moves Does logical validity actually drive chain-of-thought gains?. If the logic doesn't matter, then what the model learned was the *shape* of reasoning, not reasoning itself. Several notes converge on this framing of CoT as 'constrained imitation' — pattern-matching the appearance of inference rather than performing it Why does chain-of-thought reasoning fail in predictable ways? What makes chain-of-thought reasoning actually work?, with training format shaping the reasoning strategy 7.5× more than the actual domain What makes chain-of-thought reasoning actually work?.
So to your dichotomy — amplify or just make visible? The corpus suggests a third option that's more unsettling than either: the visible trace is often *decorative*. Studies of reasoning faithfulness find the steps frequently fail both causal sufficiency (the steps don't always change the answer) and causal necessity (spurious steps are common), meaning most evaluations measure how good the output *looks*, not whether the reasoning caused it Do language models actually use their reasoning steps?. One sharp result shows a model's intermediate tokens carry no special execution semantics — they're generated the same way as any other text, so invalid traces routinely yield correct answers. The trace is stylistic mimicry, which the authors flag as a safety risk precisely because it *looks* trustworthy Do reasoning traces actually cause correct answers?. In other words, CoT doesn't just make existing bullshit visible; it manufactures an authoritative-looking surface that makes bullshit *harder* to detect.
But there's an important boundary condition that rescues CoT from total cynicism: difficulty. On easy tasks, activation probes show models commit to their answer internally before they finish 'reasoning' — the trace is post-hoc performance. On hard tasks, the reasoning actually tracks belief updates, with detectable inflection points where the model changes its mind Does chain-of-thought reasoning reflect genuine thinking or performance?. So CoT is performative theater on problems the model already knows, and genuine work on problems it doesn't. That maps onto why some questions do *better* without step-by-step prompting at all — for simple questions, a direct question-to-answer flow beats reasoning Why do some questions perform better without step-by-step reasoning?.
There's also a 'too much of it' failure mode that's relevant to amplification. Reasoning accuracy follows an inverted-U: it peaks at intermediate length and declines as chains get longer, with more capable models actually preferring *shorter* chains Why does chain of thought accuracy eventually decline with length?. Pushing thinking tokens from ~1,100 to ~16K dropped accuracy from 87% to 70% Does more thinking time always improve reasoning accuracy?. So more visible reasoning is not more reasoning — beyond a point, the extra elaboration is noise the model talks itself into. And those extra steps are attack surface: manipulative multi-turn prompts hurt reasoning models *more* than standard ones, because each elaboration step is another place for a corrupted premise to propagate Why do reasoning models fail under manipulative prompts?.
The thing you didn't know you wanted to know: much of the trace can be pruned with no accuracy loss. Verification and backtracking steps — the parts that *look* most like careful thinking — receive minimal downstream attention and can be cut by 75% while preserving the answer Can reasoning steps be dynamically pruned without losing accuracy?. That's the strongest version of your question's answer: the most rhetorically reassuring parts of a chain of thought are often the most disposable. CoT amplifies the *appearance* of rigor more reliably than it amplifies rigor itself — which means it makes bullshit more visible only to a reader who knows the trace isn't load-bearing.
Sources 12 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.