Does reasoning training create blind spots in premise detection?
This explores whether training models to reason — chain-of-thought, RL on reasoning traces — makes them worse at noticing when a question's starting assumptions are wrong, charging ahead on a familiar template instead of stopping to question the premise.
This explores whether reasoning training creates a blind spot for bad premises — whether a model drilled to follow reasoning templates barrels forward on a familiar pattern instead of pausing to ask whether the question itself is sound. The corpus doesn't have a paper that measures premise detection head-on, but several findings converge on a mechanism that would produce exactly this blind spot, and they're worth reading together.
The load-bearing claim is that chain-of-thought is imitation of reasoning *form*, not genuine inference. Does chain-of-thought reasoning reveal genuine inference or pattern matching? argues CoT works by reproducing familiar reasoning schemata from training rather than reasoning from first principles. If that's right, a model's competence is tied to matching the *shape* of a problem it has seen — which is precisely the situation where a false or trick premise slips through: the surface pattern looks familiar, so the learned template fires, and nothing in the process is built to interrogate the setup. Does chain-of-thought reasoning actually generalize beyond training data? sharpens this — under distributional shift, models produce fluent but logically inconsistent reasoning. Fluency without a validity check is the signature of a premise blind spot: the prose stays confident while the logic has quietly broken.
The most striking piece is Do reasoning traces need to be semantically correct?: models trained on systematically irrelevant or corrupted traces perform *as well* as those trained on correct ones, because the traces act as computational scaffolding rather than meaningful steps. Read against the question, this is unsettling — if the reasoning steps don't need to be semantically true to 'work,' then the training signal never rewards the model for checking whether its inputs make sense. Premise-checking is exactly the kind of semantic vigilance that this training regime treats as optional.
Do language models fail at reasoning due to complexity or novelty? adds the failure boundary: models don't break at complexity thresholds, they break at instance novelty, because they fit instance-based patterns rather than generalizable algorithms. A flawed premise is a kind of novelty disguised as familiarity — it dresses an invalid problem in a familiar costume. Pattern-fitting is what gets fooled by that costume. And because reasoning training instills a deployment *protocol* that makes extra tokens productive (Can non-reasoning models catch up with more compute?), more reasoning may just mean more confident elaboration on a faulty foundation rather than more scrutiny of it.
The honest gap: nothing here tests premise-rejection directly, and there's a counter-current. Do base models already contain hidden reasoning ability? holds that post-training *selects* rather than creates reasoning — which implies any premise-checking ability the base model had isn't destroyed so much as left un-elicited. That reframes the blind spot as a training-objective problem rather than a capability one: the latent ability to doubt a premise may still be in there, just not what the reward signal selected for. If you want to go deeper, the corrupted-traces and constrained-imitation notes are the two doorways that most directly explain why a confidently-reasoning model can sail past a broken assumption.
Sources 6 notes
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.