How can minimal pairs expose reasoning failures that single-instance accuracy metrics miss?
This explores how comparing two near-identical problems that differ in one controlled way — minimal pairs — can reveal that a model isn't reasoning at all, even when its raw accuracy on any single problem looks fine.
This explores how comparing two near-identical problems that differ in one controlled detail can expose reasoning that single-instance accuracy hides. The sharpest demonstration in the corpus is the constraint-removal trick: when you take a problem, strip out a constraint, and re-run it, twelve of fourteen models actually get *worse* — dropping up to 38.5 points — which is backwards if they were genuinely evaluating the constraint Are models actually reasoning about constraints or just defaulting conservatively?. On any single instance the model looks like it reasoned its way to the answer; the paired comparison reveals it was just defaulting to the harder, safer option. The same paper shows the dual move — adding a constraint that *should* prune the search — and finds models can't exploit it either Can reasoning models actually sustain long-chain reflection?. One number per problem can't see this; the contrast between the pair can.
The deeper reason minimal pairs work is that accuracy and the *mechanism* behind it are different things, and several notes here separate them. If you corrupt a reasoning trace into something semantically irrelevant and the model keeps solving the problem just as well, the trace was never doing the reasoning — it was computational scaffolding Do reasoning traces need to be semantically correct?. That's a minimal pair in the explanation rather than the input: correct trace vs. nonsense trace, same accuracy, very different story about what's happening inside.
Minimal pairs also pinpoint *where* the failure lives, which is the most actionable thing accuracy can't tell you. Hold the task fixed and only swap the instance to an unfamiliar one, and performance collapses — showing the breakdown is driven by instance novelty, not task complexity, because models fit instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. Hold the problem fixed and only add a tool, and supposedly impossible problems become solvable — showing the wall was execution bandwidth, not reasoning Are reasoning model collapses really failures of reasoning?. In both cases a single accuracy score would have read as "can't reason here," while the controlled swap names the real variable.
This all connects to why chain-of-thought fails in such predictable ways: it's constrained imitation of reasoning *form*, so it works inside the training distribution and degrades the moment you shift the instance off it Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?. A minimal pair is essentially a controlled distribution shift — keep everything constant except the one thing that distinguishes pattern-matching from inference. The thing worth knowing here is that high single-instance accuracy is not evidence of reasoning at all; it's only evidence that the instance resembled training data. The pair is what tells you which one you're looking at.
Sources 7 notes
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.