INQUIRING LINE

How does contrapositive augmentation change the tractability of reasoning tasks?

This explores whether feeding a model the logical contrapositive of a rule ('if not-Q then not-P', alongside 'if P then Q') makes reasoning tasks easier to solve — and the corpus doesn't address contrapositive augmentation directly, but it has a lot to say about *why* a trick like that would or wouldn't move the needle.


Worth saying up front: none of the collected work studies contrapositive augmentation by name, so this is a lateral read of what the corpus implies about it. The honest version of the question is — would restating a rule in its logically equivalent forms make the task more tractable, or would it just paper over the fact that models aren't reasoning logically in the first place? The corpus leans hard toward the second.

The central clue is that these models don't appear to fail at reasoning because tasks are *complex* — they fail when an instance is *unfamiliar*. One study finds reasoning breaks at instance-novelty boundaries, not complexity thresholds: models fit patterns tied to specific instances rather than learning a generalizable algorithm, so any chain succeeds if something similar appeared in training Do language models fail at reasoning due to complexity or novelty?. Read through that lens, contrapositive augmentation should help — but for an unglamorous reason. You're not teaching the model that P→Q and ¬Q→¬P are the same proposition; you're just adding more familiar instances so fewer test cases land in unfamiliar territory. Tractability goes up because you widened the trained-on distribution, not because the model learned the inference.

That distinction matters because the corpus repeatedly shows chain-of-thought is imitation of reasoning *form*, not the thing itself. CoT works by constraining models to reproduce reasoning schemata seen in training, and it degrades predictably the moment you shift task, length, or format Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data?. A genuine logical operation like contraposition should transfer across all those shifts for free. If augmenting with contrapositives only helps on the exact forms you trained, that's the signature of imitation — the gain is local, not structural.

The strangest corner of the corpus pushes even further: models trained on deliberately corrupted, logically irrelevant reasoning traces perform comparably to those trained on correct ones, suggesting traces act as computational scaffolding rather than meaningful logic Do reasoning traces need to be semantically correct?. If the *content* of a reasoning step barely matters, then the value of a contrapositive restatement may be less about its logical truth and more about it being another well-formed pattern to anchor on. There's also a darker possibility worth naming: models sometimes look like they're reasoning about constraints when they're really exploiting a conservative default, scoring *worse* when constraints are removed Are models actually reasoning about constraints or just defaulting conservatively?. Augmentation that improves benchmark numbers could be feeding that kind of shortcut rather than fixing it.

So the thing you might not have known you wanted to know: the question of whether contrapositive augmentation helps is really a diagnostic. If a model genuinely reasons, it shouldn't *need* the contrapositive spelled out — it would derive it. The size of the gain from handing it over for free is a measure of how much the model was pattern-matching all along. The corpus also hints the elicitation may already be latent: base models carry reasoning capability that minimal training merely selects rather than creates Do base models already contain hidden reasoning ability?, which suggests augmentation works best as a way to *surface* an inference the model can already represent, not to install one it never had.


Sources 6 notes

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question: **Does contrapositive augmentation (restating logical rules in their logically equivalent forms) actually improve task tractability, or does it merely expand the training distribution without teaching genuine inference?** This remains genuinely open.

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Mar 2026. Key findings:
- Reasoning breakdown is driven by instance-level unfamiliarity, not task complexity; models fit patterns tied to specific instances rather than learn generalizable algorithms (2024–2025).
- Chain-of-thought succeeds as constrained imitation of reasoning *form*, not abstract logical inference; gains degrade predictably with task/length/format shifts (2025–2026).
- Models trained on deliberately corrupted reasoning traces perform comparably to those trained on correct traces, suggesting traces act as computational scaffolding, not meaningful logic (2025).
- Base models already possess latent reasoning capability that minimal training selects rather than creates (2025).
- Surface heuristics and conservative defaults often hide behind apparent reasoning success; augmentation risks feeding shortcuts rather than fixing them (2026).

Anchor papers (verify; mind their dates):
- arXiv:2508.01191 *Is Chain-of-Thought Reasoning of LLMs a Mirage?* (2025-08)
- arXiv:2506.02878 *CoT is Not True Reasoning* (2025-06)
- arXiv:2602.06176 *Large Language Model Reasoning Failures* (2026-02)
- arXiv:2603.29025 *The Model Says Walk: How Surface Heuristics Override Implicit Constraints* (2026-03)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, o3, GPT-4.5, or equivalent), in-context learning tricks (few-shot logical examples), or test-time scaling (more tokens, more reasoning steps) have since relaxed or overturned the instance-level unfamiliarity bottleneck or the imitation-vs.-inference distinction. Separate durable questions (e.g., *do LLMs generalize logical rules across novel forms?*) from perishable claims (e.g., *CoT always fails on distribution shift*). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months** — e.g., any evidence that contrapositive forms *do* transfer zero-shot, or that logical equivalence is learned end-to-end.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., *Can test-time scaling over contrapositives induce genuine rule discovery?* or *Do instruction-tuned or RL-fine-tuned models now separate logical form from surface pattern?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines