INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How faithfully do LLMs reflect the…›this inquiring line

Invalid proof examples coach AI almost as well as valid ones — which suggests it's imitating reasoning, not doing it.

What types of math proofs benefit most from proof-by-contradiction framing?

This asks about proof-by-contradiction as a math technique — but the corpus doesn't cover proof strategies directly; what it does have is a sharp body of work on whether LLM-generated 'proofs' of any framing are valid reasoning or just convincing form.

This reads as a question about which proof types reward a contradiction framing — and here's the honest pivot: the collection has no paper on proof-by-contradiction (or induction, or direct proof) as a mathematical technique. What it has instead is a more unsettling adjacent finding, which is that for today's models the *framing* of a proof may matter far less than whether any genuine inference is happening underneath it at all. If you came looking for 'when does contradiction work best,' the corpus answers a sneakier question: 'does the proof structure you see mean anything?'

The strongest thread is that LLMs reproduce the *shape* of reasoning without its substance. Invalid chain-of-thought exemplars perform nearly as well as logically valid ones Does logical validity actually drive chain-of-thought gains?, and RLVR post-training makes adjacent steps more coherent while leaving the global proof potentially invalid Does RLVR actually improve mathematical reasoning or just coherence?. The form/content gap is the headline: format shapes reasoning strategy far more than the actual logic does What makes chain-of-thought reasoning actually work?. So a model can produce a beautifully staged proof-by-contradiction — assume the negation, derive an absurdity — where each move is locally plausible but the contradiction never actually bites. The framing is decorative, not load-bearing.

That fragility is concrete. Math reasoning collapses when you merely change the numbers or insert an irrelevant clause, which marks it as pattern-matching rather than symbolic deduction Does LLM math reasoning truly generalize or just pattern match?. And the very models that *look* most rigorous — long-chain reasoners like o1 and R1 — are the most exposed: they hit only 20–23% on constraint-satisfaction problems that demand real backtracking Can reasoning models actually sustain long-chain reflection?, and their extended chains create more points where one corrupted step propagates Why do reasoning models fail under manipulative prompts?. Proof-by-contradiction is exactly the genre most vulnerable to this, since it depends on a long chain remaining valid all the way to the absurdity — one fabricated intermediate step and the whole 'contradiction' is hollow.

Where the corpus *does* gesture at an answer to your underlying instinct — that some proof framings are sturdier than others — is in the formalization work. Partial symbolic abstraction beats both pure natural language and full formalization, because selective structure adds rigor without discarding meaning Why does partial formalization outperform full symbolic logic?. Semi-formal templates that force explicit premises act as 'completeness certificates,' catching gaps free-form reasoning glides past Can structured templates make code reasoning more reliable than free-form thinking?, and Toulmin-style critical-question prompts force a model to surface the warrant it would otherwise skip Can structured argument prompts make LLM reasoning more rigorous?. Translated to your question: the proof framing that 'benefits most' isn't determined by the math topic — it's whichever framing forces every hidden premise into the open. Proof-by-contradiction earns its keep precisely when the structured negation makes an otherwise-skipped assumption explicit and checkable.

So the thing you didn't know you wanted to know: the right question for this corpus isn't 'which proofs suit contradiction' but 'which framing makes a model show its work.' A contradiction frame helps most where it converts an implicit leap into an explicit, falsifiable claim — and helps least where it just gives the model more rope to generate fluent, locally-coherent, globally-empty steps.

Sources 9 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Does LLM math reasoning truly generalize or just pattern match?

GSM-Symbolic found that LLMs show high variance across question reformulations, decline sharply when numbers change, and fail when irrelevant but related clauses are inserted. These failures indicate probabilistic pattern-matching rather than true symbolic reasoning.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Show all 9 sources

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Can structured templates make code reasoning more reliable than free-form thinking?

Semi-formal templates requiring explicit premises, code-path traces, and evidence checks improved patch equivalence accuracy from 78% to 88%, catching cases like function shadowing that free-form reasoning missed. Templates act as completeness certificates without formal verification.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens3.38 match · arxiv ↗
Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting2.58 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs1.74 match · arxiv ↗
Hierarchical Reasoning Model1.73 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective1.72 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains1.70 match · arxiv ↗
Measuring Faithfulness in Chain-of-Thought Reasoning1.66 match · arxiv ↗
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!1.62 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about proof structures in LLM reasoning. The question: *Does proof-by-contradiction framing unlock genuine inference in LLMs, or does it merely stage locally-coherent but globally-hollow reasoning chains?*

What a curated library found — and when (findings span 2023–2026; treat as dated claims):
• Invalid chain-of-thought exemplars perform nearly as well as logically valid ones; framing shapes strategy far more than actual logic does (2023).
• Math reasoning collapses under minor perturbations (number changes, irrelevant clauses), marking it as pattern-matching rather than symbolic deduction (2024).
• Long-chain reasoners (o1, R1) hit only 20–23% on constraint-satisfaction problems requiring real backtracking; extended chains create more propagation points for corrupted steps (2025).
• Partial symbolic abstraction and semi-formal templates with explicit premises outperform both pure natural language and full formalization; Toulmin-style critical-question prompts force models to surface hidden warrants (2025).
• Local coherence in reasoning traces can mask global invalidity even in RLVR-trained models (2026).

Anchor papers (verify; mind their dates):
• arXiv:2307.10573 (2023) — Invalid Logic, Equivalent Gains
• arXiv:2410.05229 (2024) — GSM-Symbolic: Limitations of Math Reasoning
• arXiv:2412.15177 (2024) — Critical-Questions-of-Thought
• arXiv:2510.18176 (2026) — Local Coherence or Global Validity?

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, assess whether newer models, training methods (post-RLHF refinements, verifier integration), evaluation harnesses, or agentic orchestration (memory, multi-step verification) have since relaxed or overturned it. Separate the durable claim (proof-by-contradiction *looks* rigorous but demands unbroken chains where LLMs are fragile) from the perishable limitation (maybe newer verifier backbones or scaffolding now catch mid-chain fabrication). Cite what resolved it; flag constraints that still hold.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months—especially anything showing constraint-satisfaction or backtracking improvement, or formal-proof integration that actually prevents hollow contradictions.
(3) Propose 2 new research questions that assume the regime may have shifted: e.g., 'Can agentic code verifiers close the local–global gap for contradiction proofs?' or 'Does proof-by-contradiction *perform better* when paired with symbolic constraint solvers?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Invalid proof examples coach AI almost as well as valid ones — which suggests it's imitating reasoning, not doing it.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8