INQUIRING LINE

What makes deductive reasoning so brittle in language models overall?

This explores why formal deductive reasoning — applying rules to reach valid conclusions — keeps breaking in LLMs, and the corpus suggests the answer is that models were never doing deduction in the first place.


This explores why deductive reasoning is so fragile in language models — and the most striking thread in the corpus is that the brittleness isn't a flaw in an otherwise-deductive process; it's a sign the process was never deductive to begin with. The sharpest evidence comes from a pair of findings about logical validity: when researchers feed models *invalid* reasoning steps, performance barely drops compared to valid ones Do reasoning traces show how models actually think?, and corrupted chain-of-thought prompts work nearly as well as correct ones What makes chain-of-thought reasoning actually work?. If logical correctness were doing the work, breaking the logic would break the answer. It doesn't. What actually drives the performance is format and pattern, not inference — CoT is "pattern-guided generation, not formal logic."

That reframes brittleness as a predictable consequence of *how* models reason rather than a mysterious failure. When semantic content is stripped away and only the formal rules remain, accuracy collapses — LLMs turn out to be "in-context semantic reasoners, not symbolic reasoners," leaning on commonsense token associations rather than manipulating symbols Do large language models reason symbolically or semantically?. Deduction is precisely the case where you must follow the rule regardless of whether the content feels familiar, so a system built on semantic association is brittle exactly where deduction is supposed to be strong. A related finding sharpens this: reasoning fails not at complexity thresholds but at *instance-novelty* boundaries — models fit patterns from training instances, so a chain succeeds if it resembles something seen before and fails when the structure is genuinely new Do language models fail at reasoning due to complexity or novelty?.

The brittleness also compounds with surface structure in ways a true deductive engine wouldn't care about. Models make systematic linguistic errors that worsen predictably as syntactic depth increases — they miss embedded clauses and nested structure, capturing surface patterns but not the underlying grammatical rules Why do large language models fail at complex linguistic tasks?. And reasoning accuracy degrades sharply just from longer inputs — dropping from 92% to 68% with a few thousand tokens of irrelevant padding, far below the context limit and even with chain-of-thought Does reasoning ability actually degrade with longer inputs?. Real deduction is indifferent to how much filler surrounds the premises; these systems are not.

Worth flagging the counter-currents, because they complicate the "models can't reason" story. Some apparent reasoning collapses turn out to be *execution* failures — the model knows the algorithm but can't carry out enough text-only steps to finish, and tool-enabled versions sail past the supposed cliff Are reasoning model collapses really failures of reasoning?. Other times what looks like reasoning is just a conservative default in disguise: most models actually perform *worse* when constraints are removed, revealing they were defaulting to the safe answer rather than evaluating the logic Are models actually reasoning about constraints or just defaulting conservatively?. There's even evidence that genuine computation happens — transformers can compute correct answers in early layers, then overwrite them to produce format-compliant filler Do transformers hide reasoning before producing filler tokens?.

Put together, the corpus says deductive brittleness is overdetermined: models substitute semantic familiarity for symbolic manipulation, so they break on novelty, on syntactic depth, on irrelevant context, and on validity-vs-form swaps — but they also sometimes "fail" for reasons that have nothing to do with reasoning at all, like running out of execution bandwidth or hiding correct work behind format compliance. The thing you might not have expected: the most damning result isn't that models get hard problems wrong, it's that scrambling the logic barely changes the score. That's the tell that the apparatus was decorative all along.


Sources 9 notes

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning research analyst. The question remains open: What makes deductive reasoning so brittle in language models? A curated library found — findings span 2023–2026, so treat these as dated claims, not current truth:

• Logical validity barely matters: invalid reasoning steps perform nearly as well as valid ones; corrupted chain-of-thought works almost as well as correct chains (2023–2024).
• LLMs are in-context *semantic* reasoners, not symbolic reasoners — they lean on commonsense token associations rather than symbolic manipulation, so they collapse when semantics are stripped away (2023).
• Reasoning fails at *instance-novelty* boundaries, not complexity thresholds — models fit patterns from training, succeeding on resemblances and failing on genuinely new structures (2024–2025).
• Models make systematic linguistic errors that worsen predictably with syntactic depth — they miss embedded clauses and nested structure (2025).
• Reasoning accuracy drops from 92% to 68% with just a few thousand tokens of irrelevant padding, far below context limits and despite chain-of-thought (2024).

Anchor papers (verify; mind their dates):
• 2305.14825 — In-Context Semantic Reasoners (2023)
• 2402.14848 — Input Length Impact on Reasoning (2024)
• 2508.01191 — Is CoT a Mirage? A Data Distribution Lens (2025)
• 2604.15726 — LLM Reasoning Is Latent, Not the Chain of Thought (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, methods (e.g., process reward models, mechanistic interpretability), orchestration (multi-step tools, symbolic solvers), or evaluation have since RELAXED or OVERTURNED it. Separate the durable question (why symbolic reasoning resists emergence) from the perishable limitation (e.g., execution bandwidth, format compliance). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~3 months — especially papers claiming genuine symbolic capability or reframing CoT as something other than pattern-matching.
(3) Propose 2 research questions that ASSUME the brittleness may be architectural (unfixable by scale/data) vs. training-contingent (fixable by incentive design or hybrid systems).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines