INQUIRING LINE

Why does premise ordering shift syllogistic reasoning performance by over 30 percent?

This explores why simply reordering the premises in a logic problem — without changing the logic itself — can swing an LLM's accuracy by more than 30 percent, and what that reveals about how these models actually 'reason.'


This explores why simply reordering the premises in a logic problem — without changing the logic itself — can swing an LLM's accuracy by more than 30 percent. The short answer the corpus points to: models aren't manipulating logic abstractly, they're pattern-matching against the sequence they saw during training. Accuracy peaks when the premises happen to arrive in the same order as the steps of the ground-truth proof, and collapses when they don't How much does the order of premises actually matter for reasoning?. The logic is identical either way; what changes is whether the surface form matches a familiar template.

That reframes the 30 percent drop as evidence of *imitation rather than inference*. Several notes converge on this from different angles. One shows that chain-of-thought works by constraining models to reproduce familiar reasoning shapes from training, and degrades predictably the moment you shift the distribution — the signature of mimicry, not capability Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Another lands the point even harder: logically *invalid* chain-of-thought exemplars perform nearly as well as valid ones, meaning it's the structural form of the reasoning, not its logical correctness, that drives the gains Does logical validity actually drive chain-of-thought gains?. If form is what matters, then rearranging the form — premise order — should matter a lot. It does.

There's a complementary mechanistic story under the hood. When researchers traced how models actually execute a syllogism, they found a content-independent three-stage circuit (recitation, middle-term suppression, mediation) — but that circuit is contaminated by separate attention heads encoding world knowledge, which bias conclusions toward what's *plausible* rather than what *follows* How do language models perform syllogistic reasoning internally?. So the reasoning machinery is real but fragile and entangled with surface cues. Premise order is exactly the kind of surface cue that nudges such a system off the rails: it doesn't break the logic, it disrupts the sequential scaffolding the circuit leans on.

The deeper pattern is that LLM reasoning failures track *familiarity*, not difficulty. Models break at instance-novelty boundaries rather than complexity thresholds — any reasoning chain succeeds if the model has seen similar instances, regardless of how hard the logic is Do language models fail at reasoning due to complexity or novelty?. A reordered premise set is, in effect, a less-familiar instance of the same problem, so performance sags. It's the same sensitivity that shows up elsewhere as accuracy dropping just from padding the input with irrelevant tokens Does reasoning ability actually degrade with longer inputs? — both are cases where something that *shouldn't* matter to the logic does matter to the model.

The thing worth walking away with: premise ordering isn't a quirky prompt-engineering footnote, it's a diagnostic. The 30 percent swing is a measurement of how much these models depend on the *shape and sequence* of a problem versus its actual logical content. If a model were genuinely doing abstract deduction, reordering the givens would be invisible to it — the way it's invisible to you. That it isn't tells you where the reasoning is really coming from.


Sources 6 notes

How much does the order of premises actually matter for reasoning?

Reordering premises in logical tasks drops LLM accuracy by more than 30 percent, even though the logic remains identical. Performance peaks when premises match the ground truth proof sequence, suggesting LLMs rely on sequential pattern matching rather than abstract logical manipulation.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

How do language models perform syllogistic reasoning internally?

LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capabilities analyst. The question: Why does premise ordering shift syllogistic reasoning performance by over 30 percent — and does that constraint still hold under newer models, training regimes, or evaluation methods?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library converges on these points:

• Reordering premises without changing logic drops accuracy by >30%, suggesting models pattern-match on sequence rather than manipulate logic abstractly (2024-02, arXiv:2402.08939).
• Chain-of-thought gains come from imitating familiar reasoning *form*, not from abstract inference; logically invalid CoT exemplars perform nearly as well as valid ones (2023-07, arXiv:2307.10573; 2025-06, arXiv:2506.02878).
• Mechanistically, syllogistic reasoning uses a three-stage circuit (recitation, middle-term suppression, mediation) contaminated by world-knowledge attention heads that bias toward plausibility over logical consequence (2024-08, arXiv:2408.08590).
• Reasoning breakdown is driven by instance-level unfamiliarity, not task-level difficulty; padding input with irrelevant tokens degrades performance even far below context limits (2024-02, arXiv:2402.14848).
• Recent work (2025-01, 2025-05, 2025-08, 2026-01) hints at test-time interventions and functional token encoding as potential remedies, though the constraint's durability remains unclear.

Anchor papers (verify; mind their dates):
• arXiv:2402.08939 (2024-02): Premise Order Matters in Reasoning with Large Language Models
• arXiv:2408.08590 (2024-08): Reasoning Circuits in Language Models: Mechanistic Interpretation of Syllogistic Inference
• arXiv:2506.02878 (2025-06): CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate
• arXiv:2508.02511 (2025-08): Test-time Prompt Intervention

Your task:
(1) RE-TEST THE CONSTRAINT. For each finding above, assess whether o1-class models, instruction-tuning on logic, retrieval-augmented scaffolding, or multi-step verification have since RELAXED the 30 percent drop. Distinguish: Is premise-order sensitivity baked into attention patterns (likely persistent) or a trainable artifact (possibly fixed)? Cite what relaxed it; say plainly if it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months — especially any showing premise-order invariance or abstract logical reasoning emerging in newer checkpoints.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Can fine-tuning on premise-order-perturbed datasets erase the 30 percent gap? (b) Do reasoning-specialized models (o1, future variants) encode logic tree-structure instead of sequence, making ordering invisible?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines