INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does example difficulty affect…›this inquiring line

Holding the task fixed and just making the answer more probable swung AI accuracy from 26% to 70% — the logic never changed.

Why does target probability matter more than task logical complexity?

This explores why a model's likelihood of producing a given answer (output/target probability) predicts performance better than how logically hard the task itself is — and what that reveals about what LLM 'reasoning' actually rests on.

This explores why target probability — how likely a model is to generate a particular output given its training — turns out to be a stronger lever on accuracy than the logical complexity of the task. The sharpest single piece of evidence comes from a shift-cipher study that pulled chain-of-thought performance apart into three independent factors: output probability, memorization, and genuine (noisy) reasoning. Holding the task fixed, varying only how probable the target answer was swung accuracy from 26% to 70% What three separate factors drive chain-of-thought performance?. The logic of the cipher never changed; what moved the needle was whether the answer sat in a high-probability region of the model's output space. That alone reframes the question: the model isn't solving harder or easier logic, it's reaching for more or less likely strings.

A cluster of corpus findings converges on the same point from different angles. Reasoning failures don't cluster at complexity thresholds — they cluster at instance-level unfamiliarity. Models fit instance-based patterns rather than general algorithms, so a long, 'hard' chain succeeds if similar instances were seen in training, and a short, 'easy' one fails if novel Do language models fail at reasoning due to complexity or novelty?. Controlled maze experiments make the mechanism visible: trace length tracks difficulty only in-distribution and decouples entirely out-of-distribution, because length reflects recall of training schemas, not adaptive computation Does longer reasoning actually mean harder problems?. In both cases the operative variable is proximity to training distribution — a probability story — not intrinsic task hardness.

The most striking corner is that the logical structure of reasoning can be wrong and it barely matters. Invalid chain-of-thought exemplars perform nearly as well as valid ones on BIG-Bench Hard, meaning the model is learning the *form* of reasoning, not genuine inference Does logical validity actually drive chain-of-thought gains?. The same lesson shows up one level out: instruction tuning on semantically empty or deliberately incorrect instructions matches full correct instructions (43% vs 42.6%), because what transfers is knowledge of the output space, not task understanding Does instruction tuning teach task understanding or output format?. If logical validity and instruction content can be corrupted without hurting performance, then performance was never resting on logic — it was resting on hitting the right output distribution.

There's a revealing flip side: when logical complexity *does* bite, the model copes by leaning even harder on probability. As tasks get harder — NLI to syllogisms to Wason selection — content effects intensify, and both humans and models fall back on semantic priors instead of logical form once working capacity is exceeded Do harder reasoning tasks trigger more semantic bias?. So complexity doesn't engage some separate logic engine; it pushes the system further toward its prior, i.e. toward whatever is probable. That's why probability keeps winning the comparison — it's the thing the model actually does, and difficulty only deepens the dependence.

For a curious reader, the unexpected takeaway is practical: if probability dominates logic, you should be able to improve 'reasoning' without making the model smarter — by reshaping what it's likely to output. The corpus bears this out. Optimal CoT length emerges from reward signals nudging the model toward shorter, higher-probability chains rather than from explicit training on difficulty Why does chain of thought accuracy eventually decline with length?, and the durable gap between reasoning and non-reasoning models comes from a training protocol that makes extra tokens productive — not from raw capability unlocked at inference time Can non-reasoning models catch up with more compute?. The lever is the output distribution the model was shaped to prefer; task complexity is mostly a passenger.

Sources 8 notes

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Show all 8 sources

Do harder reasoning tasks trigger more semantic bias?

Content effects intensify as task difficulty increases—from NLI to syllogisms to Wason selection—in both humans and language models. As working capacity is exceeded, both systems fall back on semantic priors instead of logical form.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning researcher re-testing dated constraints on LLM reasoning performance. Core question: does target output probability truly dominate task logical complexity as a performance lever, or have newer models, training methods, or eval harnesses since dissolved this hierarchy?

What a curated library found — and when (dated claims, not current truth):
These findings span 2022–2026, so treat them as perishable snapshots:
• Output probability and memorization decouple reasoning success from task intrinsic difficulty; varying target answer likelihood swung accuracy 26%→70% on fixed ciphers (2024).
• Instance-level unfamiliarity, not task complexity, predicts failure; out-of-distribution trace length decouples from problem difficulty entirely (2024).
• Logically invalid chain-of-thought exemplars perform ~as well as valid ones (BIG-Bench Hard), and semantically empty instructions match correct ones (43% vs 42.6%), indicating the model learns output-space form, not genuine inference (2023–2024).
• Content effects intensify as abstract task difficulty rises; both humans and models fall back on semantic priors once working capacity is exceeded (2022).
• Optimal CoT length follows an inverted-U curve driven by reward signals, not difficulty-adaptive computation (2025).

Anchor papers (verify; mind their dates):
• arXiv:2407.01687 (2024) — Invalid Logic, Equivalent Gains
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2509.07339 (2025) — Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
• arXiv:2504.09858 (2025) — Reasoning Models Can Be Effective Without Thinking

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, determine whether recent scaling (post-o1, post-Strawberry), in-context few-shot engineering, multi-agent orchestration, or reinforcement learning from process reward models have since relaxed, inverted, or overturned it. Plainly separate the durable question (does probability compete with logic?) from perishable limitations (the specific magnitude, the specific models tested). Name what resolved each constraint and where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months — especially any paper claiming task complexity or abstract reasoning *does* shape performance independently of probability, or showing reasoning-model scaling curves that break the probability-dominance story.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does process reward modeling or step-level probability optimization restore logical structure's causal role? Under what training regime do semantically grounded instructions outperform form-only mimicry?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Holding the task fixed and just making the answer more probable swung AI accuracy from 26% to 70% — the logic never changed.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8