INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›What capability tradeoffs emerge w…›this inquiring line

Break a complex task into small enough pieces, and an AI that always 'thinks' starts losing to one that just answers.

Why do non-reasoning models work better under extreme decomposition than reasoning models?

This explores why, when a task is sliced into many tiny well-posed subproblems, a plain model can beat a 'thinking' model — the corpus suggests the reasoning protocol becomes dead weight once the hard planning has already been done externally.

This reads the question as: under extreme decomposition, each subtask is small and clearly specified, and that's exactly the regime where a reasoning model's trained-in habits turn into liabilities. The clearest mechanism is that reasoning models are optimized to always emit reasoning steps and are never taught when to stop. When a subproblem is trivial or even ill-posed, they keep generating — Why do reasoning models overthink ill-posed questions? shows reasoning models spray redundant chains at questions a non-reasoning model simply flags as unanswerable. Decomposition multiplies these easy subproblems, so it multiplies the overthinking tax.

The second mechanism is that 'more thinking' often isn't more computing. On constraint-bound numerical work, extended chain-of-thought produces more text but not more iterative work, and reasoning variants show no consistent edge over standard models (Do reasoning models actually beat standard models on optimization?). Relatedly, when models collapse on long procedures the bottleneck turns out to be execution bandwidth, not reasoning (Are reasoning model collapses really failures of reasoning?). Extreme decomposition addresses precisely the thing reasoning helps with — the planning — by handing each fragment to the solver pre-carved. What's left is execution, where the reasoning protocol adds tokens, not accuracy. And a lot of those tokens are decorative anyway: Chain of Draft hits the same accuracy at 7.6% of the token count, meaning ~92% of a verbose chain was style and documentation, not computation (Can minimal reasoning chains match full explanations?).

The deepest framing comes from work separating the decomposer from the solver (Does separating planning from execution improve reasoning accuracy?): decomposition ability and solving ability are different skills, and keeping them apart prevents planning-execution interference. Extreme decomposition is that separation pushed to its limit — the orchestration layer becomes the 'reasoner,' so a reasoning model at the leaf level is doing redundant planning on a problem that no longer needs planning. Its inclination to re-plan every fragment is now interference, not help.

The honest tension is Can non-reasoning models catch up with more compute?, which argues reasoning models persistently win regardless of inference budget because training makes their extra tokens productive. That isn't a contradiction — it's the boundary condition. Reasoning's advantage shows up when the problem demands integrated, multi-step thinking held in one head. Decompose that away and you've removed the very thing the training was good for, leaving overhead. This is also why routing approaches like Thinkless (Can models learn when to think versus respond quickly?) matter: the real win isn't 'reasoning' or 'no reasoning' but knowing when each fragment deserves thought.

The thing you may not have expected to learn: the question isn't really about model quality at all. It's that decomposition relocates the reasoning out of the model and into the task structure — and once it lives there, a model that insists on reasoning anyway is solving a problem that's already been solved.

Sources 7 notes

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Show all 7 sources

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question — why do non-reasoning models work better under extreme decomposition than reasoning models? — remains open and worth re-testing against current capability. A curated library (2023–2026) found the following dated claims:

**What a curated library found — and when (2023–2026):**
- Reasoning models are overtrained to emit reasoning steps and lack a stopping signal; on trivial subproblems they generate redundant chains while non-reasoning models correctly flag unanswerable fragments (2024–2025).
- Extended chain-of-thought produces more text but not more iterative computation; reasoning variants show no consistent edge on constraint-bound numerical work, and performance collapses are execution failures, not reasoning failures (2024–2025).
- Chain-of-Thought wastes ~92% of tokens on style and documentation; concise intermediate reasoning matches verbose CoT accuracy at 7.6% token count (2025).
- Separating decomposer from solver prevents planning-execution interference; extreme decomposition pushes this separation to its limit, making reasoning at the leaf level redundant (2024–2025).
- Hybrid routing (e.g., Thinkless, 2025) learns when to engage extended thinking vs. direct solving — the real win is conditional reasoning, not blanket reasoning or non-reasoning.

**Anchor papers (verify; mind their dates):**
- arXiv:2406.06580 (Break the Chain, 2024-06) — shortcut reasoning mechanisms
- arXiv:2504.09858 (Reasoning Models Can Be Effective Without Thinking, 2025-04) — reasoning without extended thinking
- arXiv:2505.13379 (Thinkless, 2025-05) — learned routing between reasoning modes
- arXiv:2605.19376 (Generative Recursive Reasoning, 2026-05) — recursive decomposition and execution

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For the token-waste claim (~92% decorative), token-budget claims, and the planning-execution interference thesis: has newer model scaling, instruction refinement, or training of reasoning models since early 2025 recovered utility in those tokens or taught stopping? Does the collapse-on-long-procedures claim still hold with latest reasoning architectures and memory/caching? Separate the durable observation (reasoning models lack stopping signals in decomposed tasks) from any perishable limitation (token waste or execution bandwidth as a fundamental bottleneck).

(2) **Surface contradicting or superseding work.** The library cites arXiv:2505.00127 (Between Underthinking and Overthinking) and arXiv:2506.09250 (Comment on The Illusion of Thinking) — these appear to push back on the overthinking narrative. Find work from late 2025–2026 that directly refutes the 'reasoning is overhead under decomposition' claim or shows reasoning models recover advantage under *some* decomposition regimes.

(3) **Propose 2 research questions that assume the regime may have moved:**
   - Given routing (Thinkless, arXiv:2505.13379), can decomposition + learned routing now neutralize the overthinking tax and restore reasoning models' edge over non-reasoning on composite tasks?
   - Does recursive reasoning (arXiv:2605.19376) reframe decomposition as a form of learned thinking that reasoning models now exploit better than non-reasoning models?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Break a complex task into small enough pieces, and an AI that always 'thinks' starts losing to one that just answers.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8