INQUIRING LINE

When should a system choose extended thinking versus quick responses?

This explores when a system should spend extra compute 'thinking' before answering versus replying fast — and what the corpus says about how that choice gets made and whether more thinking even helps.


This explores when a system should spend extra compute 'thinking' before answering versus replying fast — and the corpus has a surprisingly contrarian answer: more thinking is not free, and often not better. The cleanest framing comes from a dual-process view borrowed from cognitive science — a fast 'System 1' for familiar situations and a slow, deliberate 'System 2' for novel ones, with the switch triggered by the model's own uncertainty rather than a fixed rule Can dialogue planning balance fast responses with strategic depth?. The key move is that the system decides *for itself* when it's out of its depth, instead of always paying the cost of deep planning.

Before trusting 'think more,' though, it's worth knowing how shaky the assumption is. Several notes converge on a non-monotonic relationship: accuracy climbs, peaks, then *falls* as thinking tokens grow — in one case dropping from 87% to 70% as the budget scaled from ~1,100 to 16,000 tokens Does more thinking time always improve reasoning accuracy? Does more thinking time actually improve LLM reasoning?. One unsettling explanation is that longer traces may improve answers not by reasoning better but by *sampling wider* — casting a broader net that happens to cover the right answer more often, until the net gets so diffuse accuracy collapses Does extended thinking actually improve reasoning or just increase variance?. So 'extended thinking' is partly a coverage trick, not pure cognition.

The answer to 'when' turns out to depend on both the problem and the model. Optimal reasoning length follows an inverted-U: it grows with task difficulty but *shrinks* as the model gets more capable — stronger models need shorter chains, and reinforcement learning naturally drifts toward brevity as skill improves Why does chain of thought accuracy eventually decline with length?. Easy and well-posed questions can actively suffer from step-by-step reasoning; sometimes a direct question-to-answer path beats a forced chain Why do some questions perform better without step-by-step reasoning?. And reasoning models have a blind spot — they keep grinding on ill-posed or unanswerable questions because training rewarded producing steps but never taught them when to *stop* Why do reasoning models overthink ill-posed questions?.

The most direct answer to your question is that models can be trained to route this decision themselves. One approach trains a single model to pick between extended reasoning and a direct response using a method that decouples 'which mode' from 'what answer,' avoiding the trap where the model collapses into always-think or always-skip — and it learns this self-calibration without anyone labeling which questions are hard Can models learn when to think versus respond quickly?. There's also a cheaper lever: verbose and concise reasoning occupy distinct, linearly separable regions in the model's activations, so you can steer toward brevity with a single extracted vector — cutting chain length 67% with no retraining Can we steer reasoning toward brevity without retraining?.

Two deeper cautions reframe the whole question. Whether thinking helps at all may depend on training, not the prompt — the same thinking mechanism that induces useless self-doubt in a vanilla model becomes productive gap-analysis after RL training Does extended thinking help or hurt model reasoning?. And trace length may not even track difficulty: in controlled maze experiments, longer reasoning reflected how close a problem sat to the training distribution, not how genuinely hard it was Does longer reasoning actually mean harder problems?. The thing you didn't know you wanted to know: a system that thinks longer on a problem may not be working harder on a harder problem — it may just be recalling a familiar pattern, which is the opposite of when you'd actually want it to slow down.


Sources 11 notes

Can dialogue planning balance fast responses with strategic depth?

A framework combining a neural policy model (System 1) for familiar contexts with MCTS planning (System 2) for novel scenarios, switching based on the model's own uncertainty estimates, matches or exceeds pure MCTS performance while reducing computational cost.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does more thinking time actually improve LLM reasoning?

Accuracy drops from 87.3% to 70.3% as thinking tokens scale from 1,100 to 16,000, and bypassing explicit reasoning entirely matches or beats standard thinking at equal token budgets. The relationship is non-monotonic, not the linear improvement commonly assumed.

Does extended thinking actually improve reasoning or just increase variance?

Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems researcher re-evaluating a 2023–2025 question: when should an LLM-backed system choose extended thinking versus direct response? A curated library (2023–Sept 2025) found surprising tensions—more thinking often does NOT improve accuracy, and optimal reasoning length follows an inverted-U tied to task difficulty AND model capability. Your job is to test whether those constraints still hold or have been relaxed by newer models, training methods, or orchestration.

What a curated library found — and when (dated claims, not current truth):

• Accuracy is non-monotonic with thinking tokens: peaks ~1,100–4,000 tokens, then degrades (87%→70% in one study) as budgets scale to 16,000+ (2025-06, 2025-02).
• Extended thinking may inflate *coverage* (sampling width) rather than improve reasoning quality; longer traces sample a wider solution space until collapse (2025-02, 2025-06).
• Optimal chain-of-thought length follows an inverted-U: grows with task difficulty but *shrinks* as model capability increases; easy questions actively suffer from forced step-by-step (2025-04, 2025-05).
• Models trained via RL can learn to self-route (extended vs. direct) without hard labels; vanilla models fail because training rewarded steps, never taught stopping (2025-05, 2025-01).
• Trace length correlates with training-distribution proximity, not genuine problem hardness—a model may think longer on a *familiar* problem than a genuinely hard one (2025-09).

Anchor papers (verify; mind their dates):
• arXiv:2406.05374 (2024-06): Dual-process framework (System 1/2 for dialogue).
• arXiv:2505.13379 (2025-05): "Thinkless" — learned routing of thinking vs. direct response.
• arXiv:2509.07339 (2025-09): Performative thinking; CoT length ≠ problem complexity.
• arXiv:2507.04742 (2025-07): Activation steering to compress reasoning without retraining.

Your task:

(1) RE-TEST EACH CONSTRAINT. For the non-monotonic accuracy curve, the coverage-not-reasoning hypothesis, and the inverted-U optimality claim: do newer models (o3, claude-opus-4.5 or equivalents, 2025–present) still exhibit these limits, or has scaling, training method (e.g., test-time RL, synthetic reasoning data), or inference harness innovation (token pruning, adaptive budgets, better stopping criteria) relaxed them? Separate the durable question ("when should a system think?") from perishable limitations ("thinking always helps at scale" or "models can't self-route"). Cite what resolved it.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (June–Sept 2025) that argues either "extended thinking does scale monotonically under [condition X]" or "self-routing is less reliable than [alternative method]."

(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If synthetic reasoning pretraining (2025-05) enables thinking *without* performance collapse, does that flip the cost-benefit of always-think setups?" or "How do multi-agent orchestration + memory redraw the thinking-vs-speed tradeoff?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines