INQUIRING LINE

Can a single model implement fast thinking, slow thinking, and tool use?

This explores whether one model can do all three—answer simply when that's enough (fast thinking), reason at length when needed (slow thinking), and call external tools—rather than splitting those jobs across separate systems.


This explores whether a single model can carry all three modes—fast answers, slow reasoning, and tool use—instead of routing each to a dedicated system. The corpus says the hard part isn't whether one model *can* hold these capabilities (it usually already does) but whether it can decide *when* to use which. The cleanest evidence is Thinkless, which trains one model to route between extended reasoning and direct answers using a method that decouples the choice of mode from the quality of the answer, preventing the model from collapsing into always-think or always-skip Can models learn when to think versus respond quickly?. That routing matters because more thinking is not free: accuracy actually peaks and then declines as thinking tokens grow, with models overthinking easy problems and underthinking hard ones Does more thinking time always improve reasoning accuracy?.

A recurring theme is that the slow-thinking capacity is often already latent in the base model—the bottleneck is elicitation, not acquisition. Several independent mechanisms (RL steering, critique tuning, decoding tweaks, feature steering) all surface reasoning that was already present, suggesting post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. But there's a real limit to the single-model dream: a non-reasoning model can't simply spend more inference compute to catch up to a reasoning-trained one, because training installs a *protocol* that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. And the *quality* of the thinking mode depends on training too—vanilla models can use extended thinking counterproductively (self-doubt that hurts answers), while RL redirects the same mechanism into useful analysis Does extended thinking help or hurt model reasoning?.

The tool-use leg of your question gets an interesting answer from a different angle: rather than one monolith doing everything internally, the corpus shows reasoning operations themselves can be packaged as tool calls. 'Cognitive tools'—reasoning steps implemented as sandboxed, modular LLM calls—lifted GPT-4.1's math performance sharply with no RL training, because modularity enforces an isolation that plain prompting can't guarantee Can modular cognitive tools unlock reasoning without training?. This blurs the line in your question: tool use and slow thinking can be the *same* mechanism, where the model invokes structured reasoning as a callable operation. A related line shows that separating the planner (decomposer) from the executor (solver) outperforms a single model trying to do both, with the planning skill transferring across domains while solving doesn't Does separating planning from execution improve reasoning accuracy?.

So there's a genuine tension worth knowing: one model can be trained to route between modes and to invoke tools, but the corpus repeatedly finds that *forcing separation*—decomposer from solver, reasoning steps into isolated tool calls—buys accuracy and generalizability that a single undifferentiated forward pass struggles to match. The slow-thinking machinery also needs guardrails the single model doesn't naturally have: models switch reasoning paths too early and waste tokens (fixable by penalizing thought-transitions at decode time) Do reasoning models switch between ideas too frequently?, and even elaborate reasoning frameworks converge once you control for total compute, meaning the win comes from compute and reward quality, not the framework wrapper Does the choice of reasoning framework actually matter for test-time performance?.

The thing you may not have known you wanted: the frontier framing isn't 'fast vs. slow vs. tools' as three skills to bolt together, but a single learned *controller* deciding how much to think and when to reach outside itself—and the evidence suggests that controller is the scarce ingredient, while the underlying capabilities are largely already there.


Sources 9 notes

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether a single LLM can unify fast inference, slow reasoning, and tool invocation. The question remains open: *can one model do all three well, and if so, what training and routing mechanisms make it work?*

What a curated library found—and when (dated claims, not current truth):
Findings span Feb 2024–Sep 2025. Key constraints from that window:
• One model *can* hold all three capabilities, but the bottleneck is learned routing: Thinkless (2025-05) trains a single model to decide when to engage extended reasoning vs. direct answers, decoupling mode selection from answer quality.
• More thinking tokens don't monotonically improve accuracy; models peak and degrade, overthinking easy problems and underthinking hard ones (2025-06).
• Slow-thinking capacity is often latent in base models; multiple post-training methods (RL, critique tuning, feature steering) elicit it rather than create it—suggesting training selects, not installs, reasoning (2025-04).
• Non-reasoning models cannot catch up to reasoning-trained models via inference compute alone; training installs a *protocol* that makes extra tokens productive (2025-01).
• Modularity wins: packaging reasoning operations as isolated tool calls ("cognitive tools") lifted math performance without RL, and separating decomposer from solver outperforms a single unified forward pass (2025-06).
• Thinking mode quality depends on training direction: vanilla models use extended thinking counterproductively (self-doubt), while RL redirects it into useful analysis (2025-01, 2025-05).

Anchor papers (verify; mind their dates):
• Thinkless: LLM Learns When to Think (arXiv:2505.13379, 2025-05)
• Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models (arXiv:2506.04210, 2025-06)
• Eliciting Reasoning in Language Models with Cognitive Tools (arXiv:2506.12115, 2025-06)
• Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs (arXiv:2501.18585, 2025-01)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, methods (RL variants, training protocols), tooling (inference harnesses, caching, multi-agent orchestration), or evaluation have since relaxed or overturned it. Separate the durable question (can one model learn to route?) from perishable limitations (specific RL algorithm, compute thresholds, tool-call overhead). Cite what resolved each, and say plainly where constraints still hold.
(2) Surface the strongest contradicting or superseding work from the last ~6 months that challenges the "learned router" thesis or shows a simpler unified mechanism working as well.
(3) Propose 2 new research questions that assume the routing regime may have shifted—e.g., Can a single pre-trained router transfer to new reasoning domains? or Does scaling the tool-call protocol subsume the need for internal slow thinking?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines