Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
The question "can LLMs learn when to think?" has a concrete answer. Thinkless trains a single model to adaptively select between extended chain-of-thought reasoning and concise direct responses, guided by three factors: task complexity, model capability, and the user's efficiency-accuracy tolerance.
The mechanism: two control tokens (<think> and <short>) are generated as the first output token, signaling the reasoning mode. A distillation warm-up phase aligns each token with expert behavior — a reasoning model for <think>, a compact instruction model for <short>. Then RL optimizes the routing policy.
The critical technical contribution is DeGRPO (Decoupled Group Relative Policy Optimization). Vanilla GRPO treats all tokens uniformly, but the control token is one token while the response spans hundreds to thousands. Long responses dominate gradient updates, causing the single control token to receive weak, biased signals. The model rapidly collapses to one mode — typically <short>, since short samples update the control token faster.
DeGRPO separates two objectives: (1) mode selection — how quickly the policy adapts based on current accuracy, and (2) accuracy improvement — refining answer content within the selected mode. This decoupling stabilizes training and prevents the mode collapse observed in all vanilla GRPO experiments.
The result: the model self-calibrates. Simple arithmetic routes to <short>. Multi-condition problems with multiple concepts route to <think>. The policy reflects a well-calibrated difficulty assessment without explicit difficulty labels in training.
This is the concrete instantiation of Does RL teach reasoning or just when to use it?. RL doesn't teach the model to reason — it teaches it to recognize when reasoning is worth the compute. The capability comes from pre-training and distillation; RL manages the deployment decision. The design premise aligns with Do base models already contain hidden reasoning ability?: if reasoning capability is already latent, then what's needed is not more capability training but a routing mechanism -- and the DeGRPO control token is exactly that routing mechanism.
The connection to Can we allocate inference compute based on prompt difficulty? is architecturally direct. Compute-optimal scaling proposes adaptive budget allocation as a principle. Thinkless implements it as a learned routing mechanism inside a single model.
Three-mode taxonomy with two knowledge boundaries (from Arxiv/Routers): The Fast, Slow, and Tool-augmented Thinking survey formalizes the decision space Thinkless operates in. Two knowledge boundaries define the taxonomy: (1) a fast/slow boundary separating intuitive from deliberative processes (System 1 vs System 2), and (2) an internal/external boundary distinguishing parameter-grounded reasoning from tool-augmented reasoning. This extends Thinkless's binary think/short routing to a three-mode decision: fast thinking (direct generation), slow thinking (CoT/self-reflection/verification), and tool-augmented thinking (calculators, code interpreters, search). Selection mechanisms are either implicit (learned end-to-end during post-training, no explicit control signal) or explicit (rule-based or model-based external routing). Thinkless is an implicit selector for the fast/slow boundary; extending it to the internal/external boundary would require a third mode for tool invocation decisions.
Inquiring lines that use this note as a source 108
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes conceptual inquiry the fastest high-scoring AI interaction pattern?
- Why do models commit to answers early on easy versus hard tasks?
- How do verbose and concise reasoning occupy different regions in activation space?
- Can penalizing reasoning transitions fix underthinking without fine-tuning models?
- How does step-level compute allocation compare to response-level thinking?
- Can models learn when to invoke search during reasoning tasks?
- What does Wang mean by intelligence as adaptation with limited resources?
- Can language systems learn when to ask for clarification instead of choosing one reading?
- What happens when a single loss function conflates representation learning with decision-making?
- Does training for better reasoning reduce an AI system's ability to abstain?
- Why do language models produce verbose reasoning when asked to think step by step?
- Can activation steering directly steer models toward concise reasoning without prompting?
- How does inference compute substitution affect the training parameter scaling trade-off?
- Can adaptive prompt-difficulty allocation compound with architectural efficiency improvements?
- Does parallel thinking benefit disproportionately from higher inference throughput architectures?
- Can models learn to select exemplars based on reasoning skills rather than complexity?
- Can dynamic instance-specific prompt selection solve the generalization problem across tasks?
- Why do more capable models prefer shorter chains of thought?
- Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
- Do reasoning models show the same answer-maintenance pattern that diffusion models exhibit?
- Why does joint optimization of prompts and inference strategy outperform separate tuning?
- Why do models automatically adjust reasoning length to problem difficulty?
- Can adaptive compute distribution across prompts replace the need for sophisticated reasoning frameworks?
- Why do non-reasoning models work better under extreme decomposition than reasoning models?
- Does more inference compute help reasoning models match specialized domain performance?
- Does reasoning fine-tuning actually reduce a model's ability to abstain?
- Can test-time compute on smaller models replace larger model inference?
- Which RAG sub-decisions are actually pattern matching versus reasoning intensive?
- Do models trained for safety over-refuse compared to models trained for reasoning?
- How do moment-to-moment ToM fluctuations shape AI response quality?
- How should inference budget adapt based on problem difficulty?
- How should reasoning prompts adapt based on question complexity and type?
- Can RL teach when to use reasoning versus when to respond directly?
- Does more thinking always help large language models or sometimes hurt?
- How does per-token adaptive compute improve efficiency in recurrent reasoning?
- Does trading model size for inference steps improve overall efficiency scaling?
- Does reinforcement learning learn optimal per-turn reasoning discipline?
- How can prompting help models gather information before attempting reasoning?
- What training signals would teach models when not to reason?
- How should inference-time token budgets vary across models of different capability levels?
- Do models trained for reasoning lose their ability to decline questions?
- When does sequential reasoning provide exponential advantages over parallel voting?
- How does training data format shape whether models reason in parallel or sequentially?
- How much inference efficiency do we gain by eliminating self-correction passes?
- What are the computational trade-offs between training-time vs inference-time consistency correction?
- Why does inference-time thinking hurt proactive critical thinking in vanilla models?
- How does the functional separation of knowledge and reasoning affect adaptation methods?
- Why do reasoning models fail when input length increases even below context limits?
- When should a system decide to retrieve versus reason alone?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- Why does parallel thinking outperform sequential thinking under token limits?
- Can models learn when to think versus answer directly?
- Why do language models prefer certain response styles regardless of what the prompt asks?
- How does credit assignment work across many sequential decision steps in language models?
- Can contrastive learning teach models to switch between logical and emotional reasoning?
- Can external classifiers reliably decide when a model should reason?
- What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?
- Does reasoning fine-tuning actually damage a model's ability to abstain?
- Why do SFT models memorize patterns instead of learning generalizable reasoning?
- How should inference compute budget be allocated across different prompt difficulties?
- Can inference budgets be allocated differently based on prompt difficulty?
- How does extended thinking affect variance in reasoning model outputs?
- Does reasoning fine-tuning actually harm a model's ability to abstain?
- How should inference budgets adapt based on prompt difficulty?
- Where does inference compute stop substituting for model capacity?
- When should a system choose extended thinking versus quick responses?
- How should timing for reasoning intervention be determined during inference?
- Can models learn to stop thinking when a question lacks necessary information?
- Does inference-time compute improve pretraining data efficiency in practice?
- What makes a first answer so often the best answer a model produces?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Can weaker models match stronger ones with sufficient search and reasoning budget?
- Does penalizing thought transitions improve reasoning without model retraining?
- Why does more inference compute amplify wandering rather than solving it?
- Do reasoning models switch approaches when encountering local difficulty?
- Can models maintain multiple task interpretations simultaneously before committing to a single policy?
- How can prompt intervention reduce redundant reasoning steps dynamically?
- Does more thinking always improve language model accuracy?
- Does task difficulty alone determine how many thinking tokens a model should use?
- Can activation steering vectors compress reasoning without retraining models?
- Can models learn to ask clarifying questions instead of making assumptions?
- What other internal model decisions beyond attention could be optimized directly?
- Does decoupling reasoning reduce inference cost more than sequential scaling?
- What makes inference budgets allocate adaptively per prompt difficulty?
- Does reinforcement learning teach models how to reason or when to reason?
- Can a single model implement fast thinking, slow thinking, and tool use?
- Do larger language models overcome greediness in sequential decision-making?
- Why does reflection in reasoning models mostly confirm the first answer?
- Can sleep-time compute reduce latency demands during model inference?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- Can adaptive per-step decisions outperform uniform retrieval policies across different reasoning tasks?
- Can activation steering compress reasoning without retraining models?
- Can distillation from stronger models create genuinely new reasoning abilities?
- Can models possess latent reasoning capability that training signals fail to unlock?
- Why do thinking models execute longer tasks than standard language models?
- Why do language models overthink simple questions when given extra time?
- How do reward models guide inference-time compute allocation decisions?
- Can we predict when a model will develop thinking behaviors?
- When does RL discover genuinely novel reasoning strategies versus timing optimization?
- Can inference budgets be allocated adaptively based on prompt difficulty?
- How do sleep-time and post-completion methods reduce inference latency?
- Can models learn to optimize their own chain-of-thought generation?
- What architectural variables most improve inference efficiency today?
- Why does reasoning fine-tuning reduce models' ability to abstain?
- How does the inference steps dial compare to test-time compute trade-offs in language models?
- Do models genuinely reason harder on difficult tasks or just appear to?
- Why does architecture matter more than training compute for inference efficiency?
- How can models select the optimal question to ask given multiple uncertainties?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
Thinkless is the concrete implementation: RL learns the routing, not the reasoning
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
Thinkless implements adaptive allocation as a learned control token decision
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
the routing decision is the practical resolution: use reasoning where it helps, skip where it hurts
-
Can routers select the right model before generation happens?
Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
external model routing as the inter-model analog of Thinkless's intra-model mode routing
-
Does RL post-training create reasoning or just deploy it?
Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
Thinkless is the strongest concrete evidence for the post angle: RL literally learns a routing token, not reasoning capability; the "when not how" claim is architecturally explicit in the control token design
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Thinkless's design premise: if reasoning capability is already latent in the base model, what's needed is not more capability training but a routing mechanism that decides when to activate it; DeGRPO is that routing mechanism
-
When should an agent actually stop and deliberate?
How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND extends the Thinkless routing principle to a finer granularity: Thinkless decides once per response whether to think or not, while SAND decides at each step within an agentic trajectory whether to deliberate; together they form a hierarchy of adaptive compute allocation (response-level routing + step-level gating)
-
Does thinking emerge when agents choose between learned sub-policies?
Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
theoretical grounding: the thought MDP formalizes what DeGRPO's control token does — selecting between sub-policies (think vs. short) already contained in the policy function; the meta-policy over sub-policies IS the routing decision
-
When should retrieval happen during model generation?
Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
the retrieval-level analog of Thinkless's compute routing: FLARE gates retrieval on low token-probability, Thinkless gates extended thinking on task complexity; both implement uncertainty-triggered compute allocation, one at the retrieval layer, one at the reasoning layer
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Thinkless: LLM Learns When to Think
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Base Models Know How to Reason, Thinking Models Learn When
- Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
- Fast, Slow, and Tool-augmented Thinking for LLMs: A Review
- Implicit Chain of Thought Reasoning via Knowledge Distillation
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Original note title
hybrid reasoning via decoupled rl learns when to engage extended thinking versus giving concise responses based on task complexity and model capability