SYNTHESIS NOTE

Can models learn when to think versus respond quickly?

Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search

The question "can LLMs learn when to think?" has a concrete answer. Thinkless trains a single model to adaptively select between extended chain-of-thought reasoning and concise direct responses, guided by three factors: task complexity, model capability, and the user's efficiency-accuracy tolerance.

The mechanism: two control tokens (<think> and <short>) are generated as the first output token, signaling the reasoning mode. A distillation warm-up phase aligns each token with expert behavior — a reasoning model for <think>, a compact instruction model for <short>. Then RL optimizes the routing policy.

The critical technical contribution is DeGRPO (Decoupled Group Relative Policy Optimization). Vanilla GRPO treats all tokens uniformly, but the control token is one token while the response spans hundreds to thousands. Long responses dominate gradient updates, causing the single control token to receive weak, biased signals. The model rapidly collapses to one mode — typically <short>, since short samples update the control token faster.

DeGRPO separates two objectives: (1) mode selection — how quickly the policy adapts based on current accuracy, and (2) accuracy improvement — refining answer content within the selected mode. This decoupling stabilizes training and prevents the mode collapse observed in all vanilla GRPO experiments.

The result: the model self-calibrates. Simple arithmetic routes to <short>. Multi-condition problems with multiple concepts route to <think>. The policy reflects a well-calibrated difficulty assessment without explicit difficulty labels in training.

This is the concrete instantiation of Does RL teach reasoning or just when to use it?. RL doesn't teach the model to reason — it teaches it to recognize when reasoning is worth the compute. The capability comes from pre-training and distillation; RL manages the deployment decision. The design premise aligns with Do base models already contain hidden reasoning ability?: if reasoning capability is already latent, then what's needed is not more capability training but a routing mechanism -- and the DeGRPO control token is exactly that routing mechanism.

The connection to Can we allocate inference compute based on prompt difficulty? is architecturally direct. Compute-optimal scaling proposes adaptive budget allocation as a principle. Thinkless implements it as a learned routing mechanism inside a single model.

Three-mode taxonomy with two knowledge boundaries (from Arxiv/Routers): The Fast, Slow, and Tool-augmented Thinking survey formalizes the decision space Thinkless operates in. Two knowledge boundaries define the taxonomy: (1) a fast/slow boundary separating intuitive from deliberative processes (System 1 vs System 2), and (2) an internal/external boundary distinguishing parameter-grounded reasoning from tool-augmented reasoning. This extends Thinkless's binary think/short routing to a three-mode decision: fast thinking (direct generation), slow thinking (CoT/self-reflection/verification), and tool-augmented thinking (calculators, code interpreters, search). Selection mechanisms are either implicit (learned end-to-end during post-training, no explicit control signal) or explicit (rule-based or model-based external routing). Thinkless is an implicit selector for the fast/slow boundary; extending it to the internal/external boundary would require a third mode for tool invocation decisions.

Inquiring lines that read this note 114

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

What makes conceptual inquiry the fastest high-scoring AI interaction pattern?

How should models express uncertainty rather than forced confident answers?

How do neural networks separate factual knowledge from reasoning abilities?

How do verbose and concise reasoning occupy different regions in activation space?

What capability tradeoffs emerge when scaling model reasoning abilities?

How does latent reasoning compare to verbalized chain-of-thought?

Can AI-generated outputs constitute genuine knowledge or valid claims?

What does Wang mean by intelligence as adaptation with limited resources?

Why do language models reinforce false assumptions instead of correcting them?

Can language systems learn when to ask for clarification instead of choosing one reading?

What determines success in training models on multiple tasks?

How can models identify insufficient information and respond appropriately without guessing?

Why do reasoning models fail at systematic problem-solving and search?

Can inference-time compute substitute for scaling up model parameters?

How should inference compute be adaptively allocated based on prompt difficulty?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

How do training data properties shape reasoning capability development?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?

What structural advantages do diffusion language models offer over autoregressive methods?

Do reasoning models show the same answer-maintenance pattern that diffusion models exhibit?

How does example difficulty affect learning efficiency in language models?

Why do models automatically adjust reasoning length to problem difficulty?

How do knowledge injection methods compare across cost and effectiveness?

Can prompting inject entirely new knowledge into language models?

Does reinforcement learning teach reasoning or just when to reason?

Why does training format shape reasoning strategy more than domain content?

How does training data format shape whether models reason in parallel or sequentially?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does the functional separation of knowledge and reasoning affect adaptation methods?

What prevents language models from reliably adopting diverse personas?

Why do language models prefer certain response styles regardless of what the prompt asks?

What properties determine whether reward signals teach genuine reasoning?

How does credit assignment work across many sequential decision steps in language models?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why do SFT models memorize patterns instead of learning generalizable reasoning?

When do additional thinking tokens stop improving reasoning performance?

Can next-token prediction alone produce genuine language understanding?

What other internal model decisions beyond attention could be optimized directly?

Do language models learn genuine linguistic structure or just surface patterns?

Does self-reflection enable models to reliably correct their errors?

Why does reflection in reasoning models mostly confirm the first answer?

How should retrieval systems optimize for multi-step reasoning during inference?

Can adaptive per-step decisions outperform uniform retrieval policies across different reasoning tasks?

Do base models contain latent reasoning that training can unlock?

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

24 direct connections · 220 in 2-hop network ·dense cluster Open in graph ↗

Can models learn when to think versus respond qu… Does RL teach reasoning or just when to use it? Can we allocate inference compute based on prompt … When does explicit reasoning actually help model p… Can routers select the right model before generati… Does RL post-training create reasoning or just dep… Do base models already contain hidden reasoning ab… When should an agent actually stop and deliberate? Does thinking emerge when agents choose between le…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
Thinkless is the concrete implementation: RL learns the routing, not the reasoning
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
Thinkless implements adaptive allocation as a learned control token decision
When does explicit reasoning actually help model performance? Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
the routing decision is the practical resolution: use reasoning where it helps, skip where it hurts
Can routers select the right model before generation happens? Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
external model routing as the inter-model analog of Thinkless's intra-model mode routing
Does RL post-training create reasoning or just deploy it? Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
Thinkless is the strongest concrete evidence for the post angle: RL literally learns a routing token, not reasoning capability; the "when not how" claim is architecturally explicit in the control token design
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Thinkless's design premise: if reasoning capability is already latent in the base model, what's needed is not more capability training but a routing mechanism that decides when to activate it; DeGRPO is that routing mechanism
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND extends the Thinkless routing principle to a finer granularity: Thinkless decides once per response whether to think or not, while SAND decides at each step within an agentic trajectory whether to deliberate; together they form a hierarchy of adaptive compute allocation (response-level routing + step-level gating)
Does thinking emerge when agents choose between learned sub-policies? Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
theoretical grounding: the thought MDP formalizes what DeGRPO's control token does — selecting between sub-policies (think vs. short) already contained in the policy function; the meta-policy over sub-policies IS the routing decision
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
the retrieval-level analog of Thinkless's compute routing: FLARE gates retrieval on low token-probability, Thinkless gates extended thinking on task complexity; both implement uncertainty-triggered compute allocation, one at the retrieval layer, one at the reasoning layer

Can models learn when to think versus respond quickly?

Inquiring lines that read this note 114

Related concepts in this collection 9

Related papers in this collection 8

Search by related questions 5