INQUIRING LINE

What is the distinction between teaching reasoning how versus when to activate?

This explores a finding the corpus keeps circling back to: that a model's reasoning ability and its judgment about *when* to switch reasoning on are two separate things — one built during pre-training, the other tuned afterward.


This explores the difference between a model *having* reasoning skill and *knowing when to use it* — and the corpus draws a surprisingly sharp line between the two. The cleanest statement comes from work showing that reinforcement-learning post-training doesn't actually teach a model to reason; pre-training already planted that capability. What RL adds is timing — knowing when to deploy reasoning and when to answer directly. The evidence is striking: a hybrid that paired a base model's raw reasoning with a thinking model's steering recovered 91% of the performance gain using just 12% of the tokens Does RL teach reasoning or just when to use it?. That reframes RL as a deployment optimizer, not a capability creator — and it's the organizing idea behind proposals to architect reasoning systems that deliberately separate *activation timing* from *execution capability* How should reasoning systems actually be architected?.

The 'capability is already there' half of the story shows up everywhere once you look. A single latent feature, identified and steered directly, matches chain-of-thought performance without any prompting — implying the reasoning mode is a built-in switch waiting to be flipped Can we trigger reasoning without explicit chain-of-thought prompts?. Four modular 'cognitive tools' lifted GPT-4.1 on a hard math benchmark from 27% to 43% with no training at all, just by isolating operations the model could already perform Can modular cognitive tools unlock reasoning without training?. And critique fine-tuning unlocked reasoning from exposure to a *single* problem — suggesting what's needed is an activation signal, not new skill Can a single problem unlock reasoning through solution critique?. All three point the same way: the 'how' is latent; the work is in the 'when.'

But 'when' turns out to be subtle, because more reasoning is not always better. Vanilla models, left to think freely, talk themselves into self-doubt that *degrades* answers — and RL training's real job is redirecting that same thinking machinery from spiraling into productive gap-analysis Does extended thinking help or hurt model reasoning?. Instance-adaptive analysis shows simple questions do better with a direct question-to-answer path than with step-by-step reasoning, so the optimal move depends on the specific question, not the task category Why do some questions perform better without step-by-step reasoning?. Knowing 'when' includes knowing *when not to* — and getting that wrong has costs: scaling reasoning capability quietly erodes instruction-following, because longer chains pull attention away from the original request Why do better reasoning models ignore instructions?.

The one place the corpus pushes back on the clean split is *where* the 'how' gets built. If capability really is fixed in pre-training, you might plant it more deliberately there: treating chain-of-thought as an exploratory action rewarded by information gain lifts reasoning ~19% during pre-training itself Can chain-of-thought reasoning be learned during pretraining itself?. And the 'how' may be structural — reasoning generalizes from broad procedural knowledge picked up across many documents, unlike facts, which are memorized narrowly Does procedural knowledge drive reasoning more than factual retrieval? — which even maps onto network anatomy, with knowledge concentrated in lower layers and reasoning in higher ones Why does reasoning training help math but hurt medical tasks?.

The thing you didn't know you wanted to know: the recent leap in 'reasoning models' may be less about teaching machines to think and more about teaching them *restraint* — when to think, how hard, and when to just answer. The skill was already in the box; what was missing was the judgment to use it.


Sources 11 notes

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

How should reasoning systems actually be architected?

Research shows RL post-training teaches models *when* to use reasoning mechanisms that pre-training already provides. Decoupled architectures, latent reasoning in continuous space, and interleaved action-grounding all outperform monolithic chain-of-thought approaches.

Can we trigger reasoning without explicit chain-of-thought prompts?

SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can a single problem unlock reasoning through solution critique?

Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems analyst. The question remains open: **What is the true distinction between teaching a model how to reason versus when to activate reasoning?** A curated library (spanning 2023–2026) proposes a sharp split, but that library is now dated—treat its claims as perishable.

**What a curated library found — and when (dated claims, not current truth):**
- Base models *already possess* reasoning capability from pre-training; RL post-training doesn't teach reasoning—it teaches *timing*. A hybrid approach recovered 91% of performance gain using just 12% of tokens (2025-10).
- A single SAE-identified latent feature, when steered, matches chain-of-thought performance without prompting, suggesting reasoning is a built-in switch (2026-01).
- More reasoning isn't always better: vanilla models spiral into counterproductive self-doubt; RL redirects that same machinery into productive gap-analysis (2025-07).
- Instance-adaptive analysis shows simple questions degrade under step-by-step reasoning—optimal activation depends on the *specific question*, not task category (2025-07).
- Scaling reasoning capability erodes instruction-following because longer chains pull attention from the original request (2025-05).

**Anchor papers (verify; mind their dates):**
- 2025-10: arXiv:2510.07364 "Base Models Know How to Reason, Thinking Models Learn When"
- 2026-01: arXiv:2601.08058 "Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models"
- 2025-05: arXiv:2505.14810 "Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models"
- 2024-11: arXiv:2411.12580 "Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models"

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, determine whether newer models (o1, o3, Claude Opus 4+), scaling laws, post-training refinements, or evals (MATH-Hard, ARC-c, instruction-following benches) have *relaxed or overturned* it. Separate the durable insight—"reasoning activation is separable from capability"—from the perishable limitation (e.g., "scaling reasoning hurts instruction-following"). Does that trade-off still hold, or have recent architectures or training regimes reconciled it?

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for papers arguing reasoning *capability* is still being built post-training, or that activation and execution cannot be decoupled, or that the trade-off dissolves under specific conditions.

(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If activation and capability truly separate, can we train an ultra-lightweight activation oracle on a frozen base model?" or "Does the instruction-following deficit reverse if reasoning is framed as a clarification tool rather than a chain?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines