INQUIRING LINE

Can we predict when a model will develop thinking behaviors?

This explores whether we can anticipate when a model starts to reason — whether 'thinking' is a capability that suddenly appears, or something already latent that training merely switches on at a predictable moment.


This reads the question as: can we forecast the point at which a model develops reasoning behavior — and the corpus reframes it in a surprising way. The more striking answer isn't about *predicting an emergence event*, but that there may be no emergence event to predict. Several notes converge on the idea that reasoning is already sitting inside base models, latent, waiting to be elicited rather than built. One survey finds five completely independent methods — reinforcement learning steering, critique fine-tuning, decoding tweaks, sparse-feature steering, and RLVR — all unlock the *same* dormant capability, concluding that post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. If that's right, 'when does thinking develop' becomes 'when do we choose to switch it on.'

That reframing sharpens what training actually contributes: timing, not capability. RL post-training is described as a *deployment optimizer* — it teaches a model *when* to spend reasoning effort, with a hybrid setup recovering 91% of the gains using just 12% of the tokens Does RL teach reasoning or just when to use it?. The same flavor of result shows up in models that learn to route between deep thinking and quick answers on their own, without anyone labeling which problems are hard Can models learn when to think versus respond quickly?. So the predictable thing isn't the birth of reasoning — it's the model learning a policy for when to use it.

There's also a quieter, more unsettling thread: thinking behavior often *hurts* before it helps, and only training flips the sign. Vanilla models given a thinking mode use it for counterproductive self-doubt until RL redirects that exact mechanism into useful gap analysis Does extended thinking help or hurt model reasoning?. Asking a model to think first degrades general performance until RL — judging only the final answer — teaches it to make those thoughts pay off Why does asking models to think first hurt performance?. Even when present, more thinking isn't monotonically better: accuracy peaks then collapses as thinking tokens climb, models overthinking easy problems and underthinking hard ones Does more thinking time always improve reasoning accuracy?. The trajectory is predictable in shape — initial harm, then benefit, then diminishing returns — but it's a training curve, not a switch.

The genuinely interesting twist is whether you can *measure* thinking as it forms rather than guess at it. A 'deep-thinking ratio' tracks how many tokens get their predictions substantially revised across the model's layers, and that internal signal correlates with accuracy well enough to drive a cheaper test-time strategy Can we measure how deeply a model actually reasons?. That's closer to a real predictor — an observable internal marker of reasoning effort. But it comes with a warning from the skeptics: visible reasoning traces can be pure stylistic mimicry, with invalid traces still producing correct answers, meaning the *appearance* of thinking isn't proof of the function Do reasoning traces actually cause correct answers?. So if you want to predict thinking, you'll want internal measures, not the visible chain-of-thought.

Two adjacent findings stretch the question further. Reasoning may be plantable *earlier* than anyone assumed — chain-of-thought trained directly into pretraining with an information-gain reward lifts reasoning ~19% Can chain-of-thought reasoning be learned during pretraining itself? — pushing the 'when' upstream. And models develop a kind of self-knowledge about their own behaviors without being trained to report it Can language models describe their own learned behaviors?, which hints that the cleanest predictor of a model's thinking tendencies might eventually be the model itself.


Sources 10 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Why does asking models to think first hurt performance?

Prompting models to think before responding degrades performance on general tasks. RL training with judges evaluating only responses teaches models to generate thoughts that actually improve outputs across diverse task types, not just math.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can language models describe their own learned behaviors?

LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: *Can we predict when a model will develop thinking behaviors?* remains open—but a curated library (Oct 2024–May 2026) suggests the framing may need updating.

**What a curated library found — and when (dated claims, not current truth):**

Findings span 2024–2026; treat each as time-stamped, not current:

• Reasoning may not *emerge*; five independent post-training methods (RL, critique fine-tuning, decoding, sparse steering, RLVR) unlock the *same* latent capability, suggesting reasoning pre-exists in base models and post-training *selects* rather than creates it (~2025).

• RL post-training is a deployment optimizer teaching *when* to reason: hybrid setups recover 91% of gains with 12% of tokens; models self-learn routing between deep and quick answers without labeled difficulty (~2025).

• Thinking initially *hurts* performance—vanilla models use thinking for counterproductive self-doubt until RL flips the sign; accuracy peaks then collapses beyond a critical thinking-token threshold (~2025–2026).

• A 'deep-thinking ratio' (layer-wise prediction revisions) correlates with accuracy and drives cheaper test-time strategies, but visible reasoning traces can be stylistic mimicry—invalid traces still produce correct answers (~2026).

• Chain-of-thought trained into pretraining with information-gain reward lifts reasoning ~19%; models develop behavioral self-awareness without explicit training (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2505.13379 *Thinkless* (2025-05)
- arXiv:2602.13517 *Deep-Thinking Tokens* (2026-02)
- arXiv:2510.18176 *RLVR Traces* (2025-10)
- arXiv:2501.11120 *LLM Behavioral Self-Awareness* (2025-01)

**Your task:**

(1) **RE-TEST EACH CONSTRAINT.** The claim that reasoning is latent (not emergent) depends on five methods converging on one capability. Since mid-2026, have newer architectures, scaling laws, or multi-modal training changed whether latency *holds* at larger scales? Does the "thinking hurts first" curve still describe current frontier models—or do recent scaling/pretraining tweaks accelerate when RL becomes beneficial? Separate the durable insight (post-training optimizes timing, not just capability) from any perishable limit (e.g., peak-then-collapse at fixed compute budgets).

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Are there papers showing emergent reasoning *without* post-training, or claiming visible traces *do* reliably indicate reasoning function, that cut against the latency + stylistic-mimicry consensus?

(3) **Propose 2 research questions** that assume the regime has moved: (a) If reasoning is pre-latent, what determines *which model size or training recipe* makes it easiest to elicit—and is there a scale below which latency never manifests? (b) Can we design an internal measure of reasoning (beyond deep-thinking ratio) that survives adversarial styling, and does it predict performance on truly novel domains?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines