INQUIRING LINE

Can distillation from stronger models create genuinely new reasoning abilities?

This explores whether learning from a stronger model can actually add reasoning skills a model didn't have — or whether it just surfaces abilities that were already latent and waiting to be unlocked.


This explores whether distillation from stronger models creates genuinely new reasoning ability, and the corpus pushes back hard on the premise: most of the evidence suggests reasoning isn't created by post-training at all — it's elicited from what the base model already contains. The strongest version of this claim comes from work showing that five independent techniques — RL steering, critique fine-tuning, decoding changes, feature steering, and RLVR — all converge on the same conclusion: post-training selects reasoning that's already latent in base-model activations rather than installing it Do base models already contain hidden reasoning ability?. If the bottleneck is elicitation, not acquisition, then distillation's gains are better read as a more efficient unlocking mechanism than as the birth of a new skill.

Several other notes reinforce how shallow the 'new ability' framing can be. RLVR turns out to operate on only about 20% of tokens — the high-entropy 'forking points' where reasoning decisions happen — and training on just those matches full updates, suggesting the learning signal adjusts existing decision behavior rather than building new machinery Do high-entropy tokens drive reasoning model improvements?. You can even elicit large reasoning jumps with no training at all: modular cognitive tools lifted GPT-4.1 on a hard math benchmark from 26.7% to 43.3% purely by isolating operations Can modular cognitive tools unlock reasoning without training?, and a single steering vector extracted from 50 examples can reshape how a model reasons without retraining Can we steer reasoning toward brevity without retraining?. When behavior this consequential is reachable without weight updates, it's hard to argue the underlying capability was absent.

The sharpest warning sign for distillation specifically is what happens when models imitate reasoning form without the underlying logic. Chain-of-thought trained on a distribution degrades predictably outside it, producing fluent but logically inconsistent traces — the model copies the shape of reasoning rather than the validity Does chain-of-thought reasoning actually generalize beyond training data?. Since distillation is precisely the business of copying a teacher's surface traces, this is the failure mode to fear: you can get a student that sounds like the teacher inside the training distribution and collapses outside it. That gap between mimicked form and transferable competence is the central risk in calling distilled reasoning 'genuinely new.'

That said, the corpus isn't uniformly deflationary — there's a real distinction between capability and protocol. Non-reasoning models can't catch up to reasoning models no matter how much inference compute you throw at them, because training instills a *protocol* that makes extra tokens productive Can non-reasoning models catch up with more compute?. This is the strongest case that training (and by extension distillation) adds something durable: not a new latent ability, but a learned discipline for deploying it. Related work shows models can even learn *when* to engage extended thinking versus answering directly Can models learn when to think versus respond quickly?, and that more thinking isn't free — accuracy peaks then declines past a token threshold Does more thinking time always improve reasoning accuracy?. So distillation may transfer the *control policy* over reasoning even when the raw capability was already present.

The synthesis, then: the corpus reframes the question. Distillation almost certainly doesn't conjure reasoning from nothing — base models already hold latent capability, and copying surface traces risks form-without-logic. But there's a defensible middle ground where distillation transfers a *protocol* — when to think, how long, which forking tokens matter — that a model couldn't easily discover on its own. The interesting frontier in this collection is methods that build reasoning structure more fundamentally, like energy-based transformers that derive system-2 behavior from unsupervised learning alone Can energy minimization unlock reasoning without domain-specific training? — a hint that the genuinely-new reasoning, if it exists, may come from architecture rather than from imitating a stronger teacher.


Sources 9 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst testing whether distillation from stronger models creates genuinely new reasoning abilities in LLMs, treating this as an open question despite recent findings.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2025. A curated library concluded:
• Base models already possess latent reasoning capability; five independent techniques (RL steering, critique fine-tuning, decoding changes, feature steering, RLVR) converge on elicitation-not-acquisition (2025).
• RLVR operates on only ~20% of tokens (high-entropy 'forking points'); training those matches full updates, suggesting adjustment of existing behavior rather than machinery-building (2025).
• Modular cognitive tools lifted GPT-4.1 from 26.7% to 43.3% on hard math purely by isolation, with no retraining; a single steering vector from 50 examples reshapes reasoning without weight updates (2025).
• Chain-of-thought trained on a distribution degrades predictably outside it — fluent but logically inconsistent traces — raising the risk that distillation copies form without logic (2025).
• Non-reasoning models cannot match reasoning models even with unlimited inference compute; training instills a protocol that makes extra tokens productive; models can learn *when* to engage extended thinking (2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.12115 (Cognitive Tools, Jun 2025)
• arXiv:2506.01939 (High-Entropy Tokens, Jun 2025)
• arXiv:2508.01191 (CoT Mirage, Aug 2025)
• arXiv:2507.02092 (Energy-Based Transformers, Jul 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 4+), architectural changes (sparse experts, memory-augmented), training methods (curriculum RL, constitutional AI), tooling (prompt caching, reasoning harnesses), or evaluation frameworks have since relaxed or overturned it. Separate the durable question (can distillation transfer protocol over reasoning?) from perishable limitations (does steering vector work at scale-10B+?). Cite what resolved it; state plainly where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that argues distillation does create genuinely new reasoning (not just elicitation).
(3) Propose 2 research questions that assume the regime may have moved: e.g., can protocol-transfer distillation work at 10M-token scales?, does multi-agent scaffolding + distillation enable reasoning absent in base model?.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines