INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How do training data properties sh…›this inquiring line

The surprising finding: AI models already know how to reason — specialized training just teaches them when to bother doing it.

What makes reasoning-specific post-training different from standard parameter scaling?

This explores why training a model specifically to reason works differently from just making the model bigger or training it longer in the usual ways — and what the corpus says is actually happening when you 'post-train for reasoning.'

This explores why reasoning-specific post-training is a different kind of intervention than standard parameter scaling — and the corpus's most striking answer is that the two may not even operate on the same thing. A recurring finding here is that base models already contain latent reasoning ability; post-training selects and deploys it rather than building it. Five independent mechanisms — RL steering, critique fine-tuning, decoding tweaks, feature steering, and RLVR — all elicit reasoning that was already present in base activations Do base models already contain hidden reasoning ability?. That reframes the whole question: where parameter scaling pours in more capacity, reasoning post-training is closer to flipping a switch on capacity you already paid for. The sharpest version of this is the claim that RL post-training teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains by routing tokens alone Does RL post-training create reasoning or just deploy it?.

If reasoning were just a function of scale or compute, you'd expect a smaller or non-reasoning model to catch up given enough inference budget. It doesn't. Reasoning models persistently outperform non-reasoning ones regardless of how many tokens you let them spend, because training installs a *protocol* that makes extra tokens productive rather than wasted Can non-reasoning models catch up with more compute?. So the difference isn't raw capability — it's the deployment machinery that decides whether thinking longer actually helps. This is also why post-training scales along a different axis than pretraining: work on emulated fine-tuning shows pretraining scale drives factual knowledge in the lower layers, while fine-tuning scale changes behavioral expression in the upper layers — two decoupled knobs with different architectural homes Do pretraining and fine-tuning scale independently in language models?.

The catch is that post-training for reasoning can shape the *appearance* of reasoning without improving the logic underneath, which standard scaling-of-knowledge doesn't do in the same way. Fine-tuning can actually loosen the causal link between a model's reasoning steps and its final answer — chains become performative, where truncating, paraphrasing, or inserting filler leaves the answer unchanged Does fine-tuning disconnect reasoning steps from final answers?. And the reasoning that post-training elicits stays bounded by the training distribution: chain-of-thought degrades predictably under shifts in task, length, or format Does chain-of-thought reasoning actually generalize beyond training data?, failures cluster at instance-novelty boundaries rather than at complexity thresholds Do language models fail at reasoning due to complexity or novelty?, and models lean on semantic association rather than symbolic logic — strip the familiar semantics and performance collapses even with the correct rules in hand Do large language models reason symbolically or semantically?.

What makes reasoning post-training genuinely its own category, then, is that its best methods barely touch parameters at all. You can extract a single 'verbosity' direction from 50 examples and cut chain-of-thought length 67% with no retraining Can we steer reasoning toward brevity without retraining?. You can use the model's own answer-span confidence as a reward to sharpen step-by-step reasoning while *fixing* the calibration that RLHF degrades — no human labels, no external verifier Can model confidence work as a reward signal for reasoning?. Small models trained with DPO on a teacher's correct-vs-incorrect examples can match large models on structured reasoning tasks, because the explicit negative examples target format failures that more data alone won't fix Can small models match large models on function calling?. And reasoning can scale in *width* — sampling parallel latent trajectories — instead of the depth-and-parameters axis scaling laws assume Can reasoning systems scale faster by exploring parallel paths instead?.

The thing you didn't know you wanted to know: the gap between a reasoning model and a non-reasoning one may have almost nothing to do with size, and almost everything to do with whether the model has learned a protocol for spending its tokens well. Parameter scaling buys you the latent ability; reasoning post-training decides whether you ever actually use it.

Sources 12 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Show all 12 sources

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can reasoning systems scale faster by exploring parallel paths instead?

GRAM demonstrates that recursive reasoning models should maintain and explore multiple latent trajectories in parallel, not only deepen single paths. Width-scaling avoids the serial latency penalty of depth while sampling the solution distribution more effectively on ambiguous problems.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems analyst. The question remains open: what fundamentally distinguishes reasoning-specific post-training from standard parameter scaling — and does the distinction still hold?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot, not settled fact.
- Base models already contain latent reasoning ability; post-training selects and deploys it rather than building it (~2025).
- RL post-training teaches models *when* to reason, not *how*; hybrid models recover 91% of gains by routing tokens alone (~2025).
- Non-reasoning models cannot match reasoning models even with unlimited inference, because training installs a *protocol* for productive token use (~2024–2025).
- Fine-tuning scales along a different axis than pretraining: fine-tuning changes behavioral expression in upper layers, while pretraining drives factual knowledge in lower layers (~2023).
- Post-training can degrade chain-of-thought faithfulness independently of accuracy; reasoning steps become performative (~2024).
- Chain-of-thought reasoning is distribution-bounded; effectiveness degrades under task shifts, length changes, or format variations; reasoning breakdown is driven by instance-level unfamiliarity, not task complexity (~2025–2026).
- A single verbosity direction extracted from 50 examples can cut chain-of-thought length 67% with no retraining; model confidence as intrinsic reward sharpens reasoning while fixing RLHF calibration (~2025).
- Small models trained with DPO on teacher examples match large models on structured reasoning (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2305.14825 (2023): Semantic vs. symbolic reasoning.
- arXiv:2410.18890 (2024): Small-scale function calling and reasoning.
- arXiv:2504.09858 (2025): Reasoning models effective without explicit thinking.
- arXiv:2512.07783 (2026): Interplay of pre-training, mid-training, and RL.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 91%-recovery routing claim and the "installs a protocol" hypothesis: have newer inference methods (speculative decoding, adaptive compute, mixture-of-experts routing) or training regimes (curriculum RL, multi-task reasoning fine-tuning) since relaxed or overturned the routing/protocol framing? Does the latent-ability hypothesis still hold under instruction tuning at scale, or does post-training also BUILD new reasoning capacity? Separate durable claim (post-training ≠ parameter scaling) from perishable one (mechanism is protocol-selection-only).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper directly challenged the "models already possess latent reasoning" finding, or shown that post-training does inject *new* reasoning capability, not just select existing ones?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If post-training installs a protocol, can you _measure_ or _intervene_ on that protocol directly, without RL? (b) Does the distribution-boundedness of chain-of-thought break under multi-modal or long-context reasoning, where the latent space may have fundamentally expanded?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The surprising finding: AI models already know how to reason — specialized training just teaches them when to bother doing it.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8