INQUIRING LINE

Does targeting the edge of competence during RL pretraining unlock true reasoning gains?

This explores whether RL gains real new reasoning ability — versus just sharpening what's already there — and whether the trick is aiming training at tasks right at the model's competence boundary.


This explores whether RL gains real new reasoning ability — versus just sharpening what's already there — and whether the trick is aiming training at tasks right at the model's competence boundary. The corpus gives an unusually crisp answer: most of the time RL doesn't create reasoning, it elicits and schedules reasoning the base model already has. Several independent lines converge here — base models appear to hold latent reasoning that minimal training simply unlocks Do base models already contain hidden reasoning ability?, and RL post-training looks less like teaching a skill and more like teaching *when* to deploy it, with hybrid routing recovering 91% of gains by adjusting timing alone Does RL post-training create reasoning or just deploy it?. Studies of RLVR sharpen the point: reward learning improves sampling efficiency inside the existing capability envelope without pushing its walls outward, and even spurious rewards work nearly as well as correct ones for a well-pretrained model What does reward learning actually do to model reasoning?.

So where does the 'edge of competence' come in? One controlled synthetic study is the direct hit on your question: RL produces *genuine* capability gains — not just resampling — under two conditions together. Pretraining must already have planted the reasoning primitives, and the RL data must target tasks sitting right at the boundary of what the model can currently do. Strip either condition and RL collapses back into refining the sampling distribution rather than extending reach When does RL actually extend reasoning beyond pretraining?. In other words, 'edge of competence' isn't a slogan — it's the regime where the headroom exists for RL to do more than rehearse.

The interesting twist is that the same logic suggests you can move the boundary itself by reasoning *earlier*. Instead of treating pretraining as pure capability-stuffing and RL as the reasoning phase, a cluster of work plants reasoning into pretraining: treating chain-of-thought as an exploratory action rewarded by information gain lifts benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?, while reframing next-token prediction itself as a verifiable reasoning task strengthens downstream RL Can next-token prediction become a reasoning task with RL?. If reasoning primitives are richer going in, there's more headroom for edge-of-competence RL to exploit afterward.

There's also a sequencing answer that complements the boundary-targeting one. Doing imitation first to build reasonable rollouts, *then* RLVR to sharpen against rewards, beats either alone — because the imitation phase makes outcome rewards informative, effectively manufacturing the headroom and competence edge that RL needs to bite on Does sequencing imitation then exploration training improve reasoning?. And RL's own internal dynamics mirror the edge idea: training moves through a two-phase arc, first consolidating execution correctness, then shifting the bottleneck to strategic planning — so the 'edge' the model is pushed against literally migrates over training Does RL training follow a predictable two-phase learning sequence?.

What you didn't ask but might want: RL doesn't only relocate capability, it can change reasoning *quality*. The same extended-thinking machinery that induces counterproductive self-doubt in a vanilla model gets redirected by RL into productive gap analysis — training mediates how reasoning is used, not just how much Does extended thinking help or hurt model reasoning?. And if you care about teaching agents to reason better rather than merely succeed, process rewards for metacognition (planning, reflection, monitoring) cut wasted actions while generalizing better than outcome-only RL Can RL agents learn to reason better, not just succeed?. The throughline: 'true reasoning gains' are real but conditional — they live at the boundary, and whether you hit that boundary depends on what pretraining left behind and how you sequence what comes after.


Sources 10 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

When does RL actually extend reasoning beyond pretraining?

A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL-in-LLMs researcher. The question remains: does targeting the edge of competence during RL pretraining unlock *true* reasoning gains, or only surface existing capability?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Dec 2025. A library of ~12 recent papers converges on these constraints:
• Base models hold latent reasoning; RL mostly *schedules* rather than *teaches* it; hybrid routing recovers 91% of gains by timing alone (~2025).
• Genuine capability gains appear only when: pretraining seeded reasoning primitives AND RL data target boundary tasks; strip either, RL reverts to sampling refinement (~2025).
• Imitation-first, then RLVR, outperforms either alone by manufacturing the competence edge RL exploits (~2025).
• RL training exhibits two-phase dynamics: procedural consolidation, then strategic planning shift (~2025).
• Process rewards for metacognition generalize better than outcome-only RL in agentic reasoning (~2025).

Anchor papers (verify; mind their dates):
– arXiv:2505.24864 (ProRL, May 2025): prolonged RL expands reasoning boundaries.
– arXiv:2505.21493 (RGR, May 2025): reasoning without verifiers.
– arXiv:2507.14843 (The Invisible Leash, Jul 2025): RLVR escape constraints.
– arXiv:2512.07783 (Pre/Mid/RL interplay, Dec 2025): pretraining–RL sequencing.

Your task:
(1) RE-TEST THE BOUNDARY CONSTRAINT. Does the 'edge of competence' thesis still hold under newest models (o3, Gemini 3, Claude 4.x)? Has scaling, better pretraining, or new RL methods (DPO variants, outcome-weighted imitation, multi-reward fusion) already moved or dissolved the boundary? Separate: is the underlying question — whether RL creates or elicits — still open? Flag which constraints appear overturned and what resolved them.
(2) Surface the strongest work from the last 6 months that contradicts or supersedes the library's claim that RL is *mostly* scheduling, not creation.
(3) Propose 2 new research questions that assume the regime has shifted — e.g., if edge-of-competence is now a weak signal, what replaces it as the RL design principle?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines