Why do reasoning gains from RL require models trained with headroom and edge-of-competence data?
This explores why reinforcement learning only delivers reasoning gains when the base model still has untapped capacity ("headroom") and is trained on problems sitting right at the boundary of what it can already do — and the corpus has a clear mechanistic answer.
This explores why RL reasoning gains depend on headroom and edge-of-competence data, and the answer falls out of what RL actually does to a model. A growing body of work argues that RL post-training mostly *selects and sharpens* reasoning that already exists in the base model rather than creating new capability. Base models are shown to carry latent reasoning that minimal training can elicit through five independent mechanisms Do base models already contain hidden reasoning ability?, and RL's contribution is better described as teaching *when* to reason, not *how* Does RL post-training create reasoning or just deploy it?. If RL is fundamentally a deployment optimizer, then it can only pay off where there is latent competence to deploy — that's the headroom requirement.
The sharpest evidence is the pass@k crossover: RLVR-trained models beat the base model at low k but the base model catches up or wins at high k, meaning RL narrowed sampling toward solutions already in the base distribution rather than expanding the set of solvable problems Does RLVR actually expand what models can reason about?. This is why edge-of-competence data matters: RL concentrates probability on answers the model can *already* occasionally find. On problems it never solves, there's no correct rollout to reward and nothing to sharpen; on problems it always solves, there's no gradient. The signal lives in the band where the model succeeds sometimes — the edge.
The dynamics work confirms the mechanism from another angle: a single training example can suffice to activate reasoning, and even *spurious* rewards work nearly as well as correct ones — but only for models with the right pretraining What does reward learning actually do to model reasoning?. That caveat is the whole story. The reward isn't installing skill; it's flipping a switch on pretrained strategies. No latent strategy, no switch to flip. Relatedly, RL tends to collapse onto a single dominant pretraining format within the first epoch while suppressing alternatives Does RL training collapse format diversity in pretrained models? — more evidence that it amplifies what's already there rather than synthesizing something new.
The interesting tension — and where the reader learns something they didn't expect — is that the boundary isn't fixed. Prolonged RL on *diverse, non-mathematical* tasks, with KL control and policy resetting, does outperform base models across all pass@k levels and discovers genuinely novel strategies, specifically on domains where base models lack established patterns Can reinforcement learning discover reasoning strategies base models cannot?. So "headroom" isn't only latent answers waiting to be sampled — it's also unexplored task territory where the base model has no entrenched format to collapse into, leaving room for RL to build rather than just select. The edge-of-competence requirement holds either way; what changes is whether the edge is a sampling frontier or a genuine capability frontier.
If you want to go deeper on how reward design interacts with this, note that binary correctness rewards quietly degrade calibration by rewarding confident guessing Does binary reward training hurt model calibration?, while using the model's own confidence as the reward signal can improve reasoning and restore calibration at once Can model confidence work as a reward signal for reasoning? — both reinforcing the theme that what RL extracts depends heavily on what the model already knows about its own competence.
Sources 8 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.