INQUIRING LINE

What behavioral changes occur during reward learning training?

This explores what actually shifts inside a model when you train it with rewards (RL/RLVR) — does it learn genuinely new behaviors, or rearrange ones it already has?


This explores what actually changes inside a model during reward-based training — and the corpus's most striking answer is that reward learning often surfaces behavior the model already had rather than teaching it anything new. Several notes converge here: verifiable rewards act as catalysts that activate latent pretraining strategies instead of building fresh reasoning, to the point where a single training example can trigger the shift and even spurious, randomly-assigned rewards work nearly as well as correct ones What does reward learning actually do to model reasoning? How does RL training reshape reasoning and what gets lost?. The catch is that this only works for models whose pretraining planted the right seeds — Qwen2.5-Math jumps 16–25% on random rewards by waking up latent code-reasoning, while Llama and OLMo, lacking that pretraining format, gain nothing Why do random rewards improve reasoning for some models but not others?.

Underneath the behavior, the parameter changes are surprisingly disciplined. RL touches only 5–30% of parameters, yet those sparse updates are nearly full-rank and nearly identical across random seeds — meaning the model isn't randomly perturbed, it's selectively re-tuned in a structured subnetwork Does reinforcement learning update only a small fraction of parameters?. And the changes unfold in a predictable order: across eight models, training first sharpens execution correctness, then shifts to strategic planning as the bottleneck, with planning-token entropy rising while execution entropy settles Does RL training follow a predictable two-phase learning sequence?. So 'behavioral change' isn't one event — it's procedural mastery consolidating first, exploration of strategy second.

But 'activation, not creation' isn't the whole story. The boundary is conditional: for standard reasoning tasks RL just activates what's latent, but for complex multi-step planning it can generate genuinely novel strategies that base models can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?, and in domains like medical reasoning sophisticated behavior emerges from nothing more than simple accuracy rewards Can simple rewards alone teach complex domain reasoning?.

The corpus also flags behavioral changes you might not want. Binary correctness rewards quietly teach the model to guess confidently — they never punish confident wrong answers — degrading calibration until you add a proper scoring rule like the Brier score to pull accuracy and honesty back together Does binary reward training hurt model calibration?. And what you reward shapes diversity: positive-only reinforcement concentrates probability mass and hurts higher-k performance, while negative reinforcement alone — just suppressing wrong trajectories — preserves diversity and can match full PPO/GRPO Does negative reinforcement alone outperform full reinforcement learning?.

The most useful reframing here: the behavioral change during reward learning depends as much on how you treat successes versus failures as on the reward magnitude itself. Treating successful episodes as concrete demonstrations and failures as abstracted lessons beats uniform processing Should successful and failed episodes be processed differently?, and richer feedback can turn the model into its own teacher — using in-context evidence of its mistakes to generate dense credit signals without any external reward model at all Can environment feedback replace scalar rewards in policy learning?. Reward learning, read across these notes, looks less like installing new skills and more like selectively amplifying, suppressing, and re-sequencing behavior the model already carries.


Sources 11 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst examining whether reward learning in LLMs *creates* new behavior or *activates* latent capability—a live question in RL training. A curated library (2024–2026) found striking tensions here; your job is to test whether they still hold.

What a curated library found — and when (dated claims, not current truth):
• Reward learning often activates latent pretraining strategies rather than teaching new ones; even random rewards trigger ~16–25% gains in math reasoning on Qwen2.5-Math, while models lacking that pretraining architecture gain nothing (2025–2026).
• RL updates only 5–30% of parameters, yet those sparse updates are full-rank and nearly identical across random seeds—suggesting structured re-tuning, not random perturbation (2025).
• Behavioral change unfolds in two phases: procedural correctness consolidation first, then strategic planning second, with planning-token entropy rising while execution entropy settles (2025).
• The activation-vs-creation boundary is conditional: standard reasoning activates latent behavior, but complex multi-step planning can generate genuinely novel strategies; medical reasoning emerges from simple accuracy rewards alone (2025–2026).
• Binary correctness rewards degrade calibration by never punishing confident wrong answers; negative reinforcement alone (suppressing failures) preserves diversity and matches full PPO/GRPO (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.11711 — Reinforcement Learning Finetunes Small Subnetworks (2025)
• arXiv:2506.01347 — The Surprising Effectiveness of Negative Reinforcement (2025)
• arXiv:2510.18176 — Local Coherence or Global Validity? Investigating RLVR Traces (2025)
• arXiv:2601.20802 — Reinforcement Learning via Self-Distillation (2026)

Your task:
(1) RE-TEST THE ACTIVATION-VS-CREATION TENSION. For each finding above, assess whether newer model scales, pretraining formats (e.g., post-2026 synthesis pretraining), or RL methods (e.g., outcome-supervised fine-tuning, multi-reward composition) have *moved the boundary*—i.e., made creation cheaper or activation less sufficient. Separate the durable question (when does RL create vs. activate?) from the perishable limitation (which architectures/pretrains enable activation?). Cite what shifted the regime.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months. Does any recent paper show reward learning creating behavior even on unprepared models, or evidence that the two-phase dynamic doesn't replicate on newest architectures?
(3) Propose 2 research questions that *assume the regime may have moved*: (a) one on how to *design pretraining* to make RL-creation possible rather than activation-dependent, and (b) one on whether *heterogeneous reward signals* can trigger creation across diverse pretraining regimes.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines