INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What constrains reinforcement lear…›this inquiring line

If six competing AI training methods all hit the same score, maybe the algorithm doesn't matter — the base model already knew the answers.

Why do six different RLVR algorithms converge on similar performance levels?

This explores why a set of distinct RLVR algorithms (different optimizers, reward schemes, advantage functions) tend to land at roughly the same scores — and the corpus's answer is that they're all surfacing what the base model already contains, not building anything new.

This reads the question as: if six algorithms differ in their machinery, why doesn't that difference show up as different performance? The corpus points to a single uncomfortable answer — RLVR mostly redistributes probability mass over reasoning the base model could already produce, so the pretrained model, not the algorithm, sets the ceiling. The clearest statement of this is that RLVR improves *sampling efficiency*, not capability boundaries: at high pass@k the base model actually matches or beats its RLVR-tuned version, meaning the tuning just narrows sampling toward solutions already living in the base distribution Does RLVR actually expand what models can reason about?. If every algorithm is fishing in the same pond, they converge on the same catch.

The mechanism behind that convergence becomes concrete at the parameter level. Across seven RL algorithms and ten model families, RL updates only 5–30% of parameters — and strikingly, those sparse updates are nearly full-rank and *nearly identical across random seeds* Does reinforcement learning update only a small fraction of parameters?. That's structural, not arbitrary: the optimization keeps targeting the same subnetwork regardless of the dice roll. A complementary finding shows RL collapses onto a single dominant format that already existed in pretraining, amplifying one distribution and suppressing the rest within the first epoch Does RL training collapse format diversity in pretrained models?. Different algorithms, same attractor.

The spurious-reward result makes the point almost paradoxically: Qwen2.5-Math gains 16–25% on MATH-500 from *random or even incorrect* rewards, because the reward isn't teaching anything — it's activating latent code-reasoning behavior baked in during pretraining, and Llama and OLMo (lacking that pretraining) get nothing Why do random rewards improve reasoning for some models but not others?. When the reward signal can be near-noise and still work, the algorithm's design clearly isn't the load-bearing variable. The pretraining distribution is.

There's a sharper edge here worth knowing: some of that convergent 'improvement' may not be reasoning at all. RLVR raises trace *coherence* — fewer logical breaks between adjacent steps — without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?, and a chunk of headline benchmark gains turn out to be memorization on contaminated datasets rather than genuine reasoning Does RLVR success on math benchmarks reflect genuine reasoning improvement?. If algorithms converge partly because they're all converging on the same memorized or merely-coherent surface, the plateau is even less about the algorithm.

The interesting corollary — the thing you might not have known you wanted — is what *does* break the plateau. The methods that escape it don't tweak the RL objective; they change what's in the pond. Distillation genuinely transfers new reasoning patterns the base model lacked Does RLVR actually expand what models can reason about?; running supervised imitation *first* to seed new rollouts and then sharpening with RLVR beats either alone Does sequencing imitation then exploration training improve reasoning?; and injecting external data with exploration-rewarding advantage functions counteracts the 'capability boundary collapse' that on-policy RLVR otherwise causes Why does RLVR training narrow a model's problem solving ability?. The pattern across all of them: you move the ceiling by adding new material, not by redesigning the reward. Six algorithms with no new material converge because there's nothing new to converge toward.

Sources 8 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Show all 8 sources

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Invisible Leash: Why RLVR May Not Escape Its Origin5.12 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR4.30 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example4.12 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains3.46 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?3.40 match · arxiv ↗
Escaping the Verifier: Learning to Reason via Demonstrations2.51 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs2.46 match · arxiv ↗
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs1.73 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether algorithm-level convergence in RLVR remains a constraint or has been dissolved by newer methods, training regimes, or eval practices. The question: Why do structurally different RLVR algorithms produce similar performance?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–May 2026. The library's core claims:
- RLVR redistributes probability mass within the base model's existing distribution, not expanding capability boundaries; only at high pass@k does the base model match or beat RLVR-tuned versions (arXiv:2504.13837, ~2025).
- RL updates only 5–30% of parameters in near-identical sparse subnetworks across random seeds and seven algorithms, suggesting structure, not algorithm choice, drives convergence (arXiv:2505.11711, ~2025).
- RL post-training collapses onto a single dominant pretraining distribution format within one epoch, amplifying one format and suppressing others (arXiv:2504.07912, ~2025).
- Spurious/random rewards still improve Qwen2.5-Math by 16–25% on MATH-500, indicating the algorithm is not load-bearing; pretraining distribution is (arXiv:2504.13837, ~2025).
- RLVR raises local trace coherence without guaranteeing global validity, and a chunk of benchmark gains are memorization on contaminated datasets, not reasoning (arXiv:2510.18176, arXiv:2507.10532, ~2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2504.13837 (Does Reinforcement Learning Really Incentivize Reasoning?, Apr 2025)
- arXiv:2505.11711 (RL Finetunes Small Subnetworks, May 2025)
- arXiv:2504.07912 (Echo Chamber: RL Amplifies Pretraining, Apr 2025)
- arXiv:2508.00222 (RL-PLUS: Countering Capability Boundary Collapse, Aug 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above—capability ceiling, sparse subnetwork targeting, distribution collapse, spurious-reward robustness, and memorization—judge whether recent scaling (larger models), new optimization (learned reward models, process reward models, multi-step value functions), tooling (constitutional AI, outcome + process rewards), or orchestration (expert mixture, curriculum learning, distillation pipelines) have since relaxed or overturned it. Separate the durable question (likely still open: "why do algorithms converge?") from perishable limitations (possibly solved: can you now escape the base distribution's ceiling? cite what solved it, plainly state where constraints still hold).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (Jan–Jun 2026). Look for papers claiming algorithms do differ meaningfully, or that new methods escape the distribution ceiling, or that memorization is not the primary driver.
(3) Propose 2 research questions that ASSUME the convergence regime may have shifted—e.g., "Does scaling to 10B+ parameters and learned reward models break the sparse-subnetwork targeting pattern?" or "Can hybrid imitation–RL curricula with external reasoning data escape the echo-chamber effect?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

If six competing AI training methods all hit the same score, maybe the algorithm doesn't matter — the base model already knew the answers.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8