Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
"Teaching Large Language Models to Reason with RL" tests Expert Iteration, PPO, and Return-Conditioned RL across multiple model sizes and initialization conditions with both sparse and dense rewards. Result: performance differences across algorithms are small and convergence behavior is similar. More strikingly, RL training does not improve pass@n scores beyond what light supervised fine-tuning achieves with the same rollout budget.
The mechanism: LLMs require a pretrained prior to navigate the high-dimensional text action space — without it, exploration would be computationally impossible. But this prior simultaneously constrains what gets explored. The model generates variations on what it already knows rather than discovering genuinely novel solutions. Regardless of which RL algorithm manages the update step, the same pretrained exploration prior shapes the solution distribution at convergence.
Additional SFT training before RL makes this worse. More SFT concentrates the prior distribution further — the model converges faster on familiar patterns, which means the RL exploration from that point is more constrained, not less. The result: more SFT → tighter prior → smaller effective exploration space → RL finds less.
This reframes what RL training does in practice: it is primarily a selection mechanism, not a discovery mechanism. RL identifies which solutions already present in the pretrained distribution deserve reward. It rarely discovers solutions outside that distribution. The pretrained model contains most of what RL training will eventually "find."
Connects to Does policy entropy collapse limit reasoning performance in RL?: this paper provides algorithm-invariance evidence supporting that entropy is the fundamental constraint. Connects to Do base models already contain hidden reasoning ability?: if RL is unlocking pre-existing capability rather than building new capability, the algorithm doing the unlocking is interchangeable.
Reweave 2026-05-18 — interchangeability now visible at three levels, not one. When this note was written, the claim was about algorithm interchangeability — PPO, Expert Iteration, RC-RL produce similar results because the prior dominates. Late-2025 evidence shows the same interchangeability holds at two additional levels:
- Algorithm level (original claim): PPO ≈ Expert Iteration ≈ RC-RL.
- Algorithmic refinement level: Can two simple techniques match complex RL algorithms? — vanilla PPO + two techniques matches GRPO and DAPO. The "zoo of algorithms" (GRPO, DAPO, GPPO, GFPO) collapses to two load-bearing techniques.
- Reward-signal source level: Can language models replace reward models with internal signals? — the source of the reward signal is also substitutable. SERL self-judgment, ΔBelief-RL internal signal, SDPO rich-feedback distillation, POLAR similarity-to-target, RARO adversarial IRL, and VeriFree reference-likelihood all achieve similar gains.
The meta-claim sharpens: what is interchangeable in RL-for-reasoning is the entire optimization machinery — algorithm choice, algorithmic refinements, AND reward-signal source. The non-interchangeable variable is the pretrained prior. This is consistent with the "RL as catalyst, not teacher" framing in Why do random rewards improve reasoning for some models but not others?: when the prior contains the structure, almost any optimization pressure surfaces it.
The implication is structural rather than tactical: research effort on RL algorithm/refinement/reward-signal innovation has diminishing returns relative to effort on what gets baked into pretraining. The pretrained model contains most of what any RL pipeline will eventually find.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does modified PPO handle samples from much older model versions?
- Can algorithm choice like PPO substitute for recipe-level design decisions?
- Does RL primarily teach when to use reasoning or how to reason?
- Can PPO match GRPO and DAPO with just two techniques?
- Can combining SRL with RLVR outperform either method used alone?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
algorithm-invariance finding supports that entropy is the binding constraint, not which optimizer is used
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
if capability is pre-existing, the mechanism for unlocking it is less important than the prior it unlocks from
-
Does reinforcement learning squeeze exploration diversity in search agents?
Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.
extends: entropy collapse as architectural property confirmed in search domain; RL algorithm interchangeability in reasoning and RL collapse in search are two expressions of the same prior-bounded exploration ceiling
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
tension: emergence framing suggests RL generates genuinely novel capabilities; algorithm interchangeability suggests RL primarily selects from what the pretrained prior already contains — the two accounts apply at different scales of capability
-
Does RL training follow predictable scaling curves?
Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.
refines the interchangeability claim: algorithm choice is interchangeable within a recipe, but recipe-level choices (data, reward structure, training configuration) set different asymptotic ceilings; ScaleRL provides the empirical scaling framework that contextualizes algorithm-level findings
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Teaching Large Language Models to Reason with Reinforcement Learning
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Bridging Offline and Online Reinforcement Learning for LLMs
- Learning to Reason for Factuality
- On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Original note title
rl for reasoning algorithm choice is interchangeable because the exploration ceiling is set by the pretrained prior not the algorithm