Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

"Teaching Large Language Models to Reason with RL" tests Expert Iteration, PPO, and Return-Conditioned RL across multiple model sizes and initialization conditions with both sparse and dense rewards. Result: performance differences across algorithms are small and convergence behavior is similar. More strikingly, RL training does not improve pass@n scores beyond what light supervised fine-tuning achieves with the same rollout budget.

The mechanism: LLMs require a pretrained prior to navigate the high-dimensional text action space — without it, exploration would be computationally impossible. But this prior simultaneously constrains what gets explored. The model generates variations on what it already knows rather than discovering genuinely novel solutions. Regardless of which RL algorithm manages the update step, the same pretrained exploration prior shapes the solution distribution at convergence.

Additional SFT training before RL makes this worse. More SFT concentrates the prior distribution further — the model converges faster on familiar patterns, which means the RL exploration from that point is more constrained, not less. The result: more SFT → tighter prior → smaller effective exploration space → RL finds less.

This reframes what RL training does in practice: it is primarily a selection mechanism, not a discovery mechanism. RL identifies which solutions already present in the pretrained distribution deserve reward. It rarely discovers solutions outside that distribution. The pretrained model contains most of what RL training will eventually "find."

Connects to Does policy entropy collapse limit reasoning performance in RL?: this paper provides algorithm-invariance evidence supporting that entropy is the fundamental constraint. Connects to Do base models already contain hidden reasoning ability?: if RL is unlocking pre-existing capability rather than building new capability, the algorithm doing the unlocking is interchangeable.

Reweave 2026-05-18 — interchangeability now visible at three levels, not one. When this note was written, the claim was about algorithm interchangeability — PPO, Expert Iteration, RC-RL produce similar results because the prior dominates. Late-2025 evidence shows the same interchangeability holds at two additional levels:

Algorithm level (original claim): PPO ≈ Expert Iteration ≈ RC-RL.
Algorithmic refinement level: Can two simple techniques match complex RL algorithms? — vanilla PPO + two techniques matches GRPO and DAPO. The "zoo of algorithms" (GRPO, DAPO, GPPO, GFPO) collapses to two load-bearing techniques.
Reward-signal source level: Can language models replace reward models with internal signals? — the source of the reward signal is also substitutable. SERL self-judgment, ΔBelief-RL internal signal, SDPO rich-feedback distillation, POLAR similarity-to-target, RARO adversarial IRL, and VeriFree reference-likelihood all achieve similar gains.

The meta-claim sharpens: what is interchangeable in RL-for-reasoning is the entire optimization machinery — algorithm choice, algorithmic refinements, AND reward-signal source. The non-interchangeable variable is the pretrained prior. This is consistent with the "RL as catalyst, not teacher" framing in Why do random rewards improve reasoning for some models but not others?: when the prior contains the structure, almost any optimization pressure surfaces it.

The implication is structural rather than tactical: research effort on RL algorithm/refinement/reward-signal innovation has diminishing returns relative to effort on what gets baked into pretraining. The pretrained model contains most of what any RL pipeline will eventually find.

Inquiring lines that read this note 5

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do policy learning algorithm choices affect multi-objective optimization stability?

Does reinforcement learning teach reasoning or just when to reason?

Does RL primarily teach when to use reasoning or how to reason?

What constrains reinforcement learning's ability to expand model reasoning?

Can combining SRL with RLVR outperform either method used alone?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 157 in 2-hop network ·medium cluster Open in graph ↗

Does the choice of RL algorithm actually matter … Does policy entropy collapse limit reasoning perfo… Do base models already contain hidden reasoning ab… Does reinforcement learning squeeze exploration di… Can simple rewards alone teach complex domain reas… Does RL training follow predictable scaling curves…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
algorithm-invariance finding supports that entropy is the binding constraint, not which optimizer is used
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
if capability is pre-existing, the mechanism for unlocking it is less important than the prior it unlocks from
Does reinforcement learning squeeze exploration diversity in search agents? Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.
extends: entropy collapse as architectural property confirmed in search domain; RL algorithm interchangeability in reasoning and RL collapse in search are two expressions of the same prior-bounded exploration ceiling
Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
tension: emergence framing suggests RL generates genuinely novel capabilities; algorithm interchangeability suggests RL primarily selects from what the pretrained prior already contains — the two accounts apply at different scales of capability
Does RL training follow predictable scaling curves? Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.
refines the interchangeability claim: algorithm choice is interchangeable within a recipe, but recipe-level choices (data, reward structure, training configuration) set different asymptotic ceilings; ScaleRL provides the empirical scaling framework that contextualizes algorithm-level findings

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl for reasoning algorithm choice is interchangeable because the exploration ceiling is set by the pretrained prior not the algorithm

Does the choice of RL algorithm actually matter for reasoning?

Inquiring lines that read this note 5

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5