Does reinforcement learning update only a small fraction of parameters?
Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
The surprising finding is not that RL changes models — it's how little it changes them. Across PPO, GRPO, DPO, and four other algorithms applied to ten different LLM families, RL consistently updates only 5-30% of parameters. The rest remain effectively unchanged. This sparsity is intrinsic — no explicit sparsity-promoting regularizations or architectural constraints are applied.
The critical nuance is that these sparse updates are nearly full-rank. This is not low-rank adaptation (as in LoRA). The updated parameters span almost the full subspace that the parameter matrices can represent. So RL selects a small subset of parameters, but that subset is geometrically rich enough to represent complex transformations. The distinction matters: low-rank would mean RL operates in a constrained subspace; sparse-but-full-rank means RL identifies which parameters matter while preserving full expressivity.
Three additional properties make this pattern robust. First, subnetworks identified from different random seeds show substantially greater overlap than chance, suggesting the subnetwork is a structural property of the pretrained model, not an artifact of training. Second, finetuning the subnetwork alone recovers both the test accuracy and the actual parameter values of full finetuning. Third, the sparsity is distributed — nearly all parameter matrices receive similarly sparse updates rather than concentrating in a subset of layers.
The authors conjecture this sparsity arises primarily from training on data near the policy distribution. Since Does RL improve domain reasoning by adding knowledge or removing it?, the sparse-but-full-rank pattern provides a mechanistic explanation: RL doesn't need to transform the entire model because most of the model is already adequate. It just needs to adjust a targeted subset — the parameters that control which reasoning paths are taken.
This has implications for efficient RL training. If the effective parameter footprint is 5-30%, techniques that exploit this sparsity (targeted updates, efficient memory use) could dramatically reduce RL training cost without sacrificing quality.
Token-level 80/20 parallel: The parameter-level sparsity has a striking token-level analog. The "Beyond 80/20" analysis of RLVR shows that high-entropy minority tokens — the ~20% of tokens where the model is most uncertain — are the critical forking points that carry most of the learning signal. Restricting gradient updates to only these 20% of tokens matches or exceeds full-token updates (+11.04 on AIME'25 for Qwen3-32B). The remaining 80% of tokens are low-entropy, already-decided outputs where gradient updates add noise rather than signal. This creates a dual sparsity picture: RL updates 5-30% of parameters, and the effective signal comes from ~20% of tokens. Both forms of sparsity are intrinsic — not imposed by regularization — and both suggest RL is fundamentally a targeted refinement process rather than a wholesale model transformation. See Do high-entropy tokens drive reasoning model improvements?.
The same sparse-update structure appears in SFT. Core Parameter Isolation Fine-Tuning (CPI-FT) identifies task-specific "core parameter regions" — the parameters with largest update magnitudes during individual task fine-tuning — and shows that these regions are concentrated and task-specific. CPI-FT exploits this by transplanting core parameters from individually fine-tuned models and SLERP-merging non-core parameters, consistently outperforming full multi-task SFT. The key finding: full multi-task SFT (uniform parameter updates across all tasks) is consistently the worst performer — temporal task scheduling alone is insufficient without explicit structural parameter isolation. This extends the RL sparsity finding to supervised fine-tuning: task-relevant changes naturally concentrate in specific parameter regions regardless of whether the training signal is reward-based or loss-based. See Can isolating task-specific parameters prevent multi-task fine-tuning interference?.
Inquiring lines that use this note as a source 88
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does the heuristic dominance ratio vary predictably across model architectures?
- Do disorder-specific RL policies outperform single policies across anxiety, depression, and schizophrenia?
- What behavioral changes occur during reward learning training?
- How much RLVR improvement comes from benchmark data memorization?
- Why does online RL succeed where supervised training fails for self-correction?
- Why do intermediate LLM layers become more precise in frontier models?
- Can distillation methods extract directional guidance that scalar RL cannot access?
- How much does domain shift limit the mechanisms a bilevel system can autonomously discover?
- How do training objectives shape what a world model actually learns?
- Does narrow reallocation to remaining tasks constitute genuine adaptation?
- Does the model learn depth-wise drift as an explicit strategy?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- When should model isolation be preferred over weight-averaging approaches?
- Why does early experience provide better warm-starts for downstream reinforcement learning?
- Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?
- Can in-context learning replicate the timing effects that RL teaches models?
- Can gradient approximation at equilibrium replace backpropagation through time in practice?
- Can meta-reinforcement learning explain why this bias pattern emerges rationally?
- What stability techniques prevent collapse in policy-critic adversarial training?
- How does reinforcement learning compare to differentiable joint training for RAG?
- Which AI imaginaries dominate training data and shape system behavior most strongly?
- What makes certain bond distributions more learnable than others?
- When should full-parameter post-training be used instead of LoRA adaptation?
- Can episodic memory alone enable learning without parameter updates?
- How do cascaded probabilistic models compare to reinforcement learning for per-query system design?
- How do residual connections and layer norm stabilize training in deep RL?
- Can model confidence signals replace explicit external reward functions?
- How do inference-time reward methods compare to per-user fine-tuning?
- How does reinforcement learning differ from chain-of-thought distillation?
- Why does policy entropy collapse predict sigmoid saturation points?
- Which recipe choices determine the asymptotic ceiling in RL training?
- How do RL subnetworks identified from different random seeds compare?
- Why do high entropy tokens carry most of the learning signal in RL?
- Does sparsity in RL arise from training on policy-distribution data?
- Can RL format selection explain performance gains attributed to algorithmic improvements?
- Is distribution selection during RL the same compression mechanism as entropy collapse?
- Why does RL improve sampling efficiency but not expand capability boundaries?
- How does behavior cloning reduce complexity before RL training in rerankers?
- Does negative reinforcement alone achieve what full RL training accomplishes?
- Why is reinforcement learning harder to apply to diffusion language models?
- How does task-oriented fine-tuning compare to preference tuning methods?
- Can continuous spectrum training outperform sequential SFT-then-RL stages?
- Does sparse parameter updating improve test-time training's computational cost?
- Can multi-turn reinforcement learning actually solve persona drift without addressing the default bias?
- How do RL training and base models differ in creating MI peaks?
- Does format-based pretraining determine how models respond to reinforcement learning?
- Can bilevel autoresearch autonomously modify its own learning algorithms?
- What happens to model capability as weight sparsity increases during training?
- Can out-of-distribution tests expose memorization in reinforcement learning fine-tuned models?
- How should humans specify deterministic abstractions of RL problems?
- Why does prolonged RL discover strategies absent from any base model sample?
- How does Supervised RL bridge the gap between SFT and RLVR?
- How do out-of-distribution tests reveal that optimization learning is memorization?
- Can influence estimation identify the most valuable trajectories in agentic training?
- Does weight decay directly cause contractive behavior near training examples?
- How do reward signals in RLVR interact with pretraining biases?
- Can LLM-synthesized behavioral heuristics compete with learned policy improvements?
- How does post-training shift models from passive prediction to on-policy action?
- How does 93% reward reliability compare to other RL noise sources?
- Why do sparse parameter subsets enable full-rank learning in RL?
- What mechanism transfers explicit memories into parametric model weights?
- Does RL training activate latent meta-learning capacity or create it from scratch?
- What scaling properties emerge from RL training dynamics beyond verification?
- Can reinforcement learning close the gap between LLM reasoning and action?
- How does KL regularization prevent both forgetting and adaptation loss?
- Can zero-weight drift through external memory replace parameter plasticity entirely?
- Why does policy entropy collapse when scaling RL for reasoning?
- Are different reward signal sources substitutable in verifier-free RL?
- What makes supervised fine-tuning worsen RL exploration later?
- Why do six different RLVR algorithms converge on similar performance levels?
- What capacity threshold determines whether RL teaches activation versus shortcut learning?
- How does prolonged RL training differ from standard RLVR approaches?
- Can entropy regularization or critique models prevent search strategy collapse during RL training?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- How does in-weights adaptation create spurious forgetting in models?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- How do sparse parameter updates enable when-not-how training to work?
- Why does gradient discarding limit standard policy clipping?
- Why does reinforcement learning training degrade model calibration?
- Can RL directly optimize attention distributions instead of text generation?
- How do extrapolative and contextual generalization measure RL reasoning gains?
- Why do optimal learning dynamics improve scaling law coefficients specifically?
- How do newly learned facts become accessible after gradient updates?
- Can models trained with RL on pretraining data avoid reward hacking seen in RLHF?
- Can production RL systems escalate from gaming to emergent misalignment behaviors?
- What is the accuracy cost of enforcing temporal causality inside model parameters?
- Can smaller models produce skill updates as useful as frontier model updates?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
complementary: pruning describes the functional effect, sparse subnetworks describe the parametric mechanism
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: if RL updates are inherently sparse, entropy collapse may reflect exhaustion of the relevant subnetwork's capacity
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
supports: sparse updates are consistent with RL adjusting activation patterns rather than building new capabilities
-
Does RL post-training create reasoning or just deploy it?
Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
parametric evidence for the post angle: if RL only updates 5-30% of parameters in full-rank subnetworks, the model already has the capability; RL is selecting which parameters to activate, not building new ones
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
mechanistic complement: the latent-capability thesis predicts sparse updates because most of the model is already adequate; sparse subnetworks are the parametric signature of capability elicitation rather than creation
-
Do high-entropy tokens drive reasoning model improvements?
Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
token-level parallel: parameter sparsity (5-30%) + token sparsity (~20%) form dual intrinsic sparsity
-
Can sparse weight training make neural networks interpretable by design?
Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.
RL's natural sparsity and weight-sparse training represent complementary paths to the same structural principle: RL discovers sparse subnetworks post-hoc through optimization pressure, while weight-sparse training enforces sparsity by construction; together they suggest that sparse parameter utilization is fundamental to how neural networks organize task-relevant computation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- Statistical and Algorithmic Foundations of Reinforcement Learning
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Learning, Fast and Slow: Towards LLMs That Adapt Continually
- Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
- The Art of Scaling Reinforcement Learning Compute for LLMs
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Original note title
rl updates only 5-30 percent of parameters in sparse but full-rank subnetworks