SYNTHESIS NOTE

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning

The surprising finding is not that RL changes models — it's how little it changes them. Across PPO, GRPO, DPO, and four other algorithms applied to ten different LLM families, RL consistently updates only 5-30% of parameters. The rest remain effectively unchanged. This sparsity is intrinsic — no explicit sparsity-promoting regularizations or architectural constraints are applied.

The critical nuance is that these sparse updates are nearly full-rank. This is not low-rank adaptation (as in LoRA). The updated parameters span almost the full subspace that the parameter matrices can represent. So RL selects a small subset of parameters, but that subset is geometrically rich enough to represent complex transformations. The distinction matters: low-rank would mean RL operates in a constrained subspace; sparse-but-full-rank means RL identifies which parameters matter while preserving full expressivity.

Three additional properties make this pattern robust. First, subnetworks identified from different random seeds show substantially greater overlap than chance, suggesting the subnetwork is a structural property of the pretrained model, not an artifact of training. Second, finetuning the subnetwork alone recovers both the test accuracy and the actual parameter values of full finetuning. Third, the sparsity is distributed — nearly all parameter matrices receive similarly sparse updates rather than concentrating in a subset of layers.

The authors conjecture this sparsity arises primarily from training on data near the policy distribution. Since Does RL improve domain reasoning by adding knowledge or removing it?, the sparse-but-full-rank pattern provides a mechanistic explanation: RL doesn't need to transform the entire model because most of the model is already adequate. It just needs to adjust a targeted subset — the parameters that control which reasoning paths are taken.

This has implications for efficient RL training. If the effective parameter footprint is 5-30%, techniques that exploit this sparsity (targeted updates, efficient memory use) could dramatically reduce RL training cost without sacrificing quality.

Token-level 80/20 parallel: The parameter-level sparsity has a striking token-level analog. The "Beyond 80/20" analysis of RLVR shows that high-entropy minority tokens — the ~20% of tokens where the model is most uncertain — are the critical forking points that carry most of the learning signal. Restricting gradient updates to only these 20% of tokens matches or exceeds full-token updates (+11.04 on AIME'25 for Qwen3-32B). The remaining 80% of tokens are low-entropy, already-decided outputs where gradient updates add noise rather than signal. This creates a dual sparsity picture: RL updates 5-30% of parameters, and the effective signal comes from ~20% of tokens. Both forms of sparsity are intrinsic — not imposed by regularization — and both suggest RL is fundamentally a targeted refinement process rather than a wholesale model transformation. See Do high-entropy tokens drive reasoning model improvements?.

The same sparse-update structure appears in SFT. Core Parameter Isolation Fine-Tuning (CPI-FT) identifies task-specific "core parameter regions" — the parameters with largest update magnitudes during individual task fine-tuning — and shows that these regions are concentrated and task-specific. CPI-FT exploits this by transplanting core parameters from individually fine-tuned models and SLERP-merging non-core parameters, consistently outperforming full multi-task SFT. The key finding: full multi-task SFT (uniform parameter updates across all tasks) is consistently the worst performer — temporal task scheduling alone is insufficient without explicit structural parameter isolation. This extends the RL sparsity finding to supervised fine-tuning: task-relevant changes naturally concentrate in specific parameter regions regardless of whether the training signal is reward-based or loss-based. See Can isolating task-specific parameters prevent multi-task fine-tuning interference?.

Inquiring lines that read this note 96

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What capability tradeoffs emerge when scaling model reasoning abilities?

Does the heuristic dominance ratio vary predictably across model architectures?

What pretraining choices and baseline capability constrain reinforcement learning gains?

What constrains reinforcement learning's ability to expand model reasoning?

How does memorization interact with learning and generalization?

What critical LLM failures do standard benchmarks hide?

Why do intermediate LLM layers become more precise in frontier models?

Can alternative training methods improve on supervised fine-tuning for language models?

Why do self-improving systems struggle without clear external performance metrics?

How do self-generated feedback mechanisms enable effective model learning?

How does AI adoption affect human skill development and labor equality?

Does narrow reallocation to remaining tasks constitute genuine adaptation?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What determines success in training models on multiple tasks?

When should model isolation be preferred over weight-averaging approaches?

Does reinforcement learning teach reasoning or just when to reason?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How do we evaluate AI systems when user perception misleads actual performance?

Which AI imaginaries dominate training data and shape system behavior most strongly?

How does example difficulty affect learning efficiency in language models?

What makes certain bond distributions more learnable than others?

Why does finetuning cause catastrophic forgetting of model capabilities?

What memory architectures best support persistent reasoning across extended interactions?

Can episodic memory alone enable learning without parameter updates?

How should iterative research systems allocate reasoning per search step?

How do cascaded probabilistic models compare to reinforcement learning for per-query system design?

Can model confidence signals reliably improve reasoning quality and calibration?

Can model confidence signals replace explicit external reward functions?

How can identical external performance mask different internal representations?

What structural advantages do diffusion language models offer over autoregressive methods?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

How can conversational AI maintain consistent personas across conversations?

Can multi-turn reinforcement learning actually solve persona drift without addressing the default bias?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How do multi-agent systems achieve genuine cooperation and reasoning?

Can influence estimation identify the most valuable trajectories in agentic training?

Do base models contain latent reasoning that training can unlock?

Does RL training activate latent meta-learning capacity or create it from scratch?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?

How do policy learning algorithm choices affect multi-objective optimization stability?

Why does gradient discarding limit standard policy clipping?

Can language model RL training avoid reward hacking and misalignment?

Do harness improvements transfer across model scales or memorize shortcuts?

Can smaller models produce skill updates as useful as frontier model updates?

How can models identify insufficient information and respond appropriately without guessing?

Can abstention behavior transfer from small models to frontier models?

Which computational strategies best support reasoning in language models?

Can a trained decoder replace both search and parameter updates?

What are the consequences of models training on synthetic data?

How does off-policy data reuse inside trust regions affect convergence guarantees?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 173 in 2-hop network ·dense cluster Open in graph ↗

Does reinforcement learning update only a small … Does RL improve domain reasoning by adding knowled… Does policy entropy collapse limit reasoning perfo… Does RL teach reasoning or just when to use it? Does RL post-training create reasoning or just dep… Do base models already contain hidden reasoning ab… Do high-entropy tokens drive reasoning model impro… Can sparse weight training make neural networks in…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL improve domain reasoning by adding knowledge or removing it? When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
complementary: pruning describes the functional effect, sparse subnetworks describe the parametric mechanism
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: if RL updates are inherently sparse, entropy collapse may reflect exhaustion of the relevant subnetwork's capacity
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
supports: sparse updates are consistent with RL adjusting activation patterns rather than building new capabilities
Does RL post-training create reasoning or just deploy it? Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
parametric evidence for the post angle: if RL only updates 5-30% of parameters in full-rank subnetworks, the model already has the capability; RL is selecting which parameters to activate, not building new ones
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
mechanistic complement: the latent-capability thesis predicts sparse updates because most of the model is already adequate; sparse subnetworks are the parametric signature of capability elicitation rather than creation
Do high-entropy tokens drive reasoning model improvements? Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
token-level parallel: parameter sparsity (5-30%) + token sparsity (~20%) form dual intrinsic sparsity
Can sparse weight training make neural networks interpretable by design? Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.
RL's natural sparsity and weight-sparse training represent complementary paths to the same structural principle: RL discovers sparse subnetworks post-hoc through optimization pressure, while weight-sparse training enforces sparsity by construction; together they suggest that sparse parameter utilization is fundamental to how neural networks organize task-relevant computation

Does reinforcement learning update only a small fraction of parameters?

Inquiring lines that read this note 96

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4