SYNTHESIS NOTE

Topics›Deep Research›this note

Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

Synthesis note · 2026-02-21 · sourced from Deep Research

The "RL Squeezes, SFT Expands" paper studies search agents trained with RL versus SFT and finds the same pattern that the reasoning literature documented: RL training compresses the diversity of behaviors the agent explores (squeezes), while SFT on diverse demonstrations expands it. Since Does policy entropy collapse limit reasoning performance in RL?, and since this paper shows the same dynamic in search RL, entropy collapse is not a quirk of reasoning training — it is a property of RL training at large.

The mechanism is the same in both domains: RL rewards the policy for high-reward outputs and penalizes low-reward ones. Over training, the policy concentrates probability mass on the reward-maximizing region of its action space. In reasoning, this means converging on a narrow set of reasoning patterns. In search, it means converging on a narrow set of query strategies. Both reduce the agent's ability to explore novel approaches to hard problems.

SFT has the opposite effect because it trains on human demonstrations or diverse synthetic completions — the diversity of the training set is preserved in the policy. The tradeoff is that SFT cannot generalize beyond its demonstrations in the same way RL can.

This finding has practical implications for DR agent design: RL-trained search agents need explicit diversity mechanisms (entropy regularization, diverse reward models, periodic SFT refreshes) or they will converge on query templates that work well on average but fail on distribution shift. The same Do critique models improve diversity during training itself? remedy applies — external critique prevents the RL agent from collapsing to a narrow search strategy.

Inquiring lines that read this note 137

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can AI agents autonomously learn and transfer skills across tasks?

When does optimizing for quality undermine the value of diversity?

How do multi-agent systems achieve genuine cooperation and reasoning?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

What constrains reinforcement learning's ability to expand model reasoning?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How does objective evolution guide discovery better than fixed planning?

How can AI systems learn from failures without cascading errors?

Can population diversity in self-improvement prevent error avalanching failures?

Which computational strategies best support reasoning in language models?

How does example difficulty affect learning efficiency in language models?

Why do LLM research ideas score high on novelty yet collapse into low diversity?

Does externalizing cognitive work and state improve agent reliability?

What training difficulty and curriculum settings prevent instability in empathetic agent RL?

How can LLM user simulators model realistic goal-driven conversation?

When does simulated search outperform real search for agent training?

Can prompting inject entirely new knowledge into language models?

How does explicit exploratory prompting compare to fine-tuned reinforcement learning for in-context adaptation?

How does test-time aggregation affect reasoning correctness and reliability?

How does majority voting fail when reasoning samples lack genuine diversity?

What prevents language models from reliably adopting diverse personas?

How does RLHF-induced mode collapse limit diversity in LLM-generated personas?

Why do persona-level simulations fail to predict individual preferences accurately?

Can evolutionary search solve persona diversity better than prompt engineering?

What are the consequences of models training on synthetic data?

How does diversity loss in synthetic data mirror tail distribution disappearance?

Does reinforcement learning teach reasoning or just when to reason?

Why do reward structures fail to shape long-term agent learning?

Does alignment training create blind spots in detecting genuine safety threats?

What makes behavioral cloning produce more persuadable but less aligned agents?

How should iterative research systems allocate reasoning per search step?

How do self-generated feedback mechanisms enable effective model learning?

Why do agents confidently report success despite actually failing tasks?

What training objectives could reduce completion bias in autonomous agents?

How should agents balance memory condensation to optimize context efficiency?

How does memory folding enable agents to reconsider strategies mid-task?

What determines success in training models on multiple tasks?

Can AI systems balance emotional competence with factual reliability?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

What makes supervised fine-tuning worsen RL exploration later?

How should inference compute be adaptively allocated based on prompt difficulty?

Should test-time search maximize diversity of competent solutions instead of converging on one strategy?

What capability tradeoffs emerge when scaling model reasoning abilities?

Why does strategy diversity within reasoning chains improve model generalization?

Do harness improvements transfer across model scales or memorize shortcuts?

Do gains from harness-based agents transfer across different search benchmarks?

How should models express uncertainty rather than forced confident answers?

Can agents escape weak belief tracking and conservative action selection traps?

How do soft continuous representations explore multiple reasoning paths simultaneously?

How does soft thinking achieve stochastic exploration without explicit training?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How do cyclic learning rates anti-correlate with weight decay to create diversity?

How do policy learning algorithm choices affect multi-objective optimization stability?

Can on-policy optimization variants avoid the probability squeezing problem?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 112 in 2-hop network ·medium cluster Open in graph ↗

Does reinforcement learning squeeze exploration … Does policy entropy collapse limit reasoning perfo… Do critique models improve diversity during traini… Can simple rewards alone teach complex domain reas… Does the choice of RL algorithm actually matter fo… Does RL training collapse format diversity in pret… Should training maximize diversity when models fee…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends: entropy collapse is confirmed in the search domain; the bottleneck is architectural, not reasoning-specific
Do critique models improve diversity during training itself? Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
applies: the diversity-preservation remedy generalizes to search RL; critique models prevent search strategy collapse
Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
parallel RL emergence pattern: domain reasoning capabilities (AlphaMed) and search capabilities both emerge from RL reward signals; entropy collapse constrains scaling in both
Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
algorithm-invariance evidence in reasoning and entropy collapse in search are the same mechanism from different angles: both show RL is bounded by the pretrained prior, not by optimizer choice
Does RL training collapse format diversity in pretrained models? Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
the format-level selection mechanism: RL entropy collapse in search narrows strategy diversity within one distribution, while the echo chamber effect selects which pretraining distribution survives — format selection precedes and compounds within-format diversity loss
Should training maximize diversity when models feed into search? If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
extends: this note prescribes the diversity-as-objective training fix for the entropy-collapse-in-search failure that note documents

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Vector Policy Optimization: Training for Diversity Improves Test-Time Search0.85 match · arxiv ↗
Jointly Reinforcing Diversity and Quality in Language Model Generations0.85 match · arxiv ↗
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents0.84 match · arxiv ↗
Outcome-based Exploration for LLM Reasoning0.84 match · arxiv ↗
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR0.84 match · arxiv ↗
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models0.84 match · arxiv ↗
RAGEN-2: Reasoning Collapse in Agentic RL0.84 match · arxiv ↗
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL0.83 match · arxiv ↗

Original note title

rl training for search agents squeezes exploration diversity while sft expands it — the same entropy collapse dynamic operates in search as in reasoning