SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

Synthesis note · 2026-02-21 · sourced from Deep Research

The "RL Squeezes, SFT Expands" paper studies search agents trained with RL versus SFT and finds the same pattern that the reasoning literature documented: RL training compresses the diversity of behaviors the agent explores (squeezes), while SFT on diverse demonstrations expands it. Since Does policy entropy collapse limit reasoning performance in RL?, and since this paper shows the same dynamic in search RL, entropy collapse is not a quirk of reasoning training — it is a property of RL training at large.

The mechanism is the same in both domains: RL rewards the policy for high-reward outputs and penalizes low-reward ones. Over training, the policy concentrates probability mass on the reward-maximizing region of its action space. In reasoning, this means converging on a narrow set of reasoning patterns. In search, it means converging on a narrow set of query strategies. Both reduce the agent's ability to explore novel approaches to hard problems.

SFT has the opposite effect because it trains on human demonstrations or diverse synthetic completions — the diversity of the training set is preserved in the policy. The tradeoff is that SFT cannot generalize beyond its demonstrations in the same way RL can.

This finding has practical implications for DR agent design: RL-trained search agents need explicit diversity mechanisms (entropy regularization, diverse reward models, periodic SFT refreshes) or they will converge on query templates that work well on average but fail on distribution shift. The same Do critique models improve diversity during training itself? remedy applies — external critique prevents the RL agent from collapsing to a narrow search strategy.

Inquiring lines that use this note as a source 120

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 109 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl training for search agents squeezes exploration diversity while sft expands it — the same entropy collapse dynamic operates in search as in reasoning