INQUIRING LINE

How does RL compress reasoning path diversity during training?

This explores the mechanism by which reinforcement learning narrows the range of reasoning paths a model explores during training — and what the corpus says about why it happens, where it spreads, and how to counteract it.


This explores how RL training shrinks a model's repertoire of reasoning paths, not whether it raises accuracy. The corpus converges on a single mechanism with a clinical name: **entropy collapse**. When RL rewards only final-answer correctness, it sharpens the policy by piling probability mass onto the trajectories that already work — and the same dynamic shows up whether the model is reasoning through math, searching, or generating prose Does outcome-based RL diversity loss spread across unsolved problems? Does reinforcement learning squeeze exploration diversity in search agents?. The most counterintuitive finding is that this loss isn't local: outcome-based RL transfers diversity loss from solved problems to unsolved ones, globally narrowing the policy even where it hasn't yet found an answer. Sharpening where you've succeeded quietly forecloses exploration where you haven't.

A second strand reframes what's actually being compressed. RL may not be destroying reasoning ability so much as collapsing onto a *format* that was already latent in pretraining — within the first epoch, RL amplifies one dominant pretraining distribution and suppresses the alternatives, and which format wins depends on model scale rather than on which one performs best Does RL training collapse format diversity in pretrained models?. That dovetails with the argument that RL post-training teaches a model *when* to deploy reasoning it already has, rather than teaching it new ways to reason Does RL post-training create reasoning or just deploy it?. Read together, the picture is less 'RL invents a narrow skill' and more 'RL picks one path out of many the base model contained and prunes the rest.'

The compression isn't uniform across the run, either. RL training moves through two phases — first consolidating procedural execution, then shifting the bottleneck to strategic planning, with planning-token entropy rising even as execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. And it isn't uniform across task types: structured domains (math, code) systematically *decrease* output entropy while creative, open-ended domains increase it — which means naively mixing them lets the structured tasks' collapse bleed over and damage open-ended capability unless you schedule training order to protect it Does training order reshape how models handle different task types?.

What you didn't know you wanted to know: diversity loss is reversible, and the fix isn't always 'do less RL.' SFT on diverse demonstrations preserves exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?, and — more surprisingly — explicitly *rewarding* semantic diversity during RL doesn't trade off against quality; it catalyzes exploration and produces higher-quality outputs than quality-only baselines Can diversity optimization improve quality during language model training?. There's even a subtle distinction worth holding onto: preserving diversity during *training* (exploration bonuses) and recovering it at *test time* (repetition penalties, parallel sampling) are structurally different problems requiring different machinery Does outcome-based RL diversity loss spread across unsolved problems? Can reasoning systems scale wider instead of only deeper?.

The quiet warning underneath all of this: a collapsed policy looks confident and fluent, but chain-of-thought that imitates the *form* of reasoning without the underlying logic degrades predictably once you step outside the training distribution Does chain-of-thought reasoning actually generalize beyond training data?. Compressing the reasoning paths you keep is exactly what makes a model brittle on the paths it threw away.


Sources 9 notes

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL + reasoning capability analyst. The question remains open: **How does RL compress reasoning path diversity during training, and can that compression be reversed or prevented?**

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2025–May 2026. A library of recent work converges on these constraints:
• Outcome-based RL induces **entropy collapse** — probability mass concentrates on already-working trajectories, and this diversity loss *transfers* from solved to unsolved problems, globally narrowing policy even where no answer exists yet (2025-09, 2509.06941).
• RL converges on a **single dominant pretraining-distribution format** within the first epoch; which format wins depends on model scale, not performance (2025-04, 2504.07912).
• RL teaches models **when to deploy reasoning, not how** — it selects latent reasoning paths from pretraining rather than inventing new ones (2025-04, 2504.09858).
• Structured domains (math, code) systematically *decrease* output entropy while creative domains increase it; naive multi-task RL lets structured collapse bleed into open-ended capability (2025-07, 2507.14783).
• Collapsed policies are **distribution-bounded**: chain-of-thought that mimics form without underlying logic degrades predictably outside training distribution (2025-08, 2508.01191).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (Echo Chamber, Apr 2025) — pretraining-format amplification thesis.
• arXiv:2509.02534 (Jointly Reinforcing Diversity and Quality, Sep 2025) — diversity rewards don't trade off quality.
• arXiv:2508.01191 (Is CoT a Mirage, Aug 2025) — distribution-boundedness evidence.
• arXiv:2605.22817 (Vector Policy Optimization, May 2026) — diversity-preserving training improves test-time search.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, determine whether newer models (post-Oct 2026), scaled inference methods (speculative decoding, mixture-of-experts routing), process reward models, or adaptive scheduling have since RELAXED entropy collapse, reversed format lock-in, or decoupled reasoning-invention from reasoning-deployment. Separate the durable question (how does diversity loss propagate?) from the perishable limitation (outcome-only RL is the only regime). Cite what has moved the regime.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months that challenges the 'format-lock' or 'transfer-of-loss' claims, or that shows diversity can be preserved *during* training without quality loss.
(3) **Propose 2 research questions** that assume the regime may have shifted: (a) If RL now preserves diversity via explicit rewards or scheduling, does distribution-boundedness still hold? (b) Can outcome-based RL be combined with diversity-preserving mechanisms (e.g., process rewards + ensemble pressure) without reverting to exploration collapse?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines