INQUIRING LINE

How does KL penalty strength affect the degree of format collapse during RL?

This reads the question as: does the KL-to-reference penalty — the knob that keeps an RL policy from drifting too far from its pretrained starting point — control how badly the model's output diversity collapses into a single dominant format during RL.


This explores whether the KL penalty (the regularizer that tethers an RL policy to its pretrained prior) governs how severely format diversity collapses during training. Up front, a caveat worth naming: the corpus here documents format collapse vividly but does not contain a clean, controlled sweep that turns the KL knob up and down and measures the result. So the honest synthesis is lateral — what the collection *does* establish about the mechanism the KL penalty is supposed to restrain, and why that reframes the question.

The central finding is that RL collapse onto one format is fast and structural. Controlled experiments show RL converges on a single dominant *pretraining* format within the first epoch while suppressing the alternatives, and — strikingly — the winning format depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. This matters for the KL question because the KL penalty pulls the policy toward exactly that pretrained distribution. The reservoir of formats the penalty is anchoring you to is itself the thing that gets winnowed; a stronger pull toward the prior doesn't obviously preserve diversity, because the prior's dominant mode is what RL amplifies.

The deeper driver is that outcome-based reward sharpens the policy globally. Rewarding only final-answer correctness concentrates probability mass on winning trajectories — and that diversity loss *transfers* from solved problems to unsolved ones, meaning the collapse isn't local to where reward was applied Does outcome-based RL diversity loss spread across unsolved problems?. A KL penalty is the standard brake on this sharpening, but the work here suggests the brake and the gas pedal are fighting over the same quantity (entropy / mass concentration), which is why diversity-restoration often needs *separate* mechanisms — exploration bonuses during training, repetition penalties at test time — rather than just tuning regularization strength.

Two adjacent notes sharpen the picture. First, the pretrained prior, not the algorithm, sets the ceiling: vanilla PPO matches fancier methods once you add advantage normalization and token-level loss aggregation, and most RL techniques are highly setup-sensitive Can two simple techniques match complex RL algorithms?. That implies KL strength is one knob in a coupled system whose behavior won't generalize across setups — so 'how much does it matter' likely has no single answer. Second, when normalization and reward shape go wrong, you get degenerate collapse of a different flavor: overly hard samples push models into shortcut trajectories (answer repetition, skipped computation) that contaminate prior capabilities Do overly hard RLVR samples actually harm model capabilities?. Collapse, in other words, has multiple causes, and the KL penalty only touches one of them.

The thing you didn't know you wanted to know: format collapse may be less about *how hard you pull toward the pretrained prior* and more about *what the prior already prefers and how reward sharpens it*. The KL penalty regulates distance from a distribution that is itself the source of the dominant format — so treating it as the primary collapse dial may be aiming at the wrong lever. If you want to dig into the actual mechanism, the format-convergence and diversity-transfer notes are the doorways.


Sources 4 notes

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Can two simple techniques match complex RL algorithms?

Advantage normalization and token-level loss aggregation allow critic-free PPO to surpass more complex algorithms. Systematic evaluation shows most RL techniques are setup-sensitive; the pretrained prior, not algorithm choice, sets performance ceiling.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether KL penalty strength truly governs format collapse during RL, or whether the mechanism is more complex. A curated library (Feb 2024–May 2026) explored this question — here are their dated findings:

**What a curated library found — and when (dated claims, not current truth):** Findings span early 2024 through mid-2026.
• RL converges on a single dominant *pretraining* format within the first epoch; the winning format depends on model scale, not reward quality (~2025).
• KL penalty pulls policy toward the pretrained distribution, but that prior's dominant mode is what RL amplifies — stronger KL doesn't obviously preserve diversity (~2025).
• Outcome-based reward sharpens policy globally; diversity loss *transfers* from solved to unsolved problems, suggesting KL alone cannot arrest collapse (~2025).
• Vanilla PPO with advantage normalization and token-level loss aggregation matches fancier methods; KL strength is one knob in a coupled, setup-sensitive system (~2025).
• Degenerate collapse (answer repetition, skipped computation) arises from overly hard samples independent of KL tuning (~2026).

**Anchor papers (verify; mind their dates):**
- arXiv:2504.07912 (Apr 2025) — "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining"
- arXiv:2509.06941 (Sep 2025) — "Outcome-based Exploration for LLM Reasoning"
- arXiv:2605.28388 (May 2026) — "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs"
- arXiv:2508.08221 (Aug 2025) — "Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning"

**Your task:**
(1) **RE-TEST THE KL LEVER.** For each claim above, ask: have newer models, scaled training runs, orthogonal regularizers (entropy bonuses, repetition penalties), or more granular reward shaping *since relaxed* the constraint that KL strength alone cannot control collapse? Separate the durable question (does format collapse have multiple causes?) from the perishable finding (KL is ineffective). Cite what supersedes the 2025 findings.
(2) **Surface the strongest CONTRADICTING work from the last ~6 months.** Does any paper argue that KL strength *does* govern collapse once other hyperparameters are held constant? If so, what experimental design made the difference?
(3) **Propose 2 research questions that assume the regime may have moved:** (a) If KL strength is decoupled from format diversity, what *composite* regularization strategy (KL + entropy + reward shaping) minimally trades reward for diversity? (b) Do format preferences emerge at different model scales under identical KL regimes, and if so, can we predict which format wins *before* training?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines