INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›How does policy entropy collapse c…›this inquiring line

When an AI learns math and creativity together, the pressure to find right answers quietly crushes its creative range.

How does entropy collapse affect creative capability in multi-task settings?

This explores what happens to a model's open-ended, generative ability when reinforcement learning drives its output distribution to collapse — specifically when structured and creative tasks are trained together.

This explores what happens to creative capability when entropy collapse — the narrowing of a model's output distribution during RL training — hits a model that's juggling structured and open-ended tasks at the same time. The corpus has a sharp answer: the damage isn't uniform, and the order you train things in decides who gets hurt. The central finding comes from Omni-Thinker's work on multi-task RL Does training order reshape how models handle different task types?, which shows the two task types pull entropy in opposite directions. Structured domains (math, code, anything with a verifiable right answer) push output entropy *down* as the policy converges on correct solutions. Creative domains push it *up*, because open-ended generation rewards variety. When you train them jointly, the entropy-lowering pressure of the structured tasks bleeds over and crushes the exploratory range the creative tasks depend on. Their fix is almost embarrassingly simple — train structured tasks first, creative tasks later (BWT-guided scheduling) — and it buys 6.2% over naive joint training by keeping collapse from spilling into the open-ended capabilities.

Why does collapse threaten creativity specifically? Because creative capability *is* distributional breadth. The mechanism behind the ceiling is captured by the empirical law in Does policy entropy collapse limit reasoning performance in RL?: performance saturates as policy entropy approaches zero (R = -a·exp(H) + b). A near-zero-entropy policy has converged on a narrow band of reward-maximizing outputs — fine for problems with one answer, fatal for tasks where the point is to range widely. The same squeeze shows up in Does reinforcement learning squeeze exploration diversity in search agents?, where RL compresses behavioral diversity in search agents through the identical entropy-collapse mechanism, and SFT on diverse demonstrations is what restores the breadth. The pattern is consistent across domains: RL converges, and convergence is the enemy of generative range.

The most useful twist for a multi-task setting is that collapse is *domain-conditional*, not absolute. Does preference tuning always reduce diversity the same way? shows RLHF actually *reduces* lexical-syntactic diversity in code while *increasing* it in creative writing — because each domain incentivizes a different thing (code rewards converging on the correct solution; creative writing rewards standing out). So 'entropy collapse hurts creativity' is too blunt. The real claim is that mixing a convergence-rewarding domain with a divergence-rewarding one in the same training run lets the convergent pressure dominate unless you deliberately separate or schedule them.

There's a deeper, more unsettling thread worth pulling. Is the exploration-exploitation trade-off actually fundamental? argues the exploration-exploitation trade-off we assume is fundamental is actually a measurement artifact — using hidden-state Effective Rank metrics, it finds near-zero correlation between exploration and exploitation, with the apparent conflict appearing only at the token level. If that holds, the loss of creative range under multi-task RL may be a self-inflicted scheduling and measurement problem rather than an iron law — you might be able to enhance both at once (their VERL method claims simultaneous gains). And Can identical outputs hide broken internal representations? adds a warning that cuts the other way: two models can post identical task scores while one harbors fractured, entangled internal representations that block exactly the kind of recombination creativity depends on. So a model that looks fine on your multi-task benchmark may have already quietly lost the structural flexibility that makes creative transfer possible — entropy at the output is only one place collapse hides.

The thing you didn't know you wanted to know: creative degradation in multi-task RL isn't a side effect of getting better at hard tasks — it's a *contamination* effect, where one domain's reward signal poisons another's, and it's largely controllable through training order and demonstration diversity rather than something you have to trade away.

Sources 6 notes

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Show all 6 sources

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing a curated library's claims about entropy collapse in multi-task LLM training. The question remains open: does entropy collapse uniformly degrade creative capability, or is the damage conditional on task structure and training order?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026, with the core claims anchored in 2025:
• Structured tasks (math, code) push entropy *down* via convergence; creative tasks push it *up*. Joint training lets structured-task pressure bleed into creative domains, crushing exploratory range (Omni-Thinker, ~2025).
• Training structured tasks first, then creative tasks (BWT-guided scheduling) recovers 6.2% multi-task performance by containing entropy collapse (arXiv:2507.14783, 2025-07).
• RLHF's entropy effect is domain-conditional: reduces lexical-syntactic diversity in code, increases it in creative writing (~2025).
• The exploration-exploitation trade-off may be a token-level measurement artifact; near-zero correlation at hidden-state Effective Rank level (arXiv:2509.23808, 2025-09).
• Two models can show identical task scores while one harbors fractured representations that block creative recombination (arXiv:2505.11581, 2025-05).

Anchor papers (verify; mind their dates):
• arXiv:2507.14783 (Omni-Thinker, 2025-07): multi-task RL scheduling & entropy dynamics
• arXiv:2509.23808 (Beyond Exploration-Exploitation, 2025-09): hidden-state Effective Rank
• arXiv:2505.11581 (Fractured Entangled Representations, 2025-05): representation quality vs. task scores
• arXiv:2505.22617 (Entropy Mechanism, 2025-05): RL entropy collapse in reasoning

Your task:
(1) RE-TEST THE DOMAIN-CONDITIONAL COLLAPSE CLAIM. Does the 6.2% BWT-guided scheduling gain hold under newer multi-task orchestration (memory-augmented, tool-integrated, or agentic setups)? Separately: has output-level entropy measurement been superseded by representation-level diagnostics (Effective Rank, singular values) that might show collapse *absent* at the output? Plainly separate what still constrains multi-task training from what newer methods or evals have relaxed.
(2) Surface the strongest work from the last ~6 months that *contradicts* the "structured tasks poison creative tasks" model—either showing simultaneous entropy gains, or arguing the trade-off is an artifact of how we reward/schedule, not fundamental.
(3) Propose two research questions that assume the regime has shifted: (a) If representation-level metrics decouple from output entropy, what *is* the actual bottleneck for creative transfer in multi-task settings? (b) Can adaptive or adversarial scheduling—rather than fixed BWT order—recover both convergence *and* diversity in a single pass?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI learns math and creativity together, the pressure to find right answers quietly crushes its creative range.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8