How does entropy collapse affect creative capability in multi-task settings?
This explores what happens to a model's open-ended, generative ability when reinforcement learning drives its output distribution to collapse — specifically when structured and creative tasks are trained together.
This explores what happens to creative capability when entropy collapse — the narrowing of a model's output distribution during RL training — hits a model that's juggling structured and open-ended tasks at the same time. The corpus has a sharp answer: the damage isn't uniform, and the order you train things in decides who gets hurt. The central finding comes from Omni-Thinker's work on multi-task RL Does training order reshape how models handle different task types?, which shows the two task types pull entropy in opposite directions. Structured domains (math, code, anything with a verifiable right answer) push output entropy *down* as the policy converges on correct solutions. Creative domains push it *up*, because open-ended generation rewards variety. When you train them jointly, the entropy-lowering pressure of the structured tasks bleeds over and crushes the exploratory range the creative tasks depend on. Their fix is almost embarrassingly simple — train structured tasks first, creative tasks later (BWT-guided scheduling) — and it buys 6.2% over naive joint training by keeping collapse from spilling into the open-ended capabilities.
Why does collapse threaten creativity specifically? Because creative capability *is* distributional breadth. The mechanism behind the ceiling is captured by the empirical law in Does policy entropy collapse limit reasoning performance in RL?: performance saturates as policy entropy approaches zero (R = -a·exp(H) + b). A near-zero-entropy policy has converged on a narrow band of reward-maximizing outputs — fine for problems with one answer, fatal for tasks where the point is to range widely. The same squeeze shows up in Does reinforcement learning squeeze exploration diversity in search agents?, where RL compresses behavioral diversity in search agents through the identical entropy-collapse mechanism, and SFT on diverse demonstrations is what restores the breadth. The pattern is consistent across domains: RL converges, and convergence is the enemy of generative range.
The most useful twist for a multi-task setting is that collapse is *domain-conditional*, not absolute. Does preference tuning always reduce diversity the same way? shows RLHF actually *reduces* lexical-syntactic diversity in code while *increasing* it in creative writing — because each domain incentivizes a different thing (code rewards converging on the correct solution; creative writing rewards standing out). So 'entropy collapse hurts creativity' is too blunt. The real claim is that mixing a convergence-rewarding domain with a divergence-rewarding one in the same training run lets the convergent pressure dominate unless you deliberately separate or schedule them.
There's a deeper, more unsettling thread worth pulling. Is the exploration-exploitation trade-off actually fundamental? argues the exploration-exploitation trade-off we assume is fundamental is actually a measurement artifact — using hidden-state Effective Rank metrics, it finds near-zero correlation between exploration and exploitation, with the apparent conflict appearing only at the token level. If that holds, the loss of creative range under multi-task RL may be a self-inflicted scheduling and measurement problem rather than an iron law — you might be able to enhance both at once (their VERL method claims simultaneous gains). And Can identical outputs hide broken internal representations? adds a warning that cuts the other way: two models can post identical task scores while one harbors fractured, entangled internal representations that block exactly the kind of recombination creativity depends on. So a model that looks fine on your multi-task benchmark may have already quietly lost the structural flexibility that makes creative transfer possible — entropy at the output is only one place collapse hides.
The thing you didn't know you wanted to know: creative degradation in multi-task RL isn't a side effect of getting better at hard tasks — it's a *contamination* effect, where one domain's reward signal poisons another's, and it's largely controllable through training order and demonstration diversity rather than something you have to trade away.
Sources 6 notes
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.