Does training order reshape how models handle different task types?
Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
The standard framing of Does policy entropy collapse limit reasoning performance in RL? treats entropy collapse as a uniform phenomenon — RL training decreases entropy. Omni-Thinker (2025) reveals this is domain-dependent: structured domains (math, coding) decrease output entropy, while open-ended domains (creative writing, dialogue) increase it.
This is not a minor observation — it makes training order a mechanistic variable, not just a scheduling convenience. If you train creative writing first and structured reasoning second, the structured training will collapse the entropy that creative training expanded, potentially degrading creative capability. If you train structured reasoning first and creative writing second, the creative training preserves and expands the model's expressive range. The ordering effect is predictable from backward transfer (BWT) measurements.
Omni-Thinker uses BWT-guided scheduling: order tasks so that later tasks experience minimal negative backward transfer from earlier tasks. The approach uses hybrid rewards — verifiable (rule-based) for deterministic domains + preference-based (LLM-as-Judge) for subjective domains — enabling unified training across domain types within a single policy. The "short-form" QA tasks condition on distractors to reduce reward hacking from random guessing.
The gains are substantial: 6.2% over joint multi-task training, 12.4% over model merging. The accuracy of final multi-task models is well-predicted by forgettability rankings, even under simplifying assumptions — suggesting BWT-guided scheduling has principled theoretical grounding.
This extends Does gradually tightening token budgets beat fixed budget training? from temporal budgets to task ordering: the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. And it enriches the entropy collapse understanding: entropy collapse is not a bug to fix everywhere — in structured domains, it reflects desirable precision. The problem is when structured-domain entropy collapse propagates to damage open-ended capabilities.
Inquiring lines that use this note as a source 90
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can world models form from aggregated partial information across training distributions?
- Does the heuristic dominance ratio vary predictably across model architectures?
- What makes some tasks bounded enough for reliable RL?
- How do training objectives shape what a world model actually learns?
- What causes models to develop domain capability cliffs after specialization?
- How much of the combinatorial task space must training data cover?
- Why does full multi-task fine-tuning perform worse than sequential training?
- What task structures benefit most from geometric parameter merging?
- Does task ordering affect multi-task reinforcement learning outcomes?
- Why does training data format matter more than domain content?
- Can demo placement be tuned as a task-specific hyperparameter?
- How do ordering effects compound across different prompt component scales?
- Do different domains require different types of model investment?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- Does constraining AI access during early task phases preserve skill formation?
- How do task difficulty and skill type interact in model performance?
- Can in-context learning replicate the timing effects that RL teaches models?
- How can weak-to-strong progressive training target planning without interfering with grounding?
- Do different function-calling subtasks have different entropy profiles during training?
- Why does training data format matter more than its domain content?
- What capabilities actually require massive scale versus specialized training regimes?
- Does knowledge structure matter more than knowledge volume for model training?
- How does training data distribution create asymmetric competence across relation types?
- What performance trade-offs emerge when composing multiple independently trained model capabilities?
- Can backward transfer measurements reliably predict optimal multi-task training order?
- How does entropy collapse affect creative capability in multi-task settings?
- Can depth scaling and breadth scaling unlock independent capability axes?
- How do residual connections and layer norm stabilize training in deep RL?
- Does specialized training in one domain create capability cliffs elsewhere?
- Can diversity-aware RL objectives prevent format convergence?
- Can smaller models achieve domain expertise through focused RL training?
- Does RL refine existing knowledge or discover entirely new capabilities?
- Can negative reinforcement alone match full RL performance on domain tasks?
- How does RL compress reasoning path diversity during training?
- What makes software engineering environments better suited for RL than other interactive domains?
- Which recipe choices determine the asymptotic ceiling in RL training?
- Does self-generated training data reduce a model's capability diversity?
- Can RL format selection explain performance gains attributed to algorithmic improvements?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- Can continuous spectrum training outperform sequential SFT-then-RL stages?
- Why does training order matter across different domain types?
- Can models converge on similar experience descriptions across different architectures?
- What makes pretraining composition more important than reward engineering?
- How do RL training and base models differ in creating MI peaks?
- Does training on granular tasks beat training on the full function calling problem?
- Does format-based pretraining determine how models respond to reinforcement learning?
- Does critique training improve exploration diversity during model training or only test time?
- Can trajectory quality filtering improve model training in noisy environments?
- How do chunk-based step segmentation and trajectory structure modeling differ?
- How does trajectory burstiness compare to other structural properties that shape emergent capabilities?
- How do gradients flowing through both branches simultaneously reshape each component's role?
- How do self-evolving curricula help RL break beyond base model capability boundaries?
- What structural differences emerge between early generic skills and later meta-strategy skills?
- What happens when you project the same model onto different harnesses?
- Why does RL behavior differ between standard reasoning tasks and complex planning domains?
- Do interaction effects between research mechanisms depend on the task domain?
- How does a challenger's escalating difficulty function as curriculum?
- How does consolidation schedule order affect final memory quality?
- How should skill libraries coordinate with gradient-based weight optimization?
- Can explicitly optimizing for semantic diversity during RL training improve both quality and variation?
- Why does the order of training examples matter for what models learn?
- Can training on diverse related tasks be more efficient than task-specific training?
- What scaling properties emerge from RL training dynamics beyond verification?
- Why does specializing to one task make future task learning harder?
- How does curriculum learning prevent instability in social-emotional RL training?
- How does credit assignment across objectives differ from credit assignment across time?
- How does absolute-advantage weighting concentrate training on boundary cases?
- Why do single-turn RL methods fail to generalize to multi-turn tasks?
- How should multi-objective post-training balance competing behavioral goals?
- Why do overtrained domains show different RL training outcomes than novel tasks?
- What training duration is actually needed for RL to expand capabilities?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- How does prolonged RL training differ from standard RLVR approaches?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- How does stage-wise training scheduling resolve conflicts between constraint-following and creative tasks?
- Why does decomposition ability transfer across domains but solving ability does not?
- Why does curriculum order matter when information theory says data order is irrelevant?
- Can architectural changes reorder when uncertainty and empowerment signals influence decisions?
- Why does outcome-based RL specifically lose diversity during training?
- Can a single Elo ranking represent multidimensional model capability?
- How does training order affect knowledge acquisition in language models?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- How do task frequency and complexity interact with model capacity during training?
- Can intentional data-mixture design replace model scaling for rare task learning?
- Does task diversity in pretraining data transfer reasoning better than larger models?
- How do weight visualizations reveal temporal structure in cyclic training?
- Can training order and structure shape what networks retain and learn?
- How does model scale affect anticipatory behavior in structured training?
- Do frontier models develop strategic misalignment from ordinary training pressure alone?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
entropy dynamics are domain-dependent, not uniformly negative; structured tasks decrease entropy while creative tasks increase it
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
BWT-guided scheduling extends curriculum insight from temporal budgets to task ordering
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
Omni-Thinker adds that entropy direction depends on task type, further complicating the dual problem
-
Does RL training collapse format diversity in pretrained models?
Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
multi-task training with BWT scheduling may partially address format convergence by exposing the model to diverse task types
-
Can isolating task-specific parameters prevent multi-task fine-tuning interference?
Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.
complementary multi-task approach: CPI-FT addresses interference through spatial parameter isolation while Omni-Thinker uses temporal task ordering; CPI-FT shows temporal scheduling alone is insufficient, suggesting combining both spatial isolation and BWT-guided ordering could further improve multi-task training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Reinforcement Learning with Rubric Anchors
- Eliciting Reasoning in Language Models with Cognitive Tools
- Post-training makes large language models less human-like
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- The Art of Scaling Reinforcement Learning Compute for LLMs
Original note title
multi-task rl reveals complementary entropy dynamics — structured domains systematically decrease output entropy while creative domains increase it making training order a mechanistic variable