SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

The standard framing of Does policy entropy collapse limit reasoning performance in RL? treats entropy collapse as a uniform phenomenon — RL training decreases entropy. Omni-Thinker (2025) reveals this is domain-dependent: structured domains (math, coding) decrease output entropy, while open-ended domains (creative writing, dialogue) increase it.

This is not a minor observation — it makes training order a mechanistic variable, not just a scheduling convenience. If you train creative writing first and structured reasoning second, the structured training will collapse the entropy that creative training expanded, potentially degrading creative capability. If you train structured reasoning first and creative writing second, the creative training preserves and expands the model's expressive range. The ordering effect is predictable from backward transfer (BWT) measurements.

Omni-Thinker uses BWT-guided scheduling: order tasks so that later tasks experience minimal negative backward transfer from earlier tasks. The approach uses hybrid rewards — verifiable (rule-based) for deterministic domains + preference-based (LLM-as-Judge) for subjective domains — enabling unified training across domain types within a single policy. The "short-form" QA tasks condition on distractors to reduce reward hacking from random guessing.

The gains are substantial: 6.2% over joint multi-task training, 12.4% over model merging. The accuracy of final multi-task models is well-predicted by forgettability rankings, even under simplifying assumptions — suggesting BWT-guided scheduling has principled theoretical grounding.

This extends Does gradually tightening token budgets beat fixed budget training? from temporal budgets to task ordering: the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. And it enriches the entropy collapse understanding: entropy collapse is not a bug to fix everywhere — in structured domains, it reflects desirable precision. The problem is when structured-domain entropy collapse propagates to damage open-ended capabilities.

Inquiring lines that use this note as a source 90

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 148 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-task rl reveals complementary entropy dynamics — structured domains systematically decrease output entropy while creative domains increase it making training order a mechanistic variable