SYNTHESIS NOTE

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Synthesis note · 2026-02-22 · sourced from Reward Models

The standard framing of Does policy entropy collapse limit reasoning performance in RL? treats entropy collapse as a uniform phenomenon — RL training decreases entropy. Omni-Thinker (2025) reveals this is domain-dependent: structured domains (math, coding) decrease output entropy, while open-ended domains (creative writing, dialogue) increase it.

This is not a minor observation — it makes training order a mechanistic variable, not just a scheduling convenience. If you train creative writing first and structured reasoning second, the structured training will collapse the entropy that creative training expanded, potentially degrading creative capability. If you train structured reasoning first and creative writing second, the creative training preserves and expands the model's expressive range. The ordering effect is predictable from backward transfer (BWT) measurements.

Omni-Thinker uses BWT-guided scheduling: order tasks so that later tasks experience minimal negative backward transfer from earlier tasks. The approach uses hybrid rewards — verifiable (rule-based) for deterministic domains + preference-based (LLM-as-Judge) for subjective domains — enabling unified training across domain types within a single policy. The "short-form" QA tasks condition on distractors to reduce reward hacking from random guessing.

The gains are substantial: 6.2% over joint multi-task training, 12.4% over model merging. The accuracy of final multi-task models is well-predicted by forgettability rankings, even under simplifying assumptions — suggesting BWT-guided scheduling has principled theoretical grounding.

This extends Does gradually tightening token budgets beat fixed budget training? from temporal budgets to task ordering: the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. And it enriches the entropy collapse understanding: entropy collapse is not a bug to fix everywhere — in structured domains, it reflects desirable precision. The problem is when structured-domain entropy collapse propagates to damage open-ended capabilities.

Inquiring lines that read this note 92

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What are the consequences of models training on synthetic data?

What capability tradeoffs emerge when scaling model reasoning abilities?

Does the heuristic dominance ratio vary predictably across model architectures?

What constrains reinforcement learning's ability to expand model reasoning?

How do self-generated feedback mechanisms enable effective model learning?

Does domain specialization cause models to lose capabilities elsewhere?

What determines success in training models on multiple tasks?

Why does training format shape reasoning strategy more than domain content?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Can demo placement be tuned as a task-specific hyperparameter?

How do prompt structure and constraints affect model instruction reliability?

How do ordering effects compound across different prompt component scales?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does example difficulty affect learning efficiency in language models?

How does AI adoption affect human skill development and labor equality?

Does constraining AI access during early task phases preserve skill formation?

What pretraining choices and baseline capability constrain reinforcement learning gains?

What makes weaker teacher models effective for stronger student training?

How do neural networks separate factual knowledge from reasoning abilities?

Does knowledge structure matter more than knowledge volume for model training?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How does entropy collapse affect creative capability in multi-task settings?

When does architectural design matter more than raw model capacity?

Can depth scaling and breadth scaling unlock independent capability axes?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Does reinforcement learning teach reasoning or just when to reason?

How can identical external performance mask different internal representations?

Can RL format selection explain performance gains attributed to algorithmic improvements?

Can alternative training methods improve on supervised fine-tuning for language models?

How do training priors constrain what context information can override?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Do harness improvements transfer across model scales or memorize shortcuts?

What happens when you project the same model onto different harnesses?

Why does consolidated memory sometimes degrade agent performance?

How does consolidation schedule order affect final memory quality?

Why does finetuning cause catastrophic forgetting of model capabilities?

How should skill libraries coordinate with gradient-based weight optimization?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Can AI systems balance emotional competence with factual reliability?

How does curriculum learning prevent instability in social-emotional RL training?

Why do reward structures fail to shape long-term agent learning?

How does credit assignment across objectives differ from credit assignment across time?

How does memorization interact with learning and generalization?

Why does curriculum order matter when information theory says data order is irrelevant?

How should models express uncertainty rather than forced confident answers?

Can architectural changes reorder when uncertainty and empowerment signals influence decisions?

Can single-axis benchmarks accurately predict agent deployment success?

Can a single Elo ranking represent multidimensional model capability?

How do training data properties shape reasoning capability development?

Does task diversity in pretraining data transfer reasoning better than larger models?

What limits mechanistic interpretability's ability to characterize models?

How do weight visualizations reveal temporal structure in cyclic training?

Can language model RL training avoid reward hacking and misalignment?

Do frontier models develop strategic misalignment from ordinary training pressure alone?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 156 in 2-hop network ·dense cluster Open in graph ↗

Does training order reshape how models handle di… Does policy entropy collapse limit reasoning perfo… Does gradually tightening token budgets beat fixed… Why do reasoning models fail differently at traini… Does RL training collapse format diversity in pret… Can isolating task-specific parameters prevent mul…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
entropy dynamics are domain-dependent, not uniformly negative; structured tasks decrease entropy while creative tasks increase it
Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
BWT-guided scheduling extends curriculum insight from temporal budgets to task ordering
Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
Omni-Thinker adds that entropy direction depends on task type, further complicating the dual problem
Does RL training collapse format diversity in pretrained models? Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
multi-task training with BWT scheduling may partially address format convergence by exposing the model to diverse task types
Can isolating task-specific parameters prevent multi-task fine-tuning interference? Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.
complementary multi-task approach: CPI-FT addresses interference through spatial parameter isolation while Omni-Thinker uses temporal task ordering; CPI-FT shows temporal scheduling alone is insufficient, suggesting combining both spatial isolation and BWT-guided ordering could further improve multi-task training

Does training order reshape how models handle different task types?

Inquiring lines that read this note 92

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4