INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How do multi-agent reasoning syste…›Does decoupling planning from exec…›this inquiring line

Train an AI on rules and creativity at once and one skill quietly eats the other — the training order decides which one survives.

How does stage-wise training scheduling resolve conflicts between constraint-following and creative tasks?

This explores how the *order* you train tasks in—structured, rule-following work versus open-ended creative work—can keep one kind of skill from quietly destroying the other.

This explores how the order you train tasks in can stop constraint-following from eating creativity (and vice versa). The corpus has a sharp, mechanical answer: the two task types pull a model's output entropy in opposite directions, so the schedule isn't a tuning convenience—it's the thing that decides which capability survives. The clearest case is Omni-Thinker, where structured domains (math, format-following) *lower* a model's output entropy while creative domains *raise* it; training them jointly lets the entropy-collapsing tasks flatten the open-ended ones. Training the structured tasks first, guided by how much each stage degrades earlier skills, recovers about 6% over throwing everything in at once Does training order reshape how models handle different task types?. The conflict is real and physical, not stylistic—and sequence is the lever.

Why does order matter so much? Because RL training is itself staged, even when you don't design it to be. One line of work finds RL moves through two phases on its own—first nailing execution correctness, then shifting the bottleneck to strategic planning, with planning entropy rising as execution entropy settles Does RL training follow a predictable two-phase learning sequence?. If the model is naturally consolidating procedure before it explores, then front-loading the constraint-heavy, low-entropy tasks works *with* that grain instead of against it. Stage-wise scheduling is partly just respecting a sequence the optimizer was already going to follow.

The same imitation-then-exploration logic shows up in curriculum design directly: running supervised reasoning first to build a foundation, then verifiable-reward RL to sharpen it, beats either method alone—because the early phase produces reasonable attempts that make the later reward signal *informative* Does sequencing imitation then exploration training improve reasoning?. Read alongside Omni-Thinker, a pattern emerges: stages aren't just ordered for convenience, they're ordered so each phase prepares the ground the next one needs, and so the destructive phase can't run before the fragile capability is established.

There's a darker reason sequencing matters, and it's the thing you didn't know you wanted to know: RL doesn't gently blend behaviors—it *collapses* them. Controlled experiments show RL amplifies a single dominant output format from pretraining within the first epoch and actively suppresses the alternatives, with the winner determined by model scale rather than which format is actually best Does RL training collapse format diversity in pretrained models?. Creative diversity is exactly the kind of thing a careless schedule extinguishes early and irreversibly. So stage-wise scheduling is really damage control against a known failure mode: protect the high-entropy, many-valid-answers capability by not letting the convergent, one-right-answer pressure run first.

A useful counterpoint sits in the corpus too: maybe the deeper fix isn't scheduling within one model but *separation*. Splitting a decomposer from a solver, or wrapping LLM calls inside explicit algorithms that show each step only what it needs, prevents planning and execution from interfering at all Does separating planning from execution improve reasoning accuracy? Can algorithms control LLM reasoning better than LLMs alone?. And it's worth knowing what instruction tuning actually transfers is the output *format* distribution more than task understanding Does instruction tuning teach task understanding or output format?—which reframes the whole constraint-vs-creative conflict as a fight over which output distribution a model gets locked into. Scheduling resolves the conflict by controlling that lock-in: establish the strict distribution first, then widen it, rather than letting the two collide and the narrow one win by default.

Sources 7 notes

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Show all 7 sources

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs1.68 match · arxiv ↗
The Art of Scaling Reinforcement Learning Compute for LLMs1.68 match · arxiv ↗
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs1.67 match · arxiv ↗
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR1.66 match · arxiv ↗
Reinforcement Learning with Rubric Anchors1.66 match · arxiv ↗
RAGEN-2: Reasoning Collapse in Agentic RL1.64 match · arxiv ↗
Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning0.89 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining0.89 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing stage-wise training scheduling as a resolution to constraint-following vs. creative-task conflicts. The question remains open: *does scheduling order truly resolve the conflict, or have newer models, methods, or evaluation frameworks since made the constraint moot?*

What a curated library found—and when (findings span 2023–2025, dated claims, not current truth):
• Output entropy collapses under RL: constraint tasks lower entropy, creative tasks raise it; training structured tasks first recovers ~6% over joint training (Omni-Thinker, 2025).
• RL training exhibits two intrinsic phases: procedural consolidation precedes strategic planning, suggesting front-loading constraint tasks aligns with optimizer dynamics (~2024).
• Supervised reasoning → verifiable-reward RL (imitation-then-exploration) outperforms either alone; early phases prepare informative signal for later stages (~2024).
• RL amplifies a single dominant output format from pretraining within epoch 1 and actively suppresses alternatives; winner determined by scale, not quality (Echo Chamber, 2025).
• Instruction tuning teaches output *format distribution*, not task understanding; reframes constraint-vs-creative as a lock-in collision (2023).

Anchor papers (verify; mind their dates):
• arXiv:2507.14783 (Omni-Thinker, 2025) — multi-task RL scheduling with entropy-guided sequencing
• arXiv:2504.07912 (Echo Chamber, 2025) — RL format amplification and pretraining behavior collapse
• arXiv:2402.05808 (Reverse Curriculum RL, 2024) — staged reasoning training
• arXiv:2305.11383 (Instruction Tuning, 2023) — format distribution vs. task learning

Your task:
(1) RE-TEST EACH CONSTRAINT. Judge whether post-2025 scaling (o1-like reasoning models, mixture-of-experts, in-context RL) have RELAXED the entropy-collapse problem. Does larger model capacity or longer reasoning horizons bypass the need for stage-wise scheduling? Where does scheduling still appear essential—and where has it become optional?
(2) Surface the strongest CONTRADICTING work from the last 6 months. Look for papers claiming scheduling *doesn't* matter, or that architectural separation (decomposer–solver, tool-use, modular reasoning) obsoletes sequencing entirely. Flag disagreement in the corpus itself (e.g., does Omni-Thinker's gain persist under newer RL algorithms?).
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Under extreme-scale inference (larger context, chain-of-thought verifiers), does stage-wise *pre-training* scheduling matter, or only post-training? (b) Can adaptive/online scheduling—where task order shifts based on live entropy metrics—outperform fixed stage sequences?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Train an AI on rules and creativity at once and one skill quietly eats the other — the training order decides which one survives.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8