INQUIRING LINE

Can continuous spectrum training outperform sequential SFT-then-RL stages?

This explores whether blending supervised and reinforcement signals into one continuous training process beats running them as discrete stages (SFT first, then RL) — and what the corpus knows about why staged training stumbles.


This explores whether blending supervised and reinforcement signals into one continuous training process beats the classic recipe of doing supervised fine-tuning first and bolting RL on afterward. The corpus offers a fairly direct answer: the staged version has a diagnosable failure pattern, and folding the stages together fixes it.

The clearest evidence is the discovery that SFT-then-RL training moves through three predictable phases when the expert data pulls away from what the model already does — an initial disruption as the policy shifts, a readaptation to the expert's patterns, then overfitting Why does SFT-then-RL training follow a predictable three-phase pattern?. That progression is essentially the cost of treating SFT as a finished, separate step. The same work shows that dynamically weighting SFT as an auxiliary objective *inside* on-policy RL — rather than as a prior stage — smooths out the progression and stabilizes training. That's continuous-spectrum training outperforming the staged version on its home turf.

Why does merging help? A second thread points at plasticity. Models that drift less from their base distribution stay able to keep learning, while approaches that wander far stall once the task domain changes Does staying close to the base model preserve learning ability?. A hard SFT stage followed by hard RL is exactly the kind of two-step drift that burns plasticity; a continuous blend keeps the model closer to home and learning-ready throughout. There's a structural hint here too — RL only rewrites a small, consistent slice of parameters Does reinforcement learning update only a small fraction of parameters?, so the two phases aren't fighting over the whole network, which is part of why interleaving them is even feasible.

But "continuous beats staged" isn't unconditional — order and scheduling still matter enormously, which complicates the tidy story. Training order mechanically reshapes entropy dynamics: structured tasks shrink output entropy while creative ones expand it, and front-loading structured tasks beat joint training by over 6% precisely because it stopped entropy collapse from wrecking open-ended skills Does training order reshape how models handle different task types?. So sequencing carries real information that a naive uniform blend can throw away. RL itself unfolds in phases too — procedural mastery first, then strategic exploration Does RL training follow a predictable two-phase learning sequence? — meaning even "continuous" training has internal stages whether you design for them or not.

The synthesis: the dichotomy in the question is a little bit false. The winning approach isn't "abolish stages" so much as "stop treating SFT as a frozen prior step and let supervised signal flow through RL as a tunable, weighted ingredient" — while still respecting that the *content's* difficulty and type want a schedule. What you didn't know you wanted to know: the failure of staged training isn't that RL undoes SFT, it's a specific shift-readapt-overfit arc that you can detect and dissolve by making the boundary between the two permeable.


Sources 5 notes

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether continuous spectrum training (blending SFT and RL signals) outperforms sequential SFT-then-RL stages in LLMs, treating findings from 2023–2026 as dated claims to verify, not current truth.

What a curated library found — and when (dated claims, not current truth):
Findings span October 2023–May 2026.
• SFT-then-RL exhibits a diagnosable shift-readapt-overfit arc when expert data diverges from base model; weighting SFT as an auxiliary objective *inside* on-policy RL smooths it (~2025, arXiv:2508.11408).
• Lower KL drift from base distribution preserves plasticity and sustained learning; hard SFT followed by hard RL burns plasticity via two-step drift (~2025).
• RL updates only 5–30% of parameters in sparse subnetworks, so interleaving SFT and RL is structurally feasible (~2025, arXiv:2505.11711).
• Training order mechanically reshapes entropy: structured tasks first beat joint training by >6% because sequencing prevents entropy collapse on open-ended skills (~2025, arXiv:2507.14783).
• RL itself unfolds in two phases: procedural consolidation before strategic exploration; even "continuous" training has internal stages (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2508.11408 (Aug 2025) — On-Policy RL Meets Off-Policy Experts.
• arXiv:2507.14783 (Jul 2025) — Omni-Thinker: Multi-Task RL with Hybrid Reward and Task Scheduling.
• arXiv:2505.11711 (May 2025) — Reinforcement Learning Finetunes Small Subnetworks.
• arXiv:2605.12484 (May 2026) — Learning, Fast and Slow: Continual Adaptation in LLMs.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the shift-readapt-overfit arc, has permeable SFT-auxiliary weighting been validated on models >70B? Has plasticity preservation via KL-bounded training held up as a universal principle, or do scaling laws or new schedulers relax it? Judge whether parameter sparsity claims (5–30%) generalize to post-2026 architectures and whether entropy scheduling's 6% gain persists under multi-objective regimes.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially anything showing staged training recovers parity under new initialization, curriculum, or reward design, or that continuous blending fails under certain task distributions.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can adaptive, per-layer scheduling of SFT weight within RL (rather than global weighting) unlock higher plasticity and faster convergence? (b) Do foundation models trained with continuous SFT-RL from pretraining outperform those retrofitted with staged methods, and by how much?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines