Can models generate their own training curriculum during offline dreaming?
This explores whether a model can do two things at once during an offline 'sleep' phase — invent the practice problems it learns from (a self-made curriculum) and consolidate them into its weights — rather than waiting on a human-curated training set.
This explores whether a model can both invent its own practice material and bake it into its weights during an offline 'dreaming' phase, instead of relying on a human-built training set. The corpus says yes to each half separately, and the pieces are starting to fit together.
The most direct evidence for the dreaming half comes from a 'sleep phase' for continual learning, where a model consolidates what it has picked up in-context into permanent weights using two moves: distilling smaller networks upward into the larger one, and RL-generated 'dreaming' that rehearses synthetic experience Can models consolidate memories during offline sleep phases?. That rehearsal material has to come from somewhere — and a separate line of work shows aligned models can manufacture it. Given nothing but the formatting tokens that normally precede a user query, an instruction-tuned model auto-regressively spills out millions of diverse, high-quality instruction-answer pairs that match human-curated data and beat external sources for downstream fine-tuning Can aligned LLMs generate their own training data?. So a model dreaming up its own training examples isn't speculative; it's already a working data pipeline.
The 'curriculum' word is where it gets interesting, because a curriculum isn't just data — it's data ordered by difficulty, escalating as the learner improves. A self-play loop does exactly this with three roles: a Challenger that ramps up problem difficulty (the curriculum), a Judge that issues binary verdicts (the reward), and skills that evolve through natural-language edits — no human feedback anywhere in the loop Can language models learn skills without human supervision?. A related system drops the separate Challenger entirely and has one model alternate between answering and judging its own answers, deriving reward from how consistently it ranks its own outputs Can models learn to judge themselves without external rewards?. Both show the reward signal, not just the data, can be internally generated — which is what makes a closed self-curriculum loop possible.
Here's the catch that the corpus surfaces and you might not have asked for: there's a real question about whether any of this teaches genuinely *new* ability or just reshuffles what's already there. Multiple independent methods — RL steering, critique tuning, feature steering — all turn out to merely *elicit* reasoning that base models already latently hold; post-training selects rather than creates Do base models already contain hidden reasoning ability?. And self-generated training carries a specific failure mode: RL tends to collapse onto a single dominant output format within the first epoch, suppressing the diversity it started with Does RL training collapse format diversity in pretrained models?. A model dreaming its own curriculum risks dreaming in an ever-narrowing groove — which is exactly why the self-play work has to bolt on a 'generalization safeguard' to keep adversarial pressure from collapsing the whole system Can language models learn skills without human supervision?.
So the honest answer: the machinery for self-generated curriculum during offline consolidation exists in parts — synthetic data generation, internal difficulty-escalation, internal reward, weight consolidation through dreaming — and nobody in this corpus has yet assembled all four into one loop. The open problem isn't whether a model *can* write its own syllabus, but whether it can write one that pushes past its own boundaries instead of rehearsing what it already knew.
Sources 6 notes
The Sleep paradigm uses Knowledge Seeding (distilling smaller networks into larger ones) and Dreaming (RL-generated rehearsal) to consolidate in-context knowledge into weights without forgetting. Gains appear in long-context understanding, few-shot reasoning, and continual learning.
MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.