INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do self-generated feedback mec…›this inquiring line

What if an AI could design its own lesson plan — getting harder as it improves — just by tracking where it fails?

How do developmental curriculums emerge from learning progress signals?

This explores how a training 'curriculum' — an ordered progression from easier to harder learning — can arise out of signals about where a model is currently succeeding or failing, rather than being hand-designed in advance.

This explores how a learning curriculum can emerge from progress signals — the model's own pattern of successes and failures — instead of being scripted by hand. The corpus offers a few distinct mechanisms for this, and they're more interesting together than apart.

The most literal answer is reverse curriculum. In Can curriculum learning approximate expensive process supervision?, R3 starts the model near the end of a solved problem and slides the starting point backward as it succeeds, so difficulty ramps automatically with mastery. The progress signal is just outcome reward, but because the start state moves, that single coarse signal exposes step-level failure modes — effectively manufacturing the granularity of expensive process supervision for free. The curriculum isn't a syllabus; it's a moving boundary between what the model can already do and what it can't quite reach yet.

But curriculums also emerge whether or not anyone designs them. Does RL training follow a predictable two-phase learning sequence? finds that RL training reliably moves through two phases on its own: first execution correctness drives gains, then strategic planning becomes the bottleneck — visible because planning-token entropy keeps rising while execution entropy settles. That's a curriculum the training dynamics generate spontaneously, with the shifting bottleneck acting as the progress signal that tells you which skill to invest in next. Does sequencing imitation then exploration training improve reasoning? shows the designed version of the same logic: imitation first to create reasonable rollouts, then verifiable rewards to sharpen them, because outcome rewards only become informative once the model is producing attempts good enough to be told apart. Ordering is what makes the signal legible.

There's a sharp limit lurking here, though. Can models reliably improve themselves without external feedback? argues that progress signals generated purely from within a model eventually stall — the generation-verification gap, diversity collapse, reward hacking — and every reliable method secretly imports an external anchor (a past checkpoint, a judge, a tool, a user correction). So a self-emergent curriculum needs an outside reference point to keep measuring 'progress' against, or it congratulates itself into a corner. Should successful and failed episodes be processed differently? sharpens what the signal should carry: treat successes as concrete demonstrations and failures as abstracted lessons, an asymmetry that mirrors how human experts compress experience.

The quiet twist is what the curriculum is actually teaching. What does reward learning actually do to model reasoning? finds that reward-based training mostly activates strategies already latent from pretraining rather than installing new ones — a single example, or even a spurious reward, can trigger it. Read alongside the others, an emergent curriculum may be less a ladder of new skills than a search procedure for surfacing capabilities the model already has, in the right order, with progress signals telling it which latent ability to switch on next.

Sources 6 notes

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Show all 6 sources

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about how learning curriculums emerge from progress signals in LLM training. The question remains open: *what mechanisms allow coarse outcome feedback to bootstrap fine-grained skill sequencing, and do those mechanisms scale?*

What a curated library found — and when (findings span 2023–2026; treat as dated claims, not current truth):
• Reverse curriculum (R3) moves the problem boundary backward as the model succeeds, converting single outcome rewards into step-level granularity (~2024).
• RL training spontaneously bifurcates into execution-then-planning phases; the shifting bottleneck acts as an endogenous curriculum signal (~2024).
• Pure self-improvement stalls via generation-verification gap, diversity collapse, and reward hacking; every reliable method imports an external anchor (checkpoint, judge, tool, user correction) (~2024).
• Reward-based training mostly *activates latent strategies from pretraining* rather than installing new ones; a single reward or example can trigger them (~2025).
• Asymmetric signal treatment — successes as concrete demonstrations, failures as abstracted lessons — mirrors expert learning (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.05808 (2024-02): Reverse Curriculum RL
• arXiv:2411.12580 (2024-11): Procedural Knowledge & Phase Dynamics
• arXiv:2412.02674 (2024-12): Self-Improvement Mirage
• arXiv:2507.14843 (2025-07): Latent-Strategy Activation

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, assess whether newer models (o1, o3, variants), scaling laws, synthetic data pipelines, or multi-agent orchestration have relaxed or overturned it. Distinguish the durable question (still open) from the perishable limitation (possibly resolved). Cite concretely what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Where does a later paper qualify, refute, or complicate an earlier claim?
(3) **Propose 2 research questions** that assume the regime may have shifted — e.g., *Can external anchors be abstracted into meta-level reward models that generalize across domains?* or *Does latent-strategy activation impose a ceiling on reasoning depth, and can that ceiling be lifted post-training?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What if an AI could design its own lesson plan — getting harder as it improves — just by tracking where it fails?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8