INQUIRING LINE

Why does imitation learning alone plateau without outcome-based refinement?

This explores why copying expert demonstrations (imitation learning / SFT) eventually stops improving a model, and why pairing it with reward signals tied to whether the answer was actually right is what unlocks further gains.


This explores why copying expert demonstrations alone plateaus, and why outcome-based refinement is what breaks the ceiling. The short version from the corpus: imitation teaches a model how to *look* like it's reasoning without teaching it whether the reasoning *worked*. The clearest evidence is that imitation captures surface form rather than substance — models trained to mimic ChatGPT learn its confident, fluent style and fool human evaluators, but close no real capability gap on novel tasks; the ceiling is set by the base model, not the fine-tuning Can imitating ChatGPT fool evaluators into thinking models improved?. Even more starkly, instruction tuning on semantically empty or deliberately wrong instructions performs nearly as well as correct ones — what transfers is knowledge of the *output format*, not task understanding Does instruction tuning teach task understanding or output format?. Imitation, in other words, is learning the shape of the answer space, and that's a finite well.

Outcome-based refinement supplies the thing imitation structurally cannot: a signal about whether a given attempt actually succeeded. The cleanest demonstration is curriculum — running supervised/imitation training first to establish reasonable behavior, *then* outcome rewards (RLVR) to sharpen it, beats either alone. The imitation phase matters precisely because it produces rollouts good enough that outcome rewards become informative; without it, the reward signal is too sparse to learn from Does sequencing imitation then exploration training improve reasoning?. RL training even shows a predictable two-phase arc: first it consolidates execution correctness (the procedural stuff imitation is good at), then the bottleneck shifts to strategic planning — exactly the exploratory territory imitation never reaches Does RL training follow a predictable two-phase learning sequence?.

The deeper reason imitation alone plateaus is that pure self-driven improvement is circular. A model can only imitate or refine against itself for so long before hitting the generation-verification gap, diversity collapse, and reward hacking; every method that reliably keeps improving smuggles in an *external* anchor — a verifier, a judge, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. Outcome-based refinement is one concrete form of that external anchor. And the anchor needn't be a raw number: when numerical rewards plateau, natural-language critiques that explain *why* a solution failed can break models off the plateau, because the scalar reward lacks information about how to improve Can natural language feedback overcome numerical reward plateaus?.

The most interesting wrinkle is that the imitation-vs-outcome dichotomy isn't actually binary — the corpus has been busy filling in the middle. Supervised RL rewards a model by how closely each step matches an expert's, giving dense signal even when every rollout fails, bridging rigid token-by-token imitation and sparse outcome-only rewards Can step-wise expert rewards help small models learn hard reasoning?. A 'third paradigm' lets agents treat the consequences of their own actions as supervision — no external reward, yet a far better warm-start for later RL than imitation gives Can agents learn from their own actions without external rewards?. And agents can learn from outcomes without any weight update at all, by storing verbal reflections on success/failure in episodic memory — the binary outcome signal is what prevents the model from rationalizing its mistakes away Can agents learn from failure without updating their weights?.

The thing worth walking away with: the plateau isn't a flaw in imitation, it's the natural limit of a method whose job is to teach *form*. You learn the moves by copying; you only learn which moves win by being told when you won. The frontier of this corpus is less about choosing imitation or outcomes and more about engineering the gradient between them — step-wise expert similarity, self-generated consequences, episodic reflection — so the reward signal stays informative the whole way up.


Sources 9 notes

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating the claim that imitation learning alone plateaus without outcome-based refinement. The question remains open: what structural limits, if any, constrain pure imitation, and can they be overcome?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025; treat these as snapshots, not current consensus.
- Instruction tuning captures output format distribution, not task understanding; even semantically empty instructions train nearly as well as correct ones (2023–2025).
- Imitation-only models plateau at the base model's capability ceiling; outcome-based refinement (RLVR, curriculum) breaks the ceiling by supplying a success signal imitation structurally cannot provide (~2024–2025).
- Pure self-improvement is circular without an external anchor (verifier, judge, tool feedback); every reliable method smuggles in an external signal (~2025).
- Natural-language critiques explaining *why* a solution failed break numerical-reward plateaus, because scalar rewards lack improvement information (~2025).
- Three intermediate paradigms bridge rigid token-by-token imitation and sparse outcome-only rewards: supervised RL (step-wise expert similarity), self-generated consequences, and episodic verbal reflection (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.15717 (2023-05): The False Promise of Imitating Proprietary LLMs
- arXiv:2402.05808 (2024-02): Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- arXiv:2506.03106 (2025-06): Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- arXiv:2510.25992 (2025-10): Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer models (GPT-4o, o1-class reasoning, next-gen agents), scaled imitation, novel architectures (e.g., state-caching Transformers), or in-context RL orchestration have relaxed or overturned the plateau claim. Separate the durable question — does imitation alone hit a ceiling, and why? — from the perishable limitation — that ceiling is at the base model, or that outcome signals are necessary. Cite what resolved each constraint, and flag where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper show imitation-only scaling breaking free, or outcome-free methods matching outcome-based refinement?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Can multimodal or long-horizon imitation + in-context outcome inference abolish the plateau?" or "Does scaling imitation data and model size change the role of outcome refinement?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines