How does action-level decomposition differ from token-level imitation in supervision?
This explores the contrast between two ways of teaching a model: supervising at the level of what an action accomplishes (its consequences) versus supervising at the level of copying the surface tokens a teacher produced — and why the corpus suggests these two signals teach very different things.
This explores the gap between supervising on consequences (what an action does) versus supervising on surface form (copying the exact tokens a teacher emitted). The collection makes a sharp case that token-level imitation teaches the *shape* of an answer while action-level decomposition teaches its *function* — and these are not the same lesson.
The clearest evidence that pure token imitation is shallow comes from work showing that models trained to copy ChatGPT learn its confident, fluent style without closing any real capability gap — evaluators are fooled, but factuality and generalization don't move Can imitating ChatGPT fool evaluators into thinking models improved?. A companion finding goes further: instruction tuning seems to transfer knowledge of the *output space* (what a valid answer looks like) rather than task understanding, since deliberately wrong or empty instructions produce nearly the same scores Does instruction tuning teach task understanding or output format?. Both point to the same limit — imitating the token stream captures format distribution, not reasoning.
Action-level supervision flips the source of the signal. Instead of asking 'did you reproduce the teacher's words,' it asks 'what happened when you acted.' One striking result treats an agent's own future states as the supervision signal, letting it learn without external rewards and matching expert-imitation baselines on half the data Can agents learn from their own actions without external rewards?. Chain-of-thought work makes the same move during pretraining, treating each reasoning step as an exploratory *action* scored by how much it improves prediction — a verifier-free reward grounded in consequence rather than mimicry Can chain-of-thought reasoning be learned during pretraining itself?.
The richer idea in the corpus is that you can *decompose* supervision to get step-level granularity without hand-labeling every step. Checklist-based rewards break a vague instruction into verifiable sub-criteria, which reduces overfitting to superficial artifacts that holistic, imitation-style reward models reward by accident Can breaking down instructions into checklists improve AI reward signals?. Reverse-curriculum learning reaches the same place from another angle — sliding the start state backward from near-completion exposes which individual steps fail, approximating expensive process supervision using only outcome feedback Can curriculum learning approximate expensive process supervision?.
What you didn't know you wanted to know: the two paradigms aren't rivals — the corpus suggests imitation is the *warm-start* that makes action-level supervision work at all. Running imitation first to build reasonable rollouts, then refining against verifiable rewards, beats either method alone, because outcome rewards are uninformative until imitation has produced trajectories worth sharpening Does sequencing imitation then exploration training improve reasoning?. Token imitation gives you the format; action decomposition gives you the function — and the trick is sequencing them, not choosing.
Sources 7 notes
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.