Why does adversarial training force deeper reasoning than surface imitation?
This explores why training a model against a critic or adversary (it has to defend or fix its answers) builds deeper reasoning than just copying correct examples — and what the corpus says about the gap between genuine understanding and surface mimicry.
This explores why training a model against an adversary — a critic that probes its answers, or a game it has to win — produces deeper reasoning than simply imitating correct outputs. The corpus has a sharp answer, and it starts with what imitation actually buys you. Pure imitation captures style, not substance: models trained to mimic ChatGPT learn its confident, fluent voice well enough to fool human evaluators while closing none of the real capability gap on novel tasks Can imitating ChatGPT fool evaluators into thinking models improved?. Copying correct answers teaches the surface texture of being right without the machinery that generates it.
Adversarial setups break that shortcut because they force engagement with where reasoning fails. Training a model to critique noisy, wrong responses produces deeper understanding than training it on clean correct ones — even imperfect critique supervision beats correct-answer imitation, because spotting why something is broken requires structural reasoning that pattern-matching the right answer never demands Does critiquing errors teach deeper understanding than imitating correct answers?. The adversarial-game version generalizes this: RARO pits a critic against the policy to discriminate expert answers from the model's own, and that pressure alone trains strong reasoning without any task-specific verifier Can adversarial critics replace task-specific verifiers for reasoning?. The opponent is what manufactures the difficulty.
There's a deeper wrinkle, though — and it's where the corpus gets interesting. Several lines of work suggest adversarial training isn't installing new reasoning so much as *eliciting* reasoning the base model already latently contains. Five independent methods — RL steering, critique fine-tuning, decoding tricks, feature steering, RLVR — all surface capability that's already present in base-model activations; post-training selects rather than creates Do base models already contain hidden reasoning ability?. RLVR specifically sharpens sampling efficiency within existing boundaries rather than expanding them, which is why even spurious rewards can work for a well-pretrained model What does reward learning actually do to model reasoning?. So adversarial pressure may force deeper reasoning precisely by being a harder *selection* signal: imitation rewards looking right, while an adversary only rewards actually-correct reasoning it can't refute.
The most practical finding is that you don't have to choose. Sequencing matters: do supervised imitation first to lay down reasonable reasoning scaffolds, then apply adversarial/verifiable RL to sharpen against them, and the curriculum beats either method alone Does sequencing imitation then exploration training improve reasoning?. Imitation makes outcome rewards informative by producing rollouts worth critiquing; the adversarial phase then does the deepening. A related move — rewarding explanation quality, not just token correctness — internalizes coherent knowledge structures better than flat supervised fine-tuning Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.
Here's the unsettling coda. Deeper reasoning chains are also longer attack surfaces. The same extended elaboration that adversarial training cultivates creates more intervention points where a single corrupted step propagates — reasoning models lose 25–29% accuracy under multi-turn manipulative prompts, *more* than plain models Why do reasoning models fail under manipulative prompts? Are reasoning models actually more vulnerable to manipulation?. Adversarial training that forges depth and adversarial prompting that exploits it are two faces of the same coin: every reasoning step you add is both a chance to think harder and a chance to be led astray.
Sources 9 notes
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.