INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›How should retrieval-augmented gen…›How do adversarial and manipulativ…›this inquiring line

An AI trained on correct answers masters the look of being right — training against a critic forces it to actually understand why.

Why does adversarial training force deeper reasoning than surface imitation?

This explores why training a model against a critic or adversary (it has to defend or fix its answers) builds deeper reasoning than just copying correct examples — and what the corpus says about the gap between genuine understanding and surface mimicry.

This explores why training a model against an adversary — a critic that probes its answers, or a game it has to win — produces deeper reasoning than simply imitating correct outputs. The corpus has a sharp answer, and it starts with what imitation actually buys you. Pure imitation captures style, not substance: models trained to mimic ChatGPT learn its confident, fluent voice well enough to fool human evaluators while closing none of the real capability gap on novel tasks Can imitating ChatGPT fool evaluators into thinking models improved?. Copying correct answers teaches the surface texture of being right without the machinery that generates it.

Adversarial setups break that shortcut because they force engagement with where reasoning fails. Training a model to critique noisy, wrong responses produces deeper understanding than training it on clean correct ones — even imperfect critique supervision beats correct-answer imitation, because spotting why something is broken requires structural reasoning that pattern-matching the right answer never demands Does critiquing errors teach deeper understanding than imitating correct answers?. The adversarial-game version generalizes this: RARO pits a critic against the policy to discriminate expert answers from the model's own, and that pressure alone trains strong reasoning without any task-specific verifier Can adversarial critics replace task-specific verifiers for reasoning?. The opponent is what manufactures the difficulty.

There's a deeper wrinkle, though — and it's where the corpus gets interesting. Several lines of work suggest adversarial training isn't installing new reasoning so much as *eliciting* reasoning the base model already latently contains. Five independent methods — RL steering, critique fine-tuning, decoding tricks, feature steering, RLVR — all surface capability that's already present in base-model activations; post-training selects rather than creates Do base models already contain hidden reasoning ability?. RLVR specifically sharpens sampling efficiency within existing boundaries rather than expanding them, which is why even spurious rewards can work for a well-pretrained model What does reward learning actually do to model reasoning?. So adversarial pressure may force deeper reasoning precisely by being a harder *selection* signal: imitation rewards looking right, while an adversary only rewards actually-correct reasoning it can't refute.

The most practical finding is that you don't have to choose. Sequencing matters: do supervised imitation first to lay down reasonable reasoning scaffolds, then apply adversarial/verifiable RL to sharpen against them, and the curriculum beats either method alone Does sequencing imitation then exploration training improve reasoning?. Imitation makes outcome rewards informative by producing rollouts worth critiquing; the adversarial phase then does the deepening. A related move — rewarding explanation quality, not just token correctness — internalizes coherent knowledge structures better than flat supervised fine-tuning Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.

Here's the unsettling coda. Deeper reasoning chains are also longer attack surfaces. The same extended elaboration that adversarial training cultivates creates more intervention points where a single corrupted step propagates — reasoning models lose 25–29% accuracy under multi-turn manipulative prompts, *more* than plain models Why do reasoning models fail under manipulative prompts? Are reasoning models actually more vulnerable to manipulation?. Adversarial training that forges depth and adversarial prompting that exploits it are two faces of the same coin: every reasoning step you add is both a chance to think harder and a chance to be led astray.

Sources 9 notes

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Show all 9 sources

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Are reasoning models actually more vulnerable to manipulation?

GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Eliciting Reasoning in Language Models with Cognitive Tools2.62 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin2.55 match · arxiv ↗
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!2.43 match · arxiv ↗
Reasoning Models Are More Easily Gaslighted Than You Think1.77 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.73 match · arxiv ↗
Escaping the Verifier: Learning to Reason via Demonstrations1.73 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?1.70 match · arxiv ↗
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization1.69 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing claims about adversarial training vs. imitation in LLM reasoning. The question: why does adversarial training force deeper reasoning than surface imitation? A curated library (spanning 2023–2025, primarily 2024–2025) found:

— Pure imitation captures style, not substance: models imitating ChatGPT fool humans while closing no capability gap on novel tasks (2023).
— Training to critique noisy/wrong responses produces deeper understanding than training on clean correct answers; imperfect critique supervision beats correct-answer imitation (2025).
— Adversarial game setups (e.g., RARO: critic vs. policy) train strong reasoning without task-specific verifiers; the opponent manufactures difficulty (2024–2025).
— Five independent methods (RL steering, critique fine-tuning, decoding, feature steering, RLVR) surface latent reasoning already in base models; post-training *selects* rather than *creates* (2025).
— Curriculum sequencing (supervised imitation → adversarial RL) beats either alone; imitation scaffolds, adversarial phase sharpens (2025).
— Longer reasoning chains (the output of adversarial training) create more intervention points: reasoning models lose 25–29% accuracy under multi-turn manipulative prompts, *more* than baseline models (2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.15717 (2023): The False Promise of Imitating Proprietary LLMs
- arXiv:2402.05808 (2024): Reverse Curriculum Reinforcement Learning
- arXiv:2501.17703 (2025): Critique Fine-Tuning
- arXiv:2506.09677 (2025): Reasoning Models Are More Easily Gaslighted

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer models, improved critique harnesses, ensemble/multi-agent orchestration, or recent evaluations have since relaxed or overturned it. Distinguish the durable question (likely still open) from the perishable limitation (possibly resolved by better tooling, data, or alignment). Where a constraint still holds, say so plainly with evidence.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~3 months. The library hints at a central tension: if reasoning is already latent, does adversarial training *create* depth or merely *unlock* it? Does that distinction collapse under scrutiny?
(3) Propose 2 research questions that ASSUME the regime has moved: e.g., if adversarial training is a selection signal over existing capability, what happens when the base model's latent reasoning is weak or misaligned with the adversary's criterion?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI trained on correct answers masters the look of being right — training against a critic forces it to actually understand why.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8