INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do self-generated feedback mec…›this inquiring line

If an AI trains by copying a smarter model's outputs, does it gain real capability — or just learn to dress the part?

Does extended exoskeleton use eventually produce meaningful skill transfer?

This reads 'exoskeleton' as the external scaffolding a model leans on — imitating a stronger model, copying teacher-provided reasoning traces, riding instruction formats and reward crutches — and asks whether leaning on those props eventually hardens into real, transferable skill.

Reading 'exoskeleton' as the external supports a model trains against — a stronger model to imitate, a teacher's worked traces, an instruction template, a reward signal — the corpus gives a sharp and slightly uncomfortable answer: the brace mostly transfers the *shape* of competence, not competence itself. The clearest case is straight imitation. Models trained to copy ChatGPT learn to wear its confident, fluent style well enough to fool human judges, yet close no actual capability gap — factuality and generalization to novel tasks don't move, because the ceiling is set by the base model, not the costume it puts on Can imitating ChatGPT fool evaluators into thinking models improved?. Instruction tuning shows the same seam from another angle: models trained on semantically empty or even deliberately wrong instructions perform about as well as those given correct ones, which means what the scaffold actually teaches is the *output format* — the exoskeleton's silhouette — not the understanding underneath Does instruction tuning teach task understanding or output format?.

The more interesting failure is when the exoskeleton is too good. Teacher-refined data that exceeds a student's learning frontier *degrades* it, even when the refinements are objectively higher quality — the student can't metabolize moves beyond its reach, so the better crutch makes for a worse walker Does teacher-refined data always improve student model performance?. And conditioning a teacher on the correct answer plus verifier output produces clean, confident traces that students happily inherit — at the cost of out-of-distribution robustness, because the polished brace teaches them to suppress exactly the uncertainty they'd need when the support is removed Does richer teacher context hurt student generalization?. The crutch doesn't just fail to transfer skill; it can quietly transfer the *absence* of one.

But the question says 'eventually,' and that's where the corpus turns. Skill transfer does happen — when the scaffolding is staged rather than permanent. Running supervised imitation *first* to build reasoning foundations, then switching to verifiable-reward training to sharpen against real outcomes, beats either method alone: the imitation phase isn't the destination, it's what makes the later reward signal informative by producing reasonable attempts worth refining Does sequencing imitation then exploration training improve reasoning?. The exoskeleton works precisely when it's worn temporarily and then shed.

There's also evidence that genuine transfer rides on structure already present, not bolted on from outside. Length generalization carries across related tasks because models reuse the same attention heads — pretrained models already contain the reusable computational scaffolding, and shorter tasks borrow it to extrapolate further Can length generalization transfer between different related tasks?. That reframes the whole question: maybe the durable 'exoskeleton' is internal architecture being reused, not external imitation being copied. And when you *do* want external structure to produce transferable strategy, you have to train for it deliberately — a decoupled, trainable skill curator (separate from a frozen executor) learns to evolve a skill library away from generic verbose additions toward actionable cross-task meta-strategies that generalize across different backbones Can a separate trained curator improve skill libraries better than frozen agents?.

So: extended exoskeleton use rarely produces meaningful transfer on its own — left on, it teaches style, format, and false confidence. It produces transfer when it's a phase that ends (imitate then verify), when it tracks capability the learner can actually absorb, or when the supporting structure is itself trained toward portable meta-skills rather than imitation of a finished performance. The surprise worth taking away: the better and more comfortable the brace, the more likely it is to teach dependence instead of the thing you wanted it to teach.

Sources 7 notes

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Show all 7 sources

Can length generalization transfer between different related tasks?

Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning3.24 match · arxiv ↗
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning1.67 match · arxiv ↗
Extrapolation by Association: Length Generalization Transfer in Transformers0.91 match · arxiv ↗
SkillOS: Learning Skill Curation for Self-Evolving Agents0.90 match · arxiv ↗
The False Promise of Imitating Proprietary LLMs0.89 match · arxiv ↗
Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning0.89 match · arxiv ↗
A Survey on Post-training of Large Language Models0.87 match · arxiv ↗
Exploring Format Consistency for Instruction Tuning0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about skill transfer from external scaffolding in LLM training. This question remains open: does extended exoskeleton use (imitation, instruction templates, teacher refinement, verifier conditioning) eventually produce meaningful generalization, or does it teach style and dependence instead?

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026.
• Imitation of proprietary models captures fluent *style* but closes no capability gap; base model ceiling is binding (~2023).
• Instruction tuning teaches output *format* distribution, not task understanding, even with deliberately wrong instructions (~2023).
• Teacher-refined data exceeding student's frontier *degrades* performance; better scaffolds can cripple learning by blocking necessary struggle (~2024).
• Richer teacher context (correct answers + verifier output) produces confident student traces but erodes out-of-distribution robustness (~2025).
• Staged scaffolding (imitation → verifiable-reward training) beats either alone; shedding the brace is the mechanism of transfer (~2024–2025).
• Length generalization transfers via reused internal attention structure, not external imitation (~2025).
• Trainable, decoupled skill curators evolve meta-strategies that generalize across tasks and backbones; skill transfer requires intentional training, not passive copying (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.15717 (2023) — False Promise of Imitating Proprietary LLMs
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2402.05808 (2024) — Reverse Curriculum Reinforcement Learning
• arXiv:2605.06614 (2026) — SkillOS: Skill Curation for Self-Evolving Agents

Your task:
(1) RE-TEST THE CONSTRAINT that external scaffolding fails to transfer skill. Judge whether newer optimizers, curriculum scheduling (e.g., Omni-Thinker's hybrid rewards), mechanistic interpretability of sample difficulty, or post-hoc skill distillation have since *relaxed* this limit. Separate the durable finding (imitation alone doesn't close capability gaps) from what may be outdated (staged or trained curation might now reliably work). Cite the paper resolving each.
(2) Surface the *strongest contradicting* work from the last 6 months. The library notes that self-distillation can degrade reasoning (2026); does any concurrent work show conditions under which it reliably transfers?
(3) Propose two research questions that assume the regime has shifted: (a) Can mechanistic analysis of why sample difficulty matters (2605.28388) guide *automated* stage-selection in curriculum scaffolding? (b) Do agents trained with SkillOS-style curators exhibit verifiable transfer to unseen task families, or does portability still require task-family pretraining?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

If an AI trains by copying a smarter model's outputs, does it gain real capability — or just learn to dress the part?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8