Does extended exoskeleton use eventually produce meaningful skill transfer?
This reads 'exoskeleton' as the external scaffolding a model leans on — imitating a stronger model, copying teacher-provided reasoning traces, riding instruction formats and reward crutches — and asks whether leaning on those props eventually hardens into real, transferable skill.
Reading 'exoskeleton' as the external supports a model trains against — a stronger model to imitate, a teacher's worked traces, an instruction template, a reward signal — the corpus gives a sharp and slightly uncomfortable answer: the brace mostly transfers the *shape* of competence, not competence itself. The clearest case is straight imitation. Models trained to copy ChatGPT learn to wear its confident, fluent style well enough to fool human judges, yet close no actual capability gap — factuality and generalization to novel tasks don't move, because the ceiling is set by the base model, not the costume it puts on Can imitating ChatGPT fool evaluators into thinking models improved?. Instruction tuning shows the same seam from another angle: models trained on semantically empty or even deliberately wrong instructions perform about as well as those given correct ones, which means what the scaffold actually teaches is the *output format* — the exoskeleton's silhouette — not the understanding underneath Does instruction tuning teach task understanding or output format?.
The more interesting failure is when the exoskeleton is too good. Teacher-refined data that exceeds a student's learning frontier *degrades* it, even when the refinements are objectively higher quality — the student can't metabolize moves beyond its reach, so the better crutch makes for a worse walker Does teacher-refined data always improve student model performance?. And conditioning a teacher on the correct answer plus verifier output produces clean, confident traces that students happily inherit — at the cost of out-of-distribution robustness, because the polished brace teaches them to suppress exactly the uncertainty they'd need when the support is removed Does richer teacher context hurt student generalization?. The crutch doesn't just fail to transfer skill; it can quietly transfer the *absence* of one.
But the question says 'eventually,' and that's where the corpus turns. Skill transfer does happen — when the scaffolding is staged rather than permanent. Running supervised imitation *first* to build reasoning foundations, then switching to verifiable-reward training to sharpen against real outcomes, beats either method alone: the imitation phase isn't the destination, it's what makes the later reward signal informative by producing reasonable attempts worth refining Does sequencing imitation then exploration training improve reasoning?. The exoskeleton works precisely when it's worn temporarily and then shed.
There's also evidence that genuine transfer rides on structure already present, not bolted on from outside. Length generalization carries across related tasks because models reuse the same attention heads — pretrained models already contain the reusable computational scaffolding, and shorter tasks borrow it to extrapolate further Can length generalization transfer between different related tasks?. That reframes the whole question: maybe the durable 'exoskeleton' is internal architecture being reused, not external imitation being copied. And when you *do* want external structure to produce transferable strategy, you have to train for it deliberately — a decoupled, trainable skill curator (separate from a frozen executor) learns to evolve a skill library away from generic verbose additions toward actionable cross-task meta-strategies that generalize across different backbones Can a separate trained curator improve skill libraries better than frozen agents?.
So: extended exoskeleton use rarely produces meaningful transfer on its own — left on, it teaches style, format, and false confidence. It produces transfer when it's a phase that ends (imitate then verify), when it tracks capability the learner can actually absorb, or when the supporting structure is itself trained toward portable meta-skills rather than imitation of a finished performance. The surprise worth taking away: the better and more comfortable the brace, the more likely it is to teach dependence instead of the thing you wanted it to teach.
Sources 7 notes
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.
SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.