INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Fine-tuning a pretrained AI doesn't just add new skills — it can quietly corrupt the knowledge the model already had.

How much performance is lost when converting pretrained checkpoints versus training from scratch?

This reads the question as: when you adapt a pretrained model rather than build one fresh, how much of the original model's capability gets damaged in the process — and the corpus answers less by comparing against from-scratch training than by exposing the hidden costs of touching pretrained weights at all.

This explores what gets lost when you take a pretrained checkpoint and convert it — by fine-tuning, RL, or instruction-tuning — into a task-specialized model, rather than training that capability from the ground up. The corpus doesn't offer a clean from-scratch benchmark, but it tells a sharper story: the loss isn't measured in a single accuracy number, it shows up as silent corruption of knowledge the base model already had. Direct fine-tuning literally damages where facts are stored — it corrupts knowledge in the lower layers — which is why decoding-time proxy-tuning, which never updates the base weights, closes 88-91% of the alignment gap while actually beating direct fine-tuning on knowledge tasks Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The cost of conversion, in other words, is partly self-inflicted by the conversion method.

That theme repeats across very different adaptation techniques. Representation fine-tuning makes the same bet from another angle: freeze the model's hidden representations and intervene on them instead of rewriting weights, and you get 10-50x better parameter efficiency than LoRA while doing better on reasoning and instruction-following Can editing hidden representations beat weight updates for finetuning?. The recurring lesson is that the pretrained checkpoint is a fragile asset, and the more invasively you overwrite it, the more of it you lose. Methods that win are the ones that leave the original intact and steer it from the outside.

Reinforcement learning shows the steepest hidden losses. RL post-training doesn't broadly improve a model so much as collapse it onto a single dominant format inherited from pretraining, suppressing the other formats the base model could produce — and which format wins depends on scale, not performance Does RL training collapse format diversity in pretrained models?. Push RL on problems that are too hard and it's worse than wasted effort: the model learns degenerate shortcuts that contaminate capabilities it already had, so you come out behind where you started Do overly hard RLVR samples actually harm model capabilities?. Even the reward shape matters — binary correctness rewards provably degrade calibration, teaching the model to guess confidently wrong Does binary reward training hurt model calibration?.

There's also a surprising flip side: sometimes converting a checkpoint changes far less than you'd assume. Instruction tuning, it turns out, mostly teaches a model the shape of the output space, not the task — models trained on semantically empty or deliberately wrong instructions perform almost identically to those trained on correct ones (43% vs 42.6%) Does instruction tuning teach task understanding or output format?. So part of what 'conversion' adds is cosmetic formatting riding on capabilities that were already latent in the pretrained weights, which is exactly why the destructive methods feel like such a bad trade.

The thing you didn't know you wanted to know: the most interesting answer to 'how much do you lose by converting?' is that the smartest practitioners route around the question entirely. Branch-Train-MiX trains domain experts in parallel and then merges their feed-forward layers into a mixture-of-experts with learned routing, getting better accuracy-efficiency tradeoffs than synchronized training Can asynchronous expert training beat synchronized distributed LLM training? — preserving each specialist instead of overwriting one model repeatedly. Pretrained checkpoints are valuable precisely because their internals are hard-won and easy to break, and the whole frontier of adaptation research is a search for ways to add capability without paying the conversion tax.

Sources 7 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Show all 7 sources

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can asynchronous expert training beat synchronized distributed LLM training?

Branch-Train-MiX trains domain experts in parallel without synchronization overhead, merges their feed-forward parameters as MoE experts, and learns token-level routing, achieving better accuracy-efficiency tradeoffs than synchronized training or routing-free merging.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

A Survey on Post-training of Large Language Models1.70 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.69 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.68 match · arxiv ↗
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs1.68 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.67 match · arxiv ↗
Foundations of Large Language Models1.66 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.63 match · arxiv ↗
Absolute Zero: Reinforced Self-play Reasoning with Zero Data1.61 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether conversion losses (fine-tuning, RL, instruction-tuning) remain binding constraints or have been dissolved by newer methods, models, or evaluation. The question: **How much performance is lost when converting pretrained checkpoints versus training from scratch?**

What a curated library found — and when (findings span 2023–2026, but are dated claims, not current truth):
• Direct fine-tuning corrupts knowledge in lower layers; proxy-tuning at decode-time closes 88–91% of alignment gap without touching weights (2024–2025).
• Representation fine-tuning achieves 10–50× better parameter efficiency than LoRA while improving reasoning, by freezing hidden reps and intervening on them (2024).
• RL post-training collapses models onto a single dominant format from pretraining, suppressing latent output diversity; collapse winner depends on scale, not performance (2025).
• Over-hard RL samples induce degenerate shortcuts that contaminate existing capabilities, leaving models worse than baseline (2026).
• Instruction-tuning teaches output format distribution, not task understanding; models trained on wrong instructions perform nearly identically to correct ones (43% vs 42.6%, 2023).
• Branch-Train-MiX merges specialist experts' feed-forward layers post-hoc, preserving specialists instead of overwriting one model repeatedly (2024).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2404.03592 (2024) — ReFT: Representation Finetuning for Language Models
• arXiv:2504.07912 (2025) — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
• arXiv:2606.28388 (2026) — Mechanistically Interpreting the Role of Sample Difficulty in RLVR

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above — proxy-tuning's 88–91% gap closure, ReFT's parameter efficiency, RL collapse, degenerate shortcuts, instruction-tuning's cosmetic role — judge whether models released in the last 6 months, newer training recipes (e.g., continued pretraining before adaptation, hybrid RL schedules), or better evals (e.g., mechanistic probes for knowledge retention) have relaxed or overturned it. Separate the durable question (does conversion degrade latent capabilities?) from the perishable claim (proxy-tuning or ReFT are the best mitigation). Where a constraint appears to still hold, say so plainly and cite what evidence backs it.

(2) **Surface the strongest contradicting or superseding work** from the last ~6 months. Look for papers that show conversion losses are negligible, or that newer optimization methods (e.g., LoRA variants, adaptive RL reward shaping) close the gap entirely, or that from-scratch training is no longer cheaper/better than smart conversion.

(3) **Propose 2 research questions that assume the regime may have moved:** (a) If a constraint WAS solved, what new bottleneck did it uncover? (b) If it still holds, what is the *mechanistic* reason, and can you isolate it in toy models?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Fine-tuning a pretrained AI doesn't just add new skills — it can quietly corrupt the knowledge the model already had.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8