INQUIRING LINE

Which finetuning method works best across different task and data regimes?

This explores whether there's a single best finetuning method — and the corpus's answer is that 'best' depends on your model size, your task, and what you're optimizing for, with a recurring tension between methods that update weights and methods that leave them alone.


This explores whether one finetuning method wins across the board — and the corpus's clearest finding is that the question itself is mis-framed: the right method depends on your base model size, the task's reward structure, and whether you care more about preserving knowledge or changing behavior. The most fundamental result is that finetuning follows a multiplicative scaling law where a larger base model helps far more than more finetuning data or more tunable parameters How should finetuning scale with model and data size?. So before picking a method, the lever that matters most is what you're starting from.

A striking thread is that the lightest-touch methods often win. Representation finetuning (ReFT) edits frozen hidden states instead of weights and beats LoRA by 10–50x on parameter efficiency across reasoning, instruction-following, and NLU Can editing hidden representations beat weight updates for finetuning?. Proxy-tuning goes further and never touches weights at all, steering outputs at decoding time — and it actually surpasses direct finetuning on knowledge tasks because direct weight updates corrupt knowledge stored in lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. That damage isn't a bug you can tune away: pretraining and finetuning live in different parts of the network, with pretraining scale driving factual accuracy and finetuning scale driving helpfulness Do pretraining and fine-tuning scale independently in language models?. If you finetune hard for behavior, you risk eroding the facts.

The method also has to match the task regime. Preference tuning (RLHF) reduces output diversity in code, where convergence on correct answers is rewarded, but increases it in creative writing, where distinctiveness pays — the same method flips direction depending on the domain Does preference tuning always reduce diversity the same way?. Reinforcement learning carries its own caveats: it tends to collapse onto a single dominant output format inherited from pretraining within the first epoch Does RL training collapse format diversity in pretrained models?, and on reasoning tasks it often sharpens memorized template-matching rather than installing genuine procedures, as out-of-distribution tests expose Do fine-tuned language models actually learn optimization procedures?. So RL is powerful for shaping behavior but a poor bet if you want transferable reasoning.

For multi-task settings the failure mode is interference, and the corpus's answer is structural rather than algorithmic: isolating each task's core parameters and freezing them while merging the rest beats standard multi-task finetuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?. And what you feed the method matters as much as the method — instruction tuning largely teaches the output format, not task understanding, since even semantically empty instructions get comparable results Does instruction tuning teach task understanding or output format?, while teacher-refined data backfires when it overshoots the student model's learning frontier, even if it's objectively higher quality Does teacher-refined data always improve student model performance?.

The takeaway you didn't know you wanted: there's no universal champion, but there is a default worth reaching for — start from the largest base model you can, and prefer interventions that leave pretrained weights intact (representation editing, decoding-time proxies) unless the task specifically rewards reshaping behavior. Match the heavier methods to the regime, not the other way around.


Sources 10 notes

How should finetuning scale with model and data size?

Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM finetuning researcher. The question: which finetuning method generalizes best across task and data regimes?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–10/2025; treat these as perishable constraints:
• Finetuning follows multiplicative scaling law: base model size >> finetuning data or parameter count (2024-02)
• Representation finetuning (ReFT) beats LoRA by 10–50× on parameter efficiency; proxy-tuning at decode-time preserves pretrained knowledge better than direct weight updates (2024-04)
• Preference tuning (RLHF) reduces diversity in code but increases it in creative writing—domain-dependent flip (2025-04); RL converges to single pretraining distribution format within epoch 1 and often sharpens memorized templates rather than installing transferable reasoning (2025-04)
• Instruction tuning teaches output format, not task understanding; semantically empty instructions yield comparable results (2023-05)
• Teacher-refined data backfires without student model selection; multi-task finetuning fails via interference unless core parameters are isolated and frozen (2025-08)

Anchor papers (verify; mind their dates):
• arXiv:2402.17193 (2024-02) — scaling laws across methods
• arXiv:2404.03592 (2024-04) — ReFT's efficiency gains
• arXiv:2504.07912 (2025-04) — RL's convergence collapse
• arXiv:2508.21741 (2025-08) — parameter isolation for multi-task

Your task:
(1) RE-TEST EACH CONSTRAINT. For the multiplicative scaling law, ReFT/proxy superiority, RL's format collapse, and instruction-tuning's format-only learning: have newer models (o3, Grok-3, etc.), training procedures (curriculum, test-time adaptation), or evals (OOD reasoning benches) since relaxed any of these? Distinguish durable insight (base size matters most) from perishable limitation (ReFT beats LoRA by X factor).
(2) Surface the strongest work from the last 3 months that contradicts or supersedes the "lightest-touch wins" narrative—especially if larger-scale RL or full-weight finetuning has re-emerged as competitive.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does scaling RL compute change the template-collapse finding? Can hybrid methods (frozen core + adaptive layers) outperform isolation-only?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines