INQUIRING LINE

Why does parameter-efficient tuning scaling fail to improve finetuning performance?

This explores why adding more trainable parameters to parameter-efficient tuning (PET) methods like LoRA doesn't reliably buy you better finetuning — and what the corpus suggests is actually doing the work instead.


This explores why scaling up parameter-efficient tuning (PET) — adding more trainable adapter parameters — fails to translate into better finetuning, and what the research says is the real lever instead. The most direct answer comes from the scaling-law work: finetuning follows a *multiplicative* power law where the size of the base model matters far more than anything you do at the finetuning stage, and pushing up the number of PET parameters yields minimal returns How should finetuning scale with model and data size?. In other words, PET capacity isn't the bottleneck — the knowledge already sitting in the pretrained weights is. You can't add adapter parameters your way into capability the base model never had.

Why would extra parameters be wasted? Because finetuning and pretraining touch different parts of the model. One line of work shows the two scale almost independently: pretraining enriches factual knowledge stored in lower layers, while finetuning mostly modifies *behavior expression* in upper layers — helpfulness, style, format Do pretraining and fine-tuning scale independently in language models?. So a bigger adapter is just a bigger knob on a layer that was only ever going to adjust surface behavior. This explains a recurring observation that finetuning makes outputs *look* right without making them right: supervised finetuning on optimization problems produces clean JSON and valid structure while the underlying solutions remain physically infeasible — the model learns the surface features of good answers, not the reasoning to construct them Does supervised fine-tuning actually improve reasoning on optimization problems?. More parameters sharpen the costume, not the competence.

The more interesting twist is that *where* you intervene beats *how much* you tune. Representation finetuning (ReFT) edits frozen hidden representations rather than updating weights, and its low-rank variant beats LoRA across reasoning and instruction benchmarks while using 10–50× fewer parameters Can editing hidden representations beat weight updates for finetuning?. That's the inverse of the scaling intuition: a smaller, better-placed intervention wins. The same theme runs through decoding-time methods — proxy-tuning leaves base weights untouched and closes most of the alignment gap while actually *outperforming* direct finetuning on knowledge tasks, because direct weight updates corrupt the knowledge stored in lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. SoftCoT makes the same bet from another angle: freeze the backbone, delegate the new reasoning to a tiny auxiliary model, and you avoid catastrophic forgetting entirely Can continuous reasoning avoid forgetting in instruction-tuned models?.

There's also a sign that the *signal*, not the parameter count, decides whether finetuning sticks. DPO with explicit negative examples beats plain SFT for small models on function-calling precisely because it targets the rigid format failures SFT can't fix Can small models match large models on function calling?, and when multiple tasks are involved, structurally isolating each task's core parameters beats throwing everything into one undifferentiated finetune Can isolating task-specific parameters prevent multi-task fine-tuning interference?. The consistent picture across all of these: finetuning performance is gated by base-model knowledge, by which layers you touch, and by the quality of your training signal — none of which a larger adapter addresses. Scaling PET parameters fails because it's optimizing the one dimension that turns out not to matter.


Sources 8 notes

How should finetuning scale with model and data size?

Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether parameter-efficient tuning (PET) scaling remains bottlenecked by base-model capacity or whether newer methods, tooling, or training regimes have relaxed this constraint.

What a curated library found — and when (dated claims, not current truth): These findings span 2023–2025.
• Finetuning follows multiplicative scaling where base-model size dominates; PET parameter count yields minimal returns (2024).
• Pretraining and finetuning scale almost independently: pretraining builds lower-layer factual knowledge; finetuning modifies upper-layer behavior expression, making larger adapters ineffective at surfacing new capability (2024).
• Supervised finetuning improves response formatting without improving underlying reasoning or correctness — structure wins, competence stalls (2024).
• Representation finetuning (ReFT) and decoding-time proxy-tuning beat weight-updating methods while using 10–50× fewer parameters, suggesting *where* you intervene matters far more than parameter count (2024–2025).
• DPO with negative examples and task-core parameter isolation outperform undifferentiated finetuning, implicating signal quality and structural targeting, not capacity (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.17193 (2024-02) — scaling laws and finetuning method interaction
• arXiv:2404.03592 (2024-04) — ReFT as alternative to weight-update scaling
• arXiv:2508.21741 (2025-08) — parameter isolation and selective targeting
• arXiv:2502.12134 (2025-02) — auxiliary-model delegation (SoftCoT)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, investigate whether recent advances in adapter architecture (e.g., mixture-of-experts adapters, dynamic routing), training signal design (synthetic negatives, contrastive pretraining), or inference-time composition (multi-adapter ensembles, dynamic gating) have *relaxed* the finding that PET scaling fails. Separate durable constraint (base-model knowledge ceiling) from perishable one (current PET methods cannot exploit added capacity). Cite what resolved it.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months — any paper showing PET scaling *does* yield returns under specific conditions, or proposing a new intervention that breaks the multiplicative-scaling rule.
(3) Propose 2 research questions that assume the constraint may have shifted: (a) Can dynamic, learnable routing between adapter modules and pretrained layers overcome the layer-independence bottleneck? (b) Under what signal-design regimes (e.g., process supervision, tree-search trajectories) does PET scaling finally produce gains?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines