INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Why does finetuning cause catastro…›this inquiring line

Teaching a model to reason better turns out to be cheap — a tiny adapter works nearly as well as retraining the whole thing.

When should full-parameter post-training be used instead of LoRA adaptation?

This explores when you actually need to update all of a model's weights (full-parameter post-training) versus the cheaper route of training a small add-on adapter (LoRA) — and the corpus has a surprisingly opinionated answer.

This explores when full-parameter post-training earns its much higher cost over LoRA-style lightweight adaptation. The short version the corpus keeps circling back to: less often than you'd think — because much of what people reach for full fine-tuning to teach turns out to be format and behavior, not new knowledge, and those are exactly what small adapters handle well.

The sharpest data point is that a 1.5B model trained with LoRA alone matched far larger full-parameter RL models on reasoning tasks Can small models reason well by just learning output format?. The interpretation there is that reinforcement learning mostly teaches a model how to *organize* its output, not new facts — meaning reasoning and knowledge storage are separable, and the reasoning half is cheap to adapt. This lines up with a structural finding from the other direction: even when you do run full RL, it only ends up modifying 5–30% of parameters, in sparse but nearly full-rank subnetworks that are consistent across random seeds Does reinforcement learning update only a small fraction of parameters?. So 'full-parameter' training is, in practice, already doing something closer to targeted surgery — which weakens the case that you needed all the parameters unfrozen to begin with.

The real argument *against* full fine-tuning, though, is damage. Updating all weights directly corrupts knowledge stored in a model's lower layers; decoding-time proxy-tuning closes 88–91% of the alignment gap while *beating* direct fine-tuning on knowledge tasks precisely because it leaves the base weights untouched Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Catastrophic forgetting, on this view, isn't an inherent cost of adaptation but a misallocation: route task-specific lessons into prompts or fast context and keep parameter updates minimal, and you reach equivalent performance faster with far less forgetting Can splitting adaptation into two channels reduce forgetting?. There's even a method that beats LoRA itself by tuning only the singular values of weight matrices into composable expert vectors Can models dynamically activate expert skills at inference time? — pushing the frontier toward *less* invasive, not more.

So when *would* you go full-parameter? The corpus implies the honest answer is: when you genuinely need to reshape the model's internal knowledge or its base distribution, not just its output style. There's a quiet warning here too — full RL post-training collapses the diversity of formats a pretrained model can produce, locking onto a single dominant one within the first epoch regardless of whether it's the best Does RL training collapse format diversity in pretrained models?. That homogenizing pressure is a cost you pay with deep training and largely avoid with isolated adapters. And if your real problem is multi-task interference, the fix isn't training harder but isolating each task's core parameters and freezing them while merging the rest Can isolating task-specific parameters prevent multi-task fine-tuning interference?.

The thing you didn't know you wanted to know: the field is steadily reframing 'full vs. LoRA' not as a power-vs-efficiency tradeoff but as a *which capability am I actually changing* question. If you're adapting behavior, reasoning format, or style, the lightweight methods now win on both cost and knowledge preservation — and full-parameter training increasingly looks like the option you choose only when you've confirmed the cheaper, less destructive routes can't reach the knowledge you need to move.

Sources 7 notes

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Show all 7 sources

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining2.52 match · arxiv ↗
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance1.70 match · arxiv ↗
Learning, Fast and Slow: Towards LLMs That Adapt Continually1.69 match · arxiv ↗
The Art of Scaling Reinforcement Learning Compute for LLMs1.67 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.65 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.65 match · arxiv ↗
An Emulator for Fine-Tuning Large Language Models using Small Language Models1.64 match · arxiv ↗
A Survey on Post-training of Large Language Models1.61 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher evaluating when full-parameter post-training justifies its cost over LoRA. The question remains open: under what conditions does end-to-end retraining outperform lightweight adaptation?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to re-test:
• 1.5B LoRA models matched large full-parameter RL reasoning models; reasoning format adapts cheaply, knowledge storage does not (~2025).
• Full RL updates only 5–30% of parameters via sparse, full-rank subnetworks; 'full-parameter' training already performs targeted surgery (~2025).
• Proxy-tuning at decode time closes 88–91% alignment gap *without* knowledge corruption; direct fine-tuning damages lower-layer knowledge (~2024–2025).
• RL post-training collapses format diversity into a single dominant distribution within epoch one, locking the model (~2025).
• Core parameter isolation (freezing task-specific parameters, merging shared ones) prevents multi-task interference (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.15777 (Tina, Apr 2025) — LoRA reasoning at scale.
• arXiv:2504.07912 (Echo Chamber, Apr 2025) — RL's homogenizing effect.
• arXiv:2505.11711 (May 2025) — sparse RL subnetworks.
• arXiv:2605.12978 (May 2026) — continual learning costs.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the five claims above, assess whether newer models (GPT-4.5+, o3-class), training methods (DPO variants, mixture-of-adapters), or orchestration (test-time adaptation, in-context learning) have since relaxed or overturned them. Separate the durable question—*what capability truly requires weight redistribution?*—from perishable limitations (e.g., *LoRA rank insufficiency for reasoning*, possibly resolved by routing or ensemble). Cite what resolved it; flag where constraints still hold.
(2) Surface the strongest contradicting or superseding work from the last ~6 months (e.g., papers showing full RL *does* preserve diversity, or LoRA fails on knowledge insertion tasks).
(3) Propose 2 research questions that assume the regime may have shifted: e.g., *Can test-time adaptation + retrieval fully replace post-training for few-shot knowledge?* or *Do newer sparse training methods (pruning, MoE) change the economics of full-parameter tuning?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Teaching a model to reason better turns out to be cheap — a tiny adapter works nearly as well as retraining the whole thing.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8