INQUIRING LINE

Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scaling · Model Architecture and Internalscross-cluster

Can a trained decoder replace both search and parameter updates?

This reads the question as asking whether inference-time methods that act at the decoder — steering outputs, editing internal representations, composing skills on the fly — can stand in for both retrieval (search) and weight fine-tuning (parameter updates), and where that substitution breaks.

This explores whether you can move work out of two expensive places — the retriever and the optimizer — and into the model's own forward pass at decoding time. The corpus is surprisingly bullish on replacing parameter updates, and more cautious on replacing search. On the parameter side, several lines of work show you don't need to touch the weights at all to get the behavior fine-tuning gives you. Proxy-tuning steers the output distribution at decoding time and actually preserves pretrained knowledge *better* than direct fine-tuning, which corrupts the lower layers where facts are stored Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Representation fine-tuning goes further, leaving weights frozen and learning small interventions on hidden states — reaching 10–50x better parameter efficiency than LoRA Can editing hidden representations beat weight updates for finetuning?. And self-adaptive models compose 'expert vectors' at inference by tuning singular values, mixing skills per-query without retraining Can models dynamically activate expert skills at inference time?.

There's a deeper reason this works: fine-tuning may be doing far less to the weights than we assume. Across seven RL algorithms and ten model families, reinforcement learning updates only 5–30% of parameters, and those sparse updates are nearly identical across random seeds — structural, not arbitrary Does reinforcement learning update only a small fraction of parameters?. If adaptation is concentrated and reproducible, it's plausible to relocate it to a lightweight decoder-side intervention. A complementary framing splits adaptation into 'slow weights' and 'fast context,' routing most task-specific lessons into the prompt and barely touching parameters — treating forgetting as a misallocation problem rather than an inherent cost Can splitting adaptation into two channels reduce forgetting?.

Replacing search is where it gets interesting. The model's own calibrated token-probabilities — read off at decoding time — can decide *when* to retrieve better than elaborate adaptive-retrieval heuristics, using a fraction of the retriever calls Can simple uncertainty estimates beat complex adaptive retrieval?. So the decoder can replace the *control logic* of search. But replacing the *content* of search is harder. Transformers provably beat fixed-state models at copying and retrieval precisely because they keep the full context addressable; a bounded internal state can't reconstruct what it never stored Can state-space models match transformers at copying and retrieval?. A decoder reaching into parameters instead of an external store inherits that ceiling.

And the parametric channel actively fights you. Language models routinely ignore information in their context when prior training associations are strong — parametric knowledge overrides what's right in front of them, and prompting alone can't fix it Why do language models ignore information in their context?. That's the failure mode of asking a trained decoder to 'just know' instead of retrieving. Relatedly, models that look like they're computing in latent space are often pattern-matching memorized templates rather than executing the procedure Do large language models actually perform iterative optimization? — a caution against assuming the forward pass can absorb arbitrary work.

So the honest answer the corpus points to: a trained decoder can largely replace parameter updates, and it can replace the *orchestration* of search, but it can't replace search's external memory for anything fresh, long, or contrary to the model's priors. The interesting design isn't 'decoder instead of both' — it's a decoder that handles adaptation internally while still calling out to a store, with its own uncertainty deciding when to reach.

Sources 9 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can a trained decoder replace both search and parameter updates?

Sources 9 notes

Next inquiring lines