INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Can alternative training methods i…›this inquiring line

Do you need to retrain an AI for every user, or can you just steer its outputs on the fly?

How do inference-time reward methods compare to per-user fine-tuning?

This explores the trade-off between steering a model's behavior at generation time (using a reward signal to nudge outputs without touching weights) versus actually retraining the weights to fit a particular user.

This explores two different places you can inject a user's preferences: into the model's outputs at the moment it generates (inference-time reward methods), or into the model's weights themselves (per-user fine-tuning). The most direct answer in the collection is PReF, which personalizes purely at inference time — it learns a set of base reward functions once, then infers a specific user's preferences as a lightweight combination of those, taking as few as ten adaptive questions to lock in. No weights are modified per user Can user preferences be learned from just ten questions?. The appeal is obvious: you skip a training run for every individual, and personalization becomes a cheap, reversible dial rather than a permanent change.

What does fine-tuning buy you that the inference-time route can't? The collection suggests the answer is depth of internalization. When reinforcement learning rewrites weights, it doesn't scatter changes everywhere — it reliably edits a sparse, structured subnetwork (5–30% of parameters), nearly the same one across random seeds, which hints that training durably restructures how the model reasons rather than just biasing its surface outputs Does reinforcement learning update only a small fraction of parameters?. There's a sharper version of this gap: reasoning models keep beating non-reasoning ones no matter how much inference-time compute you throw at the weaker model, because the training regime instilled a protocol that makes extra tokens productive. You can't always buy at inference time what was installed during training Can non-reasoning models catch up with more compute?.

But the inference-time camp has been quietly getting more powerful, which narrows that gap. Reward models themselves can now reason before they score — adding chain-of-thought to evaluation raises their ceiling and lets them scale with test-time compute, so the 'judge' guiding generation is no longer a fixed function Can reward models benefit from reasoning before scoring?. And that compute can be spent intelligently: allocating more inference budget to hard prompts and less to easy ones beats a fixed budget, meaning inference-time steering can be tuned per-case rather than per-user Can we allocate inference compute based on prompt difficulty?.

The interesting twist is that the two approaches aren't a clean either/or — the boundary blurs. Test-Time RL uses majority-vote agreement across samples as its own reward signal at deployment, then trains on it, turning what starts as inference-time compute into actual weight updates with no labels at all Can models improve themselves using only majority voting?. So the real design question isn't 'reward at inference or fine-tune,' it's where on the spectrum you commit a preference. Inference-time reward methods are cheap, reversible, and ideal when users are many and preferences shift — exactly the per-user case PReF targets. Fine-tuning is the right tool when a behavior needs to become load-bearing and permanent.

The thing worth carrying away: the reward signal is the shared currency between both worlds. Whether it nudges a single generation or drives a gradient step, the quality of that signal dominates — and we know it has sharp failure modes, like binary correctness rewards that degrade a model's calibration into overconfident guessing Does binary reward training hurt model calibration?. A well-designed reward steers well at inference and trains well at fine-tune time; a badly designed one corrupts both. The choice of where to apply it is almost secondary to getting the signal right.

Sources 7 notes

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Show all 7 sources

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Reward Reasoning Model1.73 match · arxiv ↗
Reasoning Models Can Be Effective Without Thinking1.68 match · arxiv ↗
When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling1.68 match · arxiv ↗
Can Large Reasoning Models Self-Train?1.67 match · arxiv ↗
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking1.64 match · arxiv ↗
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets1.64 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning0.91 match · arxiv ↗
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models0.90 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing whether inference-time reward methods and per-user fine-tuning remain distinct design regimes, or whether the boundary has dissolved. The question: where should you inject user preferences—at generation time or into weights—and does that choice still matter as much as the reward signal itself?

What a curated library found — spanning 2021–2025, though most density at 2025 (findings dated, not current truth):
• Inference-time reward methods (PReF) lock in user preferences via ~10 adaptive questions, zero per-user weight updates; reward reasoning models now chain-of-thought before scoring, extending test-time compute scaling to evaluation (2025).
• Fine-tuning reliably rewrites 5–30% of parameters in sparse subnetworks, nearly identical across seeds, suggesting durable restructuring of reasoning rather than surface bias (2025).
• Reasoning models consistently outpace non-reasoning ones even under unlimited inference-time compute, because training installs a protocol that makes extra tokens productive (2025).
• Test-Time RL synthesizes unlabeled data via majority-vote reward estimation, then trains on it—blurring inference-time steering and weight updates (2025).
• Binary correctness rewards provably degrade calibration, corrupting both inference-time and fine-tune regimes equally (2024).

Anchor papers (verify; mind their dates):
• arXiv:2503.06358 (PReF, 2025)
• arXiv:2505.11711 (RL edits sparse subnetworks, 2025)
• arXiv:2504.09858 (Reasoning models without explicit thinking, 2025)
• arXiv:2504.16084 (TTRL, 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For inference-time reward methods: has adaptive sampling, reward model scaling, or orchestration (memory, caching, batching) since tightened or loosened the 10-question estimate? For fine-tuning: do recent sparse subnetwork findings (2025) hold across larger models, longer adaptation horizons, or multi-objective settings? Judge whether reasoning-model superiority persists if inference-time methods now use reasoning judges—has the gap narrowed or stayed fixed?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Does any recent paper argue that the regimes are now equivalent, or that one dominates across realistic user scales?
(3) Propose 2 research questions that assume the regime may have shifted: (a) Under what conditions can inference-time reward reasoning fully replicate the internalization benefits of sparse-subnetwork fine-tuning? (b) Does adaptive compute allocation (hard prompts → more budget) subsume per-user fine-tuning for multi-objective alignment?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do you need to retrain an AI for every user, or can you just steer its outputs on the fly?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8