INQUIRING LINE

Training, RL, and Test-Time Scaling · Model Architecture and Internals · Reasoning, Retrieval, and Evaluationcross-cluster

Can learned priors effectively select and weight ensemble members by inference budget?

This explores whether a model can learn — rather than hand-tune — how to pick which 'experts' or ensemble members to fire and how heavily to weight each one given a compute budget, and whether the corpus has anything on that learned routing under its various names.

This explores whether learned priors can do the job of an ensemble dispatcher: deciding which members to activate, how much to trust each, and how to spend a fixed inference budget across them. No single note in the corpus is framed as 'budget-aware ensemble weighting,' but the conceptual territory shows up repeatedly under different vocabulary — dynamic expert composition, learned compute allocation, and budgeted selection — and reading across them gives a surprisingly coherent picture.

The strongest evidence that learned priors can *select and weight* members at inference is the work on composing expert vectors on the fly. Transformer² learns to mix task-specific experts at inference by tuning only the singular values of weight matrices, producing composable vectors that blend without interfering Can models dynamically activate expert skills at inference time?. That's exactly the 'weight the members' half of your question, learned rather than fixed. A more radical version skips gradients entirely: swarms of LLM 'particles' search weight space and discover composed experts — including ones that answer questions every starting expert got wrong — from just 200 validation examples Can language models discover new expertise through collaborative weight search?. Both suggest the *combination* prior is learnable cheaply.

The 'by inference budget' half points to a different, quieter result: the cheapest reliable router is often the model's own calibrated uncertainty. A simple token-probability uncertainty estimate beats elaborate adaptive-retrieval heuristics at deciding when to spend extra calls, using a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. That's a learned prior allocating budget — not over ensemble members exactly, but over whether to invoke an expensive component at all. The same logic of treating selection as a budget problem appears in few-shot example choice, where framing demonstration selection as budgeted active learning (pick what most reduces uncertainty) beats similarity-based retrieval Can optimal experimental design improve few-shot example selection?.

The sharp caveat — and the thing you might not have known you wanted to know — is that inference budget has a ceiling that no clever routing can break. Non-reasoning models do not catch up to reasoning models no matter how much inference compute you throw at them, because the *training regime* decides whether extra tokens are productive at all Can non-reasoning models catch up with more compute?. So a learned prior can weight and budget across members effectively, but only among members that were trained to make compute pay off; you can't dispatch your way out of a weak ensemble.

One adjacent thread worth a doorway: the reason inference-time composition is attractive at all is that it leaves base weights untouched. Both proxy-tuning at decoding time Can decoding-time tuning preserve knowledge better than weight fine-tuning? and representation-level intervention on frozen activations Can editing hidden representations beat weight updates for finetuning? show that steering behavior without weight updates preserves knowledge and is dramatically more parameter-efficient — which is what makes a fleet of cheaply-composed, budget-weighted experts practical in the first place.

Sources 7 notes

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can optimal experimental design improve few-shot example selection?

AIPD frames demonstration selection as budgeted active learning, choosing examples that maximally reduce test-set uncertainty. Two algorithms (GO and SAL) outperformed similarity-based methods across small, medium, and large language models.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can learned priors effectively select and weight ensemble members by inference budget?

Sources 7 notes

Next inquiring lines