INQUIRING LINE

Training, RL, and Test-Time Scaling · Model Architecture and Internals · Reasoning, Retrieval, and Evaluationcross-cluster

What cognitive burdens should move from model parameters into harness infrastructure?

This explores which jobs we currently ask the model's weights to do — planning, memory, skill, calibration of effort — that research suggests belong instead in the scaffolding around the model (the 'harness'): the memory stores, tool protocols, and orchestration logic.

This explores which cognitive jobs we've been baking into model parameters that actually belong in the harness — the memory, skills, and protocols wrapped around the model. The clearest signal in the corpus is that reliability is increasingly a property of scaffolding, not weights: agent capability has been shown to shift from the model itself onto a surrounding harness of memory, skills, and structured protocols, with reliability emerging from externalizing cognitive burden rather than scaling parameters Where does agent capability really come from?. So the question isn't whether to offload burdens — it's which ones move cleanly.

Planning is the strongest candidate. When you split the decomposer (the part that breaks a problem into steps) from the solver (the part that executes each step), accuracy and generalization both improve — and notably, the decomposition skill transfers across domains while raw solving ability does not Does separating planning from execution improve reasoning accuracy?. That's a strong argument for treating planning as harness-level orchestration: a reusable, transferable layer that coordinates the model rather than something each model must re-learn internally. The corpus also warns that asking a single model to reason 'harder' internally often misfires — extended chain-of-thought produces more text rather than more computation on constraint-bound numeric tasks Do reasoning models actually beat standard models on optimization?, and verbose reasoning actively degrades perception tasks where the real bottleneck is elsewhere Does verbose chain-of-thought actually help multimodal perception tasks?. When the burden is misplaced inside the model, more parameters or more tokens don't fix it.

Effort allocation is the second burden to externalize. Deciding how much computation a given problem deserves is better handled as a harness policy than as a fixed model property: inference-time compute can substitute for parameter scaling on hard prompts Can inference compute replace scaling up model size?, and allocating that compute adaptively — little for easy prompts, lots for hard ones — beats a larger model spending a uniform budget Can we allocate inference compute based on prompt difficulty?. There's a catch worth knowing: models are bad at self-judging difficulty, since reasoning trace length tracks how close a problem is to training data, not how hard it actually is Does longer reasoning actually mean harder problems?. That's precisely why the difficulty-estimation-and-budgeting loop wants to live in the harness, where it can be measured and tuned, rather than trusted to the model's own sense of effort.

Memory and personalization round out the picture: durable per-user state belongs outside the base weights. Lightweight adapters can act as persistent behavioral deltas, letting one shared base plus millions of small adapters replace millions of full personalized models Can lightweight adapters replace millions of personalized models?, and intervening on frozen representations rather than retraining weights buys 10–50x parameter efficiency Can editing hidden representations beat weight updates for finetuning?. The unifying logic — also visible in looped computation that trades depth for parameter count Can looped computation replace parameter count in world models? and in isolating task-specific parameters to stop fine-tuning interference Can isolating task-specific parameters prevent multi-task fine-tuning interference? — is to keep the base model a stable, general engine and push the variable, stateful, task-specific work outward.

The twist the reader might not expect: offloading to the harness has a sweet spot, not a monotonic payoff. The ability to *write* useful harness updates is roughly flat across model tiers, but the ability to *benefit* from them follows an inverted U — weak models fail to invoke the scaffolding at all, and very strong models struggle to follow its instructions faithfully, leaving mid-tier models as the biggest winners Do stronger models always evolve harnesses better?. So 'move it to the harness' isn't free — the harness is only as good as the model's ability to actually use what's been externalized.

Sources 12 notes

Where does agent capability really come from?

Research shows that agent capability shifts from the model itself to the surrounding harness of memory, skills, and protocols. Reliability emerges from externalizing cognitive burden into structured scaffolding rather than scaling model weights.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Can lightweight adapters replace millions of personalized models?

PEFT adapters function as durable behavioral deltas carrying learned user experience, enabling a single strong base plus millions of lightweight adapters to replace millions of full models—but only when scale-up, scale-down, and scale-out reinforce simultaneously.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can looped computation replace parameter count in world models?

LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Do stronger models always evolve harnesses better?

Model capability to produce useful harness edits stays constant across tiers, but capacity to actually benefit from those edits follows an inverted U-shape, peaking in mid-tier models. Weak models fail to invoke harnesses; strong models struggle with faithful instruction-following.

What cognitive burdens should move from model parameters into harness infrastructure?

Sources 12 notes

Next inquiring lines