INQUIRING LINE

What causes weak models to fail at activating harness artifacts?

This explores why smaller or weaker models can't reliably reach for and use the scaffolding — memory, skills, tools, protocols — that a harness provides, even when that scaffolding is sitting right there for them to use.


This explores why weaker models fail to *activate* harness artifacts — not why they can't read instructions, but why they don't reach for the memory, skills, and protocols a harness offers even when those are available. The cleanest answer in the corpus is that benefiting from a harness is a different skill from producing one, and it follows an inverted-U curve: the capacity to write useful harness edits stays roughly flat across model tiers, but the capacity to actually *invoke and benefit from* them peaks at mid-tier and collapses at both ends — weak models fail to invoke the harness at all, while strong models trip over faithful instruction-following Do stronger models always evolve harnesses better?. So weak-model failure isn't about generating bad scaffolding; it's about not triggering it when the moment calls for it.

Why does invocation specifically break down? Because a harness only pays off when its cognitive burden has been externalized into structure the model is supposed to follow Where does agent capability really come from? — and following that structure depends on the very instruction-following ability weak models are short on. The chain-of-thought research sharpens this: CoT in weaker models is constrained imitation of reasoning *form*, not abstract inference — format matters more than content, and scaling reasoning ability actually creates instruction-following deficits What makes chain-of-thought reasoning fail in language models?. A harness artifact is essentially an instruction-following contract ("when X, consult skill Y; record Z to memory"), so a model that imitates surface form without tracking intent will skip the trigger conditions entirely.

There's a second, sharper failure: even when a weak model produces the right move, it can't reliably *select* it without an external signal. A committee of weak-model calls only matches a strong model when a local soundness check — tests, proofs, type checks — converts latent-correct proposals into actual selections; sampling alone amplifies coverage but can't pick the winner When can weak models match strong model performance?. Read against harnesses, this says weak models need the harness's verification scaffolding precisely *because* they can't self-select — yet activating that scaffolding is the thing they fail at. It's a chicken-and-egg trap: the artifact that would rescue them is the one they don't know to invoke.

The failure also compounds over a trajectory. Once a weak model misfires, its own prior errors fill the context and bias everything downstream — a self-conditioning effect that degrades long-horizon performance non-linearly, and that scaling does *not* fix (only test-time thinking does) Do models fail worse when their own errors fill the context?. The multi-agent literature names the concrete symptoms: role flipping, infinite loops, and conversation deviation, all rooted in LLMs lacking persistent goal and role representation Why do autonomous LLM agents fail in predictable ways?. A harness protocol assumes the agent holds onto a goal long enough to recognize when an artifact applies — and that stable goal-holding is exactly what weak models don't have.

The quietly surprising takeaway: making the model bigger isn't the fix, and at the top it's even counterproductive. Weak models under-invoke harnesses; strong models over-literally follow them or drift from faithful execution — so the sweet spot for harness leverage is the middle, not the frontier Do stronger models always evolve harnesses better?. If you want a weak model to use its scaffolding, the lever is structure that lowers the activation bar — verifiable triggers and external soundness signals — not raw capability Where does agent capability really come from?.


Sources 6 notes

Do stronger models always evolve harnesses better?

Model capability to produce useful harness edits stays constant across tiers, but capacity to actually benefit from those edits follows an inverted U-shape, peaking in mid-tier models. Weak models fail to invoke harnesses; strong models struggle with faithful instruction-following.

Where does agent capability really come from?

Research shows that agent capability shifts from the model itself to the surrounding harness of memory, skills, and protocols. Reliability emerges from externalizing cognitive burden into structured scaffolding rather than scaling model weights.

What makes chain-of-thought reasoning fail in language models?

Research shows CoT mirrors reasoning form without true logical abstraction. Format matters more than content, invalid prompts work as well as valid ones, and scaling reasoning creates instruction-following deficits.

When can weak models match strong model performance?

Sampling alone amplifies coverage but cannot select correct solutions. Reliable performance matching requires external soundness signals—tests, proofs, or type checks—that convert latent correct proposals into actual selections.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Why do autonomous LLM agents fail in predictable ways?

Research identifies role flipping, flake replies, infinite loops, and conversation deviation as LLM-specific failures in multi-agent cooperation. These occur because LLMs lack persistent goal representation and stable role identity.

Next inquiring lines