What happens when you project the same model onto different harnesses?
This reads 'harness' broadly — the scaffolding around a model's weights (access tier, RAG pipeline, tool-use wiring, training regime, context management) — and asks how the same underlying model behaves differently depending on what you wrap around it.
This explores a quietly radical idea in the corpus: a model's behavior isn't a fixed property of its weights — it's co-produced by the harness you put around it. Take the same base model, project it through different scaffolds, and you get different capabilities, different diversity profiles, and different failure modes. The collection keeps circling this from multiple angles, and the through-line is that 'what the model can do' is often a question about the rig, not the weights.
Start with the hardest ceiling: access. The specialization taxonomy Does model access level determine which specialization techniques work? argues that whether you have black-box, grey-box, or white-box access sets the upper bound on what any technique can achieve — black-box harnesses can only *activate* knowledge the model already has, while white-box ones can inject genuinely new knowledge (at the risk of over-specializing). The model is the same; the harness decides whether you're nudging or rewriting. The same logic appears in deployment: hierarchical RAG Can smaller models handle RAG filtering while larger models focus on synthesis? shows you don't even need *one* harness per task — splitting filtering onto cheap models and synthesis onto expensive ones beats uniform deployment on both cost and quality. And how you wire the reasoning loop matters independently of the model: decoupling reasoning from tool observations Can reasoning and tool execution be truly decoupled? (ReWOO, Chain-of-Abstraction) eliminates quadratic prompt growth and unlocks parallelism without touching the weights at all.
The most striking results are about training harnesses, because they reveal how invisible the harness's fingerprints are. RL post-training Does RL training collapse format diversity in pretrained models? consistently collapses a model down to a single dominant output format — and the *winner depends on model scale, not performance*, an effect largely hidden when you start from proprietary pretrained models. So the same training recipe projects different models onto different attractors. Preference tuning is even less stable: it reduces lexical diversity in code but *increases* it in creative writing Does preference tuning always reduce diversity the same way?, because the harness's effect flips with what the domain rewards. Training *order* does the same — structured-then-creative scheduling preserves open-ended capability that joint training would have crushed Does training order reshape how models handle different task types?. And push the harness too hard with near-impossible problems and it actively damages pre-existing skills by reinforcing shortcuts Do overly hard RLVR samples actually harm model capabilities?.
Even the runtime context is a harness, and a leaky one. The same model degrades sharply once its own prior errors fill the context window Do models fail worse when their own errors fill the context? — scaling the model doesn't fix it, only changing the harness (test-time thinking) does. And there's a structural reason: because LLMs process everything as one token string with no compartmentalized memory, every long-context harness faces an unavoidable tradeoff between context collapse and coherence loss How do LLMs balance remembering context versus keeping it separate?.
The thing you didn't know you wanted to know: there often isn't a single 'the model' to evaluate. Diversity, reasoning ceiling, format, and robustness are all things the harness can give or take away — which means a benchmark number is really a number about a model *plus its rig*, and swapping the rig can move it more than swapping the weights.
Sources 9 notes
Three tiers of access—black-box, grey-box, and white-box—create a hierarchy of specialization power. Black-box techniques can only activate existing knowledge; white-box methods can inject new knowledge but risk over-specialization.
HiFi-RAG demonstrates that routing query reformulation, passage pruning, and citation to cheaper models like Gemini Flash while reserving expensive models like Gemini Pro for final generation produces both lower cost and better answers than uniform deployment.
ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
Because LLMs process conversation as a single token string without compartmentalized memory, they cannot maintain separate contexts the way humans do. Existing mitigations like compression, longer windows, and retrieval all introduce new failure modes and cannot replicate human compartmentalization.