INQUIRING LINE

Inquiring lines›How do language models construct a…›How are AI-generated and human-wri…›Do harness improvements transfer a…›this inquiring line

Same model, different scaffolding: is the AI you deploy a product of its weights — or everything you built around them?

What happens when you project the same model onto different harnesses?

This reads 'harness' broadly — the scaffolding around a model's weights (access tier, RAG pipeline, tool-use wiring, training regime, context management) — and asks how the same underlying model behaves differently depending on what you wrap around it.

This explores a quietly radical idea in the corpus: a model's behavior isn't a fixed property of its weights — it's co-produced by the harness you put around it. Take the same base model, project it through different scaffolds, and you get different capabilities, different diversity profiles, and different failure modes. The collection keeps circling this from multiple angles, and the through-line is that 'what the model can do' is often a question about the rig, not the weights.

Start with the hardest ceiling: access. The specialization taxonomy Does model access level determine which specialization techniques work? argues that whether you have black-box, grey-box, or white-box access sets the upper bound on what any technique can achieve — black-box harnesses can only *activate* knowledge the model already has, while white-box ones can inject genuinely new knowledge (at the risk of over-specializing). The model is the same; the harness decides whether you're nudging or rewriting. The same logic appears in deployment: hierarchical RAG Can smaller models handle RAG filtering while larger models focus on synthesis? shows you don't even need *one* harness per task — splitting filtering onto cheap models and synthesis onto expensive ones beats uniform deployment on both cost and quality. And how you wire the reasoning loop matters independently of the model: decoupling reasoning from tool observations Can reasoning and tool execution be truly decoupled? (ReWOO, Chain-of-Abstraction) eliminates quadratic prompt growth and unlocks parallelism without touching the weights at all.

The most striking results are about training harnesses, because they reveal how invisible the harness's fingerprints are. RL post-training Does RL training collapse format diversity in pretrained models? consistently collapses a model down to a single dominant output format — and the *winner depends on model scale, not performance*, an effect largely hidden when you start from proprietary pretrained models. So the same training recipe projects different models onto different attractors. Preference tuning is even less stable: it reduces lexical diversity in code but *increases* it in creative writing Does preference tuning always reduce diversity the same way?, because the harness's effect flips with what the domain rewards. Training *order* does the same — structured-then-creative scheduling preserves open-ended capability that joint training would have crushed Does training order reshape how models handle different task types?. And push the harness too hard with near-impossible problems and it actively damages pre-existing skills by reinforcing shortcuts Do overly hard RLVR samples actually harm model capabilities?.

Even the runtime context is a harness, and a leaky one. The same model degrades sharply once its own prior errors fill the context window Do models fail worse when their own errors fill the context? — scaling the model doesn't fix it, only changing the harness (test-time thinking) does. And there's a structural reason: because LLMs process everything as one token string with no compartmentalized memory, every long-context harness faces an unavoidable tradeoff between context collapse and coherence loss How do LLMs balance remembering context versus keeping it separate?.

The thing you didn't know you wanted to know: there often isn't a single 'the model' to evaluate. Diversity, reasoning ceiling, format, and robustness are all things the harness can give or take away — which means a benchmark number is really a number about a model *plus its rig*, and swapping the rig can move it more than swapping the weights.

Sources 9 notes

Does model access level determine which specialization techniques work?

Three tiers of access—black-box, grey-box, and white-box—create a hierarchy of specialization power. Black-box techniques can only activate existing knowledge; white-box methods can inject new knowledge but risk over-specialization.

Can smaller models handle RAG filtering while larger models focus on synthesis?

HiFi-RAG demonstrates that routing query reformulation, passage pruning, and citation to cheaper models like Gemini Flash while reserving expensive models like Gemini Pro for final generation produces both lower cost and better answers than uniform deployment.

Can reasoning and tool execution be truly decoupled?

ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Show all 9 sources

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

How do LLMs balance remembering context versus keeping it separate?

Because LLMs process conversation as a single token string without compartmentalized memory, they cannot maintain separate contexts the way humans do. Existing mitigations like compression, longer windows, and retrieval all introduce new failure modes and cannot replicate human compartmentalization.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether model behavior truly depends on harness architecture, or whether recent advances (new model families, better orchestration, mechanistic understanding) have shifted the dependency. The question: **Is a model's capability ceiling set by its weights or by how you scaffold it?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable snapshots:

• Black-box access ceilings are hard: you can only *activate* existing knowledge, never inject new knowledge; white-box harnesses can rewrite but risk over-specialization (2023–2024).
• RL post-training collapses models to a single dominant format; the winner depends on model scale, not performance — an effect masked by proprietary baselines (2025).
• Preference tuning's diversity effect flips by domain: reduces code lexical diversity but *increases* creative-writing diversity, revealing harness-domain co-production (2025).
• Training order (structured-then-creative vs. joint) preserves or crushes open-ended capability independently of weights (2025).
• Long-context harnesses face unavoidable tradeoff: context collapse vs. coherence loss, no model-scale escape (2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.18703 (2023) — Domain Specialization taxonomy (black/grey/white-box)
• arXiv:2504.07912 (2025) — RL post-training format collapse
• arXiv:2507.14783 (2025) — Multi-task RL scheduling effects
• arXiv:2605.28388 (2026) — Sample difficulty mechanistics in RLVR

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For black-box vs. white-box ceilings: do new retrieval methods (dense passage retrieval, adaptive routing, in-context adaptation) now allow black-box models to approximate white-box knowledge injection? For format-collapse: have newer RL objectives (entropy-preserving rewards, multi-objective alignment) or larger models escaped the scale-dependent attractor? For training order: do curriculum-learning frameworks or dynamic task weighting dissolve the schedule dependency? Plainly separate what still holds from what's been relaxed.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for papers on mechanistic interpretability of harness effects, multi-modal harnesses, or deployment-time adaptation that may have overturned the "unavoidable tradeoff" claims.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) If harness effects can be mechanistically decoded in real time, can a single model self-switch harnesses to maximize task-fit without retraining? (b) If format collapse is reversible via post-hoc decoding, does the constraint move from training to inference, and can inference-time harnesses fully decouple the weight–behavior link?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Same model, different scaffolding: is the AI you deploy a product of its weights — or everything you built around them?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8