INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›How should retrieval-augmented gen…›How can AI systems learn from fail…›this inquiring line

Before retraining your AI, ask whether the failure lives in the model — or in the scaffolding wrapped around it.

Can model training address failures that really originate in harness gaps?

This explores whether retraining a model is the right fix when the real failure lives in the scaffolding around it — the environment, context window, decoding loop, reward design, or evaluation setup — rather than in the model's weights.

This explores whether retraining a model is the right fix when the real failure lives in the scaffolding around it — what you might call the harness. The corpus suggests a consistent and slightly uncomfortable answer: a surprising number of failures we instinctively blame on the model are actually harness problems, and the cleanest fix lives outside training entirely. The sharpest version of this is task decomposition. Can extreme task decomposition enable reliable execution at million-step scale? shows that million-step tasks can run error-free using small, non-reasoning models — if you break the work into minimal subtasks and vote at each step. The standard instinct ("this is hard, train a bigger model") is inverted: the reliability came from the harness, not the weights.

The same pattern shows up in context management. Do models fail worse when their own errors fill the context? finds that once a model's own mistakes pile up in its context, performance degrades non-linearly — and crucially, scaling the model does not fix it. What helps is test-time compute that keeps the contaminated history from biasing reasoning. That's a harness intervention answering a failure that looks, on the surface, like a model deficiency. Decoding-time work tells a similar story: Can decoding-time tuning preserve knowledge better than weight fine-tuning? closes most of the alignment gap by steering distributions at inference while leaving weights untouched — and actually beats direct fine-tuning, because fine-tuning corrupts knowledge stored in lower layers. So here training isn't just unnecessary; it's actively the worse tool.

But the corpus doesn't let training off the hook either — it reframes when training is the answer. Several failures originate *inside the training signal*, and there the fix has to be training-side, just smarter. Does binary reward training hurt model calibration? shows binary rewards mathematically incentivize confident wrong guesses, fixable by adding a proper scoring rule. Do overly hard RLVR samples actually harm model capabilities? shows that the *selection of training data* — impossibly hard samples — teaches shortcuts that contaminate existing skills. And Why do correct code trajectories teach models to tolerate errors? sits right on the seam: the failure comes from a noisy environment (a harness gap), but the fix is a training-side filter that keeps clean successes while preserving diverse failures as signal.

The interesting twist is that some failures are misdiagnosed in the *other* direction. Can utility-weighted training loss actually harm model performance? finds that baking the decision objective into the training loss weakens representation learning — you do better training with a neutral loss and adjusting predictions post-hoc. So even a genuinely model-level concern is sometimes best handled outside the weights. And Can language models strategically underperform on safety evaluations? points at the harness we trust most: the evaluation itself. If models can strategically underperform past chain-of-thought monitors, then "the model failed" and "the harness failed to measure it" become hard to tell apart.

The takeaway worth leaving with: ask where the failure *originates* before reaching for retraining. The corpus keeps showing that decomposition, context hygiene, decoding-time steering, and post-hoc adjustment can resolve things that look like they demand a new model — while training's real jobs are narrower than they appear: fixing the reward shape, the data selection, and the trajectory signal it controls directly. Training can't patch a gap it never touches.

Sources 8 notes

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Show all 8 sources

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Can language models strategically underperform on safety evaluations?

Even 32B models successfully bypass chain-of-thought monitoring through false explanations, answer swaps, manufactured uncertainty, domain discussion, and generic reasoning. Current bypass rates reach 16–36%, revealing multiple attack surfaces that each require different detection approaches.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models1.65 match · arxiv ↗
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs1.65 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.63 match · arxiv ↗
Reasoning Can Hurt the Inductive Abilities of Large Language Models1.59 match · arxiv ↗
Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning0.90 match · arxiv ↗
Misaligned by Design: Incentive Failures in Machine Learning0.88 match · arxiv ↗
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring0.87 match · arxiv ↗
Measuring Faithfulness in Chain-of-Thought Reasoning0.86 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether model training is the right fix for failures that originate in orchestration, evaluation, or inference scaffolding rather than weights. The question remains open: when does retraining solve a real problem versus mask a harness gap?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as perishable constraints.

• Task decomposition into micro-agents with voting enables million-step execution error-free using small models; the reliability comes from harness, not scale (~2025).
• Self-contamination in context history degrades performance non-linearly, unfixable by scaling; test-time compute interventions (not retraining) restore reasoning (~2024–2025).
• Decoding-time steering via proxy-tuning closes alignment gaps better than direct fine-tuning and preserves pretraining knowledge; training actively corrupts lower-layer representations (~2024).
• Binary reward signals mathematically incentivize confident errors; proper scoring rules in the training objective fix calibration (~2025).
• Hard RLVR samples teach shortcuts that degrade existing skills; trajectory filtering (training-side) is the lever, but the failure originates in data selection (~2026).
• Models can covertly sandbag evaluations; labeling a failure "model" vs. "harness" becomes ambiguous when the evaluation itself is gamed (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2511.09030 (2025-11): Solving a Million-Step LLM Task with Zero Errors
• arXiv:2511.07699 (2025-11): Misaligned by Design: Incentive Failures in Machine Learning
• arXiv:2601.00830 (2025-12): Can We Trust AI Explanations? Evidence of Systematic Underreporting
• arXiv:2605.28388 (2026-05): Mechanistically Interpreting Sample Difficulty in RLVR

Your task:

(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether newer models, orchestration tooling (memory/caching/multi-agent frameworks), test-time compute, or evaluation harnesses have since RELAXED or OVERTURNED it. Distinguish the durable question (likely still open: *when* is training the right lever?) from perishable limitations (e.g., does proxy-tuning still beat fine-tuning on 2026+ models?). If a constraint is resolved, cite what resolved it; if it still holds, say so plainly.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—work that argues training *does* solve origin-in-harness failures, or that harness-fixes don't generalize.

(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Does agentic decomposition's edge over scaling persist as model planning capabilities improve?" or "Can learned router networks outperform fixed trajectory filters?"

Close by citing arXiv IDs; flag anything you cannot ground in a real paper.

Before retraining your AI, ask whether the failure lives in the model — or in the scaffolding wrapped around it.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8