Why does externalizing bookkeeping raise effective feedback compute?
This explores why offloading state-tracking to an external harness — instead of making the model hold it all in context — lets a system get more value out of each unit of feedback, and what the corpus says about where that extra leverage comes from.
This explores why offloading state-tracking to an external harness — instead of making the model hold it all in context — lets a system get more value out of each unit of feedback. The clearest evidence is direct: a 20B model paired with a stateful harness beat the next-best open searcher by 11.4 points on curated recall, and the gain survived ablation, showing the harness was a learned capability rather than plumbing Can externalizing bookkeeping improve search agent performance?. The intuition behind 'effective compute' is that a model has a fixed budget of attention and reasoning per step. Every token spent re-deriving where it is, what it already tried, and what's still open is a token not spent reasoning about the actual problem. Externalize that bookkeeping and the same model's compute now lands almost entirely on the task — so each feedback signal it receives gets metabolized more fully.
Why feedback specifically benefits is sharper once you notice that feedback is not one thing. Natural feedback splits into *evaluative* information (how well an action did) and *directive* information (how it should change), and a scalar reward throws the directive half away Can scalar rewards capture all the information in agent feedback?. Directive feedback is only usable if the agent can locate it against a faithful record of what it actually did — which is exactly what the harness preserves. The same logic shows up in retrieval agents: supervising the intermediate steps of a search chain substantially outperforms rewarding only the final answer, because the contrast between good and bad steps is where the learning signal lives Does supervising retrieval steps outperform final answer rewards?. No externalized record of the steps, no step-level signal to learn from.
The corpus also shows the failure mode this avoids. When numerical rewards plateau, it's because the number carries no information about *why* a failure happened — handing the model a chain-of-thought critique instead breaks the plateau Can natural language feedback overcome numerical reward plateaus?. And interleaving reasoning with real external queries, rather than reasoning in a closed loop, injects fresh real-world feedback at each step and beats pure chain-of-thought by 10–34% Can interleaving reasoning with real-world feedback prevent hallucination?. In both cases the lever is the same: richer, externally-anchored signal does more per step than a thin internal one.
There's a deeper reason this isn't just an efficiency trick. Pure self-improvement is structurally circular — it stalls on the generation-verification gap and reward hacking — and the methods that actually work all smuggle in an external anchor: a past model version, a third-party judge, a user correction, or tool feedback Can models reliably improve themselves without external feedback?. A stateful harness is one of those anchors. The bookkeeping it holds is the ground truth the model checks itself against, which is why externalizing it doesn't merely free compute — it supplies a trustworthy reference that internal self-tracking can't, since the model would be grading its own possibly-corrupted memory.
The surprising corollary is that the relationship runs both ways. TransformerFAM shows a model can build working memory by attending to its *own* latents with no extra weights, internalizing a kind of bookkeeping Can models learn working memory by attending to their own latents?, and Post-Completion Learning shows a model can internalize self-evaluation into unused sequence space at zero inference cost Can models learn to evaluate their own work during training?. So 'externalize vs. internalize' is really a question of where the bookkeeping is cheapest and most reliable to keep — and the harness wins whenever a faithful, recoverable record matters more than keeping everything in one context window.
Sources 8 notes
A 20B model using Harness-1 achieved 0.730 average curated recall across eight benchmarks, outperforming the next open searcher by 11.4 points. The gains transfer to held-out benchmarks and survive ablation, showing the harness is not mere implementation but a learned capability.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.