INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›How should retrieval-augmented gen…›Does externalizing cognitive work…›this inquiring line

When an AI stops tracking what it's already tried, its full reasoning budget goes toward actually solving the problem.

Why does externalizing bookkeeping raise effective feedback compute?

This explores why offloading state-tracking to an external harness — instead of making the model hold it all in context — lets a system get more value out of each unit of feedback, and what the corpus says about where that extra leverage comes from.

This explores why offloading state-tracking to an external harness — instead of making the model hold it all in context — lets a system get more value out of each unit of feedback. The clearest evidence is direct: a 20B model paired with a stateful harness beat the next-best open searcher by 11.4 points on curated recall, and the gain survived ablation, showing the harness was a learned capability rather than plumbing Can externalized bookkeeping let smaller search agents beat larger ones?. The intuition behind 'effective compute' is that a model has a fixed budget of attention and reasoning per step. Every token spent re-deriving where it is, what it already tried, and what's still open is a token not spent reasoning about the actual problem. Externalize that bookkeeping and the same model's compute now lands almost entirely on the task — so each feedback signal it receives gets metabolized more fully.

Why feedback specifically benefits is sharper once you notice that feedback is not one thing. Natural feedback splits into *evaluative* information (how well an action did) and *directive* information (how it should change), and a scalar reward throws the directive half away Can scalar rewards capture all the information in agent feedback?. Directive feedback is only usable if the agent can locate it against a faithful record of what it actually did — which is exactly what the harness preserves. The same logic shows up in retrieval agents: supervising the intermediate steps of a search chain substantially outperforms rewarding only the final answer, because the contrast between good and bad steps is where the learning signal lives Does supervising retrieval steps outperform final answer rewards?. No externalized record of the steps, no step-level signal to learn from.

The corpus also shows the failure mode this avoids. When numerical rewards plateau, it's because the number carries no information about *why* a failure happened — handing the model a chain-of-thought critique instead breaks the plateau Can natural language feedback overcome numerical reward plateaus?. And interleaving reasoning with real external queries, rather than reasoning in a closed loop, injects fresh real-world feedback at each step and beats pure chain-of-thought by 10–34% Can interleaving reasoning with real-world feedback prevent hallucination?. In both cases the lever is the same: richer, externally-anchored signal does more per step than a thin internal one.

There's a deeper reason this isn't just an efficiency trick. Pure self-improvement is structurally circular — it stalls on the generation-verification gap and reward hacking — and the methods that actually work all smuggle in an external anchor: a past model version, a third-party judge, a user correction, or tool feedback Can models reliably improve themselves without external feedback?. A stateful harness is one of those anchors. The bookkeeping it holds is the ground truth the model checks itself against, which is why externalizing it doesn't merely free compute — it supplies a trustworthy reference that internal self-tracking can't, since the model would be grading its own possibly-corrupted memory.

The surprising corollary is that the relationship runs both ways. TransformerFAM shows a model can build working memory by attending to its *own* latents with no extra weights, internalizing a kind of bookkeeping Can models learn working memory by attending to their own latents?, and Post-Completion Learning shows a model can internalize self-evaluation into unused sequence space at zero inference cost Can models learn to evaluate their own work during training?. So 'externalize vs. internalize' is really a question of where the bookkeeping is cheapest and most reliable to keep — and the harness wins whenever a faithful, recoverable record matters more than keeping everything in one context window.

Sources 8 notes

Can externalized bookkeeping let smaller search agents beat larger ones?

A 20B model using Harness-1 achieved 0.730 average curated recall, beating the next open searcher by +11.4 points and matching frontier models. The gains transfer to held-out benchmarks, showing the harness itself is learned capability, not mere implementation.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Show all 8 sources

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can models learn working memory by attending to their own latents?

TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Let’s Verify Step by Step1.67 match · arxiv ↗
Reward Reasoning Model1.65 match · arxiv ↗
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback0.90 match · arxiv ↗
TransformerFAM: Feedback attention is working memory0.90 match · arxiv ↗
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses0.90 match · arxiv ↗
Post-Completion Learning for Language Models0.89 match · arxiv ↗
RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation0.88 match · arxiv ↗
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing whether externalizing state-tracking to a stateful harness still raises effective feedback compute. This question spans 2023–2026 in a curated arXiv library; treat these findings as dated claims to be re-tested.

What a curated library found — and when:
• A 20B model paired with a stateful harness beat the next-best open searcher by 11.4 points on curated recall; the gain survived ablation (2026).
• Directive feedback is only learnable when paired with a faithful external record of actions taken; scalar rewards discard directive information (2024).
• Process-level supervision of intermediate search steps outperforms outcome-only reward, because step-level contrast is where signal lives (2024).
• Natural language feedback breaks numerical reward plateaus; interleaved reasoning + real external queries beat pure chain-of-thought by 10–34% (2023–2024).
• Pure self-improvement is circular (generation-verification gap, reward hacking); working systems all externalize to a third-party judge, past model, user correction, or tool feedback (2024).

Anchor papers (verify; mind their dates):
• arXiv:2305.20050 (2023) Let's Verify Step by Step
• arXiv:2412.02674 (2024) Mind the Gap: Examining the Self-Improvement Capabilities of LLMs
• arXiv:2507.20252 (2025) Post-Completion Learning for Language Models
• arXiv:2606.02373 (2026) Harness-1: RL for Search Agents with State-Externalizing Harnesses

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 3.5+), training methods (RL at scale, DPO variants), tooling (SDK harnesses, in-context memory), orchestration (multi-agent loops, persistent caches), or evals have since RELAXED or OVERTURNED it. Separate the durable question (likely: does externalization matter?) from perishable limitations (e.g., 20B model size, specific recall metric). Say plainly where each constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months—any paper showing models achieve the same recall without external harnesses, or proving internalized bookkeeping matches or beats externalized state.
(3) Propose 2 research questions that assume the regime has moved: one about optimal harness depth (when is internalization cheaper?), one about scaling (does the 11.4-point gain shrink or grow with model size?).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI stops tracking what it's already tried, its full reasoning budget goes toward actually solving the problem.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8