INQUIRING LINE

Agentic Systems and Tool Use · Model Architecture and Internals · Training, RL, and Test-Time Scalingcross-cluster

Does effective feedback compute matter more than raw token expenditure for agent scaling?

This explores whether what scales an agent is the quality of useful feedback it actually absorbs — not the sheer count of tokens or tool calls it burns through.

This explores whether agent performance scales with *effective feedback compute* — informative, non-redundant signal the agent retains and acts on — rather than raw token expenditure. The corpus has a sharp, almost surprising answer: when researchers credited only feedback that was valid and actually used for a decision, that measure predicted performance with R²≈0.94, while raw token and tool-call counts predicted it at only R²≈0.33–0.42 Does raw token spending actually predict agent performance?. In other words, two agents can spend identical tokens and perform very differently depending on how much of that spend was *informative*. The scaling lever is feedback quality, not interaction volume.

What makes this interesting is that it sits in tension with a competing finding the corpus also holds. Anthropic's multi-agent research evals report that roughly 80% of performance variance is a *token-spending* function — coordination intelligence explains far less than budget Does token spending drive multi-agent research performance?, How does test-time scaling work at the agent level?. So is it tokens or feedback? The reconciliation is that tokens are a *proxy*: spending more usually means gathering more feedback, but only because most setups have no way to separate the two. The same Anthropic work notes that a model-capability upgrade beats doubling the token budget — a hint that efficiency, not quantity, is doing the real work. Effective Feedback Compute is what you get when you stop letting token count stand in for the thing it's correlated with.

The lateral picture across the corpus reinforces this. Search-budget and reasoning-token scaling both show the same monotonic-but-diminishing curve Does search budget scale like reasoning tokens for answer quality?, How does search scale like reasoning in agent systems? — and *diminishing returns* is precisely the signature of feedback losing informativeness as you spend more. Each extra search step or reasoning token buys less because it's increasingly redundant. Meanwhile multi-agent systems at scale degrade not from too few tokens but from *bad* feedback: agents accept neighbors' claims without verification, propagating errors while still being able to detect direct conflicts Why do multi-agent systems fail to coordinate at scale?. More interaction can actively hurt if the feedback isn't valid.

The design implication is where this gets practical. If feedback quality is the lever, you optimize by making each token carry more signal rather than buying more tokens. That's exactly what the memory and harness literature does: autonomous memory folding compresses interaction history into structured schemas so the agent reflects on consolidated signal instead of raw logs Can agents compress their own memory without losing critical details?; reconstructing memory through active graph traversal beats retrieve-then-reason while *cutting* token and runtime cost Can agents reconstruct memory on demand instead of retrieving it?; and reliability comes from externalizing memory, skills, and protocols into a harness layer so the model stops re-solving the same problems agent-reliability-comes-from-externalizing-cognitive-burdens-into-system-structures. Each of these raises feedback-per-token.

The economic flip side is that once you treat feedback as the unit, raw token cost stops being the meaningful denominator at all. A 115-day persistent-agent study found 82.9% of tokens were cache reads, so the real cost unit becomes completed artifacts, not tokens Do persistent agents really cost less per token? — and small models suffice for most subtasks at 10–30× lower cost because most agent work is repetitive, low-feedback language plumbing Can small language models handle most agent tasks?. The thing you didn't know you wanted to know: "spend more tokens" and "get better feedback" look identical until someone measures them apart — and the moment they do, the token count turns out to be the weaker half of the correlation.

Sources 11 notes

Does raw token spending actually predict agent performance?

Effective Feedback Compute—crediting only informative, valid, non-redundant feedback retained for decisions—predicts performance (R²≈0.94) far better than raw tokens or tool calls (R²≈0.33–0.42). The scaling lever is feedback quality, not quantity of interaction.

Does token spending drive multi-agent research performance?

Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can agents reconstruct memory on demand instead of retrieving it?

MRAgent achieves up to 23% gains on reasoning tasks by reconstructing memory through active graph traversal that prunes paths based on accumulated evidence, while reducing token and runtime cost compared to fixed-retrieval pipelines.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Does effective feedback compute matter more than raw token expenditure for agent scaling?

Sources 11 notes

Next inquiring lines