Does selective history retrieval outperform full context inclusion in agent reasoning?
This explores whether agents reason better when they pull in only the relevant slices of their past (memory, retrieved facts, prior steps) versus stuffing everything they've seen into the prompt — and the corpus comes down firmly on the side of selectivity.
This explores whether agents reason better by selectively retrieving the relevant pieces of their history rather than carrying their whole context forward — and across the collection, selectivity keeps winning, though for several different reasons. The most direct evidence is that reconstructing memory on demand beats fetching it wholesale: one approach interleaves reasoning with active graph traversal, pruning paths as evidence accumulates, and earns up to 23% gains over fixed retrieve-then-reason pipelines while using fewer tokens and less runtime Can agents reconstruct memory on demand instead of retrieving it?. The lesson isn't 'retrieve less' as a blanket rule — it's that *which* history you bring in should be decided dynamically by the reasoning itself, not fixed in advance.
The strongest version of the case goes further and argues full history is actively harmful. A memoryless, Markov-style method decomposes a problem so each reasoning state depends only on the current sub-problem, not the accumulating trail behind it — eliminating 'historical baggage that bloats reasoning' while preserving the answer Can reasoning systems forget history without losing coherence?. That reframes the question: sometimes the best history to include is none, because accumulated context is noise that crowds out the live problem. But notice this is selection at its most aggressive, not the opposite of selection.
A quieter but important caveat is that the right amount of context depends on who's reading it. One line of work trains an external manager to prune context for a frozen agent, and finds the optimal compression isn't fixed — strong agents benefit from high-fidelity preservation, while weaker agents need aggressive pruning to stay reliable Can external managers compress context better than frozen agents?. So 'selective vs. full' isn't a universal verdict; it's a matching problem between how much signal the reasoner can absorb and how much you hand it. Related work on autonomous memory folding shows agents can do this self-pruning themselves, compressing interaction history into structured episodic, working, and tool schemas — and that the *structure* is what avoids the degradation poorly-designed consolidation causes Can agents compress their own memory without losing critical details?.
There's a deeper why underneath all of this. Reliable agents seem to work by externalizing memory, skills, and protocols into a harness rather than trusting the model to re-derive everything from a giant context window each turn Where does agent reliability actually come from?. Selective retrieval is one face of that principle: instead of asking the model to find the needle in its own haystack every step, you make state retrievable on demand. The same instinct appears in retrieval beyond memory — a grep-issuing agent that searches raw corpus text beats dense-embedding retrieval on entity-constrained, multi-hop queries precisely because targeted, executable lookups recover precision that bulk similarity-matching blurs together Can direct corpus search beat embedding-based retrieval?.
The thing you might not expect: the failure mode isn't only 'too much context distracts the model.' It's also that fixed retrieval — deciding once what to include — is itself the problem. The wins come from making inclusion *adaptive and interleaved with reasoning*, whether that means traversing a memory graph, contracting to a memoryless state, or routing pruning through a manager calibrated to the agent's strength. Selective history retrieval doesn't just outperform full context; it outperforms *any* static decision about what to carry.
Sources 6 notes
MRAgent achieves up to 23% gains on reasoning tasks by reconstructing memory through active graph traversal that prunes paths based on accumulated evidence, while reducing token and runtime cost compared to fixed-retrieval pipelines.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.
An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
GrepSeek trains agents to retrieve via executable shell commands over raw text, achieving better multi-hop performance on entity-constrained queries than dense embeddings. The approach scaffolds unstable search mechanics with supervised trajectories, then refines task-oriented behavior through reinforcement learning.