LLM Memory

Can LLMs read long documents like humans do?

How might mimicking human reading strategies—storing gist memories and looking up details on demand—help language models handle documents beyond their effective context window?

Is agent memory a storage problem or a connectivity problem?

Most systems treat memory as a repository to store and retrieve. But what if memory's real usefulness depends on how units are linked together rather than what is stored?

How should agents decide what memories to keep?

Agent memory management splits between agents autonomously recognizing important information versus programmatic triggers. Understanding this choice reveals why different memory architectures prioritize different information types.

How should we actually evaluate agent memory systems?

Current benchmarks score agent memory by task success alone, hiding critical design questions about cost, trade-offs, and robustness. What would evaluation reveal if we decomposed memory into its core data-management stages?

Should agent memory adapt dynamically based on execution feedback?

Can agents improve performance by continuously reshaping memory connections in response to whether tasks succeed or fail, rather than relying on fixed retrieval pipelines? This matters because static memory degrades in changing environments.

Can three axes replace the short-term long-term memory split?

Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.

How should agent memory split across time scales?

Explores whether agent working memory should be organized by temporal scope—some components persisting across a conversation, others refreshed each turn. Understanding this distinction could reveal why some memory designs fail.

Can retrieval knowledge compress into a tiny parametric model?

Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.

Can a single model replace retrieval for long-term conversation memory?

COMEDY proposes collapsing the standard retrieval pipeline into one unified model that generates, compresses, and responds. But does eliminating the retriever actually improve performance, or does compression lose critical information?

Can lookup memory and computation work together better than either alone?

Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?

Can models consolidate memories during offline sleep phases?

This explores whether LLMs can use dedicated offline periods to consolidate short-term learning into permanent weights, avoiding catastrophic forgetting and the need for expensive retraining.

Does agent memory degrade when continuously consolidated?

Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.

Why do time-based queries fail in conversational retrieval systems?

Conversational memory systems struggle with questions that reference when something was discussed rather than what was said. Standard vector databases lack temporal indexing to retrieve by metadata like date, speaker, or session order.

Does retrieved memory quality depend on its functional role?

Conversational RAG systems retrieve context to improve responses, but does the *type* of memory matter as much as its relevance score? This explores whether different memory roles (clarifying vs. irrelevant) drive response quality differently.

Can agents learn better from their failures than successes?

Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.

Can encoder models compete with large language models for retrieval?

As decoder-only LLMs dominate NLP, do modernized encoders like BERT still offer efficiency and quality advantages for retrieval, classification, and embedding tasks at scale?

Can language models communicate without human-readable text?

Do instruction-tuned LLMs have the capacity to generate and decode compressed, non-standard text that preserves semantic meaning while sacrificing human readability? What would be the tradeoffs?

Can brain memory systems explain how LLMs should store knowledge?

This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.

When do language models stop memorizing and start generalizing?

Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.

Has memory architecture replaced parameter count as the scaling frontier?

Late-2025 research suggests the field's next major efficiency gains come from restructuring how models store and use experience rather than simply making them larger. Three convergent signals point to this shift.

Can agents learn continuously from experience without updating weights?

This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.

Can agents reconstruct memory on demand instead of retrieving it?

Explores whether interleaving reasoning with memory traversal during retrieval beats the standard approach of fetching memories first then reasoning over them. Matters because it could reduce wasted token cost and improve agent adaptability.

Can agents fail from weak memory control rather than missing knowledge?

As multi-turn agent workflows grow longer, performance degrades—but is this due to insufficient context or poor memory management? This explores whether memory *control* is the real bottleneck.

Can agents learn preferences by watching rather than asking?

Explores whether multimodal agents can build accurate preference models through continuous observation of user behavior, without explicit instruction, by organizing memory around entities and separating concrete events from derived knowledge.

Where does a model store memorized paragraphs?

Can we pinpoint the specific layers, attention heads, and tokens where language models localize verbatim memorization? Understanding this spatial signature could enable targeted unlearning.

Can storing evolved thoughts prevent inconsistent reasoning in conversations?

When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Do RL agents accidentally use environments as memory?

Explores whether reinforcement learning agents unintentionally create external memory through environmental artifacts—like trails and marks—without being explicitly trained to do so, and whether this constitutes genuine cognitive extension.

Can shared agent memory systems reliably delete information?

When multiple users access a shared memory pool with different permission levels, does any current system successfully balance recall utility with access control and true deletion? This matters because institutional deployments require governed, not just intelligent, memory.