Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents
Prior research on memory mechanism in RAGbased conversational system has emphasized how memory is stored and retrieved. However, far less is known about how memories with different functional roles influence response quality. Specifically, how they shape an agent’s responses under varying conversational contexts and whether they lead to substantively different response behaviors. Existing evaluations in conversational system are also largely reference-based, insufficiently capturing the nuances in responses that may address users’ preferences differently. In this work, we probe the impact of different memory types in shaping agents’ responses. We present a fine-grained taxonomy of conversational memory, classify retrieved memories into different role types, and design a user-centric evaluation framework that simulates user perspectives. Through comparative experiments on long-term datasets and frontier LLMs, our analysis reveal many differentiated effects of memories: e.g., clarifying memory improves responses’ factual accuracy and constraint awareness, making them more correct and personalized; irrelevant memory reduces topic relevance and degrades constraint awareness.
Introduction. Conversational memory provides needed context for grounded responses and accurate query understanding (Radlinski and Craswell, 2017), and it now is a core property of contemporary chat systems like ChatGPT (OpenAI, 2025b) and Google Gemini (AI and DeepMind, 2025). At high level, conversation history is logged into an external memory bank to reserve past information. During response, chat agents can retrieve relevant pieces to surface useful memory. Such retrieval-augmented generation (RAG) pipeline has served as a standard approach especially in long-range conversation, overcoming the context length limits of Large Language Models (LLMs) to maintain response coherence and accuracy (Fan et al., 2024). Existing work on conversational RAG has largely centered on storage and retrieval strategy, i.e., memory structure, retrieval size, and granularity (Xu et al., 2022; Ye et al., 2024; Maharana et al., 2024; Zhong et al., 2024; Wu et al., 2025; Pan et al., 2025; Tan et al., 2025a).
Discussion / Conclusion. In this work, we examine how memory types affect LLMs in conversations using a novel memory taxonomy, a user-centric evaluation, and controlled comparisons. The experiments over agent models, context size, and retrievers confirm known patterns and provide new findings in long-term conversational RAG scenario. In comparison experiments, we find conversational RAG performance is driven not merely by retrieving relevant context, but by retrieving the right functional types of memory. These findings point toward memory-type-aware retrieval and selection as a promising direction for IR. First, retrieval systems can benefit from being memory-role aware, enabling category-aware curation, ranking, and diversification rather than treating all retrieved context uniformly. Second, our results motivate the development of collections that evaluate the different roles of grounding in RAG for memory and other corpora.