Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Paper · arXiv 2501.10893 · Published January 18, 2025
Retrieval-Augmented Generation (RAG)Reading and Summarization

Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-byinteract synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks — baseline results are improved by up to 12.2% for ICL with Claude- 3.5 and 19.5% for training with Codestral-22B.

Introduction. Pre-trained large language models (LLMs) offer great potential for assisting humans with various tasks in digital settings, such as editing images, performing data analysis, resolving software engineering issues, and navigating commercial platforms (Jimenez et al., 2023; Xie et al., 2023, 2024; Yao et al., 2022a). By streamlining these, LLM agents can greatly enhance human efficiency and productivity, allowing users to shift their focus toward higher-level, creative, and strategic endeavors. To explore this potential, many benchmarks (Cao et al., 2024; Jimenez et al., 2023; Koh et al., 2024; Xie et al., 2024; Zhou et al., 2023b) and agentic frameworks (Chen et al., 2024a; Gur et al., 2023; Yang et al., 2024, 2023; Zhan and Zhang, 2023) have been established based on realistic digital environments, spanning web applications, code development, desktop computing, etc. However, LLMs often fall short of expected performance in these tasks, consistently displaying a significant gap compared to human capabilities.

Discussion / Conclusion. We introduce Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Based on commonly-accessible resources like documentaion, LLMs propose downstream tasks and complete them with multi-round interactions with environments. We address the misalignment between instructions and trajectories by updating objectives with new instructions derived from trajectories. Additionally, we design innovative retrieval approaches that leverage agent instructions, interaction histories, and current observations to retrieve synthesized examples. Through extensive experiments, we demonstrate that the synthetic data from Learn-byinteract significantly enhances model performance with both ICL and training. Compared with other leading approaches in agent tasks, Learn-by-interact shows much better performance with lower latency and computational costs, which make it particularly suitable for large-scale deployment.