What properties of agent systems only become visible across multiple sessions?
This explores which agent behaviors are invisible in a single run and only show up once an agent persists, learns, or coordinates across many sessions — the properties that single-shot evaluation can't catch.
This reads the question as: what about an agent system is fundamentally a *longitudinal* property — something you literally cannot observe in one session, only across a history of them? The corpus points to several, and they cluster around memory, learning, and coordination.
The first is whether an agent actually *gets better* — or just doesn't get worse. Within a session an agent either succeeds or fails; across sessions you see learning curves and forgetting. Can agents learn new skills without forgetting old ones? frames lifelong learning as exactly this multi-session property: an agent that stores executable skills and composes new ones from old can keep improving, while a weight-updating agent quietly suffers catastrophic forgetting that's only visible when you revisit an old task. Can agents learn continuously from experience without updating weights? makes the mechanism concrete — adaptation happens entirely through accumulated episodic memory rather than parameter changes, so 'how much has this agent improved' is a question that only has meaning over time. Can agents adapt without pausing service to users? sharpens it further: there are two clocks running, fast skill-injection within minutes and slow gradient optimization over idle hours, and the two only reinforce each other across many sessions.
The second invisible property is memory hygiene. A single session never reveals whether an agent's memory is well-structured or quietly rotting. Can agents compress their own memory without losing critical details? shows that consolidation done badly degrades the agent — but that degradation only surfaces session after session as history piles up. How should agent memory split across time scales? adds a useful lens: memory isn't one thing, and the dialogue-level components (conversation history, scratchpad) have completely different failure and update patterns than turn-level ones — distinctions that matter precisely because they play out over the lifetime of an agent, not one turn.
The third is collective and cross-user behavior, which by definition can't appear in any single conversation. How can agent systems share learned skills across users? describes skills that improve by aggregating trajectories across many users and many sessions — siloed individual learning becoming shared capability. And Where does agent reliability actually come from? is the quiet unifier here: reliability itself turns out to be a multi-session property, because it comes from a persistent harness layer (memory, skills, protocols) that exists *between* sessions, not from anything the model does inside one.
The surprise worth taking away: even *coordination failure* is partly longitudinal. Why do multi-agent systems fail to coordinate at scale? shows agents accepting neighbors' claims without verification, letting errors propagate — a pathology that compounds over repeated interaction. So the things that only become visible across sessions aren't edge cases; they're the properties that actually decide whether an agent system is trustworthy. The single-session demo is the part that lies to you.
Sources 8 notes
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
MetaClaw demonstrates that deployed agents require both rapid skill injection from failures (seconds, zero downtime) and slower gradient-based optimization during idle windows (minutes to hours). The two mechanisms reinforce each other, with better policies producing more informative failures and richer skills enabling higher-reward trajectories.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
SkillClaw aggregates interaction trajectories across users, processes them through an autonomous evolver that identifies patterns and refines skills, then synchronizes updates system-wide. This converts siloed individual learning into shared capability improvement without manual curation.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.