Where does agent reliability actually come from?
Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.
How agent capability has shifted from model weights to the harness systems and skill lifecycles that surround them.
Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.
Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.
We explore whether base model capability predicts both the ability to write useful harness updates and the ability to benefit from them. The answer reshapes how we should allocate capability in self-evolving agent systems.
Does offloading routine record-keeping to an environment harness free RL policies to focus on semantic search decisions, and can this approach outperform larger searchers with fewer parameters?
Does separating agent code into model capabilities, system harness, and agent-created artifacts help explain why agentic systems fail and where to intervene for improvement?
Agent-created artifacts like patches, tests, and skill libraries outlive single tasks, but we lack guidance on what should persist, how to maintain consistency across agents, and when persistence is worth the engineering effort.
Explores whether treating extracted expertise as versioned files—rather than persona prompts—enables meaningful accountability over person-grounded knowledge. Matters because audit trails determine whether captured skills can be corrected, rolled back, or safely withheld.
Most agent systems store skills as text in prompts, but this inflates token costs and degrades model performance. Could compiling skills into trainable weight-space adapters instead offer a better trade-off between efficiency and capability?
When you subtract what a model already knows, do dedicated memory architectures genuinely enable continual learning, or do they mainly inherit base capability? CL-BENCH isolates learning from prior skill to test this.
Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.
Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?
Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?
LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.