What are the three distinct layers of agent code?
Does separating agent code into model capabilities, system harness, and agent-created artifacts help explain why agentic systems fail and where to intervene for improvement?
Does separating agent code into model capabilities, system harness, and agent-created artifacts help explain why agentic systems fail and where to intervene for improvement?
Current benchmarks reduce agents to a single success score, but agents emerge from multiple interacting systems. What dimensions of agent behavior should builders actually measure to predict deployment readiness?
Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.
Agent-created artifacts like patches, tests, and skill libraries outlive single tasks, but we lack guidance on what should persist, how to maintain consistency across agents, and when persistence is worth the engineering effort.
Explores whether code functions not just as LLM output but as the executable medium through which agents reason, act, and verify progress. This reframing treats code as infrastructure rather than deliverable.
Most agent systems store skills as text in prompts, but this inflates token costs and degrades model performance. Could compiling skills into trainable weight-space adapters instead offer a better trade-off between efficiency and capability?
Explores whether treating extracted expertise as versioned files—rather than persona prompts—enables meaningful accountability over person-grounded knowledge. Matters because audit trails determine whether captured skills can be corrected, rolled back, or safely withheld.
Does offloading routine record-keeping to an environment harness free RL policies to focus on semantic search decisions, and can this approach outperform larger searchers with fewer parameters?
We explore whether base model capability predicts both the ability to write useful harness updates and the ability to benefit from them. The answer reshapes how we should allocate capability in self-evolving agent systems.
If agents need better memory, should we focus on adding storage or improving what gets kept? This explores why curation and selective forgetting matter more than raw capacity for reliable agent performance.
The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.