TOPIC

Agent Harness

10 synthesis notes · 7 source papers

View as

What are the three distinct layers of agent code?

Does separating agent code into model capabilities, system harness, and agent-created artifacts help explain why agentic systems fail and where to intervene for improvement?

Should agent evaluation measure more than task success?

Current benchmarks reduce agents to a single success score, but agents emerge from multiple interacting systems. What dimensions of agent behavior should builders actually measure to predict deployment readiness?

Where does agent reliability actually come from?

Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.

What makes agent-authored code worth persisting and sharing?

Agent-created artifacts like patches, tests, and skill libraries outlive single tasks, but we lack guidance on what should persist, how to maintain consistency across agents, and when persistence is worth the engineering effort.

Can code serve as the operational substrate for agent reasoning?

Explores whether code functions not just as LLM output but as the executable medium through which agents reason, act, and verify progress. This reframing treats code as infrastructure rather than deliverable.

Can skills work better as weights than as prompts?

Most agent systems store skills as text in prompts, but this inflates token costs and degrades model performance. Could compiling skills into trainable weight-space adapters instead offer a better trade-off between efficiency and capability?

Can person-grounded skills remain auditable without hidden prompt state?

Explores whether treating extracted expertise as versioned files—rather than persona prompts—enables meaningful accountability over person-grounded knowledge. Matters because audit trails determine whether captured skills can be corrected, rolled back, or safely withheld.

Source papers 7

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation
LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded a…
Code as Agent Harness
Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agent…
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover in…
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation model…
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Ha…
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain op…
LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents
Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content…