← All notes

Where does agent capability really come from?

How agent capability has shifted from model weights to the harness systems and skill lifecycles that surround them.

Topic Hub · 19 linked notes · 3 sections
View as

Core Insights

14 notes

Where does agent reliability actually come from?

Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.

Explore related Read →

Does raw token spending actually predict agent performance?

Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.

Explore related Read →

Do stronger models always evolve harnesses better?

We explore whether base model capability predicts both the ability to write useful harness updates and the ability to benefit from them. The answer reshapes how we should allocate capability in self-evolving agent systems.

Explore related Read →

Can externalized bookkeeping let smaller search agents beat larger ones?

Does offloading routine record-keeping to an environment harness free RL policies to focus on semantic search decisions, and can this approach outperform larger searchers with fewer parameters?

Explore related Read →

What are the three distinct layers of agent code?

Does separating agent code into model capabilities, system harness, and agent-created artifacts help explain why agentic systems fail and where to intervene for improvement?

Explore related Read →

What makes agent-authored code worth persisting and sharing?

Agent-created artifacts like patches, tests, and skill libraries outlive single tasks, but we lack guidance on what should persist, how to maintain consistency across agents, and when persistence is worth the engineering effort.

Explore related Read →

Can person-grounded skills remain auditable without hidden prompt state?

Explores whether treating extracted expertise as versioned files—rather than persona prompts—enables meaningful accountability over person-grounded knowledge. Matters because audit trails determine whether captured skills can be corrected, rolled back, or safely withheld.

Explore related Read →

Can skills work better as weights than as prompts?

Most agent systems store skills as text in prompts, but this inflates token costs and degrades model performance. Could compiling skills into trainable weight-space adapters instead offer a better trade-off between efficiency and capability?

Explore related Read →

Do memory systems actually help language models learn continuously?

When you subtract what a model already knows, do dedicated memory architectures genuinely enable continual learning, or do they mainly inherit base capability? CL-BENCH isolates learning from prior skill to test this.

Explore related Read →

Does creating skills inside the agent loop eliminate mismatches?

Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.

Explore related Read →

Can agents learn new skills without forgetting old ones?

Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.

Explore related Read →

Can language models learn skills without human supervision?

Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?

Explore related Read →

How can agent systems share learned skills across users?

Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?

Explore related Read →

Why do LLM agents ignore condensed experience summaries?

LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.

Explore related Read →