Where does agent capability really come from? · Gravity7

Where does agent reliability actually come from?

Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.

Does raw token spending actually predict agent performance?

Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.

Do stronger models always evolve harnesses better?

We explore whether base model capability predicts both the ability to write useful harness updates and the ability to benefit from them. The answer reshapes how we should allocate capability in self-evolving agent systems.

Can externalized bookkeeping let smaller search agents beat larger ones?

Does offloading routine record-keeping to an environment harness free RL policies to focus on semantic search decisions, and can this approach outperform larger searchers with fewer parameters?

What are the three distinct layers of agent code?

Does separating agent code into model capabilities, system harness, and agent-created artifacts help explain why agentic systems fail and where to intervene for improvement?

What makes agent-authored code worth persisting and sharing?

Agent-created artifacts like patches, tests, and skill libraries outlive single tasks, but we lack guidance on what should persist, how to maintain consistency across agents, and when persistence is worth the engineering effort.

Can person-grounded skills remain auditable without hidden prompt state?

Explores whether treating extracted expertise as versioned files—rather than persona prompts—enables meaningful accountability over person-grounded knowledge. Matters because audit trails determine whether captured skills can be corrected, rolled back, or safely withheld.

Can skills work better as weights than as prompts?

Most agent systems store skills as text in prompts, but this inflates token costs and degrades model performance. Could compiling skills into trainable weight-space adapters instead offer a better trade-off between efficiency and capability?

Do memory systems actually help language models learn continuously?

When you subtract what a model already knows, do dedicated memory architectures genuinely enable continual learning, or do they mainly inherit base capability? CL-BENCH isolates learning from prior skill to test this.

Does creating skills inside the agent loop eliminate mismatches?

Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.

Can agents learn new skills without forgetting old ones?

Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.

Can language models learn skills without human supervision?

Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?

How can agent systems share learned skills across users?

Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?

Why do LLM agents ignore condensed experience summaries?

LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.