SYNTHESIS NOTE
Agentic Systems and Tool Use

Where does agent reliability actually come from?

Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.

Synthesis note · 2026-04-18 · sourced from Design Frameworks
Why do AI agents fail to take initiative? How should reasoning systems actually be architected? How should researchers navigate LLM reasoning research?

Drawing on Norman's concept of cognitive artifacts, this paper argues that the most consequential design choices in LLM agents are about externalization — relocating cognitive burdens from the model's internal computation into persistent, inspectable, reusable external structures. A shopping list doesn't expand memory; it changes recall into recognition. The same logic governs agent design.

Three dimensions of externalization address three recurrent mismatches:

  1. Memory externalizes state across time. The context window is finite and session memory is weak. Memory systems transform recall into recognition — the agent retrieves past knowledge from a persistent store rather than regenerating it from weights. This solves the continuity problem.

  2. Skills externalize procedural expertise. Long multi-step procedures are rederived rather than executed consistently. Skill systems transform generation into composition — the agent assembles behavior from pre-validated components rather than improvising each step. This solves the variance problem.

  3. Protocols externalize interaction structure. Interactions with tools, services, and collaborators are brittle when left to free-form prompting. Protocols transform ad-hoc coordination into structured contracts (e.g., MCP). This solves the coordination problem.

The harness is not a fourth dimension — it is the engineering layer that hosts all three and provides orchestration logic, constraints, observability, and feedback loops. The progression is: weights → context → harness, paralleling the human history of cognitive externalization (speech → writing → printing → computation).

Critical system-level couplings:

This reframes the question from "how capable is the model?" to "what burdens have been externalized so the model no longer has to solve them internally every time?" The base model may remain unchanged; what changes is the representation of the task.

This connects to Why do production AI agents stay deliberately simple? — the externalization framework explains why custom harnesses outperform: they externalize the right cognitive burdens for their specific domain. It also extends When should human-agent systems ask for human help? — Magentic-UI's mechanisms (co-planning, action guards, memory) are specific instances of the three externalization dimensions.

The "From Model Scaling to System Scaling" paper sharpens this into an explicit framing: model scaling (bigger models, more data, higher benchmark scores) versus system scaling (designing the auditable, persistent, modular, verifiable architecture around the model). It treats the harness as a first-class object of design, evaluation, and optimization, decomposing it into a foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer — a finer-grained partition of the same memory/skills/protocols externalization. Its central demonstration is that comparable models projected onto different harnesses (Claude Code, OpenClaw, and the released CheetahClaws reference harness) produce qualitatively different agents, making the harness "now a primary source of practical capability." This is direct evidence for the claim that reliability comes from the surrounding system, not from a larger model alone.

Inquiring lines that use this note as a source 200

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agent reliability comes from externalizing cognitive burdens into memory skills and protocols not from larger models — the harness is the unification layer