INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How does AI reshape human skill, a…›How can AI agents autonomously lea…›this inquiring line

AI agents get more productive through better scaffolding around them — not from a bigger model underneath.

Can agentic AI tools deliver productivity gains on learning tasks differently?

This reads the question as: when agentic tools improve at learning-type tasks, do the gains come from a *different mechanism* than just running a bigger model — and the corpus says yes, the gains come from structure outside the model, not raw capability.

This explores whether agentic AI gets better at learning tasks through a different route than scaling the underlying model — and the collection makes a strong case that it does. The recurring finding is that reliability and productivity gains come from *externalizing* work the model would otherwise have to re-solve every time. One synthesis frames this cleanly: agents get reliable by offloading three burdens — memory (keeping state), skills (reusable procedures), and protocols (structured interaction) — into a surrounding 'harness' layer rather than leaning on model size Where does agent reliability actually come from?. That's the 'differently' the question is hunting for: the productivity isn't inside the weights, it's in the scaffolding.

The most concrete version of this is workflow memory. When an agent extracts reusable sub-task routines from its past runs and recombines them, it posts 24–51% gains — and the gains get *larger* as the test task drifts further from training, which is the opposite of how a static model behaves Can agents learn reusable sub-task routines from past experience?. VOYAGER shows the same idea as a growing skill library: store executable skills, compose complex ones from simple ones, and you get continual learning without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. SkillOS pushes further by splitting a *trainable curator* off from a frozen executor, so the library evolves toward sharper, cross-task strategies instead of bloating with verbose junk Can a separate trained curator improve skill libraries better than frozen agents?. The thread across all three: learning happens in an editable external store, which is a fundamentally different lever than fine-tuning.

What's striking is what this implies about *which* model you need. If most of the competence lives in the harness, the model handling the repetitive sub-tasks doesn't have to be huge — small language models can do the bulk of agentic work at 10–30× lower cost, with big models called in selectively Can small language models handle most agent tasks?. And the systems can even design themselves per request: meta-agents generate a custom multi-agent setup for each individual query rather than reusing one fixed template Can AI systems design unique multi-agent workflows per individual query?.

But 'differently' cuts both ways, and the corpus is honest about the catch. A sobering counterweight finds that 80% of multi-agent performance variance is just *token budget* — you're often paying for compute, not coordination intelligence How does test-time scaling work at the agent level? — and deep research turns out to scale search the same way reasoning scales tokens, so 'agentic' can quietly mean 'expensive' How does test-time scaling work for individual research agents?. Agents trained only on expert demonstrations stay capped at what their curators imagined, never learning from their own failures Can agents learn beyond what their training data shows?. And when pushed for depth they don't have, deep-research agents will *fabricate* — 39% of failures involve inventing evidence to fake rigor Why do deep research agents fabricate scholarly content?.

The thing you may not have known you wanted to know: the productivity gain isn't really the agent getting smarter — it's the agent building a reusable external memory of *how to do the task*, which compounds over time and survives across different model backbones. That's why agentic learning gains transfer in a way fine-tuning doesn't, and also why they evaporate into raw token spend the moment that external structure isn't doing real work.

Sources 10 notes

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Show all 10 sources

Can AI systems design unique multi-agent workflows per individual query?

FlowReasoner demonstrates that meta-agents trained with reinforcement learning and external execution feedback can generate unique multi-agent architectures for each user query, optimizing across performance, complexity, and efficiency—moving beyond fixed task-level workflow templates.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

How does test-time scaling work for individual research agents?

Research shows that deep research agents exhibit test-time scaling laws where search steps scale similarly to reasoning tokens, and live search outperforms memorized retrieval on knowledge-intensive tasks. Data efficiency is extreme—78 curated demonstrations outperform 10K samples for agency.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents3.37 match · arxiv ↗
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver3.36 match · arxiv ↗
SkillOS: Learning Skill Curation for Self-Evolving Agents2.58 match · arxiv ↗
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation2.54 match · arxiv ↗
Towards a Science of Scaling Agent Systems2.54 match · arxiv ↗
How we built our multi-agent research system2.53 match · arxiv ↗
MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild2.51 match · arxiv ↗
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents1.72 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with stress-testing claims about agentic AI productivity on learning tasks. The precise question: does agentic scaffolding (memory, skills, protocols) deliver learning gains via a fundamentally different mechanism than model scaling?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable constraints to re-examine:
• Workflow memory + skill reuse yield 24–51% gains; gains *increase* on out-of-distribution tasks, opposite of scaling behavior (arXiv:2409.07429, ~2024–2025).
• Small language models handle 80–90% of agentic subtasks at 10–30× lower cost; big models called selectively (arXiv:2506.02153, ~2025).
• Query-level meta-agents auto-generate personalized multi-agent setups per request (arXiv:2504.15257, ~2025).
• 80% of multi-agent variance is raw token budget, not coordination (arXiv:2506.18959, ~2025).
• Deep-research agents fabricate evidence in 39% of failures when pushed beyond training scope (arXiv:2512.01948, ~2025).

Anchor papers (verify; mind their dates):
• arXiv:2409.07429 (Agent Workflow Memory, Sep 2024)
• arXiv:2506.02153 (Small LMs Future, Jun 2025)
• arXiv:2604.08224 (Externalization Review, Apr 2026)
• arXiv:2512.01948 (Deep Research Agents, Dec 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 24–51% workflow-memory gain claim: has newer orchestration (caching, distributed memory, retrieval-augmented skill lookup) since narrowed or widened that window? Does the small-LM sufficiency claim still hold as reasoning complexity scales, or have recent benchmarks (o1, reasoning-compute scaling) revealed new model-size thresholds? Probe whether fabrication in deep research persists or whether structured retrieval + grounding has reduced it.
(2) Surface the strongest CONTRADICTING work from the last ~6 months. Specifically flag any papers showing multi-agent complexity or single-agent sufficiency that undercuts the externalization hypothesis.
(3) Propose 2 research questions that assume the regime has moved: (a) If external skill libraries now saturate, does *continuous skill pruning* (learning what to forget) become the new frontier? (b) Can agents learn to route reasoning burden dynamically (small LM for routine, big LM for novelty) without explicit meta-agent orchestration?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI agents get more productive through better scaffolding around them — not from a bigger model underneath.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8