INQUIRING LINE

Can skill validation through testing prevent unreliable programs from accumulating?

This explores whether agents that build up reusable skill libraries can use testing — running each skill and keeping only what passes — to stop broken or unreliable code from piling up in the library over time.


This explores whether agents that accumulate reusable skills can use empirical testing as a quality gate — keeping the verified skills, discarding the rest — to prevent a library from silting up with unreliable code. The corpus suggests testing is a powerful filter, but a leaky one: it catches what it can run, and reliability is sneakier than passing a single test.

The optimistic case comes from agents that treat the environment as judge. VOYAGER stores executable skills in a searchable library and only admits a skill once environmental feedback confirms it works, composing complex behaviors from verified simpler ones — which lets it learn continuously without the catastrophic forgetting of weight-update methods Can agents learn new skills without forgetting old ones?. The Darwin Gödel Machine pushes the same idea to self-improvement, swapping formal correctness proofs for empirical benchmarking and keeping an evolutionary archive of agent variants, more than doubling its SWE-bench score Can AI systems improve themselves through trial and error?. In both, 'does it pass the test' substitutes for 'is it provably correct,' and it works well enough to compound.

But here's the thing the question doesn't anticipate: validation that merely *adds* what passes still accumulates clutter. SkillOS found that a frozen agent left to curate its own library drifts toward generic, verbose additions — passing tests isn't the same as being useful. Separating out a *trained* curator shifted the repository toward actionable execution logic and cross-task meta-strategies, and that curator generalized across different agent backbones Can a separate trained curator improve skill libraries better than frozen agents?. So preventing accumulation isn't a pure testing problem; it's a curation problem. Testing tells you what runs; something else has to decide what's worth keeping.

The deeper crack is that passing a test doesn't certify reliability. A model run at zero temperature with a fixed seed reproduces the same output every time — but that output is still a single draw from a probability distribution; consistency is not reliability, as omega-testing across 100 repetitions makes visible Does setting temperature to zero actually make LLM outputs reliable?. A skill can pass once and fail under inputs the test never probed. Worse, models learn the *form* of correctness rather than the substance: invalid chain-of-thought exemplars match valid ones on hard benchmarks, meaning a validator keyed to surface structure can be fooled Does logical validity actually drive chain-of-thought gains?. And evaluators themselves degrade — agentic evaluation cut judge error 100x over LLM-as-judge, but its own memory module cascaded errors, showing the validator needs error isolation or it becomes a source of the unreliability it's meant to catch Can agents evaluate AI outputs more reliably than language models?.

What the corpus quietly argues is that the most durable defense isn't pass/fail testing but treating failures as signal. Asymmetric trajectory filtering keeps clean successes *and* preserves diverse failures as negative training signal, letting a 14B model reach frontier reasoning — errors aren't garbage to discard, they teach the boundary Why do correct code trajectories teach models to tolerate errors?. For code specifically, semi-formal reasoning can verify patch equivalence at 93% without ever executing the code, crossing the reliability bar RL rewards need — so 'testing' need not mean running Can structured reasoning replace code execution for RL rewards?. The honest answer: testing genuinely slows the accumulation of broken skills, but a library stays healthy only when validation is paired with active curation, treats consistency as distinct from reliability, and learns from what fails rather than silently dropping it.


Sources 8 notes

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether skill validation through testing can prevent unreliable programs from accumulating in agent libraries. This question remains open despite recent progress in agentic reasoning and curation.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints:
• Testing (execution feedback) genuinely filters broken skills and enables compositional learning without catastrophic forgetting, but passes a single test ≠ reliability; zero-temperature consistency across runs is NOT the same as robustness across input distributions (2024–2025).
• RL-trained curators outperform frozen test-gate-only approaches: passing a test doesn't certify worth; SkillOS showed trained curators shift repositories toward actionable logic vs. generic verbose additions (2026-05).
• Invalid chain-of-thought and logically incorrect reasoning match valid ones on benchmarks, so surface-structure validators can be fooled; form is learned, not substance (2023-07).
• Agentic evaluation cut judge error 100x vs. LLM-as-judge but error cascaded in memory modules—the validator itself becomes a source of unreliability (2024-12).
• Asymmetric trajectory filtering (preserving diverse failures as negative training signal, not discarding them) enables frontier reasoning; errors teach boundaries rather than being garbage (2026-03).

Anchor papers (verify; mind their dates):
• arXiv:2307.10573 (2023-07): Invalid Logic, Equivalent Gains — form vs. substance in reasoning
• arXiv:2605.06614 (2026-05): SkillOS — trained curation decoupled from frozen executor
• arXiv:2604.08377 (2026-04): SkillClaw — collective skill evolution
• arXiv:2603.01896 (2026-03): Agentic Code Reasoning — asymmetric filtering and error as signal

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, multi-agent orchestration, memory isolation (e.g., error sandboxing in judges), or formal-reasoning tools (execution-free code equivalence checking ~93% accuracy) have since RELAXED or OVERTURNED the testing bottleneck. Separate the durable question (can testing prevent accumulation?) from the perishable limitation (current testing methods fail on robustness, curation, or validator cascades). Name what has evolved.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: Does any recent paper show testing ALONE (without trained curation or error preservation) can maintain library quality at scale? Or does the regime still require hybrid validation?
(3) Propose 2 research questions that ASSUME the validation regime may have shifted: (a) Can formal equivalence checking + asymmetric RL training substitute for empirical test-gate-only curation? (b) What minimal curator architecture (size, training signal, generalization scope) is needed to keep a skill library healthy when paired with imperfect testing?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines