INQUIRING LINE

How can agents distinguish over-generalized lessons from genuinely useful long-tail knowledge?

This explores how an agent learning from its own experience can tell the difference between a lesson it should generalize broadly and a rare, situation-specific fact worth keeping intact — the corpus mostly attacks this as a question of *how much to compress* a stored memory.


This explores how an agent that learns from its own runs can tell an over-generalized lesson ("always retry on failure") from a genuinely useful rare fact ("this specific API returns a 200 even when it failed"). The corpus doesn't name the problem this way, but it circles the same territory through one recurring move: deciding *how much to abstract* a stored experience, and treating successes and failures asymmetrically.

The sharpest answer is the asymmetry itself. SkillRL keeps successful episodes as concrete, replayable demonstrations while distilling failures into abstracted lessons Should successful and failed episodes be processed differently?. That maps cleanly onto your question: long-tail knowledge lives in the concrete success — the exact sequence that happened to work — whereas the generalizable takeaway lives in the abstracted failure. Crucially, the paper found uniform consolidation (abstracting everything equally) *degrades* performance. Over-generalization isn't a hypothetical risk; it's the measured cost of compressing the wrong things.

Reflexion sharpens the same edge from the failure side. Its agents write verbal self-diagnoses into episodic memory, and the finding is that keeping those reflections *uncompressed* preserves their usability — compress them and they stop helping Can agents learn from failure without updating their weights?. It also notes the binary success/failure signal prevents the agent from rationalizing, which is exactly the trap behind over-generalization: a model that's free to narrate why it failed will invent a tidy, too-broad rule. Pair this with the observation that agent feedback splits into *evaluative* (how well it went) and *directive* (what to change) components that scalar rewards can't jointly capture Can scalar rewards capture all the information in agent feedback? — the directive part is the specific correction; flatten it into a generic "do better" and you've manufactured an over-generalized lesson.

The other half of the answer is *verification* — a lesson is only worth generalizing if it survives being tested against the process, not just the outcome. Checking intermediate steps rather than final answers caught failures that outcome scoring missed entirely, lifting success from 32% to 87% Where do reasoning agents actually fail during long traces?. An agent could use the same discipline on its own memory: a lesson that only ever "worked" because the final answer happened to be right is a prime over-generalization candidate. VOYAGER's skill library offers the complementary structure — store skills as discrete, executable, embedding-indexed units that get *refined by environmental feedback* rather than collapsed into one big policy Can agents learn new skills without forgetting old ones?. Keeping knowledge granular and individually addressable is itself a defense against over-generalizing.

The quiet warning underneath all this: the failure mode is often *knowing when not to apply a lesson at all*. Reasoning models over-respond to ill-posed questions because training rewarded producing reasoning steps but never taught them to disengage Why do reasoning models overthink ill-posed questions?. An over-generalized lesson is the memory equivalent — a rule the agent fires everywhere because nothing ever taught it the boundaries of where the rule holds. So the corpus's combined prescription is less "build a better classifier" and more: keep successes concrete, abstract only failures, leave the specifics uncompressed, verify the process that earned the lesson, and explicitly learn the edges where it stops applying.


Sources 6 notes

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Next inquiring lines