SYNTHESIS NOTE

Does learning from mistakes improve in-context learning?

Explores whether inducing models to make errors on few-shot examples, then having them articulate principles from those mistakes, leads to better performance than learning from correct examples alone.

Synthesis note · 2026-06-03 · sourced from Prompts Prompting

In-context learning has always learned from correct input-output pairs only. LEAP (Learning Principles) revisits that: given the same few examples, it (1) intentionally induces the model to make mistakes on them, (2) has the model reflect on those mistakes and articulate explicit, task-specific principles — with no human supervision — that help avoid common errors, then (3) prompts the test question with the original few-shot examples plus the learned principles. It uses exactly the same number of labeled examples as standard few-shot, yet improves strong models (GPT-3.5/4/4-turbo, Claude-2.1, Gemini Pro) across DROP, HotpotQA, GSM8K, MATH, and Big-Bench Hard (e.g., +7.5% on DROP with GPT-4).

The keeper is a generative-learning principle at the prompt level: the model extracts more usable structure from examples by erring and explaining the error than by imitating correct answers. Negative experience, articulated, transfers better than positive examples alone — within a single inference-time prompt, no fine-tuning.

This is the in-context, self-supervised cousin of learning-from-mistakes at training time. It rhymes with Can reconstructing expert thinking improve reasoning transfer? (articulating the latent process behind surface examples) and with Can confidence trajectories reveal when reasoning goes wrong? in deriving a usable training/prompting signal from the model's own errors rather than external labels.

Inquiring lines that read this note 7

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can prompting inject entirely new knowledge into language models?

Do few-shot examples improve in-context learning or add noise?

How do training data properties shape reasoning capability development?

What makes a good in-context learning example for a given task?

How do training priors constrain what context information can override?

Why does negative experience transfer better than positive examples alone?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Can alternative training methods improve on supervised fine-tuning for language models?

Can we reverse the instruction-following deficit through targeted training?

Does reinforcement learning teach reasoning or just when to reason?

Can reinforcement learning improve how accurately models explain themselves?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 177 in 2-hop network ·dense cluster Open in graph ↗

Does learning from mistakes improve in-context l… Can reconstructing expert thinking improve reasoni… Can confidence trajectories reveal when reasoning … Why do chain-of-thought examples fail across diffe…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can reconstructing expert thinking improve reasoning transfer? Expert texts show only the final result of complex thinking. Can we reverse-engineer those hidden thought processes and use them to train models that reason better across different domains?
both articulate the latent structure behind surface examples; LEAP does it in-context from induced mistakes
Can confidence trajectories reveal when reasoning goes wrong? Does the timing of when a model commits to an answer predict whether its reasoning will be flawed? And can we use this signal to train better reasoning without expensive annotations?
both derive a usable signal from the model's own errors rather than external labels
Why do chain-of-thought examples fail across different conditions? Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
LEAP adds learned principles atop exemplars, a lever beyond exemplar selection

Does learning from mistakes improve in-context learning?

Inquiring lines that read this note 7

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4