INQUIRING LINE

What makes a good in-context learning example for a given task?

This reads the question as 'what actually makes a demonstration useful' — not just picking examples that look similar to the task, but what property of an example does the teaching work.


This explores what makes an in-context example earn its place in the prompt — and the corpus pushes back hard on the obvious answer, which is 'pick examples that resemble the test case.' Several lines suggest similarity is a weak proxy for usefulness, and that the better question is what an example *demonstrates*. The most counterintuitive finding: an example's main job may be teaching the *shape of the answer* rather than the content. Models trained on semantically empty or even deliberately wrong instructions perform almost as well as those given correct ones — what transfers is knowledge of the output space, not task understanding Does instruction tuning teach task understanding or output format?. By that logic, a good example is one that vividly shows the model what a valid answer looks like, format and all.

If you do want to select examples deliberately, the corpus argues against grabbing the nearest neighbors. Framing demonstration choice as a budgeted experiment — pick the examples that most reduce uncertainty across the whole test set — beats similarity-based retrieval across small, medium, and large models Can optimal experimental design improve few-shot example selection?. The good example isn't the one closest to your query; it's the one that resolves the most ambiguity about the task as a whole. And sometimes a single, well-chosen example is enough to flip latent capability on: one training instance lifted math accuracy from 36% to 73.6%, suggesting examples can act as activation signals rather than lessons Can a single training example unlock mathematical reasoning?.

The richest thread reframes the unit of a 'good example' entirely. A clean correct demonstration may teach less than one that surfaces the *principle* behind the answer. LEAP shows models improve by being induced to err on the few-shot examples, then reflecting on those mistakes to derive explicit task rules — error-revealing examples beat tidy ones Does learning from mistakes improve in-context learning?. Relatedly, extracting natural-language skills from context — turning examples into reusable rules — lifts frozen models without any weight updates Can frozen models learn better by extracting context into skills?. For sequential or multi-step tasks, the unit isn't even a single example: in-context learning of decision-making requires full or partial *trajectories* from the same setting, not isolated input-output pairs Why do trajectories matter more than individual examples for in-context learning?.

Two hard ceilings bound all of this. No example can inject knowledge the model never had — prompting only reorganizes and activates what's already in the training distribution Can prompt optimization teach models knowledge they lack?. And even a well-chosen example can be ignored: when a model's prior training associations are strong enough, in-context information loses, and textual prompting alone can't override it Why do language models ignore information in their context?. So the honest answer is layered: a good example shows the output shape, reduces task-wide uncertainty rather than just matching the query, ideally exposes a principle (often through a corrected mistake), and supplies a full trajectory when the task is sequential — but none of it works against a knowledge gap or an overpowering prior.


Sources 8 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can optimal experimental design improve few-shot example selection?

AIPD frames demonstration selection as budgeted active learning, choosing examples that maximally reduce test-set uncertainty. Two algorithms (GO and SAL) outperformed similarity-based methods across small, medium, and large language models.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Does learning from mistakes improve in-context learning?

LEAP demonstrates that models achieve better performance on reasoning and math tasks by intentionally erring on few-shot examples, reflecting on mistakes, and deriving explicit task-specific principles—without additional labeled data or fine-tuning.

Can frozen models learn better by extracting context into skills?

Extracting natural-language rules from context into reusable skills improves frozen model reasoning without weight updates. On CL-bench, this lifts GPT-4.1 from 11.1% to 16.5%, with skills transferable across model backbones.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating what makes a good in-context learning example — a question that remains open despite recent progress. A curated library (2022–2025) has surfaced several dated claims; your job is to stress-test them against the latest work.

What a curated library found — and when (dated claims, not current truth):
• Similarity to the test case is a weak proxy; examples teaching output *shape* and task-wide uncertainty reduction matter more (2023–2024).
• Instruction-tuning transfers output-format distribution, not deep task understanding; semantically empty instructions perform nearly as well as correct ones (~2023).
• Active experimental design (choosing examples that resolve ambiguity across the test set) beats similarity-based retrieval across model scales (2024).
• A single well-chosen example can activate latent capability: one instance lifted math accuracy from 36% to 73.6% (2025).
• Error-revealing examples and explicit principle extraction outperform clean demonstrations; full or partial trajectories, not isolated pairs, are needed for sequential tasks (2024–2025).
• Hard ceilings: prompting cannot inject new knowledge (only activates training distribution), and strong priors can override in-context information (2023–2024).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — instruction tuning teaches format, not understanding
• arXiv:2404.08846 (2024) — active experimental design for example selection
• arXiv:2402.05403 (2024) — learning from mistakes (LEAP)
• arXiv:2504.20571 (2025) — one training example activates reasoning

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, determine whether newer models (GPT-4o, o1, Claude 3.5), training methods (RL, distillation), evaluation harnesses, or multi-agent orchestration have since relaxed or overturned it. Separate the durable question (what makes an example *useful*?) from perishable limitations (e.g., prior dominance, format-only learning). Cite what resolved each constraint; flag what still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months — especially on positional bias, multi-task superposition, or knowledge injection.
(3) Propose 2 research questions that assume the selection regime may have shifted: e.g., do larger or instruction-aligned models now *learn task semantics* from examples rather than just output format? Can trajectory-based in-context learning scale to RL beyond sequential decision-making?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines