INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How effectively can inference-time…›Why does supervised fine-tuning im…›this inquiring line

Where you place examples in a prompt turns out to be a real performance lever — identical content, different position, up to 20% accuracy swing.

Can demo placement be tuned as a task-specific hyperparameter?

This explores whether *where* you place demonstrations in a prompt — their position and ordering — is something you can deliberately tune per task, the way you'd tune a learning rate, rather than an incidental detail.

This explores whether demo placement — position and ordering — behaves like a tunable knob you set per task. The corpus says yes, and more strongly than you'd expect: placement isn't a minor formatting choice, it's a lever with measurable, sometimes dramatic effects. The most direct evidence is that moving an *identical* block of demonstrations from the start of a prompt to the end can swing in-context-learning accuracy by up to 20% and flip nearly half the model's predictions How much does demo position alone affect in-context learning accuracy?. The content didn't change — only the position. That's the signature of a real hyperparameter: same input, different result depending on a setting you control.

But placement isn't one knob, it's two. Position (where the demos sit) is distinct from *order* (the sequence within the demo block), and the corpus shows order is tunable too — and tunable without hand-labeling difficulty. Sparsity-Guided Curriculum In-Context Learning uses the model's own last-layer activation sparsity to rank demonstrations from harder to easier, then arranges them in that curriculum, yielding solid gains across diverse tasks with no external difficulty labels Can representation sparsity order few-shot demonstrations effectively?. So you can let the model's internal signal pick the ordering automatically — which is exactly what 'tune it as a hyperparameter' should mean in practice: a setting you can search over or derive, not guess.

Here's the part that answers the 'task-specific' half of your question. The same corpus repeatedly finds that the *right* ordering depends on the task type, so a single fixed placement policy won't be optimal everywhere. Omni-Thinker shows training structured tasks before creative ones (a sequencing choice at the data level) prevents entropy collapse and beats joint training by 6.2% — but the benefit comes precisely from matching the schedule to how each domain's entropy behaves Does training order reshape how models handle different task types?. Preference tuning tells the same story from another angle: the same intervention reduces diversity in code but *increases* it in creative writing, because each domain rewards different things Does preference tuning always reduce diversity the same way?. The lesson that carries over to demo placement: ordering effects are domain-dependent, so the optimal setting is task-specific by nature — which is the whole premise of treating it as a per-task hyperparameter rather than a universal default.

There's a deeper, slightly unsettling reason placement matters so much: a lot of what demonstrations 'teach' may be format and output-space, not task understanding. Models trained on semantically empty or even deliberately wrong instructions perform almost identically to those given correct ones — what transfers is knowledge of the output space, not the meaning Does instruction tuning teach task understanding or output format?. If demos work largely by anchoring format and steering the model toward a region of output space, then *where and in what order* you place them — what the model sees last, what primes it first — is doing real mechanical work, which is exactly why position can flip half the predictions.

The thing you might not have known you wanted to know: placement tuning rhymes with a broader pattern in the corpus of treating *structure* as the tunable thing rather than weights. Self-adaptive models compose task-specific expert vectors at inference time Can models dynamically activate expert skills at inference time?, and multi-task systems get isolated, task-specific parameter regions Can isolating task-specific parameters prevent multi-task fine-tuning interference?. Demo placement is the cheapest member of that family — no training, no weight surgery, just rearranging the prompt — yet it sits on the same principle: per-task configuration, applied at inference, with effects large enough to take seriously.

Sources 7 notes

How much does demo position alone affect in-context learning accuracy?

Repositioning an identical demo block from prompt start to end swaps up to 20% accuracy and flips nearly half of predictions. This spatial effect operates independently of demo content and spans multiple task types.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Show all 7 sources

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance1.70 match · arxiv ↗
Exploring Format Consistency for Instruction Tuning1.69 match · arxiv ↗
A Survey on Post-training of Large Language Models1.65 match · arxiv ↗
Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning1.64 match · arxiv ↗
LESS: Selecting Influential Data for Targeted Instruction Tuning1.64 match · arxiv ↗
Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning0.89 match · arxiv ↗
Evaluating the Diversity and Quality of LLM Generated Content0.88 match · arxiv ↗
Transformer2: Self-adaptive LLMs0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether demo placement—position and ordering in prompts—can be reliably tuned as a task-specific hyperparameter for in-context learning. The question remains open; treat the findings below as dated claims.

What a curated library found — and when (findings span 2023–2026; these are perishable claims):
• Moving an identical demo block from prompt start to end shifts accuracy by up to 20% and flips ~50% of predictions, showing position is a real tunable parameter (2025-07, arXiv:2507.22887).
• Demo *order* can be tuned automatically using the model's own last-layer activation sparsity to rank demonstrations by difficulty, yielding gains across tasks without hand-labeled labels (curator's synthesis from sparsity-curriculum work).
• Optimal demo ordering is task-dependent: structured domains benefit from curriculum sequencing (6.2% gain), while creative domains reward different patterns—the right placement policy varies by task type (2025-07, arXiv:2507.14783).
• Demonstrations may teach primarily output-space format and distribution, not task semantics; models given semantically empty or wrong instructions perform almost identically to those given correct ones (2023-05, arXiv:2305.11383).

Anchor papers (verify; mind their dates):
• arXiv:2507.22887 (2025-07) — Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
• arXiv:2507.14783 (2025-07) — Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
• arXiv:2305.11383 (2023-05) — Do Models Really Learn to Follow Instructions?
• arXiv:2501.06252 (2025-01) — Transformer2: Self-adaptive LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For placement position: has model scaling (>100B params), training methods (e.g., instruction-following improvements post-2025-07), or new evaluation harnesses (multi-shot benchmarks, domain-specific suites) since relaxed or overturned the 20% swing? For order-tuning: do newer sparsity-based or learned ranking methods (e.g., from preference-tuning systems, 2025-06+) outperform curriculum-from-activation? For task-specificity: has any work shown a *universal* placement policy that works across all task types, or do recent models preserve task-dependence? Plainly name what still holds and where constraints may have dissolved.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (roughly 2026-01 onward) that argues placement effects are *not* tunable, or that position/order matter less than previously thought, or that a single policy generalizes better than claimed.
(3) Propose 2 research questions that ASSUME the regime may have shifted:
   – Given that demos primarily teach format/output-space (not semantics), can placement be jointly optimized with in-context *instruction clarity* to decouple format-steering from task-understanding?
   – In multi-agent or chain-of-thought prompts, does demo placement interact with intermediate reasoning steps, and can you tune placement *per-step* rather than globally?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Where you place examples in a prompt turns out to be a real performance lever — identical content, different position, up to 20% accuracy swing.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8