INQUIRING LINE

Inquiring lines›How do language models construct a…›How are AI-generated and human-wri…›When does optimizing for quality u…›this inquiring line

Does showing an AI a few examples before a creative task quietly shrink the variety of what it produces?

Can few-shot examples narrow generative diversity in creative tasks?

This explores whether giving an LLM a handful of sample outputs (few-shot prompting / in-context learning) anchors it to those samples and shrinks the variety of what it produces on open-ended creative work — and the corpus answers the broader 'what collapses diversity' question more directly than the few-shot-specific one.

This explores whether feeding an LLM a few worked examples quietly narrows the range of what it generates on creative tasks. The honest starting point: the collection has rich material on *why* generative diversity collapses, but only thin coverage of few-shot examples as the specific lever — so the strongest answer comes from triangulating the surrounding mechanisms rather than from one paper that names the effect.

The most useful surprise is that 'examples narrow diversity' isn't a clean yes. Diversity effects flip by domain. Preference tuning *reduces* lexical and syntactic variety in code generation but *increases* it in creative writing, because code rewards converging on one correct answer while creative writing rewards standing out Does preference tuning always reduce diversity the same way?. By that logic, a few-shot example that would tightly anchor a coding task might do something different in a creative one — examples there can demonstrate that distinctiveness is the goal rather than fence the model into a single mold. So the answer depends on what your examples implicitly signal the task is *for*.

The deeper risk the corpus does document is that models converge on their own even without your examples. Across 70+ models and 26K open-ended queries, different LLMs independently produce strikingly similar outputs — an 'Artificial Hivemind' driven by overlapping training data and shared alignment Do different AI models actually produce diverse outputs?. And larger models concentrate probability mass on their preferred outputs, so they generate fewer distinct samples per draw than much smaller ones Why aren't bigger models better for generating diverse outputs?. Few-shot examples would plausibly *compound* this baseline pull toward the mode: you're handing the model an anchor on top of an architecture already biased toward its favorite answer.

The closest the corpus comes to few-shot directly is work on ordering in-context demonstrations — sequencing them from harder to easier improves performance Can representation sparsity order few-shot demonstrations effectively?. Notably that's framed around accuracy, not diversity, which mirrors a field-wide blind spot: most reasoning and ideation methods optimize for getting the conventional answer right and ignore the distinct creative modes (combinational, exploratory, transformational) where diversity actually lives — a gap the corpus argues may itself explain ideation collapse Can LLMs reason creatively beyond conventional problem-solving?.

What you didn't know you wanted to know: the same narrowing shows up under many names — entropy collapse in RL search agents Does reinforcement learning squeeze exploration diversity in search agents?, format collapse where RL amplifies one pretraining format in the first epoch Does RL training collapse format diversity in pretrained models? — and the documented *fixes* point at what an anti-collapse few-shot strategy should look like. Diversity is preserved by training on varied demonstrations rather than a narrow set Does reinforcement learning squeeze exploration diversity in search agents?, by step-level critique that counteracts 'tail narrowing' before it sets in Do critique models improve diversity during training itself?, and by deliberately layering variation (persona, subtopic, context) so the examples themselves carry breadth instead of collapsing it Can synthetic dialogues become realistic through layered diversity?. The lesson: a few homogeneous examples will likely narrow you; a deliberately heterogeneous set is the same tool pointed the other way.

Sources 9 notes

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Show all 9 sources

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can synthetic dialogues become realistic through layered diversity?

Research shows that realistic synthetic dialogues require three multiplicative layers: subtopic specificity, Big Five persona variation, and 11 contextual characteristics via Chain of Thought reasoning. This structured approach captures 90.48% of in-domain dialogue performance.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Jointly Reinforcing Diversity and Quality in Language Model Generations2.46 match · arxiv ↗
Evaluating the Diversity and Quality of LLM Generated Content1.71 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.68 match · arxiv ↗
NoveltyBench: Evaluating Language Models for Humanlike Diversity1.67 match · arxiv ↗
Vector Policy Optimization: Training for Diversity Improves Test-Time Search1.66 match · arxiv ↗
Outcome-based Exploration for LLM Reasoning1.65 match · arxiv ↗
Scaling Synthetic Data Creation with 1,000,000,000 Personas1.63 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.61 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Does feeding an LLM a few worked examples narrow generative diversity in creative tasks?** Treat this as still-open.

What a curated library found — and when (dated claims, not current truth):
- Diversity effects are domain-dependent: preference tuning *reduces* lexical/syntactic variety in code but *increases* it in creative writing, because code rewards convergence while creative writing rewards distinctiveness (~2024–2025).
- Models converge independently even without examples: across 70+ LLMs and 26K open-ended queries, different models produce strikingly similar outputs ("Artificial Hivemind") driven by shared training data and alignment (~2025).
- Larger models concentrate probability mass more than smaller ones (~500M parameters generate the most unique samples per draw) (~2025).
- Diversity is preserved by *training* on heterogeneous demonstrations rather than narrow sets, and by critique-layer interventions that counteract tail narrowing during training (~2024–2025).
- Few-shot examples ordered harder-to-easier improve *accuracy* but the corpus does not directly measure their effect on diversity (~2024).

Anchor papers (verify; mind their dates):
- 2305.15717 (2023): The False Promise of Imitating Proprietary LLMs
- 2510.22954 (2025): Artificial Hivemind: The Open-Ended Homogeneity of Language Models
- 2605.22817 (2026): Vector Policy Optimization: Training for Diversity Improves Test-Time Search
- 2511.20471 (2025): Universe of Thoughts: Enabling Creative Reasoning with Large Language Models

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, judge whether newer post-training methods (DPO, self-supervised diversity objectives, test-time search scaling), orchestration (multi-sampling with explicit diversity penalties), or recent evals have relaxed the narrowing effect. Separate the durable question (does *any* few-shot regime narrow diversity?) from the perishable limitation (does it narrow *more* than baseline model collapse?). If newer work shows heterogeneous few-shot examples *prevent* narrowing, cite it plainly.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Does recent test-time compute scaling (e.g., generative process-reward models) or multi-agent ideation override the few-shot narrowing effect? Cite arXiv IDs.
(3) **Propose 2 research questions that assume the regime may have moved:** (a) Under what conditions do few-shot examples *expand* rather than narrow diversity, and is this measurable by creative evaluation metrics (not just entropy)? (b) Can adaptive few-shot selection (picking examples that maximize downstream diversity) be trained end-to-end?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does showing an AI a few examples before a creative task quietly shrink the variety of what it produces?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8