SYNTHESIS NOTE

Do different AI models actually produce diverse outputs?

Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.

Synthesis note · 2026-03-27 · sourced from Foundation Models

INFINITY-CHAT studied 70+ open and closed source LLMs across 26K real-world open-ended queries that admit a wide range of plausible answers with no single ground truth. The findings reveal a pronounced "Artificial Hivemind" effect characterized by two distinct phenomena:

Intra-model repetition — a single model consistently generates similar responses to the same prompt across runs.
Inter-model homogeneity — different models independently produce strikingly similar outputs, sometimes verbatim: DeepSeek-V3 and GPT-4o generated overlapping phrases like "Elevate your iPhone with our," "sleek, without compromising." In some cases, models from the same family output identical responses.

The inter-model effect is the more concerning finding. Model ensembles — using multiple different models to increase diversity — may not yield true diversity when their constituents share overlapping alignment and training priors. The convergence is not just stylistic but substantive: models converge on the same ideas, not just the same words.

This has direct implications for the False Punditry argument. Since Does polished AI output trick audiences into trusting it?, the hivemind effect means that AI-generated social media content will sound similar regardless of which model generates it. The "diversity" of AI voices on social media is illusory — different accounts using different models will produce strikingly similar analysis, framing, and conclusions, creating a false consensus that looks like independent agreement.

Since Why do LLMs generate novel ideas from narrow ranges?, the hivemind effect extends from research ideas to all open-ended generation. The diversity collapse documented in research ideation is a specific instance of a general phenomenon: LLMs trained on overlapping data with similar alignment procedures converge on a shared distribution of outputs.

Recommendation as a concrete domain instance. LLM-based conversational recommender systems exhibit the hivemind in a specific, measurable way: "the most popular items such as The Shawshank Redemption appear around 5% of the time" across different recommendation datasets, and "the recommended popular items are similar across different datasets, which may reflect the item popularity in the pre-training corpus of LLMs" (Large Language Models as Zero-Shot Conversational Recommenders). The convergence is not on quality or relevance but on pretraining-distribution popularity — the same items surface regardless of the user's context or the dataset's actual popularity distribution. This is the hivemind effect translated from open-ended generation to decision-making: LLMs don't just write the same things, they recommend the same things.

The study also found that reward models and LM-based judges are miscalibrated for responses that elicit divergent human preferences — they assume a single consensus notion of quality and fail to reward the pluralistic preferences that open-ended queries produce. This means the homogeneity is self-reinforcing: training on reward model scores optimizes for the consensus the hivemind already occupies.

Fiction is a concrete narrative-level instance of the hivemind — with per-model fingerprints layered on top. StoryScope ("Investigating idiosyncrasies in AI fiction") applies the convergence finding to creative writing and shows it operates at the level of narrative decisions, not just words. Across a parallel corpus where five LLMs (Claude, DeepSeek, Gemini, GPT, Kimi) each wrote stories to the same 10,272 prompts, the five models occupy a tight, well-separated cluster in narrative-feature space while human-authored stories scatter more widely — the hivemind effect translated from phrasing to plot, agency, and temporal structure (see Do AI stories explain their themes more than human stories do?). Crucially, the inter-model convergence coexists with detectable per-model fingerprints: Claude produces notably flat event escalation, GPT over-indexes on dream sequences, Gemini defaults to external character description, enabling 68.4% macro-F1 six-way authorship attribution. This refines the hivemind picture — models converge on a shared region of output space relative to humans, yet retain stable individual signatures relative to each other. The convergence is not total homogenization but a common cluster with distinguishable accents.

NoveltyBench (2025) provides the first benchmark-level quantification of mode collapse across 20 leading models. Evaluating models on prompts curated to elicit diverse answers (using filtered real-world queries), the study finds that current SOTA systems "generate significantly less diversity than human writers." A counterintuitive finding: larger models within a family often exhibit LESS diversity than their smaller counterparts, directly challenging the assumption that capability on standard benchmarks translates to generative utility. While in-context regeneration prompting strategies can elicit some diversity, the findings reveal "a fundamental lack of distributional diversity" that reduces utility for users seeking varied responses. The mode collapse is driven by alignment: today's aligned models produce lower entropy distributions than earlier generations, and random sampling produces substantial near-duplicates. Source: Arxiv/Evaluations.

Source (enrichment): Co Writing Collaboration — "StoryScope: Investigating idiosyncrasies in AI fiction", https://arxiv.org/abs/2604.03136

Inquiring lines that read this note 93

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What are the consequences of models training on synthetic data?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Do language models learn genuine linguistic structure or just surface patterns?

How does AI-generated content transformation affect public discourse quality?

Do AI-generated posts crowd out human voices without any coordination or intent?

What makes AI persuasion effective and how can we counter it?

Why do multiple language models independently produce similar outputs in influence campaigns?

How can AI alignment serve diverse human preferences at scale?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

When does optimizing for quality undermine the value of diversity?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Does alignment training create blind spots in detecting genuine safety threats?

How do language models inherit human biases from training data?

How do multi-agent systems achieve genuine cooperation and reasoning?

Why does diversity without expertise produce worse results than a single capable agent?

What determines success in training models on multiple tasks?

What factors beyond surface content determine how readers extract meaning differently?

What semantic classifier design avoids lexical variation without genuine conceptual distinctness?

Can prompting inject entirely new knowledge into language models?

Why do persona-level simulations fail to predict individual preferences accurately?

Does single model persona diversity match true multi-model diversity at scale?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

How do training priors constrain what context information can override?

How do language models establish social grounding in human dialogue?

Why do language models presume common ground instead of building it?

What makes weaker teacher models effective for stronger student training?

Why do semantic similarity and task relevance diverge in vector embeddings?

Does the same spectral signature appear across different embedding models?

Can next-token prediction alone produce genuine language understanding?

Does token-level loss aggregation help aligned models differently?

Can ensemble evaluation methods reduce bias more than single judges?

How do ensemble methods reduce bias in automated evaluation?

Do corrupted reasoning traces serve as effective supervision signals?

How does an aggregator use diverse complementary traces to improve final answers?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 186 in 2-hop network ·medium cluster Open in graph ↗

Do different AI models actually produce diverse … Does polished AI output trick audiences into trust… Why do LLMs generate novel ideas from narrow range… Why do preference models favor surface features ov… Why do multi-agent LLM systems converge without ge…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does polished AI output trick audiences into trusting it? When AI generates professional-looking graphs, diagrams, and presentations, do audiences mistake visual polish for analytical depth? This matters because appearance might substitute for actual expertise.
hivemind makes all AI artifacts sound similar
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
research ideation collapse as specific instance of general hivemind
Why do preference models favor surface features over substance? Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
reward model miscalibration reinforces homogeneity
Why do multi-agent LLM systems converge without genuine deliberation? Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?
hivemind at generation level parallels silent agreement at reasoning level

Do different AI models actually produce diverse outputs?

Inquiring lines that read this note 93

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4