Do language models favor outputs from their own model family?
This reads the question as being about self-preference bias — whether a model, when judging or selecting among outputs, systematically favors text produced by itself or by siblings in its own architecture family — and the honest answer is that the corpus circles this territory rather than hitting it head-on, so the most useful thing is to map what it does establish about model-specific signatures and shared outputs.
This explores self-preference bias: whether models favor outputs from their own family. The collection doesn't contain a study that directly measures "LLM-as-judge prefers its own generations," so rather than pad, it's worth saying what the corpus *does* tell you — and it pulls in two opposite directions that together reframe the question.
First, the case that family even matters. There's a real, measured phenomenon where effects are *model-specific* in a way that crosses ordinary boundaries: behavioral traits can be transmitted between models through data with no semantic relationship to the trait at all — but only when teacher and student share an architecture, and the transmission fails across different families Can language models transmit hidden behavioral traits through unrelated data?. That's the strongest hint in the corpus that "own family" is a meaningful category at the mechanism level: models carry statistical signatures that siblings recognize and outsiders don't. If a model can absorb a hidden trait from a same-family teacher through gibberish, it's plausible it could also recognize same-family *outputs* as more familiar — and familiarity is exactly what self-preference bias would feed on.
But the opposite finding undercuts how much room there is to favor anything. The "Artificial Hivemind" result found that 70+ models, across 26K open-ended prompts, independently converge on strikingly similar — sometimes identical — responses, because they share overlapping training data and alignment recipes Do different AI models actually produce diverse outputs?. If outputs are already this homogeneous across families, the practical space for a model to prefer "its own kind" shrinks: there's less distinguishing signal to prefer. Self-preference and convergence are in tension — one needs distinctiveness, the other erases it.
The deeper mechanism worth knowing: models lean hard on what training baked in over what's in front of them. Parametric priors routinely override in-context information, and plain prompting can't talk a model out of a strong learned association — it takes causal intervention in the representations Why do language models ignore information in their context?. A judging model evaluating candidate outputs is doing exactly this kind of integration, so any bias toward familiar-looking (same-family) text would be a prior-over-context effect, not a reasoned judgment — which is precisely why it would be hard to fix with instructions alone.
The corpus also points at the alternative people actually trust when they don't trust models to judge each other: human preference. Chatbot Arena's 240K+ crowdsourced pairwise votes produce rankings that track expert raters, which is part of why scaled *human* preference, not model self-judgment, became the credible evaluation signal Can crowdsourced votes reliably rank language models?. The thing you didn't know you wanted to know: the reason self-preference matters at all is that the field increasingly wants models to grade models — and the evidence here suggests the safer move is to keep a human or an external verifier in the loop, since models are formally bounded from validating their own quality without something outside themselves What stops large language models from improving themselves?.
Sources 5 notes
Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Chatbot Arena's 240K+ crowdsourced preference votes produce credible model rankings because the underlying questions are diverse and discriminating, and crowd judgments correlate with expert raters—validating human preference as a scalable evaluation signal.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.