Do users trust citations more when there are simply more of them?
Explores whether citation quantity alone influences user trust in search-augmented LLM responses, independent of whether those citations actually support the claims being made.
Search Arena provides the largest analysis of user preferences for search-augmented LLMs: over 24,000 paired multi-turn interactions with ~12,000 human preference votes. The finding that matters most: users prefer responses with more cited sources, and this preference extends to irrelevant citations.
The effect sizes are nearly identical. Correctly attributed citations have a positive coefficient of β=0.285 on user preference. Irrelevant citations — citations that do not support the associated claims — have a positive coefficient of β=0.273. Users are influenced by the presence of citations roughly equally regardless of whether those citations actually back up the text.
This means citation count functions as a surface trust heuristic, decoupled from citation quality. Users see citations and infer credibility without verifying the cited content supports the claim. The gap between perceived and actual credibility is systematic, not incidental.
Additional preference signals: users prefer community-driven platforms (tech blogs, social networks) over encyclopedic sources like Wikipedia. Reasoning-enhanced responses are preferred. Longer responses are preferred. Web search does not degrade and may improve performance in non-search settings — but search settings are significantly affected when relying solely on parametric knowledge.
This connects to Do users worldwide trust confident AI outputs even when wrong?. In that finding, confidence signals override accuracy assessment. Here, citation signals override quality assessment. Both are instances of the same pattern: users use surface proxies for quality because evaluating actual quality is cognitively expensive.
The implication for RAG system design is direct: optimizing for user satisfaction and optimizing for answer quality are not the same optimization target. A system can score highly on user preference by adding more citations — even irrelevant ones — without improving answer quality. This is a form of metric gaming at the human-evaluation level.
Inquiring lines that use this note as a source 76
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do LLMs generate false citations that sound like real scholarship?
- Can statistical filtering plus narrative generation fool academic peer review?
- Can social validation of expertise exclude systems that lack participatory track records?
- How does social proof work differently when there is no identifiable author?
- How does AI fact-checking compare to other trust signals like citation counts?
- Can citation practices work when AI cannot produce traceable sources?
- Can beam search and ranking functions evaluate claims without understanding counterarguments?
- Why do some LLM clusters cite broader psychology than others?
- Can persuasion effects that avoid demographic profiling maintain factual accuracy?
- Why do LLM explanations cite similarity and diversity more as options increase?
- How does explanation fluency mislead users about actual recommendation procedures?
- Does post-hoc justification increase when LLM choices become harder to defend?
- Does uncertainty quantification in model responses reduce persuasive impact on audiences?
- Do verbal uncertainty estimates calibrate better than confidence scores for personalization?
- Does complexity signal credibility and authority to readers?
- How does source attribution change the complexity-persuasion relationship?
- How does understanding persistent journeys intensify both trust and privacy concerns?
- How does pretraining corpus popularity bias affect LLM recommendation behavior?
- Does the interface design itself shape how much content users will review?
- Does stripping social context from knowledge claims hollow out their meaning?
- Does weak versus robust anthropomimesis produce different user trust responses?
- How does personalization increase trust while degrading clinical safety outcomes?
- Why do review corpora contain biases that affect generated comparisons?
- Does endorsement structure outperform content in detecting social controversy?
- Why do citation counts increase trust even without relevance?
- How do retrieval failures enable generation of fabricated scholarly constructs?
- Can verification mechanisms prevent AI agents from inventing false citations?
- How does search budget affect answer quality at test time?
- Are larger models and search access substitutes for factual accuracy?
- How do real search queries reveal what counts as a deep research question?
- Can graded relevance assumptions hold when user ratings are temporally inconsistent?
- How do experts select which other experts to trust?
- What anchoring effects shape how users rate items in sequence?
- How does anomalous state of knowledge affect user self-assessment?
- How does processing fluency bias credibility and expertise judgments?
- What role should the trust parameter play in using synthetic data as evidence?
- Why do users experience LLMs as peers rather than statistical tools?
- What would it take for readers to inspect rather than assume authorship?
- How does collapsing the author-public distinction remove the audience an appeal would target?
- Can confidence levels improve recommendations compared to single-number ratings?
- What documents improve answers beyond surface query similarity?
- Does exposure to more domain-specific examples reduce LLM overconfidence?
- Why does evaluating multiple candidates work better than judging one answer?
- Could real-time search systems avoid era sensitivity in legal reasoning?
- Does high knowledge density in text reduce user motivation to read more?
- What role does search capacity play in making debate more accurate?
- How much do social audience effects distort the true average satisfaction in review aggregates?
- Can adaptive elbow detection replace fixed top-k limits in evidence retrieval?
- Why does adaptive document allocation improve over fixed k selection?
- Does RL pruning of documents differ fundamentally from rationale-driven evidence selection?
- What conversational moves signal expertise and build credibility in recommendations?
- Can factual product data improve the credibility of subjective opinion summaries?
- What makes search budget matter for research task performance?
- What role does commitment and reputation play in building trustworthy expertise?
- How does social standing give certain claims more persuasive power than others?
- Why does probability of text completion not equal knowledge value?
- Why does automated evaluation consistently overestimate research quality?
- Why does RLHF alone fail to fully prevent opinion copying?
- Can RAG systems game user preferences by adding irrelevant citations?
- Why do users prefer community sources over encyclopedic references?
- Why do current benchmarks fail to match user satisfaction with search results?
- How does confidence in LLM outputs override users' ability to check accuracy?
- Can personalized systems reward honest disagreement instead of user confirmation?
- How do one-sided explanations act as confidence signals to users?
- How do citation patterns encode collective judgment about research quality?
- What role does document reranking play alongside decisions about whether to retrieve?
- What role does vague intent play in realistic search evaluation?
- How does multi-turn dialogue improve user satisfaction in search interactions?
- Why do aggregate persuasion metrics mask what actually changes minds?
- Why do users trust some recommenders more than others?
- Can ranking by coherence while minimizing author-community coverage find novel research?
- Can crowdsourced voting and automated panels both credibly evaluate LLM outputs?
- What validity threats exist in crowdsourced preference signals?
- How does persuasive framing replace evidence in contested domains?
- Can anonymity and trustworthiness coexist in online spaces without credential systems?
- Why are documents read but not cited harder distractors than random samples?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
same pattern: surface signals override quality evaluation
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
citation inflation is another bias axis exploitable in evaluation systems
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
plausibility ≠ precision mirrors citation-count ≠ citation-quality
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Search Arena: Analyzing Search-Augmented LLMs
- Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
- LLMs can be Fooled into Labelling a Document as Relevant
- AI Can Learn Scientific Taste
- Query Understanding in the Age of Large Language Models
- News Source Citing Patterns in AI Search Systems
- Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection
- From speaking like a person to being personal: The effects of personalized, regular interactions with conversational agents
Original note title
users prefer responses with more citations even when citations are irrelevant — citation count is a decoupled trust heuristic