INQUIRING LINE

Can aggregate survey realism coexist with unreliable fine-grained effects?

This explores whether LLMs can reproduce believable population-level survey patterns while still getting the individual or fine-grained effects wrong — and whether those two facts can be true at once.


This explores whether LLMs can reproduce believable population-level survey patterns while still getting the individual or fine-grained effects wrong — and whether those two facts can be true at once. The corpus says yes, emphatically, and even explains the mechanism. The clearest statement comes from work on causal simulation: LLMs guided by structural causal models recover effect *directions* reliably but not effect *magnitudes* Can structural causal models automate social science with language models?. That's exactly the split the question names — the coarse shape of the aggregate looks right, while the precise size of any one effect is untrustworthy. Directional social science survives; point estimates don't.

The survey-realism research sharpens this. Pathological skew and over-positivity in LLM survey responses turn out to be *measurement artifacts* of how you elicit the answer, not intrinsic model failures — eliciting free text and mapping it to scales via embeddings recovers ~90% of human test-retest reliability with realistic distributions Why do LLMs give unrealistic survey responses?. So aggregate realism is real and recoverable. But realism at the distribution level says nothing about whether any single simulated respondent is a faithful person, which is where fine-grained reliability quietly breaks.

Why the two layers come apart is worth seeing. A consistent output is not a reliable one: zero-temperature determinism just replays one draw from the model's distribution, and repeated-sampling tests show consistency ≠ reliability Does setting temperature to zero actually make LLM outputs reliable?. And the thing being measured isn't even one thing — annotation responses decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable only by consistency across conditions Do all annotation responses measure the same underlying thing?. Aggregating washes these together into a plausible mean while the individual signal stays noisy.

There's also a structural reason the aggregate can lie *by design*. A single model trained on pooled preferences literally cannot represent a 51-49 disagreement — it must either always side with the majority or please everyone half the time Can aggregate reward models satisfy genuinely disagreeing users?. The aggregate looks coherent precisely because it has erased the fine-grained variance that would make it unreliable. Relatedly, simulations look most competent exactly where they cheat: LLMs handle social scenarios well when one model puppets every party, then fail once agents must hold private information Why do LLMs fail when simulating agents with private information? — apparent population-level fluency resting on grounding work skipped at the individual level.

The takeaway a curious reader might not expect: aggregate realism and fine-grained unreliability don't just coexist — the first can *cause* the illusion of the second being solved. Use these simulations the way the causal-models paper recommends — to read the direction of an effect, generate hypotheses, rank options — and treat any specific magnitude, individual respondent, or minority signal as something you still have to verify against humans. Crowdsourced preference at scale works for the same reason: it's the diverse aggregate that's trustworthy, validated against expert raters, not any one vote Can crowdsourced votes reliably rank language models?.


Sources 7 notes

Can structural causal models automate social science with language models?

LLMs guided by structural causal models can propose and test causal hypotheses across negotiation, bail, interview, and auction scenarios. Simulations reveal effect directions reliably but not magnitudes, making them useful for directional social science.

Why do LLMs give unrealistic survey responses?

Semantic Similarity Rating—prompting for text then mapping to scales via embeddings—achieves 90% of human test-retest reliability with realistic distributions. Pathological skew and over-positivity disappear when output channels change, proving these are measurement artifacts, not intrinsic failures.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Can crowdsourced votes reliably rank language models?

Chatbot Arena's 240K+ crowdsourced preference votes produce credible model rankings because the underlying questions are diverse and discriminating, and crowd judgments correlate with expert raters—validating human preference as a scalable evaluation signal.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can aggregate survey realism coexist with unreliable fine-grained effects in LLM simulations?** — and if so, what mechanism enables it?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026.
• Structural causal models let LLMs recover effect *directions* reliably (~2024) but not magnitudes; aggregate distributions match human patterns (~90% test-retest reliability) while individual respondent fidelity breaks (2024–2025).
• Measurement artifacts drive pathological skew: free-text elicitation + embedding-based scale mapping recovers realistic aggregate distributions; this says nothing about fine-grained signal fidelity (2024).
• Consistency ≠ reliability: zero-temperature determinism replays one draw; repeated sampling reveals the gap (2024).
• Annotation responses decompose into three signal types (genuine preferences, non-attitudes, constructed-on-the-spot); aggregation washes them into plausible means while individual signals stay noisy (2025).
• Omniscient multi-agent social simulation fails under information asymmetry: one model puppeting all parties achieves apparent fluency; private information breaks the illusion (2024).

Anchor papers (verify; mind their dates):
• 2404.11794 (Automated Social Science, Apr 2024)
• 2403.05020 (Misleading Success of Simulating Social Interaction, Mar 2024)
• 2604.03238 (Measuring Human Preferences in RLHF as Social Science, Jan 2026)
• 2605.28388 (Mechanistically Interpreting Sample Difficulty in RLVR, May 2026)

Your task:
(1) **RE-TEST each constraint.** For every finding above — causal-model directionality, distribution matching, consistency-reliability decoupling — judge whether scaling reasoning models (2505.21444, 2510.13786), better preference elicitation (2503.17338), or new evaluation harnesses (2506.09038 on unanswerable questions, 2412.12509 on LLM-as-judge reliability) have since relaxed or overturned it. Separate the durable question (does aggregation mask fine-grained noise?) from perishable limitations (can newer RL training recover individual fidelity?). Say plainly where constraints still appear to hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months that claims either (a) fine-grained reliability is recoverable via scaling or training, or (b) aggregate realism itself is less robust than the 2024 library suggested.
(3) **Propose 2 research questions** that assume the regime may have moved: e.g., *Does mechanistic interpretability of preference decomposition (2605.28388) enable per-respondent signal recovery?* or *Can multi-agent architectures with true information asymmetry (vs. omniscient pooling) achieve both aggregate and fine-grained fidelity?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines