SYNTHESIS NOTE
Psychology, Society, and Alignment

Can AI agents learn people better from interviews than surveys?

Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.

Synthesis note · 2026-02-22 · sourced from Personas Personality
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The Generative Agent Simulations study (Park et al.) created agents for 1,052 real individuals using voice-to-voice interview transcripts averaging 6,491 words. When tested on the General Social Survey, these interview-based agents matched participants' own responses with 85% normalized accuracy — nearly as well as participants replicate their own answers two weeks later.

The critical finding is what drives this accuracy. Three ablation conditions isolate the mechanism:

  1. Summary agents — bullet-pointed factual dictionaries stripping linguistic features — still achieved 83% accuracy. This means content richness, not linguistic nuance, is the primary driver.
  2. Random lesion agents — removing 80% of the interview (96 of 120 minutes) — still outperformed composite agents at 79%. Even a short interview contains enough richness.
  3. Maximal agents — adding surveys and experiments on top of interviews — showed no improvement (85%). Surveys don't add predictive power beyond what interviews already capture.

The architecture matters too: an "expert reflection" module prompts the model to generate reflections from four domain expert personas (psychologist, behavioral economist, political scientist, demographer), then routes questions to the most relevant expert. This structured multi-perspective synthesis extracts more from the same interview data than generic reflection.

The implication challenges the dominant approach of seeding agents with demographic attributes or short persona descriptions. Those approaches achieve much lower fidelity because they provide taxonomic labels rather than the rich situational detail that interviews capture. Since Why do LLM persona prompts produce inconsistent outputs across runs?, the key difference may be that interviews provide enough specific content to anchor the model's output distribution, while short persona descriptions leave too much to the model's uncertain defaults.

However, since How do we generate realistic personas at population scale?, even 85% fidelity at the individual level may not translate to valid population-level simulation without calibration.

A related but distinct evaluation methodology — the Turing Experiment (TE) — takes the complementary approach of replicating well-established findings from prior human subject research rather than individual-level response prediction. TEs reveal a specific distortion: "hyper-accuracy" where some models (including ChatGPT and GPT-4) produce systematically more accurate crowd-wisdom estimates than representative human samples would. This connects to Can AI systems learn social norms without embodied experience? — LLMs can systematically exceed human accuracy on collective tasks, which paradoxically makes them worse simulacra of representative human populations. High individual accuracy can mask poor population-level representativeness.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 125 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

interview-based generative agents replicate human responses 85 percent as accurately as humans replicate themselves — content richness not linguistic style is the primary driver