Can AI agents learn people better from interviews than surveys?
Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.
The Generative Agent Simulations study (Park et al.) created agents for 1,052 real individuals using voice-to-voice interview transcripts averaging 6,491 words. When tested on the General Social Survey, these interview-based agents matched participants' own responses with 85% normalized accuracy — nearly as well as participants replicate their own answers two weeks later.
The critical finding is what drives this accuracy. Three ablation conditions isolate the mechanism:
- Summary agents — bullet-pointed factual dictionaries stripping linguistic features — still achieved 83% accuracy. This means content richness, not linguistic nuance, is the primary driver.
- Random lesion agents — removing 80% of the interview (96 of 120 minutes) — still outperformed composite agents at 79%. Even a short interview contains enough richness.
- Maximal agents — adding surveys and experiments on top of interviews — showed no improvement (85%). Surveys don't add predictive power beyond what interviews already capture.
The architecture matters too: an "expert reflection" module prompts the model to generate reflections from four domain expert personas (psychologist, behavioral economist, political scientist, demographer), then routes questions to the most relevant expert. This structured multi-perspective synthesis extracts more from the same interview data than generic reflection.
The implication challenges the dominant approach of seeding agents with demographic attributes or short persona descriptions. Those approaches achieve much lower fidelity because they provide taxonomic labels rather than the rich situational detail that interviews capture. Since Why do LLM persona prompts produce inconsistent outputs across runs?, the key difference may be that interviews provide enough specific content to anchor the model's output distribution, while short persona descriptions leave too much to the model's uncertain defaults.
However, since How do we generate realistic personas at population scale?, even 85% fidelity at the individual level may not translate to valid population-level simulation without calibration.
A related but distinct evaluation methodology — the Turing Experiment (TE) — takes the complementary approach of replicating well-established findings from prior human subject research rather than individual-level response prediction. TEs reveal a specific distortion: "hyper-accuracy" where some models (including ChatGPT and GPT-4) produce systematically more accurate crowd-wisdom estimates than representative human samples would. This connects to Can AI systems learn social norms without embodied experience? — LLMs can systematically exceed human accuracy on collective tasks, which paradoxically makes them worse simulacra of representative human populations. High individual accuracy can mask poor population-level representativeness.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does adding survey data to interviews improve agent accuracy further?
- Why do short interviews outperform demographic labels for persona simulation?
- Can individually accurate agents still fail at population-level representation?
- How much does interview richness matter compared to model capability for persona accuracy?
- How does AI persona fidelity compare to interview-based generative agents?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable under thin persona prompts; interviews may provide enough anchoring content to overcome this
-
Can AI systems learn social norms without embodied experience?
Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?
both findings show LLMs can approximate human responses without lived experience, but through different mechanisms
-
Why do LLMs fail when simulating agents with private information?
Explores whether single-model control of all social participants masks fundamental limitations in how LLMs handle information asymmetry and genuine uncertainty about others' knowledge.
simulation fidelity measured under omniscient conditions may overstate real-world applicability
-
What makes linguistic agency impossible for language models?
From an enactive perspective, does linguistic agency require embodied participation and real stakes that LLMs fundamentally lack? This matters because it challenges whether LLMs can truly engage in language or only generate text.
the 85% fidelity from text-only interview transcripts empirically challenges the strong embodiment requirement for social simulation; though the enactive view would note that the interview itself was an embodied interaction whose residue the text merely captures
-
Can AI learn social norms better than humans?
Explores whether large language models can predict cultural appropriateness more accurately than individual humans, and what this reveals about how social knowledge is transmitted and learned.
complementary evidence for the same meta-argument: social norm prediction at 100th percentile + interview-based response replication at 85% form a capability triad showing text-based learning approximates embodied social knowledge across multiple task types
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Generative Agent Simulations of 1,000 People
- Measuring Agents in Production
- Fine-tuning Language Models for Factuality
- Persona Generators: Generating Diverse Synthetic Personas at Scale
- From speaking like a person to being personal: The effects of personalized, regular interactions with conversational agents
- Proactive Human-Machine Conversation with Explicit Conversation Goals
- Linguistic markers of inherently false AI communication and intentionally false human communication: Evidence from hotel reviews
- Synthetic Dialogue Dataset Generation using LLM Agents
Original note title
interview-based generative agents replicate human responses 85 percent as accurately as humans replicate themselves — content richness not linguistic style is the primary driver