INQUIRING LINE

Does adding survey data to interviews improve agent accuracy further?

This explores whether layering structured survey responses on top of open-ended interviews makes AI agents better at predicting how real people answer — and the corpus suggests the answer hinges less on stacking data sources than on what kind of information actually drives accuracy.


This explores whether adding survey data to interviews pushes agent accuracy higher, and the most direct evidence in the collection points to a more interesting question underneath it. The flagship study here built generative agents from two-hour voice interviews with 1,052 people and found they replicated participants' own survey answers about 85% as well as those people replicated themselves on retest Can AI agents learn people better from interviews than surveys?. The striking detail is *why*: accuracy was driven by factual content, not linguistic style — and even when the rich interview was compressed down to summary bullet points, fidelity only dropped to 83%. That tells you the signal lives in substantive personal information, not in conversational texture. So the real question isn't "interview plus survey," it's "does the second source add factual content the first one missed?" If your survey is capturing attitudes the interview already surfaced, you're adding redundancy, not accuracy.

That framing reframes the whole debate. A separate line of work shows AI personas reproduce about 76% of published experimental main effects, with success tightly correlated to how strong the original effect was — and unreliable performance on marginal effects, where they generate both false positives and false negatives Can AI personas reliably replicate human experiment results?. The lesson that travels across both studies: agents are good at recovering robust, well-evidenced signal and shaky at the margins. Adding a survey helps to the exact extent it strengthens weak or missing signal; it won't rescue genuinely ambiguous cases.

There's also a quieter cost the collection flags. Persona-driven agents drift over multi-turn interaction — losing consistency within turns, across conversations, and through outright factual contradictions — and reducing that drift took dedicated reinforcement training, not more profile data Can training user simulators reduce persona drift in dialogue?. More input fields can actually widen the surface for contradiction. And work on grounding personas in real source documents found that *where* the persona comes from (real stakeholder perspectives vs. arbitrary roles) matters more for generalization than how many attributes you pile on Can personas extracted from documents generalize across evaluation tasks?.

If you zoom out, the collection keeps returning to a theme: agent reliability comes from how information is structured and externalized, not from sheer volume of it Where does agent reliability actually come from?. So the honest answer to your question is that the collection doesn't have a head-to-head test of interview-plus-survey vs. interview-alone — but it strongly predicts the result. Surveys would help only as a vehicle for new *factual* content, the interview already extracts most of that signal (which is why even its bullet-point summary holds up), and the marginal gain shrinks fast while the drift and contradiction risks grow. The thing you didn't know you wanted to know: the interview's edge isn't that it's a conversation — it's that talking gets people to volunteer facts a survey form never thought to ask for.


Sources 5 notes

Can AI agents learn people better from interviews than surveys?

A 1,052-person study found agents built from voice interviews replicated participant responses nearly as well as people replicate their own answers. Factual content, not linguistic style, drove this accuracy—even summary bullet points retained 83% fidelity.

Can AI personas reliably replicate human experiment results?

Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can personas extracted from documents generalize across evaluation tasks?

MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher stress-testing claims about agent accuracy in persona simulation. The question: does adding survey data to interviews improve generative agent accuracy further?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library identified these constraints:
• Interview-based agents replicate human survey answers ~85% as well as humans replicate themselves; fidelity drops only to 83% when compressed to bullet-point summaries, suggesting the signal is factual content, not linguistic style (~2024-11).
• LLM personas reproduce ~76% of published experimental main effects but fail reliably on marginal effects, generating both false positives and negatives; success correlates tightly to effect size (~2024-08).
• Persona consistency degrades over multi-turn interaction (drift within and across conversations, factual contradictions); dedicated RL reduced drift by 55%, not additional profile data (~2025-10).
• Persona sourcing matters more than attribute volume: real stakeholder perspectives generalize better than arbitrary roles; more input fields expand contradiction surface (~2024-11).
• Agent reliability stems from information structure and externalization (memory, skills, protocols) rather than input volume (~2026-04).

Anchor papers (verify; mind their dates):
• arXiv:2411.10109 (2024-11): Generative Agent Simulations of 1,000 People
• arXiv:2408.16073 (2024-08): Using Large Language Models to Create AI Personas for Replication and Prediction
• arXiv:2511.00222 (2025-10): Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
• arXiv:2604.08224 (2026-04): Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness E

Your task:
(1) RE-TEST EACH CONSTRAINT. Does newer training, orchestration (memory architectures, caching, retrieval-augmented persona grounding), or evaluation harnesses now allow agents to extract marginal effects reliably, reduce multi-turn drift without RL, or handle input-field growth without contradiction bloom? Separate the durable question (interview vs. survey as information sources) from perishable limitations (drift, marginal-effect blindness) that orchestration or scaling may have relaxed. Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that challenges the finding that surveys add little beyond interviews, or that information structure outweighs input volume.
(3) Propose 2 research questions that assume the regime may have shifted: (a) If orchestration can now prevent multi-turn drift, does survey + interview synergy emerge? (b) Do smaller, finer-grained survey items (micro-surveys tied to specific interview topics) outperform bulk survey intake?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines