Can agent-based simulators replace real-user A/B testing for studying recommendation system harms?
This explores whether swapping real users for LLM-driven agents — populations of simulated people reacting to a recommender — can stand in for live A/B tests when the thing you're studying is harm: filter bubbles, fatigue, manipulation.
This explores whether LLM agent populations can substitute for live A/B testing specifically when the outcome under study is harm, not just engagement. The corpus says: for some harms, surprisingly yes — but the substitution holds only where the effect is strong and the simulation isn't quietly cheating.
The strongest case for replacement is exactly the recommendation-harm setting. Agent4Rec runs 1,000 LLM agents seeded from real behavioral data, each with a taste model and an emotion module, and reproduces filter-bubble dynamics and user fatigue through causal interventions you could never ethically or cheaply run on real people — turn off diversity, watch the bubble tighten Can LLM agents realistically simulate filter bubble effects in recommendations?. The appeal isn't only cost; it's that you can manipulate the system deliberately to surface harm, which an A/B test on live users can't do without exposing them to it. The same cost-collapse logic shows up in a neighboring domain: LLMs simulating search engines well enough to train agents, with mid-size simulators matching or beating the real thing Can LLMs replace search engines during agent training?.
But fidelity is conditional, and the corpus is unusually precise about where it breaks. When AI personas were tested against 111 published human experiments, they replicated 76% of main effects — and replication tracked the original p-value, meaning simulators reproduce strong, robust effects but get marginal ones wrong in both directions, inventing effects that aren't there and missing ones that are Can AI personas reliably replicate human experiment results?. Harms are often the marginal, long-tail, rare-configuration effects — exactly the regime where simulators are least trustworthy.
Two more findings sharpen the warning. Simulations can look competent while skipping the hard part: LLMs do well when one model secretly controls all the agents, but fail systematically once agents hold private information they must reason about — the apparent social competence was grounding work the model never actually did Why do LLMs fail when simulating agents with private information?. A recommender-harm sim where the agent already "knows" what it's supposed to discover is measuring its own assumptions. And what you optimize the synthetic population for matters: maximizing trait coverage (including rare, consequential user types) beats matching the average user distribution, because the users who get harmed are often the unusual ones a density-matched sample erases persona-diversity-optimization-should-maximize-support-coverage-not-different-density-matc.
The honest read: agent simulators are a powerful *pre-screen and counterfactual lab* for recommendation harms — cheap, manipulable, ethically safe for the dangerous interventions — but not a replacement for the live test on the effects that matter most. They reliably catch the loud harms and the rare-user harms you design them to probe, while systematically under-detecting the subtle ones and over-trusting any dynamic that depends on agents genuinely not-knowing what real users wouldn't know. The thing worth carrying away: the question isn't "are simulated users realistic?" but "is this harm a strong effect or a marginal one?" — that distinction predicts whether the simulator will tell you the truth.
Sources 5 notes
Agent4Rec demonstrates that 1,000 LLM-empowered agents initialized from real behavioral datasets can model both taste-driven and emotion-driven actions, enabling researchers to study filter bubble effects and user fatigue through causal interventions without expensive human experiments.
ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.
Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
Evolutionary optimization of Persona Generator code achieves broader trait coverage than density-matched baselines, including rare but consequential user configurations that naive LLM prompting misses.