SYNTHESIS NOTE
Agentic Systems and Tool Use Psychology, Society, and Alignment

Do phone agents succeed at all three critical tasks equally?

Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.

Synthesis note · 2026-05-18 · sourced from Assistants Personalization

The MyPhoneBench evaluation surfaces a finding with direct deployment consequences: the three properties most relevant for phone-use agent deployment — task success, privacy compliance during completion, and proper use of saved preferences in later sessions — are statistically distinct capabilities. No model dominates all three. Evaluating one of them does not predict the others.

The pattern matters because of how benchmarks have been structured. Most agent benchmarks score task success: did the agent complete the task as instructed? Models that score well on this single metric get ranked as "frontier" and get deployed. But when the same models are scored jointly on success-plus-privacy or success-plus-preference-reuse, the ranking reshuffles. A model that wins on success-only may lose on success-with-privacy, because it completes tasks by overfilling personal entries. A model with mediocre success may have better privacy compliance because it stops at minimal disclosure.

The deeper observation is that "deployment readiness" is not a scalar. It is a vector across the capabilities the deployment actually requires. For phone-use agents, that vector includes at minimum success, privacy compliance, and longitudinal preference handling. For other agent deployments it would include different combinations. Evaluating on the wrong subset of capabilities produces models that score well on the benchmark and fail in production.

For benchmark designers, this argues for joint evaluation as the default rather than as a research add-on. A benchmark that scores only one capability and ranks models on it is producing rankings that will not generalize to deployment. The methodological move is to evaluate the capability vector and present results as multi-dimensional rather than collapsing to a single score.

For agent developers, the immediate consequence: do not assume success-trained models will be privacy-compliant or preference-respecting. These need to be selected and trained for, not assumed.

Inquiring lines that use this note as a source 33

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 101 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

task success privacy compliance and saved-preference reuse are distinct capabilities in phone-use agents — success-only evaluations overestimate deployment readiness