SYNTHESIS NOTE

Topics›Assistants Personalization›this note

Do phone agents succeed at all three critical tasks equally?

Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.

Synthesis note · 2026-05-18 · sourced from Assistants Personalization

The MyPhoneBench evaluation surfaces a finding with direct deployment consequences: the three properties most relevant for phone-use agent deployment — task success, privacy compliance during completion, and proper use of saved preferences in later sessions — are statistically distinct capabilities. No model dominates all three. Evaluating one of them does not predict the others.

The pattern matters because of how benchmarks have been structured. Most agent benchmarks score task success: did the agent complete the task as instructed? Models that score well on this single metric get ranked as "frontier" and get deployed. But when the same models are scored jointly on success-plus-privacy or success-plus-preference-reuse, the ranking reshuffles. A model that wins on success-only may lose on success-with-privacy, because it completes tasks by overfilling personal entries. A model with mediocre success may have better privacy compliance because it stops at minimal disclosure.

The deeper observation is that "deployment readiness" is not a scalar. It is a vector across the capabilities the deployment actually requires. For phone-use agents, that vector includes at minimum success, privacy compliance, and longitudinal preference handling. For other agent deployments it would include different combinations. Evaluating on the wrong subset of capabilities produces models that score well on the benchmark and fail in production.

For benchmark designers, this argues for joint evaluation as the default rather than as a research add-on. A benchmark that scores only one capability and ranks models on it is producing rankings that will not generalize to deployment. The methodological move is to evaluate the capability vector and present results as multi-dimensional rather than collapsing to a single score.

For agent developers, the immediate consequence: do not assume success-trained models will be privacy-compliant or preference-respecting. These need to be selected and trained for, not assumed.

Inquiring lines that read this note 37

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How should personalization be implemented to improve AI assistant effectiveness?

What dimensions of recommendation quality do standard metrics miss?

Can standard accuracy metrics miss the real constraints on user consumption?

How can humans calibrate appropriate trust in AI systems?

What makes users willing to relinquish control to an agent?

Why do agents confidently report success despite actually failing tasks?

Why do reward structures fail to shape long-term agent learning?

How do chatbots affect human self-disclosure and emotional engagement?

How do privacy concerns compete with disclosure comfort in human-machine conversation?

How should conversational agents balance goal-driven initiative with user control?

How do interface design choices shape consciousness attribution?

What creates the tension between users wanting convenience and resisting loss of control?

How do we evaluate AI systems when user perception misleads actual performance?

What design changes if we separate behavior description from adoption justification goals?

What drives capability and cost efficiency in agent systems?

What makes AI persuasion effective and how can we counter it?

Can advertising mechanisms designed for humans work on agents?

How should we design LLM systems to maintain alignment and control?

How does direct web access change privacy assumptions built on API limits?

When should tasks involve human-AI partnership versus full automation?

Can worker preference serve as a legitimate axis for delegation design?

What determines success in training models on multiple tasks?

Why do models that excel at task success often fail at privacy compliance?

Can single-axis benchmarks accurately predict agent deployment success?

How do aggregate reward models systematically exclude minority user preferences?

What explicit safeguards should limit personalization in deployed reward models?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do minimal-disclosure privacy contracts enable multi-dimensional agent evaluation?

What coordination failures limit multi-agent LLM systems as they scale?

Do layered defenses work better than single privacy techniques?

Does externalizing cognitive work and state improve agent reliability?

What specific bookkeeping tasks can environments maintain more reliably than policies?

Do harness improvements transfer across model scales or memorize shortcuts?

Can per-user adapters remain consistent without drifting or leaking?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 108 in 2-hop network ·medium cluster Open in graph ↗

Do phone agents succeed at all three critical ta… Why do phone-use agents overfill optional personal… Can a two-category privacy boundary actually be au… Do short benchmarks predict how models perform ove…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do phone-use agents overfill optional personal data fields? Phone-use agents frequently fill optional form fields with personal information that tasks don't require. Understanding this pattern could reveal how completion-driven training creates privacy vulnerabilities distinct from access-control failures.
same paper, the specific failure mode that produces the capability divergence
Can a two-category privacy boundary actually be auditable? Most privacy frameworks are either too vague or too complex for agent deployment. Can a minimal binary split—LOW versus HIGH data categories—provide enough clarity for both users and automated compliance auditing?
same paper, the contract that makes joint evaluation possible
Do short benchmarks predict how models perform over long workflows? Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
adjacent: benchmarks-overstate-deployment-readiness pattern at a different scale axis

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

task success privacy compliance and saved-preference reuse are distinct capabilities in phone-use agents — success-only evaluations overestimate deployment readiness

Do phone agents succeed at all three critical tasks equally?

Inquiring lines that read this note 37

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4