Do phone agents succeed at all three critical tasks equally?
Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.
The MyPhoneBench evaluation surfaces a finding with direct deployment consequences: the three properties most relevant for phone-use agent deployment — task success, privacy compliance during completion, and proper use of saved preferences in later sessions — are statistically distinct capabilities. No model dominates all three. Evaluating one of them does not predict the others.
The pattern matters because of how benchmarks have been structured. Most agent benchmarks score task success: did the agent complete the task as instructed? Models that score well on this single metric get ranked as "frontier" and get deployed. But when the same models are scored jointly on success-plus-privacy or success-plus-preference-reuse, the ranking reshuffles. A model that wins on success-only may lose on success-with-privacy, because it completes tasks by overfilling personal entries. A model with mediocre success may have better privacy compliance because it stops at minimal disclosure.
The deeper observation is that "deployment readiness" is not a scalar. It is a vector across the capabilities the deployment actually requires. For phone-use agents, that vector includes at minimum success, privacy compliance, and longitudinal preference handling. For other agent deployments it would include different combinations. Evaluating on the wrong subset of capabilities produces models that score well on the benchmark and fail in production.
For benchmark designers, this argues for joint evaluation as the default rather than as a research add-on. A benchmark that scores only one capability and ranks models on it is producing rankings that will not generalize to deployment. The methodological move is to evaluate the capability vector and present results as multi-dimensional rather than collapsing to a single score.
For agent developers, the immediate consequence: do not assume success-trained models will be privacy-compliant or preference-respecting. These need to be selected and trained for, not assumed.
Inquiring lines that use this note as a source 33
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does understanding persistent journeys intensify both trust and privacy concerns?
- Can standard accuracy metrics miss the real constraints on user consumption?
- How does personalization create tradeoffs between trust and privacy concerns?
- What makes users willing to relinquish control to an agent?
- Can agent success reports serve as reliable oversight signals in real deployment?
- Can reward engineering and information-theoretic architecture solve partner-awareness separately?
- How do privacy concerns compete with disclosure comfort in human-machine conversation?
- What makes complex UI navigation and social interaction harder than task completion?
- What creates the tension between users wanting convenience and resisting loss of control?
- Why does personalization increase both trust and privacy concerns?
- When should agents accommodate user preferences over their own goals?
- How does asymmetric information between users and agents relate to proactivity?
- Can agents balance goal-driven proactivity with user preference alignment?
- What design changes if we separate behavior description from adoption justification goals?
- Why do APIs outperform UIs for agent task completion?
- Can advertising mechanisms designed for humans work on agents?
- What ecosystem conditions make agent attention markets viable?
- What data types carry the most privacy risk in personalization systems?
- Does personalization make users trust AI or increase privacy concerns?
- How does direct web access change privacy assumptions built on API limits?
- Can worker preference serve as a legitimate axis for delegation design?
- Why do models that excel at task success often fail at privacy compliance?
- Can tool access control prevent agents from filling optional personal fields?
- Why do identical task success rates mask deployment readiness differences?
- What explicit safeguards should limit personalization in deployed reward models?
- Why do completion-oriented models systematically sacrifice privacy compliance?
- How do minimal-disclosure privacy contracts enable multi-dimensional agent evaluation?
- Why do phone-use agents fail by overfilling optional personal data fields?
- How do agent privacy compliance and task success differ in evaluation?
- Can minimal privacy boundaries generalize beyond phone-use contexts?
- Do layered defenses work better than single privacy techniques?
- What specific bookkeeping tasks can environments maintain more reliably than policies?
- Do information gathering and task execution require different incentive structures?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do phone-use agents overfill optional personal data fields?
Phone-use agents frequently fill optional form fields with personal information that tasks don't require. Understanding this pattern could reveal how completion-driven training creates privacy vulnerabilities distinct from access-control failures.
same paper, the specific failure mode that produces the capability divergence
-
Can a two-category privacy boundary actually be auditable?
Most privacy frameworks are either too vague or too complex for agent deployment. Can a minimal binary split—LOW versus HIGH data categories—provide enough clarity for both users and automated compliance auditing?
same paper, the contract that makes joint evaluation possible
-
Do short benchmarks predict how models perform over long workflows?
Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
adjacent: benchmarks-overstate-deployment-readiness pattern at a different scale axis
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Do Phone-Use Agents Respect Your Privacy?
- CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
- From speaking like a person to being personal: The effects of personalized, regular interactions with conversational agents
- Rise of Machine Agency: A Framework for Studying the Psychology of Human–AI Interaction (HAII)
- PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
- Interaction Dynamics as a Reward Signal for LLMs
- Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems
- PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time
Original note title
task success privacy compliance and saved-preference reuse are distinct capabilities in phone-use agents — success-only evaluations overestimate deployment readiness