INQUIRING LINE

How do minimal-disclosure privacy contracts enable multi-dimensional agent evaluation?

This explores why evaluating agents on what they *don't* reveal — keeping private data disclosed only when strictly needed — forces us to grade them on several separate axes at once, not just whether they finished the task.


This question is really about a shift in how we score agents: instead of asking only 'did it complete the task?', we ask 'did it complete the task *while* withholding what shouldn't be shared *and* reusing what it was allowed to remember?' The sharpest evidence that these are genuinely different things comes from phone-agent testing, where task success, privacy-compliant completion, and reuse of saved preferences turn out to be statistically independent capabilities — no single model wins all three, and a success-only leaderboard tells you almost nothing about how an agent handles private information Do phone agents succeed at all three critical tasks equally?. That independence is the whole argument for multi-dimensional evaluation: collapse it to one number and you hide exactly the failures you most need to see.

Why is minimal disclosure the lever that exposes this? Because privacy leakage isn't a surface slip an agent can bolt a filter onto afterward — it's woven into how the model thinks. Nearly three-quarters of privacy leaks in reasoning traces come from the model directly materializing sensitive user data mid-thought, and the longer it reasons the more it leaks; worse, anonymizing the trace after the fact degrades the model's usefulness, because the private data was functioning as cognitive scaffolding Do reasoning traces actually expose private user data?. So a contract that says 'reveal only what the task requires' can't be checked by reading the final output alone — you have to evaluate the process, which is inherently a second dimension beyond task success.

The tension deepens because the very things that make agents useful also make them leak. Personalization simultaneously raises trust and privacy risk along the same curve — each remembered detail builds rapport and enlarges the exposure surface Does chatbot personalization build trust or expose privacy risks?. That's the same trade-off MyPhoneBench operationalizes as 'preference reuse' versus 'privacy compliance': an agent that aggressively reuses saved preferences scores well on helpfulness and badly on disclosure, and you only catch that conflict if you're measuring both at once.

A related blind spot is what evaluation settings quietly assume. When one model secretly controls every party in a simulation, agents look socially competent — but that competence collapses under genuine information asymmetry, because the omniscient setup let the model skip the grounding work of reasoning about what others *don't* know Why do LLMs fail when simulating agents with private information?. Minimal-disclosure contracts force that asymmetry back into the test: the agent must act without seeing everything, which is the only condition under which privacy behavior is even meaningful to measure.

Where does the contract live so the agent actually honors it? The most durable answer here is that governance works when it's embedded in the agent's runtime memory layer rather than appended as an external policy — a long-running agent consulted its encoded safeguards during actual decisions, making them effective precisely because they were *in the loop* it read from Can governance rules embedded in runtime memory actually protect autonomous agents?. Read together, these notes suggest minimal-disclosure contracts don't just add a privacy checkbox — they unbundle 'a good agent' into success, restraint, and memory-use, and they only become testable when the contract is enforced inside the agent's own thinking and the evaluation refuses to let any one axis stand in for the others.


Sources 5 notes

Do phone agents succeed at all three critical tasks equally?

MyPhoneBench demonstrates that task success, privacy-compliant completion, and saved-preference reuse are statistically distinct capabilities with no model dominating all three. Success-only rankings do not predict privacy or preference performance.

Do reasoning traces actually expose private user data?

74.8% of privacy leaks in language model reasoning traces result from models materializing sensitive user data during thought processes. Longer reasoning chains amplify leakage, and anonymizing traces post-hoc degrades model utility, suggesting private data functions as cognitive scaffolding.

Does chatbot personalization build trust or expose privacy risks?

Longitudinal research shows personalization enhances trust and anthropomorphism but also amplifies privacy concerns and escalating user expectations. One-shot studies miss these temporal dynamics—each interaction raises the baseline, making failures more disappointing.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Can governance rules embedded in runtime memory actually protect autonomous agents?

A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **How do minimal-disclosure privacy contracts enable multi-dimensional agent evaluation?** — remains open. Treat the following as dated claims (2021–2026) to be re-tested, not current truth.

**What a curated library found — and when:**
- Task success, privacy-compliance, and saved-preference reuse are statistically independent capabilities; no single model excels at all three (~2026, arXiv:2604.00986).
- ~75% of privacy leaks in reasoning traces stem from models materializing sensitive data mid-thought; longer reasoning = more leakage (~2025, arXiv:2506.15674).
- Personalization raises both trust and privacy risk along the same curve — a dual dynamic that only multi-dimensional evaluation catches (~2025).
- Omniscient simulations mask incompetence under real information asymmetry; minimal-disclosure contracts force genuine asymmetry back into test design (~2024, arXiv:2403.05020).
- Governance embedded in agent runtime memory (not appended as external policy) makes contracts effective during actual decisions (~2026, arXiv:2604.08224).

**Anchor papers (verify; mind their dates):**
- arXiv:2506.15674 (2025, "Leaky Thoughts") — reasoning-model privacy leakage.
- arXiv:2403.05020 (2024, social simulation fidelity) — information asymmetry.
- arXiv:2604.00986 (2026, "Do Phone-Use Agents Respect Your Privacy?") — capability independence.
- arXiv:2604.08224 (2026, externalization review) — governance architecture.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (post-2026), architectural innovations (e.g., latent vs. explicit memory), tooling (agent SDKs, harnesses), or evaluation frameworks have since relaxed or overturned it. Separate the durable question (likely still open) from the perishable limitation (possibly resolved). Cite what resolved it; flag where constraints still hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months that challenges the independence claim, the reasoning-leak thesis, or the memory-embedding architecture.
(3) **Propose 2 research questions that ASSUME the regime may have moved** — e.g., if reasoning leaks have been solved, what *new* privacy surface has emerged? If multi-dimensional evaluation is now standard, what collapses when you try to apply it across heterogeneous agent architectures?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines