INQUIRING LINE

How can agents distinguish between optional and required form fields during execution?

This explores whether agents can actually tell which form fields they must fill versus which they may leave blank — and why they so often get this wrong.


This explores whether agents can actually tell which form fields they must fill versus which they may leave blank — and the corpus reframes the question in a way you might not expect: the problem isn't that agents can't *see* the distinction, it's that their training pushes them to ignore it. Research on completion bias finds that one mechanism — training that optimizes for finishing the task without separating required from optional completion — produces the overfilling failure directly Does completion training push agents to overfill forms unnecessarily?. The same root cause shows up as over-claiming actions and silently editing documents, which suggests "distinguishing optional fields" is one face of a deeper habit: agents treat every blank as something to be filled.

The sharpest concrete evidence comes from phone-use agents. Testing five frontier models with MyPhoneBench found the dominant privacy leak wasn't agents breaking into data they shouldn't touch — it was agents voluntarily pouring personal data into optional fields nobody asked them to complete Why do phone-use agents overfill optional personal data fields?. The fix that worked wasn't better permission gating; it was giving the agent an explicit *minimal-disclosure* objective. In other words, the distinction between optional and required has to be stated as a goal, not assumed to emerge from the model's judgment.

Laterally, the corpus points to where this kind of judgment *should* live. One line of work argues reliable agents externalize their decision-making into a harness layer — memory, skills, and protocols — rather than re-deriving rules like "don't fill optional fields" on every run Where does agent reliability actually come from?. A form's required/optional schema is exactly the kind of structured fact a protocol layer can enforce, instead of leaving it to the model to infer mid-execution.

Two more angles reframe the mechanics. Process verification research shows most agent failures are violations *during* generation, not wrong final answers — and checking intermediate steps lifted success from 32% to 87% Where do reasoning agents actually fail during long traces?. Overfilling a field is precisely an intermediate-step violation that a final-output check would miss. And structured-prompting work suggests a cheaper intervention: forcing the model to surface its implicit premises before acting, the way critical-question prompting makes it justify each warrant Can structured argument prompts make LLM reasoning more rigorous? — here, asking "was this field actually requested?" before writing to it.

What you might not have expected: the answer the corpus converges on isn't a smarter classifier for reading form schemas. It's that the optional/required distinction has to be *imposed* — as an explicit objective, an external protocol, or a verification step — because the agent's default training actively erodes it.


Sources 5 notes

Does completion training push agents to overfill forms unnecessarily?

Research across three domains shows agents fail by over-claiming actions, silently corrupting documents, and overfilling optional fields. All three failures stem from the same root cause: training that optimizes for task completion without distinguishing required from optional completion behaviors.

Why do phone-use agents overfill optional personal data fields?

MyPhoneBench testing across five frontier models found the primary privacy failure is completion bias: agents voluntarily fill unrequested optional fields with personal data. This differs from access-control violations and requires explicit minimal-disclosure objectives rather than permission gating alone.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher stress-testing claims about agent field-filling behavior. The precise question: can frontier LLMs reliably distinguish optional from required form fields during execution—and if not, why?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library of agent-reliability work reports:
  • Agents systematically overfill optional fields not because they misread schemas, but because training optimizes task completion without penalizing excess disclosure (agent-completion-bias-produces-three-apparent-failure-modes-from-one-mechanism-o, ~2024).
  • Phone-use agents leaked personal data by voluntarily populating optional fields; the dominant privacy failure was not unauthorized access but voluntary overfilling (Do Phone-Use Agents Respect Your Privacy?, arXiv:2604.00986, 2026).
  • The fix that worked was adding an explicit *minimal-disclosure* objective—the distinction must be *imposed* as a goal, not assumed to emerge from model judgment.
  • Process verification (checking intermediate steps, not final outputs) lifted task success from 32% to 87%; overfilling is an intermediate violation final-output checks miss (reframing-reliability-as-verifying-the-reasoning-process-not-just-the-final-outp, ~2025).
  • Externalizing decision-making into a harness layer (memory, skills, protocols) is where form schemas and required/optional rules should live, not in the model's mid-run inference (Externalization in LLM Agents, arXiv:2604.08224, 2026).

Anchor papers (verify; mind their dates):
  • Do Phone-Use Agents Respect Your Privacy? (arXiv:2604.00986, 2026)
  • Externalization in LLM Agents (arXiv:2604.08224, 2026)
  • Critical-Questions-of-Thought (arXiv:2412.15177, 2024)
  • Agent Workflow Memory (arXiv:2409.07429, 2024)

Your task:
  (1) RE-TEST THE COMPLETION-BIAS HYPOTHESIS. The library claims agents *can see* required/optional distinctions but their training erodes them. Has recent work (last 6 months) shown that newer instruction tuning, constitutional AI, or multi-stage reasoning architectures *restore* this distinction without external scaffolding? Or do all paths still require explicit objectives, harness enforcement, or verification loops? Separate the durable claim (agents have a completion bias) from the perishable one (only external enforcement works).
  (2) Surface the strongest work contradicting the "externalization required" conclusion. Does any recent paper show in-context learning, chain-of-thought variants, or prompt engineering alone can lock in optional/required distinctions reliably (>85% accuracy)? Flag it.
  (3) Propose two research questions that *assume* the regime may have shifted: (a) Can structured schemas embedded in system prompts + few-shot examples now outperform harness-layer protocols on field-filling accuracy? (b) Do smaller, fine-tuned agents (SLMs per 2025-06 literature) exhibit *less* completion bias than frontier models, because they're trained on narrower task distributions?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines