Tool Use and Computer-Use Agents

Can small models match large models on function calling?

Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.

Can structured reasoning replace code execution for RL rewards?

Can semi-formal templates enable execution-free code verification reliable enough to train RL agents without running code? This matters because execution is expensive and slow in agent training loops.

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Can breaking function calling into subtasks improve model generalization?

Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Can unlabeled UI video teach models what users intend?

Can temporal masking on screen recordings learn task-aware representations without paired text labels? This matters because labeled UI video is scarce and expensive, so self-supervised learning could unlock scaling.

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Can structured templates make code reasoning more reliable than free-form thinking?

Unstructured chain-of-thought reasoning lets models skip cases and make unsupported claims. This explores whether semi-formal templates requiring explicit premises, evidence traces, and alternative checks can prevent these failure modes.

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

Can structured templates replace formal verification for code reasoning?

Formal verification is rigorous but impractical at repository scale. Can natural-language templates with enforced structure provide the same reliability guarantees without the formalization cost? This explores the middle ground between unstructured reasoning and full formalism.