Action Models

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Can agents learn reusable sub-task routines from past experience?

Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Does constraining edits help agents improve their own skills?

When agents rewrite their own instructions, does freedom to edit lead to better learning, or do safeguards like edit budgets and memory of failures produce more stable improvement?

Can frozen language models continually improve through memory structure alone?

If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?

Can LLMs generate workflows without touching proprietary data?

Explores whether LLMs can orchestrate task automation by composing API calls rather than directly accessing confidential information, and whether this approach preserves security while handling unpredictable tasks.

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

Can separating causal models from language models improve reasoning?

Can an explicit formal causal model paired with an LLM translator overcome both spurious correlation reasoning and reward-without-explanation problems in RL? This explores whether dividing reasoning labor between systems addresses fundamental weaknesses in each.

What makes synthetic data work across different domains and models?

Explores whether a single optimal approach to synthetic data generation exists, or whether success depends on context like domain, model architecture, and scale. Understanding this matters for building effective data systems.

Can skill documents be optimized like neural network weights?

Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.