Task Planning

Can delegation teach models to manage context more actively?

Does training models to decompose tasks and delegate to subagents—rather than passively compressing when context fills up—improve their ability to reason over long horizons? And does this skill transfer to single-agent work?

Can command generation replace intent classification in dialogue systems?

Explores whether generating pragmatic commands in a DSL could outperform traditional intent classification for task-oriented dialogue, particularly regarding training data needs and scalability.

Can LLMs actually forecast time series better than we think?

Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.

Can large language models actually create executable plans?

Do LLMs genuinely assemble plans that work, or just generate planning-domain knowledge that sounds coherent? Understanding this distinction matters for deploying AI in real planning tasks.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

This explores whether breaking time-series forecasting into separate stages for contextualization, dual-resolution outlook, and synthesis allows systems to combine the strengths of numerical models and language models more effectively than either alone.

Does tree depth automatically produce supervision at multiple granularities?

Tree-search rollouts branch at different depths, potentially creating supervision signals ranging from coarse strategy-level to fine-grained detail-level choices. Does this depth variation naturally yield multi-granular process supervision without explicit annotation design?

Task Planning

Can delegation teach models to manage context more actively?

Can command generation replace intent classification in dialogue systems?

Can LLMs actually forecast time series better than we think?

Can large language models actually create executable plans?

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Does tree depth automatically produce supervision at multiple granularities?

Can shared-prefix trees reduce redundancy in agent rollouts?

How much of LLM few-shot ability comes from training data?

Can tree structure alone convert outcome rewards into process supervision?

Why do unified image generators fail on non-Latin scripts?