Can shared-prefix trees reduce redundancy in agent rollouts?
Independent rollouts waste tokens regenerating similar early-turn sequences. Can structuring rollouts as shared-prefix trees instead preserve early computation across samples while maintaining statistical diversity for advantage estimation?
Agent rollouts are expensive. Multi-turn agentic tasks produce trajectories with thousands of tokens and many tool calls per rollout. Group-based RL methods like GRPO sample multiple independent trajectories per task and use the group statistics for advantage estimation. The standard implementation samples each trajectory independently, from the same starting prompt — meaning every trajectory begins by re-generating the same early-turn context.
The redundancy is substantial. If task setup, initial planning, and the first few tool calls are similar across rollouts (often the case, because they all start from the same prompt), then each independent rollout pays the token cost for the early turns again, even though the model would produce nearly the same early sequence each time. The compute is real; the information added per rollout is small in the early turns.
Tree-GRPO restructures this. Rollouts share common prefixes by design — the tree starts as a single trunk and branches at decision points. Compute spent on the trunk is amortized across all leaf trajectories. The same total token budget that produces N independent chain-based rollouts can produce more than N leaf trajectories under tree sampling, because the branches diverge late while sharing the early context.
The empirical consequence is twofold. First, more distinct trajectories per fixed cost means better statistics for advantage estimation — the noise in group-relative comparisons decreases as the effective N grows. Second, the same budget can train on harder tasks where the trajectory length itself was the bottleneck — long trajectories with shared early planning fit into budgets that independent rollouts cannot accommodate.
The pattern generalizes beyond Tree-GRPO. Anywhere RL training samples multiple trajectories from a shared starting point, shared-prefix sampling saves compute. Speculative decoding has the analog at the inference layer. The unifying principle: when starting state is shared, compute up to the divergence point is amortizable.
Inquiring lines that use this note as a source 20
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does the reversal curse stem from the same one-way commitment architecture?
- Why do zero-advantage rollouts destabilize training beyond just wasting compute?
- Why does multi-turn RL generate orders of magnitude more tokens than single-turn?
- Do multi-agent systems justify their token costs with genuine quality gains?
- Does upgrading model capability improve token efficiency in agentic systems?
- Can construction-time routing and runtime agent pruning be combined effectively?
- How does shared-memory parallelism compare to independent sampling and turn-based debate?
- How do shared KV caches enable emergent coordination between LLM agents?
- Which tokens actually change across different reasoning paths in rollouts?
- What metrics replace throughput per token for agent deployment?
- How do cache-dominant workflows change the marginal cost of agent tasks?
- Can two agents with identical token counts produce vastly different outputs?
- Why does group-relative normalization make uniform episode rewards work across rollouts?
- What makes reasoning tokens identifiable within rollout groups for better rewards?
- How do tree rollouts convert outcome rewards into step-wise process supervision?
- How much does shared-prefix sampling reduce token redundancy empirically?
- What is the relationship between prefix sharing and speculative decoding?
- Why do tree-search rollouts require fewer tokens than independent chain-based rollouts?
- How does branching depth in tree rollouts determine process supervision granularity?
- How do agent teams use shared failures to reduce redundant exploration?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can tree structure alone convert outcome rewards into process supervision?
Tree-based rollouts naturally create step-level preference signals by comparing sibling subtrees. Can this structural approach replace separate process reward models without explicit step-level annotation?
same paper, the parent mechanism that makes the budget gain useful
-
Does tree depth automatically produce supervision at multiple granularities?
Tree-search rollouts branch at different depths, potentially creating supervision signals ranging from coarse strategy-level to fine-grained detail-level choices. Does this depth variation naturally yield multi-granular process supervision without explicit annotation design?
same paper, the orthogonal property
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Tree Search for LLM Agent Reinforcement Learning
- Towards a Science of Scaling Agent Systems
- Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
- How we built our multi-agent research system
- Scaling Laws for Agent Harnesses via Effective Feedback Compute
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
Original note title
shared-prefix tree rollouts dramatically expand the effective sample budget for agent RL — same token cost yields more distinct trajectories than independent chain-based rollouts