INQUIRING LINE

Can influence estimation identify the most valuable trajectories in agentic training?

This explores whether the techniques we use to score which *data* matters for training — gradient-based influence estimation — can be turned on whole agent *trajectories* to find the handful that actually teach the model something, and what the corpus suggests would make that hard or easy.


This explores whether influence estimation — measuring how much a given training example moves the model — can pick out the most valuable trajectories when training agents, rather than just feeding the model everything it generates. The corpus has a strong anchor for the *data-selection* half of this and several notes that complicate the *trajectory* half.

The clearest yes comes from gradient-based selection. LESS uses low-rank gradient features to find the 5% of instruction examples whose learning signal most resembles a target capability, and training on just that slice beats training on the full dataset Can we train better models on less data?. The reason matters for agents: mixed datasets contain examples that *actively hurt* a skill by dragging the model's reasoning strategy in the wrong direction. So influence estimation isn't only about finding gold — it's about removing trajectories that quietly cost you. That's exactly the failure mode an agent's own rollouts would produce, since most of them are mediocre or misleading.

But 'valuable' for a trajectory isn't the same as 'similar to a target,' and other notes suggest the signal you'd estimate influence *over* can come from a trajectory's internal structure rather than the whole thing. Tree-GRPO, Supervised RL, and ToolPO each turn a sparse outcome reward into dense per-step credit by reading structural features — tree branch points, expert-aligned actions, tool-call positions Can trajectory structure replace hand-annotated process rewards?. And cross-rollout variance does double duty as both a token-weighting reward and a query-level filter that throws out degenerate comparisons Can one statistical measure serve dual purposes in RL training?. Both hint that the most informative parts of a trajectory are localized — which is good news for influence estimation, because it means you may not need to score entire rollouts uniformly.

Two notes raise warnings worth knowing about. First, RL itself only updates 5–30% of parameters, in sparse but nearly full-rank subnetworks that are stable across random seeds Does reinforcement learning update only a small fraction of parameters? — so 'influence' is already concentrated structurally, and a good estimator would need to align with *where* learning actually happens, not where you'd naively expect. Second, influence concentrates where dependencies converge: in multi-agent workflows, signals injected at high-influence positions propagate far further than the same signal elsewhere How does workflow position shape attack propagation in multi-agent systems?. The same property that makes a trajectory valuable to learn from makes it dangerous if it's wrong.

The deeper tension is about what's even in the pool to estimate over. Trajectories drawn from static expert demonstrations cap competence at what the curators imagined — agents never see their own failures Can agents learn beyond what their training data shows?. The promising counter-move is the 'early experience' paradigm, where agents treat the consequences of their *own* actions as supervision, matching expert-dependent baselines on half the data Can agents learn from their own actions without external rewards?. Put those together with LESS and you get the real prize the question is circling: influence estimation matters most not for ranking a fixed dataset, but for filtering the flood of self-generated trajectories an agent produces — keeping the few that expand competence and discarding the many that just reinforce what it already does.


Sources 7 notes

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

How does workflow position shape attack propagation in multi-agent systems?

FLOWSTEER demonstrates that malicious signals propagate farther when injected into high-influence subtasks, and that framing them as evidence rather than instruction causes downstream agents to relay them. Influence concentrates where dependencies converge, making position-aware attacks far more effective.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing whether influence estimation can identify valuable trajectories in agentic RL training. The question remains open.

What a curated library found — and when (dated claims, not current truth):
These findings span Feb 2024–May 2026. A library of recent arXiv papers reports:
• Gradient-based influence (LESS, 2024) selects ~5% of instruction data whose learning signal matches target skills; training on that slice beats full-dataset baselines, suggesting influence can filter mixed/harmful trajectories (~2024).
• Trajectory value concentrates in *localized structure* (tree branch points, expert-aligned actions, tool calls) rather than uniform rollout scoring; per-step credit assignment from structural features outperforms whole-trajectory reward (~2025).
• RL updates only 5–30% of model parameters in sparse, full-rank subnetworks stable across seeds (~2025); influence estimation must align with *where* learning actually happens, not intuitive positions.
• Early-experience paradigm (agents learning from own consequences, not just static demos) matches expert baselines on ~half the data; self-generated trajectories vastly outnumber curated ones, making filtering urgent (~2025).
• Influence amplifies at high-dependency positions in multi-agent workflows; the property that makes a trajectory valuable also makes errors propagate far (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2402.04333 (LESS, 2024)
• arXiv:2510.08558 (Agent Learning via Early Experience, 2025)
• arXiv:2507.07484 (RL Finetunes Small Subnetworks, 2025)
• arXiv:2605.25459 (Post-trained Models Recognize Own Behavior, 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For gradient-based influence selection, tree-structured credit, parameter sparsity, and early-experience filtering: has newer model scale, constitutional training, or multi-agent orchestration since relaxed or overturned these limits? Distinguish durable ("which trajectory examples drive skill growth?") from perishable ("influence is sparse / 5% suffices").
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially any showing influence estimation *fails* on agent self-rollouts or that trajectory value is *not* structurally localized.
(3) Propose 2 research questions that assume the regime has moved: e.g., does influence estimation transfer across agent architectures? Can it predict which self-generated trajectories generalize beyond the training task?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines