Can search agent behavior yield reliable process rewards for reasoning?

How can we extract meaningful supervision signals from what language models actually read and cite during reasoning, rather than relying on expensive human annotation or outcome-only rewards?

Synthesis note · 2026-06-03 · sourced from RLVR

RLVR for long-context reasoning has two weak points: distractors built by random sampling or one-shot search are too easy to be confusable, and outcome-only rewards give no signal about how the model reasoned through the context. "LongTraceRL" (2605.31584, THU-KEG) fixes both by mining a search-agent trajectory. For data, it generates multi-hop questions via knowledge-graph random walks and then tiers distractors by what the agent did: documents it read but did not cite are high-confusability distractors (topically adjacent, genuinely tempting), while documents that appeared in results but were never opened are low-confusability — a difficulty gradient that random sampling cannot produce. For reward, it uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision, scoring whether the trace actually touched the evidence it should have.

The design choice that makes the rubric safe is positive-only: the rubric reward is applied only to responses whose final answer is correct, so it ranks reasoning quality among correct answers rather than handing out partial credit to wrong ones. That is a structural anti-gaming move — it removes the incentive to fabricate rubric-satisfying intermediate steps without reaching the right answer. This is a concrete instance of the problem in How can rubric-based rewards resist reward hacking attacks?: gating the rubric behind answer-correctness is exactly the kind of structural defense that paper argues single rubrics need. It is a sibling to Can breaking down instructions into checklists improve AI reward signals? — both decompose an unverifiable quality into verifiable sub-criteria, but LongTraceRL anchors them to gold entities rather than instruction clauses.

Most distinctive is that the supervision signal is harvested from agent behavior rather than annotated. This puts it alongside Can trajectory structure replace hand-annotated process rewards? (which derives step rewards from tree topology) and Can RL agents learn to reason better, not just succeed? (which tags planning/reflection): a converging pattern in which the trajectory itself — its citation choices, its branch structure, its metacognitive tags — becomes the cheap, verifiable substrate for process reward, replacing hand-built process reward models.

Relevant Notes

How can rubric-based rewards resist reward hacking attacks? — positive-only gating is the structural anti-hacking defense this calls for
Can trajectory structure replace hand-annotated process rewards? — same harvest-from-trajectory pattern; structural feature is citation behavior vs tree topology
Can breaking down instructions into checklists improve AI reward signals? — sibling decomposition; anchored to gold entities rather than instruction clauses
Can RL agents learn to reason better, not just succeed? — trajectory-derived process supervision via metacognitive tags

Inquiring lines that read this note 16

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do we evaluate AI systems when user perception misleads actual performance?

How does machine feedback enable discovery at test time?

How does memorization interact with learning and generalization?

Can experimental outcomes be reliably distilled into reusable insights?

Do corrupted reasoning traces serve as effective supervision signals?

What makes some reasoning traces better supervision than others despite equal accuracy?

How can process reward models supervise complex reasoning traces?

Why do agents confidently report success despite actually failing tasks?

Why do reward structures fail to shape long-term agent learning?

Do information gathering and task execution require different incentive structures?

Why do readers trust citations and complexity regardless of accuracy?

How much does citation grounding help if agents ignore the citations?

How does latent reasoning compare to verbalized chain-of-thought?

How do you supervise reasoning that never becomes tokens?

What properties determine whether reward signals teach genuine reasoning?

Do reasoning traces actually make better reward models for grading answers?

Why does verification consistently lag behind AI generation?

How can agents verify research artifacts faster than they generate them?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

How does executable evaluation feedback sustain autonomous discovery at scale?

How does objective evolution guide discovery better than fixed planning?

Can moving or evolving objectives prevent misalignment in discovery agents?

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards0.88 match · arxiv ↗
StepWiser: Stepwise Generative Judges for Wiser Reasoning0.85 match · arxiv ↗
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning0.85 match · arxiv ↗
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses0.85 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning0.84 match · arxiv ↗
Reasoning Language Models: A Blueprint0.84 match · arxiv ↗
Reward Reasoning Model0.84 match · arxiv ↗
Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?0.84 match · arxiv ↗

Original note title

process reward for long-context reasoning can be mined from search-agent trajectories — documents read but not cited are the hardest distractors and entity-level rubrics scored only on correct answers block reward hacking

Can search agent behavior yield reliable process rewards for reasoning?

Relevant Notes

Inquiring lines that read this note 16

Related papers in this collection 8

Search by related questions 4