Can search agent behavior yield reliable process rewards for reasoning?
How can we extract meaningful supervision signals from what language models actually read and cite during reasoning, rather than relying on expensive human annotation or outcome-only rewards?
RLVR for long-context reasoning has two weak points: distractors built by random sampling or one-shot search are too easy to be confusable, and outcome-only rewards give no signal about how the model reasoned through the context. "LongTraceRL" (2605.31584, THU-KEG) fixes both by mining a search-agent trajectory. For data, it generates multi-hop questions via knowledge-graph random walks and then tiers distractors by what the agent did: documents it read but did not cite are high-confusability distractors (topically adjacent, genuinely tempting), while documents that appeared in results but were never opened are low-confusability — a difficulty gradient that random sampling cannot produce. For reward, it uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision, scoring whether the trace actually touched the evidence it should have.
The design choice that makes the rubric safe is positive-only: the rubric reward is applied only to responses whose final answer is correct, so it ranks reasoning quality among correct answers rather than handing out partial credit to wrong ones. That is a structural anti-gaming move — it removes the incentive to fabricate rubric-satisfying intermediate steps without reaching the right answer. This is a concrete instance of the problem in How can rubric-based rewards resist reward hacking attacks?: gating the rubric behind answer-correctness is exactly the kind of structural defense that paper argues single rubrics need. It is a sibling to Can breaking down instructions into checklists improve AI reward signals? — both decompose an unverifiable quality into verifiable sub-criteria, but LongTraceRL anchors them to gold entities rather than instruction clauses.
Most distinctive is that the supervision signal is harvested from agent behavior rather than annotated. This puts it alongside Can trajectory structure replace hand-annotated process rewards? (which derives step rewards from tree topology) and Can RL agents learn to reason better, not just succeed? (which tags planning/reflection): a converging pattern in which the trajectory itself — its citation choices, its branch structure, its metacognitive tags — becomes the cheap, verifiable substrate for process reward, replacing hand-built process reward models.
Relevant Notes
- How can rubric-based rewards resist reward hacking attacks? — positive-only gating is the structural anti-hacking defense this calls for
- Can trajectory structure replace hand-annotated process rewards? — same harvest-from-trajectory pattern; structural feature is citation behavior vs tree topology
- Can breaking down instructions into checklists improve AI reward signals? — sibling decomposition; anchored to gold entities rather than instruction clauses
- Can RL agents learn to reason better, not just succeed? — trajectory-derived process supervision via metacognitive tags
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does machine feedback enable discovery at test time?
- Can experimental outcomes be reliably distilled into reusable insights?
- What makes some reasoning traces better supervision than others despite equal accuracy?
- Do process reward models need different supervision strategies by domain?
- Can trajectory structure replace hand-annotated process reward models entirely?
- What other agent behaviors besides citations reveal reasoning quality?
- Do information gathering and task execution require different incentive structures?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
- StepWiser: Stepwise Generative Judges for Wiser Reasoning
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
- RM-R1: Reward Modeling as Reasoning
- Reasoning Language Models: A Blueprint
- Reward Reasoning Model
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
Original note title
process reward for long-context reasoning can be mined from search-agent trajectories — documents read but not cited are the hardest distractors and entity-level rubrics scored only on correct answers block reward hacking