What makes trajectory more actionable than absolute scores for human moderators?
This explores why showing moderators *how a decision unfolded* (the trajectory of steps) is more useful than handing them a single final score — and what the corpus says about scalar judgments losing the information humans actually need to act.
This reads the question as: a single number tells a moderator *whether* something was good or bad, but a trajectory tells them *where* and *how* it went wrong — and the corpus is surprisingly unified that the second kind of information is what makes intervention possible. The clearest statement of the gap comes from work showing that feedback decomposes into two orthogonal channels: an *evaluative* signal (how well an action did) and a *directive* signal (how it should change). A scalar reward captures the first and silently discards the second Can scalar rewards capture all the information in agent feedback?. An absolute score is pure evaluation with the directional content stripped out — which is exactly the part a human moderator needs to do anything other than approve or reject.
The same shortfall shows up wherever numbers hit a ceiling. Models stuck on a performance plateau under purely numerical rewards start improving again the moment they receive chain-of-thought critiques, because the number never carried *why* the failure happened or *how* to fix it Can natural language feedback overcome numerical reward plateaus?. That's the machine-side mirror of a moderator's problem: an outcome score is a verdict without a reason. Process-level supervision makes the contrast concrete — grading the intermediate retrieval steps of an agent substantially beats grading only the final answer, precisely because contrasting good and bad steps localizes the error to a place you can act on Does supervising retrieval steps outperform final answer rewards?. A trajectory is a chain of those steps; a score is the chain collapsed to one bit.
Trajectory also changes *when and where* a human should step in, not just what they see. Targeted intervention at high-leverage decision points dramatically outperformed both full autonomy and exhaustive step-by-step oversight (87.5% acceptance vs. 25% and 50%) — and you can only target the leverage points if you can see the path of decisions, not just the endpoint Does targeted human intervention outperform both full autonomy and exhaustive oversight?. There's a related asymmetry worth knowing: successes and failures aren't equally informative. Treating successful episodes as concrete demonstrations and failures as abstracted lessons outperforms processing them uniformly Should successful and failed episodes be processed differently?. A moderator reading a trajectory can apply that same asymmetry by hand; a column of identical scores forces uniform treatment.
The corpus also issues a caution that keeps this from being a free lunch. Moving from absolute scores to trajectory-level judgment doesn't *dissolve* the hard evaluation problems — comparability, reproducibility, mapping evidence to a verdict — it relocates them into a higher-dimensional space and the field still needs shared protocols to make trajectory scoring interpretable Do interactive evaluations actually solve the benchmark comparison problem?. So trajectory is more *actionable* (it tells you where to intervene and why) without being more *settled* (it's harder, not easier, to score consistently). The thing you didn't know you wanted to know: the same structural richness that makes a trajectory legible to a human is what lets machines manufacture their own dense feedback — tree search and trajectory topology can generate step-level quality signals that replace the human annotation oracle entirely Can tree search replace human feedback in LLM training?. The trajectory is actionable for the moderator and the model for the same reason: structure carries direction, and a single score throws the direction away.
Sources 7 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.
AutoResearchClaw's confidence-routed CoPilot mode achieved 87.5% acceptance, substantially outperforming full autonomy (25%) and step-by-step oversight (50%). The key insight: selective interruption avoids both uncaught critical errors and the coherence degradation caused by constant human interruption.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Interactive evaluation relocates core problems—comparability, reproducibility, evidence-to-judgment mapping—into higher-dimensional space rather than solving them. The field needs design protocols and shared standards, not format adoption, to make trajectory scoring interpretable.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.