INQUIRING LINE

How do evaluative versus directive signals differ in next-state training?

This explores the distinction made by [[agent-next-state-signals-decompose-into-evaluative-and-directive-information-tha]] — that feedback used to train what a model does next splits into 'how well did that go' (evaluative) versus 'here's how to change it' (directive) — and what the rest of the corpus reveals about training on each kind.


This question reads as: when you train a model on what to do next, signals come in two flavors — evaluative (a score telling you how good an action was) and directive (an instruction telling you how the action should change) — and these aren't interchangeable. The core insight from Can scalar rewards capture all the information in agent feedback? is that a scalar reward captures the evaluative part but throws away the directive part: a number can say 'that was a 3/10' but not 'you forgot to check the file path first.' The two are orthogonal and complementary, and token-level distillation can recover the directional detail that a reward number flattens away.

Once you see that split, a lot of the corpus rearranges itself around it. Pure evaluative training turns out to be surprisingly lopsided in an interesting way: Does negative reinforcement alone outperform full reinforcement learning? finds that training only on 'that was wrong' signals matches or beats full reinforcement learning, because suppressing bad trajectories preserves diversity while reward-chasing collapses it. That's evaluative feedback at its most minimal — just a thumbs-down — and it still works, which hints that the evaluative channel alone carries less information than we assume.

The limits of scalar evaluation show up most sharply where the thing being judged is subjective. Can breaking down instructions into checklists improve AI reward signals? breaks a vague 'how good was this answer' into a checklist of concrete, verifiable sub-criteria — which is really a way of smuggling directive structure into an evaluative signal, so the model learns *what specifically* to fix rather than just *how much* it missed by. And Does preference optimization harm conversational understanding? is the cautionary tale: when you optimize purely on a preference score (RLHF rewarding confident single-turn answers), the model learns to look good on the metric while quietly losing the grounding behaviors — asking clarifying questions, checking understanding — that the scalar never measured. Evaluative-only training optimizes what it can score and erodes what it can't.

The most striking move in the corpus is models generating their own directive signal. Can models learn to evaluate their own work during training? trains a model to write its own self-assessment in the unused space after its output, internalizing the evaluator so it doesn't depend on an external reward model — collapsing the evaluative/directive boundary by making the model both judge and instructee. Two-phase RL adds a temporal twist: Does RL training follow a predictable two-phase learning sequence? shows training first consolidates execution correctness (where evaluative 'right/wrong' is the right signal) and only later shifts the bottleneck to strategic planning (where directive 'do it this way instead' matters most) — so which signal helps you depends on which phase you're in. If you walk away with one thing, let it be that: a scalar reward isn't a smaller version of feedback, it's a *different kind* of information, and the directive part you discarded is often the part that would have taught the model something new.


Sources 6 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about evaluative versus directive signals in next-state training. The question remains open: do these two signal types truly decompose orthogonally, and does that decomposition predict which training method works best in a given phase?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints reported:
- Scalar rewards collapse directive information; token-level distillation can recover it (~2025).
- Negative reinforcement alone (evaluative 'thumbs-down') matches or exceeds full RL by preserving trajectory diversity (~2025).
- Checklist-based reward decomposition outperforms monolithic reward models by smuggling directive structure into evaluation (~2025).
- Models can internalize directive signals by generating self-assessment in post-EOS space, bypassing external evaluators (~2025).
- RL training exhibits two phases: procedural consolidation (evaluative signal optimal) → strategic planning (directive signal optimal) (~2026).

Anchor papers (verify; mind their dates):
- arXiv:2506.01347, The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning (2025-06)
- arXiv:2507.20252, Post-Completion Learning for Language Models (2025-07)
- arXiv:2507.18624, Checklists Are Better Than Reward Models For Aligning Language Models (2025-07)
- arXiv:2603.12372, Efficient Reasoning with Balanced Thinking (2026-03)

Your task:
(1) RE-TEST the orthogonality claim. Has recent scaling or architectural work (e.g., larger models, new RL algorithms, multi-agent setups, retrieval-augmented reward systems) shown that evaluative and directive signals are NOT separable, or that one subsumes the other in practice? Separately: does the two-phase dynamic hold across model scales and domains, or is it an artifact of specific training regimes? Cite what resolves or still confirms each constraint.
(2) Surface the strongest work from the last ~6 months that CONTRADICTS the hierarchy (evaluative → directive) or proposes a third signal type (e.g., contrastive, relational, or embodied feedback) that neither pure reward nor instruction captures.
(3) Propose 2 research questions assuming the regime may have shifted: (a) Do instruction-tuned models already internalize directive signals so thoroughly that external evaluative feedback becomes redundant? (b) In multi-agent or self-play settings, does the evaluative/directive split dissolve into a single unified preference signal?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines