INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Why do reward structures fail to s…›this inquiring line

If you reward an AI for each step in isolation, it never learns to set up a payoff several moves later.

Why do next-turn reward objectives fail to encourage multi-turn goal progress?

This explores credit assignment in multi-turn RL — why rewarding each turn on its immediate quality (a myopic, next-turn objective) doesn't add up to progress toward a goal that only resolves several turns later.

This explores credit assignment in multi-turn RL: why a reward that scores each turn on its own immediate merit fails to push an agent toward a goal that only pays off many turns down the line. The short version from the corpus is that a next-turn objective is *myopic* — it measures local correctness, but multi-turn success is a property of the whole trajectory, and there's no clean way to back-propagate "this turn helped us win three turns later" from a signal that only ever looks one step ahead.

The most direct rebuttal to next-turn rewards is to stop using them. MS-GRPO assigns the *cumulative episode reward* to every step and then normalizes across rollouts, so the training signal surfaces which whole action-sequences succeeded rather than which individual moves looked good in isolation — a 3B model trained this way beat 72B baselines by 50%, which says the credit-assignment scheme mattered more than scale Can full episode rewards per step enable better credit assignment?. The flip side is that pure outcome rewards are *sparse*: when every rollout fails, there's no gradient at all. Supervised RL threads this by giving dense step-wise rewards based on similarity to expert actions, so the model still learns from failed trajectories — sitting between rigid token imitation and outcome-only rewards Can step-wise expert rewards help small models learn hard reasoning?.

There's a deeper reason a scalar next-turn reward is structurally lossy. Agent feedback actually carries two orthogonal things: an *evaluative* signal (how good was that action) and a *directive* one (how should it change). A scalar reward captures the first and throws away the second — so even a well-shaped per-turn number can't tell the model which way to move next, only whether it did okay Can scalar rewards capture all the information in agent feedback?. That missing directional content is exactly what multi-turn progress needs.

The two-phase dynamic of RL training gives this a sharper edge. Across eight models, learning first masters *execution* correctness and only later hits a *strategic planning* bottleneck — planning-token entropy keeps rising while execution stabilizes Does RL training follow a predictable two-phase learning sequence?. A next-turn reward is great at the first phase (was this step done right?) and nearly blind to the second (was this step part of a good plan?). So it plateaus precisely where multi-turn goal progress lives. None of this means multi-turn RL is hopeless — modified DAPO doubled SWE-bench performance in exactly these stateful, delayed-reward settings Can reinforcement learning scale beyond single-turn language tasks? — but it got there by handling delayed credit, not by leaning harder on per-turn scoring.

Two adjacent failure modes are worth following if you want to go further. One is upstream: if the *user* or environment signal drifts across turns, the reward is corrupted before credit assignment even begins — goal-state tracking decomposes a goal into trackable sub-components to keep the signal coherent Why do LLM user simulators fail to track their own goals?. The other is the reward's own clarity: RL gains track how verifiable the reward is, so a fuzzy per-turn judgment barely moves the needle no matter how the horizon is framed Why does RL succeed more on some tasks than others?. Taken together, the corpus reframes the question: the problem isn't the *turn*, it's asking a single local scalar to carry information that's inherently global, directional, and strategic.

Sources 7 notes

Can full episode rewards per step enable better credit assignment?

MS-GRPO assigns cumulative episode reward to each step, and group-relative normalization across rollouts surfaces which action sequences succeed. A 3B model post-trained this way outperforms 72B baselines by 50%, showing the training method matters more than scale for multi-step tasks.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Show all 7 sources

Why do LLM user simulators fail to track their own goals?

The UGST framework breaks user goals into profile, policy, task, requirements, and preferences—each with explicit status tracking. A three-stage method (steering, SFT, GRPO) progressively internalizes goal alignment, reducing the misalignment that corrupts RL training signals.

Why does RL succeed more on some tasks than others?

Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Intrinsic Credit Assignment for Long Horizon Interaction2.47 match · arxiv ↗
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR1.68 match · arxiv ↗
The Art of Scaling Reinforcement Learning Compute for LLMs1.68 match · arxiv ↗
Teaching Large Language Models to Reason with Reinforcement Learning1.67 match · arxiv ↗
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs1.65 match · arxiv ↗
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents1.64 match · arxiv ↗
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?1.64 match · arxiv ↗
Can Large Language Models Reason and Optimize Under Constraints?1.58 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Why do next-turn reward objectives fail to encourage multi-turn goal progress?** — remains open. Treat the findings below as dated claims (spanning May 2025–May 2026) to be re-tested against current model capabilities and RL methods.

**What a curated library found — and when (dated claims, not current truth):**
- Next-turn rewards are myopic: they score local correctness but cannot backpropagate "this turn helped us win three turns later." MS-GRPO's cumulative episode reward enabled a 3B model to beat 72B baselines by 50%, suggesting credit-assignment scheme matters more than scale (~2025).
- Scalar per-turn rewards carry only evaluative signal, discarding directional information needed for multi-turn planning; two-phase RL training shows execution correctness plateaus before strategic planning, where next-turn rewards are "nearly blind" (~2025–2026).
- Sparse outcome rewards (common in pure credit assignment) yield zero gradient on failure; Supervised RL threads this with dense step-wise expert-similarity rewards, enabling learning from failed trajectories (~2025).
- Upstream goal-state tracking and downstream reward verifiability (binary > fuzzy) both gate performance; RL gains track shows fuzzy per-turn judgment barely moves the needle (~2025–2026).
- Modified DAPO doubled SWE-bench performance in delayed-reward, stateful multi-turn settings, proving multi-turn RL is feasible (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2506.01347 (June 2025) — Negative Reinforcement in LLM Reasoning
- arXiv:2510.25992 (October 2025) — Supervised Reinforcement Learning
- arXiv:2602.12342 (February 2026) — Intrinsic Credit Assignment for Long Horizon
- arXiv:2605.25604 (May 2026) — Dynamic Variance-adaptive Advantage Optimization

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (post-May 2026), training improvements (RL curriculum, reward shaping innovations), orchestration (multi-agent + memory + caching stacks), or evaluation harnesses have since relaxed or overturned it. Has the 50% advantage of cumulative-reward schemes held, or has tuning next-turn rewards narrowed the gap? Does the execution–planning two-phase bottleneck still appear in current large models? Separate the durable question (credit assignment is hard) from perishable limitation (next-turn rewards are the *only* way to train).
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Has any recent paper shown next-turn rewards *do* scale to multi-turn goals when paired with novel architectures, loss designs, or inference techniques? Or does the tension hold?
(3) **Propose 2 research questions** that assume the regime may have moved — e.g., "Can hierarchical or recursive reward decomposition (sub-goal credits + outcome credits) outperform flat cumulative schemes?" or "Do transformer attention patterns now naturally learn long-horizon value aggregation, rendering explicit credit assignment secondary?"

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

If you reward an AI for each step in isolation, it never learns to set up a payoff several moves later.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8