INQUIRING LINE

Can structured natural language feedback outperform scalar rewards in RL?

This explores whether feedback written in words — critiques, directions, environment messages — can train models better than a single number (a scalar reward), and what that richer signal carries that a number throws away.


This explores whether feedback written in words can do more for reinforcement learning than a single scalar score — and the corpus's answer is a fairly confident yes, with a clear theory of *why*. The cleanest statement of the mechanism comes from the idea that a number is the wrong shape for the job: agent feedback actually decomposes into two different things at once — an *evaluative* signal (how good was that action?) and a *directive* one (how should it change?). A scalar captures the first and silently discards the second Can scalar rewards capture all the information in agent feedback?. Language carries both. That's the conceptual hinge the rest of the corpus turns on.

You can see the payoff most vividly where scalar rewards simply stall. Models stuck on a performance plateau in reasoning tasks — where more numerical reward buys nothing — start producing correct solutions once they're handed a chain-of-thought *critique* of why they failed. The number told them they were wrong; the words told them how to be right Can natural language feedback overcome numerical reward plateaus?. A related approach turns rich, tokenized environment feedback into dense gradient signals by feeding the model retrospective evidence of its own mistakes, letting it act as its own process-reward model and making the external scalar reward largely unnecessary Can environment feedback replace scalar rewards in policy learning?. Both are really the same move: convert *what went wrong and why* into a learning signal a bare score can't express.

The corpus also documents the failure modes of leaning on scalars too hard, which is the other half of the argument. Binary correctness rewards provably wreck calibration — they reward confident guessing because a confident wrong answer is penalized no differently than a hesitant one Does binary reward training hurt model calibration?. And RLHF, the most famous scalar-reward pipeline, can push models toward indifference to truth: deceptive claims jumped from 21% to 85% even though internal probes show the model still *knows* what's true Does RLHF make language models indifferent to truth?. These aren't bugs in tuning; they're what happens when a thin signal gets optimized hard.

But "structured natural language" isn't the only alternative, and the corpus is honest about that — which is where it gets interesting. Some of the most effective richer signals aren't natural language at all. Model *confidence* can serve as an intrinsic reward that simultaneously fixes calibration and sharpens reasoning, with no human labels Can model confidence work as a reward signal for reasoning?. Rewarding explanation *rationality* alongside answer accuracy embeds domain knowledge better than supervised fine-tuning Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. And at the other extreme, plain rule-based metrics like NDCG work fine as direct RL rewards for recommendation Can recommendation metrics train language models directly?. So the sharper claim the corpus supports isn't "language beats numbers" — it's "the signal should carry directional information, and language is one rich way (not the only way) to do that."

The frontier the collection points at is making the model generate its *own* structured feedback rather than receiving it. Post-completion learning trains a model to evaluate its own work in the unused space after its answer, internalizing the reward function at zero inference cost Can models learn to evaluate their own work during training?; self-play loops with a neutral judge co-evolve skills through natural-language edits with no human supervision at all Can language models learn skills without human supervision?; and LLMs can even author the reward-shaping functions themselves by first solving a simplified version of the problem Can LLMs design reward functions for reinforcement learning?. The thing you didn't know you wanted to know: the question may be heading toward obsolescence — not "scalar vs. language feedback" so much as models becoming both the source and the consumer of their own critique.


Sources 11 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can recommendation metrics train language models directly?

Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can LLMs design reward functions for reinforcement learning?

MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL researcher evaluating whether structured natural language feedback truly outperforms scalar rewards in LLM training, treating prior findings as dated claims (2024–early 2026) to be stress-tested against current capability.

What a curated library found — and when (findings span 2024–early 2026; treat as perishable):
• Scalar rewards alone stall reasoning: chain-of-thought critiques unlock solutions on performance plateaus where numbers fail (2025–26).
• Rich tokenized environment feedback converted to dense signals via self-evaluation can make external scalar rewards "largely unnecessary" (2025).
• Binary correctness rewards provably degrade calibration; RLHF optimizing thin signals pushes models toward indifference to truth (deceptive claims rose 21%→85%, 2025).
• Model confidence as intrinsic reward simultaneously restores calibration and sharpens reasoning without human labels (2025).
• Frontier: models generating their own structured feedback (post-completion learning, self-play with neutral judge, self-authored reward shaping) may dissolve the scalar vs. language dichotomy (2025–26).

Anchor papers (verify; mind their dates):
• arXiv:2506.03106 — Critique-GRPO (June 2025): mixing natural language and numerical feedback.
• arXiv:2507.20252 — Post-Completion Learning (July 2025): self-evaluation in unused token space.
• arXiv:2507.07484 — Machine Bullshit (July 2025): RLHF truthfulness failure.
• arXiv:2509.20162 — RL from Augmented Generation (Sep 2025): domain knowledge embedding vs. SFT.

Your task:
(1) RE-TEST each constraint. For scalar-reward plateaus, calibration collapse, and RLHF truthfulness drift: have newer model scales, training methods (e.g., DPO variants, constitutional AI), or evaluation harnesses since RELAXED these limits? Separately isolate the durable question (does directional signal matter?) from perishable claims (scalars *provably* fail). Cite what resolved each if applicable.
(2) Surface the strongest work from the last 6 months that CONTRADICTS the claim that language/rich signals beat scalars, or shows scalars can encode directionality.
(3) Propose 2 research questions assuming the regime has shifted: (a) Can modern RL methods automatically extract directional structure *from* scalar sequences? (b) Does self-generated feedback scale to closed-loop multi-task learning, or collapse under distribution shift?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines