SYNTHESIS NOTE

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning

Three failure modes of purely numerical RL for reasoning: (1) performance plateaus despite 8x scaling of training examples (from 4k to 32k); (2) self-reflection behaviors during RL, often celebrated as "aha moments," contribute minimally to successful problem-solving; (3) persistent failures on certain problems despite extensive trial-and-error training. The common cause: numerical feedback contains limited information about WHY a response is correct or incorrect and HOW to improve.

Critique-GRPO demonstrates that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems when provided with chain-of-thought critiques. The key is integrating both natural language feedback (NLF) and numerical feedback within online RL. The model learns from initial responses and critique-guided refinements simultaneously while maintaining exploration.

This is significant because it challenges the implicit assumption that RL's learning signal is sufficient for arbitrarily complex reasoning. Since Does reflection in reasoning models actually correct errors?, the ineffectiveness of self-reflection during RL training is predictable — the model cannot generate useful critiques of its own failures. External critiques break the ceiling because they provide the information that numerical rewards lack: specific identification of where reasoning went wrong.

The practical architecture has three components: (1) the model generates initial responses; (2) a reasoning-based reward model generates CoT critiques identifying flaws; (3) a shaping function enhances learning from valid refinements and heavily penalizes failed refinements. This approach encourages the model to integrate targeted refinements while preserving exploration.

Since Do critique models improve diversity during training itself?, the NLF mechanism works by expanding the effective exploration space — critiques point toward regions of solution space that numerical rewards cannot identify.

Semantic reward shaping as lightweight NLF: The Semantic Reward Shaping paper proposes a complementary mechanism: using a small encoder-only transformer to compute cosine similarity between generated explanations and ground-truth references. This provides a dense, semantically rich reward signal within GRPO — not as information-rich as full CoT critiques, but vastly cheaper and faster than LLM-as-judge evaluation. The approach combines semantic similarity reward with auxiliary correctness and formatting rewards, significantly improving explanation faithfulness over SFT baselines. This occupies a middle ground between brittle keyword metrics (ROUGE) and expensive LLM-based critiques — suggesting the NLF principle scales down to lightweight implementations when full CoT critique is impractical.

Textual gradients as generalized NLF: TextGrad (2406.07496) formalizes the broader principle: natural language criticism can serve as "textual gradients" propagated through arbitrary computation graphs including LLM API calls, simulators, and external solvers. Each AI system component is a node in a computation graph; textual feedback describes how variables should change to improve the system. This extends NLF from RL plateau-breaking to general AI system optimization — the same principle (informative language feedback > scalar signal) applies at the system level, not just the training level.

Inquiring lines that read this note 171

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can natural language feedback overcome numerical reward plateaus?

Inquiring lines that read this note 171

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4