SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Three failure modes of purely numerical RL for reasoning: (1) performance plateaus despite 8x scaling of training examples (from 4k to 32k); (2) self-reflection behaviors during RL, often celebrated as "aha moments," contribute minimally to successful problem-solving; (3) persistent failures on certain problems despite extensive trial-and-error training. The common cause: numerical feedback contains limited information about WHY a response is correct or incorrect and HOW to improve.

Critique-GRPO demonstrates that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems when provided with chain-of-thought critiques. The key is integrating both natural language feedback (NLF) and numerical feedback within online RL. The model learns from initial responses and critique-guided refinements simultaneously while maintaining exploration.

This is significant because it challenges the implicit assumption that RL's learning signal is sufficient for arbitrarily complex reasoning. Since Does reflection in reasoning models actually correct errors?, the ineffectiveness of self-reflection during RL training is predictable — the model cannot generate useful critiques of its own failures. External critiques break the ceiling because they provide the information that numerical rewards lack: specific identification of where reasoning went wrong.

The practical architecture has three components: (1) the model generates initial responses; (2) a reasoning-based reward model generates CoT critiques identifying flaws; (3) a shaping function enhances learning from valid refinements and heavily penalizes failed refinements. This approach encourages the model to integrate targeted refinements while preserving exploration.

Since Do critique models improve diversity during training itself?, the NLF mechanism works by expanding the effective exploration space — critiques point toward regions of solution space that numerical rewards cannot identify.

Semantic reward shaping as lightweight NLF: The Semantic Reward Shaping paper proposes a complementary mechanism: using a small encoder-only transformer to compute cosine similarity between generated explanations and ground-truth references. This provides a dense, semantically rich reward signal within GRPO — not as information-rich as full CoT critiques, but vastly cheaper and faster than LLM-as-judge evaluation. The approach combines semantic similarity reward with auxiliary correctness and formatting rewards, significantly improving explanation faithfulness over SFT baselines. This occupies a middle ground between brittle keyword metrics (ROUGE) and expensive LLM-based critiques — suggesting the NLF principle scales down to lightweight implementations when full CoT critique is impractical.

Textual gradients as generalized NLF: TextGrad (2406.07496) formalizes the broader principle: natural language criticism can serve as "textual gradients" propagated through arbitrary computation graphs including LLM API calls, simulators, and external solvers. Each AI system component is a node in a computation graph; textual feedback describes how variables should change to improve the system. This extends NLF from RL plateau-breaking to general AI system optimization — the same principle (informative language feedback > scalar signal) applies at the system level, not just the training level.

Inquiring lines that use this note as a source 151

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 126 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

natural language feedback breaks rl performance plateaus that scaling numerical rewards alone cannot resolve