Why does RL succeed more on some tasks than others?
Reinforcement learning shows wildly different improvement rates across conversational tasks—from near-total capability unlock to modest gains. What determines whether RL will transform performance or produce incremental progress?
Both papers use RL to train conversational capabilities, but the improvement magnitudes diverge dramatically:
- Proactive critical thinking: 0.15% → 73.98% — near-total capability unlock
- Persona consistency: 55% inconsistency reduction — significant but not transformative
Three factors explain the gap:
1. Reward signal verifiability. Proactive critical thinking has a clear binary reward: did the model correctly identify the missing variable and ask for it? Yes or no. Persona consistency requires LLM-as-a-Judge evaluation of whether an utterance is consistent with a persona description — a softer, more ambiguous signal. Since Does the choice of RL algorithm actually matter for reasoning?, when the reward signal is clear, the algorithm barely matters. When the reward is fuzzy, everything matters.
2. Baseline differences. Proactive critical thinking starts from near-zero — the capability is completely suppressed in vanilla models. Persona consistency starts from a partially functional baseline — models already maintain some consistency. Unlocking a suppressed capability (going from 0 to 1) is architecturally different from improving an expressed capability (going from 0.5 to 0.8).
3. Task complexity. Detecting a missing variable is a bounded problem with a finite answer space. Maintaining consistent personality across an open-ended conversation is unbounded — the space of possible persona-relevant responses is vast and context-dependent.
This pattern generalizes across the vault:
- RLVER emotional rewards work because emotion categories are partially verifiable — empathy shifts are measurable through linguistic markers
- Checklist-based rewards (RLCF) work because sub-criteria can be independently verified
- Binary reward RL degrades calibration because forcing binary judgment onto graded reality introduces systematic distortion
The principle: RL improvement magnitude tracks reward signal verifiability. Binary verification → dramatic improvement. Judgment-based evaluation → modest improvement. The training method is the same. The reward signal determines the ceiling.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does baseline capability level affect RL improvement ceiling?
- What role does natural language play in breaking reinforcement learning performance plateaus?
- Does task ordering affect multi-task reinforcement learning outcomes?
- Why do next-turn reward objectives fail to encourage multi-turn goal progress?
- Does RL refine existing knowledge or discover entirely new capabilities?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- What separates bootstrapping gains from sustained self-improvement gains?
- Why do single-turn RL methods fail to generalize to multi-turn tasks?
- Why do overtrained domains show different RL training outcomes than novel tasks?
- What training duration is actually needed for RL to expand capabilities?
- What makes a task at the edge of competence optimal for RL?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
the dramatic success case (0.15% → 73.98%)
-
Can training user simulators reduce persona drift in dialogue?
Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
the modest success case (55% reduction)
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
algorithm interchangeability when reward is clear
-
Can breaking down instructions into checklists improve AI reward signals?
Exploring whether decomposing subjective instruction quality into verifiable yes/no criteria enables reinforcement learning on tasks without clear correctness signals, like writing and reasoning.
decomposition into verifiable sub-criteria as a fix for the judgment-reward problem
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
binary forcing on graded tasks as a specific failure mode
-
Can models learn what makes research worth doing?
Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.
RLCF introduces a third reward category (community-level feedback) beyond the binary/judgment dichotomy
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- Spurious Rewards: Rethinking Training Signals in RLVR
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Learning to Reason without External Rewards
- Reinforcement Pre-Training
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- Intrinsic Credit Assignment for Long Horizon Interaction
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Original note title
RL succeeds dramatically on tasks with verifiable binary rewards but only modestly on tasks requiring judgment-based evaluation