SYNTHESIS NOTE
Psychology, Society, and Alignment Training, RL, and Test-Time Scaling Conversational AI and Personalization

Why does RL succeed more on some tasks than others?

Reinforcement learning shows wildly different improvement rates across conversational tasks—from near-total capability unlock to modest gains. What determines whether RL will transform performance or produce incremental progress?

Synthesis note · 2026-03-31 · sourced from Conversation Agents
How does RL training reshape reasoning and what gets lost?

Both papers use RL to train conversational capabilities, but the improvement magnitudes diverge dramatically:

Three factors explain the gap:

1. Reward signal verifiability. Proactive critical thinking has a clear binary reward: did the model correctly identify the missing variable and ask for it? Yes or no. Persona consistency requires LLM-as-a-Judge evaluation of whether an utterance is consistent with a persona description — a softer, more ambiguous signal. Since Does the choice of RL algorithm actually matter for reasoning?, when the reward signal is clear, the algorithm barely matters. When the reward is fuzzy, everything matters.

2. Baseline differences. Proactive critical thinking starts from near-zero — the capability is completely suppressed in vanilla models. Persona consistency starts from a partially functional baseline — models already maintain some consistency. Unlocking a suppressed capability (going from 0 to 1) is architecturally different from improving an expressed capability (going from 0.5 to 0.8).

3. Task complexity. Detecting a missing variable is a bounded problem with a finite answer space. Maintaining consistent personality across an open-ended conversation is unbounded — the space of possible persona-relevant responses is vast and context-dependent.

This pattern generalizes across the vault:

The principle: RL improvement magnitude tracks reward signal verifiability. Binary verification → dramatic improvement. Judgment-based evaluation → modest improvement. The training method is the same. The reward signal determines the ceiling.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 151 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

RL succeeds dramatically on tasks with verifiable binary rewards but only modestly on tasks requiring judgment-based evaluation