SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Post-training of LLMs via RL typically prioritizes accuracy and helpfulness, which sharpens output distributions and reduces the range of ideas. This creates a tension: quality improves while diversity degrades, limiting usefulness for creative and exploratory tasks. The standard assumption is that quality and diversity trade off.

Diversity-Aware Reinforcement Learning (DARLING, 2025) challenges this assumption. It jointly optimizes for quality and semantic diversity during online RL by: (1) using a learned partition function to cluster rollouts into semantically distinct groups (beyond surface-level lexical variation), and (2) multiplying the diversity signal with the quality reward, amplifying the advantage for responses that are both high-quality and semantically novel.

The counter-intuitive finding: explicitly optimizing for diversity also improves quality. On five non-verifiable benchmarks (instruction following and creative writing), DARLING consistently produces outputs of both higher quality and higher novelty than quality-only RL baselines. On verifiable tasks (competition math), it achieves higher pass@1 (solution quality) and pass@k (solution variety).

The mechanism is exploration. Since Does policy entropy collapse limit reasoning performance in RL?, standard RL concentrates probability mass on a narrow set of high-reward trajectories. The diversity reward counteracts this: it forces the model to maintain exploration across semantically distinct solution strategies, which means it encounters more high-quality solutions that pure exploitation would never reach. Diversity is not just an output property — it is a training-time exploration signal.

This has direct implications for Does negative reinforcement alone outperform full reinforcement learning?. If negative reinforcement works by suppression, DARLING works by forced exploration — and the latter may produce broader capability because it explicitly rewards novel correct solutions rather than just penalizing known failures.

The learned semantic classifier is the key architectural innovation. Surface-level lexical diversity (different words) does not capture semantic diversity (different ideas). By training a classifier to recognize genuine conceptual distinctness, DARLING avoids the failure mode where the model produces lexically varied but semantically identical outputs.

Inquiring lines that use this note as a source 34

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 124 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

explicitly optimizing for semantic diversity during rl catalyzes exploration and simultaneously improves both quality and diversity