SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can one statistical measure serve dual purposes in RL training?

Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.

Synthesis note · 2026-05-18 · sourced from Reasoning Methods CoT ToT
What actually changes inside a model during RL training? What does reward learning actually do to model reasoning?

The cross-rollout variance signal in DRO does double duty. First, it identifies the tokens within a reference answer whose certainty depends on the chain-of-thought, and up-weights those in the dense reward. Second, the same variance computed across a query's rollout group serves as a query-level filter: queries whose rollouts produce too little variance get discarded entirely, because they offer no comparative signal for learning.

The query-filter use is the underappreciated half. Most RL setups process every query in the batch equally, computing rewards across rollouts and updating the policy. But not every query carries gradient signal. Queries where all rollouts converge to the same answer with similar certainty contribute nothing — the comparative reward is degenerate, and the gradient is noise. Filtering these out before the update concentrates compute on queries where comparative learning is possible.

The two uses come from the same statistical quantity: cross-rollout variance over reasoning-reflective tokens. The token-level view says "which positions in this answer respond to reasoning differences." The query-level view says "does this entire query produce enough variation across rollouts to be worth learning from." Both are derived from the same self-supervised samples — no human labels, no PRM, no extra forward passes.

The empirical result is that DRO trains 2–3× faster with better stability than baselines on unverifiable tasks. The decomposition explains why: every gradient update spends compute on queries with measurable signal, and within each query, the gradient concentrates on the tokens that actually carry reasoning sensitivity. Sample efficiency emerges from filtering at both grain levels.

The transferable principle: when a self-supervised signal exists, reuse it at multiple aggregation levels. The same statistic that identifies which tokens to weight also identifies which queries to keep. Looking for one such statistic per pipeline is cheap; designing two separate signals (one for filtering, one for weighting) is what makes other dense-reward pipelines expensive.

Inquiring lines that use this note as a source 48

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cross-rollout variance functions simultaneously as reward signal and query filter — one statistical quantity unlocks sample-efficient RL on unverifiable tasks