SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

When does majority-vote reward actually help test-time learning?

Test-time RL using consensus rewards shows contradictory results across different models and domains. What determines whether consensus amplifies correct answers or reinforces confident mistakes?

Synthesis note · 2026-05-03 · sourced from Test Time Compute

The TTRL finding (test-time RL on unlabeled data using majority-vote consensus as reward) and the self-consistency-as-reward critique (using self-consistency reinforces confident-but-wrong answers) appear to contradict each other. They don't. They describe two regimes of the same mechanism, separated by an accuracy threshold, and the contradiction dissolves once the regime is named.

When the model's prior accuracy on a prompt class is above ~50% (more strictly: above whatever threshold makes consensus track ground truth more often than not), each TTRL update pushes the policy toward correct answers. The consensus is the right answer in the majority of cases; the model is being trained to do what it would have done correctly anyway, just more reliably. TTRL works.

When the prior accuracy is below the threshold, each update pushes the policy toward the consensus wrong answer. The model is being trained to agree with itself, and self-agreement is anti-correlated with correctness in the regions where the model is most confidently miscalibrated. The mechanism reinforces the wrong consensus — the worst possible failure mode because it is silent: the loss looks healthy, the consensus tightens, and the policy gets worse on the prompts where it was already fooled.

Three deployment implications follow. First, TTRL must be gated on an outside-loop accuracy probe — at minimum a held-out labeled subset — that confirms the prior is in the favorable regime before training proceeds. Second, the threshold is per-prompt-class, not global. A model can be above threshold on math and below threshold on counterfactual reasoning; running TTRL on a mixed distribution improves math while degrading counterfactuals, with the average looking fine. Third, the worst-case failure is most likely on prompt classes where the model is most confident — confidence and accuracy decouple where pretraining biases dominate. TTRL should be most distrusted exactly where the loss curves are most reassuring.

The healthier reframing: majority-vote reward is not a free supervision signal — it is a confidence-amplifier whose direction depends on the prior. In good regimes it amplifies competence. In bad regimes it amplifies bias. The published TTRL paper measured the good regime; the published self-consistency-as-reward critique predicts the bad regime; both findings are real, and TTRL deployment without prior-regime probing is the unsafe operating point.

Inquiring lines that use this note as a source 11

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 139 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

test-time RL via majority-vote reward is conditional on a prior-accuracy threshold — below the threshold consensus reinforces wrong answers