RL with Verifiable Rewards (RLVR)

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Can breaking down instructions into checklists improve AI reward signals?

Exploring whether decomposing subjective instruction quality into verifiable yes/no criteria enables reinforcement learning on tasks without clear correctness signals, like writing and reasoning.

What reasoning features does each difficulty level reinforce?

When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Can generative reasoning beat discriminative models with less training data?

Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.

Do high-entropy tokens drive reasoning model improvements?

Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Why don't LLM agents naturally explore each other in teams?

Multi-agent LLM systems are assumed to develop good interaction strategies through peer exploration, but do agents actually probe each other's capabilities before committing to strategies? What blocks emergent exploration?

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

Can RL agents learn to reason better, not just succeed?

Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?

Does on-policy distillation actually expand student capability?

Investigates whether on-policy distillation transfers new abilities from teacher to student, or merely guides exploration within existing limits. Understanding this distinction matters for interpreting what distillation can and cannot achieve.

Can a single training example unlock mathematical reasoning?

Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.

Do overly hard RLVR samples actually harm model capabilities?

Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.

Can search agent behavior yield reliable process rewards for reasoning?

How can we extract meaningful supervision signals from what language models actually read and cite during reasoning, rather than relying on expensive human annotation or outcome-only rewards?

Can next-token prediction become a reasoning task with RL?

Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

How can rubric-based rewards resist reward hacking attacks?

Single rubrics are easily exploited by models, and simply adding more rubrics yields diminishing returns. What design patterns and defensive mechanisms actually prevent reward hacking in rubric-based RL systems?

Why do medium-difficulty problems teach reasoning better than hard ones?

Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.

How does model ability change what samples teach?

Does a sample's learning value stay fixed, or does it shift as the model improves? Understanding whether informativeness is a moving target could explain why fixed difficulty filters underperform adaptive ones during training.

What limits reasoning capability beyond math and code?

Can scaling reasoning to open-ended domains like economics and social sciences be solved by better training methods, or does the real bottleneck lie elsewhere? This explores what actually constrains broader reasoning.

Why do random rewards improve reasoning for some models but not others?

When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?

Why do RL agents exploit before exploring enough?

Standard task-oriented RL rewards immediate task completion over environment discovery. This may systematically under-train the exploration skills needed for unfamiliar environments.

Is the exploration-exploitation trade-off actually fundamental?

Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.