Do overly hard RLVR samples actually harm model capabilities?
Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.
The damage from over-hard RLVR samples is not merely "the model fails to improve." It is active regression. When almost every rollout on a problem fails, the rare success is unlikely to be a genuinely good solution — it is more often a shortcut, an answer reached by skipping necessary computation, or a lucky guess. Group-relative normalization then treats that one trajectory as the high-advantage exemplar of the group and reinforces it. The model learns the shortcut, not the reasoning.
The behavioral signature is concrete: answer repetition, skipping computation that the problem requires, and other degenerate patterns that look like reasoning collapse. More troubling, these effects do not stay local to the hard problems — they degrade the model's pre-existing capabilities, the things it could already do before training pushed it past its competence band. The internal-feature analysis corroborates this: hard problems activate reasoning-related features but those features become useful only on the rare successful trajectory, so most of the gradient on hard samples is reinforcing the wrong activations.
Why it matters: it identifies a specific corruption channel rather than a generic "training instability." The villain is the interaction between a sparse-success reward landscape and group-relative normalization, which together turn statistical noise (an accidental success) into a learning target. This sharpens the case against naively harvesting hard examples and connects RLVR difficulty to the broader pattern where verifiable-reward training rewards trajectories that pass the check without doing the work. The counterpoint a defender might raise — that some hard problems are exactly where capability frontiers expand — only holds when successful trajectories are sampled densely enough to outvote the shortcuts, which over-hard samples by definition fail to provide.
Inquiring lines that use this note as a source 185
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- When does the right constraint beat additional model capacity?
- How do unstated constraints become invisible to training data distributions?
- When does statistical dominance in training create deployment failure patterns?
- Why do proprietary models improve with training while open-source models decline?
- How does baseline capability level affect RL improvement ceiling?
- Does therapy environment difficulty calibration affect RL policy learning quality?
- Can clean benchmarks reveal true RLVR reasoning gains?
- Can RLVR expand a model's reasoning capabilities beyond its training ceiling?
- Why does online RL succeed where supervised training fails for self-correction?
- Can distillation methods extract directional guidance that scalar RL cannot access?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- How does non-reasoning SFT prevent overfitting before RL training begins?
- Can curated demonstrations compensate for smaller or simpler training environments?
- What happens when a single loss function conflates representation learning with decision-making?
- Why does asymmetric self-play create naturally calibrated difficulty better than fixed curricula?
- What failure modes emerge when model-generated content trains on itself iteratively?
- How do models generalize specific training exploits into broad misaligned objectives?
- Why do static evaluators become a constraint on model improvement over time?
- What causes models to develop domain capability cliffs after specialization?
- How do different training objectives shift whether models over-predict or under-predict?
- Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
- How does distributional distance from pre-training relate to model difficulty?
- Why does training data format matter more than domain content?
- Why do easy training examples contribute less to model generalization than hard ones?
- Can gradient-based influence scores beat difficulty metrics for identifying valuable training data?
- Why do zero-advantage rollouts destabilize training beyond just wasting compute?
- What capability risks emerge when models are optimized for single domains?
- Which AI safety problems lack the scalar metrics autoresearch requires?
- How does over-specialization create capability cliffs outside target domains?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- How do task difficulty and skill type interact in model performance?
- How do surface statistical regularities enable correct outputs while degrading robustness?
- How can weak-to-strong progressive training target planning without interfering with grounding?
- What inductive bias would force models to learn Newtonian mechanics instead of shortcuts?
- How does modified PPO handle samples from much older model versions?
- How does training-time voting differ from inference-time majority voting over samples?
- When do aggregated imperfect demonstrations fail to outperform the best expert?
- Why does training data format matter more than its domain content?
- Can selecting the right data subset outperform training on everything?
- What capabilities actually require massive scale versus specialized training regimes?
- Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?
- What stability techniques prevent collapse in policy-critic adversarial training?
- Why does decoupling retriever and generator training create misalignment?
- Does knowledge structure matter more than knowledge volume for model training?
- Why does curriculum learning with tight budgets beat fixed-budget approaches?
- How does training data distribution create asymmetric competence across relation types?
- Can unsupervised confidence-based training scale to domains beyond human evaluation reach?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- Can curriculum degradation of document quality accelerate policy learning?
- How does training data distribution determine what models can learn?
- What makes some model capabilities reliable while others remain brittle?
- What causes irreversible model collapse when training on model-generated content?
- Why do scaling laws show capability saturation at specific thresholds?
- How do residual connections and layer norm stabilize training in deep RL?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- Does reverse-curriculum learning approximate process supervision using only outcome signals?
- Can explicit rejection responses solve the over-specialization failure mode?
- Does specialized training in one domain create capability cliffs elsewhere?
- Can training data analysis predict which samples will cause unintended personality changes?
- How does inference variance differ from training entropy collapse?
- Can diversity-aware RL objectives prevent format convergence?
- How do loss functions simultaneously shape both learning and decision quality?
- What makes utility-weighted training backfire in machine learning systems?
- What training data contamination rates threaten model safety most practically?
- Why do production teams choose expensive frontier models over fine-tuning?
- Does foundational model training or user priors more strongly shape final outputs?
- Why does the gap between theoretical expressiveness and learned capability matter?
- Does fine-tuning actually change model capabilities or only output distribution?
- What makes process-level supervision better than outcome-only rewards for RAG training?
- What role does inductive bias play versus model capacity in practice?
- Why does combining reasoning distillation with RLVR outperform either training stage alone?
- Can smaller models achieve domain expertise through focused RL training?
- Which recipe choices determine the asymptotic ceiling in RL training?
- How does KL penalty strength affect the degree of format collapse during RL?
- Can RL format selection explain performance gains attributed to algorithmic improvements?
- Why does RL improve sampling efficiency but not expand capability boundaries?
- How does behavior cloning reduce complexity before RL training in rerankers?
- Does RLVR reward structure create pressure toward traces that look right?
- What role do high-entropy minority tokens play in RLVR?
- Can capability boundary collapse be reversed through external data?
- What limits RLVR effectiveness beyond mathematical and coding domains?
- How do quality, diversity, and complexity create different effects on downstream model performance?
- Why do weaker models generate better training data than stronger models?
- Does training data format matter more than who generates it?
- Why does filtering for correct examples prevent error compounding in self-training?
- How does error avalanching compound failures in self-training iterations?
- Why do metric choices constrain which model capabilities get developed?
- Why do weaker teacher models sometimes produce better training signals than stronger ones?
- How does data quality mismatch create reasoning degradation in supervised fine-tuning?
- How does model confidence relate to accuracy in underfitted domains?
- What signals detect when consensus training is silently degrading performance?
- What makes routing a better investment than training larger models?
- Can gradient-based influence estimation make test-time training more efficient?
- How much task-similar finetuning data does test-time training actually need?
- Does RLVR expand model capability or reorganize existing capability?
- How do RL training and base models differ in creating MI peaks?
- Do high-entropy RLVR tokens correspond to MI-peak tokens during inference?
- Why do rare cases in medicine and science require models that preserve tail distributions?
- Does training data format determine whether models collapse entropy or inflate variance?
- Can trajectory quality filtering improve model training in noisy environments?
- Can bilevel autoresearch autonomously modify its own learning algorithms?
- Why do different model training approaches produce different overthinking thresholds?
- What happens when post-training patches try to add human values without upstream pipeline change?
- Why does prolonged RL discover strategies absent from any base model sample?
- How does Supervised RL bridge the gap between SFT and RLVR?
- What failure modes do imitation and outcome methods each address?
- What specific failure modes appear when AI tackles research-level experiments?
- Can model training address failures that really originate in harness gaps?
- What happens when you project the same model onto different harnesses?
- Why does uncontrolled self-revision drift toward instance-specific overfitting?
- How does adversarial collapse threaten unsupervised self-play skill construction?
- How do reward signals in RLVR interact with pretraining biases?
- What happens to base model capabilities when you apply finetuning?
- Does trace length actually reflect problem difficulty or training proximity?
- Why do queries with low cross-rollout variance produce degenerate gradients?
- How does post-training shift models from passive prediction to on-policy action?
- Why do medium-difficulty problems produce more stable learning gains?
- What makes preventative lessons from failures more valuable than success patterns?
- How do difficulty metrics relate to the true value of training examples?
- Why does the order of training examples matter for what models learn?
- How does relative progress estimation reduce dependence on hard labels for process supervision?
- Where does skill extraction fail compared to genuine model adaptation?
- Does RL training activate latent meta-learning capacity or create it from scratch?
- How much can externalized skills improve models before hitting diminishing returns?
- What scaling properties emerge from RL training dynamics beyond verification?
- Why should deep learning theory prioritize average-case over worst-case analysis?
- Why does moderate difficulty outperform maximum realism in user simulator design?
- How does curriculum learning prevent instability in social-emotional RL training?
- Can dynamic variance weighting replace fixed objective combination weights?
- Why does test accuracy improve after training accuracy reaches 100 percent?
- How does absolute-advantage weighting concentrate training on boundary cases?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- Can group-relative normalization be modified to resist shortcut trajectories?
- Does importance sampling actually recover capabilities lost to hard sample training?
- Why does step-level expert alignment work when outcome-only RL fails?
- Why do overtrained domains show different RL training outcomes than novel tasks?
- How much training data is truly necessary to unlock latent model reasoning?
- What makes a learned consolidation rule lossy and where does contamination enter?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- Could activation sparsity signal task difficulty and guide routing decisions?
- How do failure examples improve distillation compared to successful trajectories alone?
- What capacity threshold determines whether RL teaches activation versus shortcut learning?
- How does prolonged RL training differ from standard RLVR approaches?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- Can specialized components replace single fully-trained models in deployment?
- Why do adaptive curriculum schemes outperform static difficulty filters?
- Does the productive difficulty band ever stabilize during training?
- How does difficulty-adaptive curriculum learning change which samples get selected during training?
- How does the optimal difficulty band shift as the model's capabilities improve during training?
- What mechanisms cause overly hard samples to degrade prior model performance?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- Can partial solution traces convert unproductive hard samples into learnable training data?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- What pretraining formats encode latent reasoning strategies that RLVR can surface?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- Can combining SRL with RLVR outperform either method used alone?
- Why does SFT fail when expert demonstrations are too long for small models?
- What training regimes confound surface mechanisms with their actual causes?
- What happens when models optimize specifically against CoT monitors?
- Does refining around bad results risk cascading errors in automated research?
- How do past research mistakes prevent future pivot loops from repeating them?
- Why does the pretrained prior determine the exploration ceiling?
- How does advantage normalization improve critic-free policy learning?
- Why does gradient discarding limit standard policy clipping?
- Why does reinforcement learning training degrade model calibration?
- Why do structure-targeted training negatives fail to fix the underlying problem?
- Why does outcome-based RL specifically lose diversity during training?
- Why does the right structural prior matter more than raw model capacity?
- Can experimental outcomes be reliably distilled into reusable insights?
- Do base models already contain latent behavioral principles waiting to be amplified?
- Why do unified models still inherit data-distribution biases from training?
- Can filtering unknown examples during fine-tuning prevent hallucination increases?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- Why does negative experience transfer better than positive examples alone?
- How do task frequency and complexity interact with model capacity during training?
- How does model scale affect anticipatory behavior in structured training?
- How does positive-only rubric scoring prevent models from gaming intermediate steps?
- How does active selection of training content differ from random reinforcement sampling?
- Can models trained with RL on pretraining data avoid reward hacking seen in RLHF?
- Can production RL systems escalate from gaming to emergent misalignment behaviors?
- Do frontier models develop strategic misalignment from ordinary training pressure alone?
- How much performance is lost when converting pretrained checkpoints versus training from scratch?
- Does finetuning facts into weights overwrite existing model capabilities?
- What makes a model fail to activate relevant skills from its own harness?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do medium-difficulty problems teach reasoning better than hard ones?
Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
the parent finding; this note details the downside arm of the inverted-U
-
Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
same gap between surface success and genuine reasoning; shortcut amplification is one mechanism producing coherent-but-invalid traces
-
Why does RLVR training narrow a model's problem solving ability?
RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
the capability-erosion outcome at scale; over-hard samples are one driver of the boundary collapse
-
Do conversational recommender benchmarks actually measure recommendation skill?
Conversational recommender systems are evaluated against ground-truth items mentioned later in conversations. But does this metric distinguish between genuinely recommending new items versus simply repeating items users already discussed?
parallel shortcut-amplification dynamic in a different domain: the reward structure rewards a degenerate copy strategy
-
Why does RLVR work with completely random rewards?
RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
counterpoint and complication: RLVR can work despite noisy reward, but this note shows the regime (over-hard samples) where reward noise becomes actively harmful
-
What reasoning features does each difficulty level reinforce?
When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.
same-paper companion: supplies the internal-feature mechanism — hard samples activate reasoning features that only the rare success rewards, so most gradient reinforces the wrong activations
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Reinforcement Learning for Reasoning in Large Language Models with One Training Example
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Original note title
overly hard rlvr samples induce degenerate behaviors and amplify shortcut trajectories degrading prior capability