← All notes

How does RL training reshape reasoning and what gets lost?

A navigation hub examining RL training mechanics, verifiable rewards, and the trade-offs between optimization and capability preservation.

Topic Hub · 39 linked notes · 11 sections
View as

Sub-Maps

2 notes

What actually changes inside a model during RL training?

RL training modifies only sparse regions of model parameters through suppression of incorrect paths rather than broad capability building. Understanding these mechanics reveals how fine-tuning shapes reasoning and what hidden costs accompany optimization.

Explore related Read →

What does reward learning actually do to model reasoning?

Explores whether RLVR expands reasoning capabilities or merely activates latent skills. Investigates the mechanism by which rewards reshape model outputs and whether this constitutes genuine learning or efficient sampling.

Explore related Read →

Writing Angles

2 notes

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.

Explore related Read →

Why do language models fail to act on their own reasoning?

LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?

Explore related Read →

Pass 3 Additions (2026-05-03)

3 notes

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Explore related Read →

Can agents learn from their own actions without external rewards?

Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.

Explore related Read →

When does majority-vote reward actually help test-time learning?

Test-time RL using consensus rewards shows contradictory results across different models and domains. What determines whether consensus amplifies correct answers or reinforces confident mistakes?

Explore related Read →

Cross-Paper Synthesis (2026-05-18)

1 note

Can trajectory structure replace hand-annotated process rewards?

Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?

Explore related Read →

2026-05-18 Batch — Diversity, Process Policies, and Distillation Hazards

11 notes

Does preference tuning actually reduce the diversity of model outputs?

The field assumes RLHF and DPO reduce diversity, but this assumption rests on measuring all outputs equally. What happens if we only count diverse outputs that meet quality thresholds?

Explore related Read →

Does preference tuning always reduce diversity the same way?

Explores whether the standard narrative that RLHF reduces model diversity holds equally across different task domains, or if the effect varies by what the domain rewards.

Explore related Read →

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Standard RL training optimizes token outputs in multimodal models, but the real bottleneck may be where the model attends to visual information. Does steering attention directly outperform indirect optimization through final outputs?

Explore related Read →

Can step-wise expert rewards help small models learn hard reasoning?

When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.

Explore related Read →

Does sequencing imitation then exploration training improve reasoning?

Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.

Explore related Read →

Can LLMs learn to ask for feedback during problem solving?

Explores whether language models can be trained to actively solicit corrective feedback mid-conversation rather than committing to single-turn answers. This matters because it could bridge the gap between fluent chat and genuine conversational learning.

Explore related Read →

Why does teacher-student information asymmetry enable learning signals?

What role does privileged answer access play in making social meta-learning training work? Without asymmetric information, can a conversation between teacher and student function as pedagogy or only as parallel speculation?

Explore related Read →

Can models learn to ask clarifying questions without explicit training?

Do language models trained only on fully-specified problems spontaneously develop the ability to ask for missing information when facing underspecified tasks? This tests whether conversational problem-solving strategies emerge from meta-learning rather than direct instruction.

Explore related Read →

Does self-distillation harm mathematical reasoning performance?

Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?

Explore related Read →

Does richer teacher context hurt student generalization?

When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?

Explore related Read →

Can post-training objectives preserve reasoning style alongside correctness?

Even mathematically sound training objectives may suppress reasoning behaviors like uncertainty expression without penalizing them. Does optimizing for answer correctness inadvertently degrade the stylistic features that enable generalization?

Explore related Read →

2026-05-18 Batch C — Personalized Rewards, Process Supervision, Recommendation RL

10 notes

Does preference data need more raters than examples?

Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?

Explore related Read →

Can aggregate reward models satisfy genuinely disagreeing users?

When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?

Explore related Read →

Does personalizing reward models amplify user echo chambers?

Personalized reward models solve the minority-preference problem but may introduce new risks by reinforcing existing user beliefs and narrowing exposure to diverse viewpoints.

Explore related Read →

Can recommendation metrics train language models directly?

Explores whether LLMs can be optimized through closed-loop reinforcement learning using real recommendation system outputs as rewards, rather than relying on expensive proprietary model distillation.

Explore related Read →

Can LLMs recommend products without ever seeing the catalog?

Explores whether language models can learn to generate effective search queries for recommendation systems without direct access to inventory data. This challenges the intuition that good recommendations require knowing what items exist.

Explore related Read →

Can tree structure alone convert outcome rewards into process supervision?

Tree-based rollouts naturally create step-level preference signals by comparing sibling subtrees. Can this structural approach replace separate process reward models without explicit step-level annotation?

Explore related Read →

Can shared-prefix trees reduce redundancy in agent rollouts?

Independent rollouts waste tokens regenerating similar early-turn sequences. Can structuring rollouts as shared-prefix trees instead preserve early computation across samples while maintaining statistical diversity for advantage estimation?

Explore related Read →

Does tree depth automatically produce supervision at multiple granularities?

Tree-search rollouts branch at different depths, potentially creating supervision signals ranging from coarse strategy-level to fine-grained detail-level choices. Does this depth variation naturally yield multi-granular process supervision without explicit annotation design?

Explore related Read →

Can structured reasoning replace code execution for RL rewards?

Can semi-formal templates enable execution-free code verification reliable enough to train RL agents without running code? This matters because execution is expensive and slow in agent training loops.

Explore related Read →

Can simulated APIs and token-level credit assignment train better tool-using agents?

Training agents to use real APIs is expensive and unstable, and sparse rewards make it hard to credit the right tool calls. Can combining LLM simulators with fine-grained advantage attribution solve both problems?

Explore related Read →

Backlog wave 2 — Batch #3 *(2026-06-03)*

1 note

Do tools actually expand what language models can reason about?

Explores whether tool access fundamentally breaks through reasoning limits in pure-text models, or merely optimizes existing capabilities. Understanding this distinction clarifies whether tools are luxury features or necessity for genuine capability growth.

Explore related Read →

Backlog wave 2 — Batch #3 *(2026-06-03)*

2 notes

Can reinforcement learning improve models during general pretraining?

Can RL work during standard pretraining on unverified text like Wikipedia, without reward models or labeled data? This matters because it would remove the data bottleneck that currently limits RL-based training to small verified domains.

Explore related Read →

Can online LLM feedback improve direct preference optimization during training?

Direct alignment methods like DPO use fixed preference data from older models, creating off-policy training. Could sampling fresh responses from the current model and using an LLM judge to pick preferences in real time reduce overfitting and improve alignment?

Explore related Read →

Backlog wave 3 — Batch #4 *(2026-06-03)*

1 note

Process reward from trajectories — new papers *(2026-06-03)*

1 note

Can search agent behavior yield reliable process rewards for reasoning?

How can we extract meaningful supervision signals from what language models actually read and cite during reasoning, rather than relying on expensive human annotation or outcome-only rewards?

Explore related Read →