INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›Why does reinforcement learning su…›this inquiring line

Being smarter about punishing AI's wrong answers can't fix the problem — the punishment is the problem.

Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?

This explores whether the way reasoning RL kills diversity — by penalizing wrong answers — can be fixed by just penalizing them more carefully, or whether suppression is the wrong lever entirely.

This reads the question as asking: if RL's diversity problem comes from rewarding correct answers, can we solve it purely by being smarter about pushing down incorrect behavior — or does diversity require its own positive machinery? The corpus answers fairly cleanly: suppression alone is not a fix, because suppression is the mechanism that creates the bottleneck in the first place. Outcome-based RL that rewards only final-answer correctness sharpens the policy globally — it concentrates probability mass on the trajectories it already favors, and crucially this collapse leaks from solved problems onto unsolved ones, so the model gets narrower exactly where it still needs to explore Does outcome-based RL diversity loss spread across unsolved problems?. The same entropy-collapse signature shows up in search agents, where RL converges on a few reward-maximizing strategies while SFT on diverse demonstrations keeps exploration broad Does reinforcement learning squeeze exploration diversity in search agents?.

A deeper reason suppression can't manufacture diversity: RL isn't adding new behaviors to suppress around. Multiple independent lines of evidence suggest base models already contain latent reasoning strategies, and post-training selects among them rather than creating them Do base models already contain hidden reasoning ability? — RL largely teaches a model *when* to deploy reasoning it already has, not *how* to reason Does RL post-training create reasoning or just deploy it?. Seen that way, suppressing wrong trajectories can only prune an existing repertoire; it has no channel for widening it. Worse, RL tends to amplify a single dominant format inherited from pretraining within the first epoch and quietly extinguish the alternatives Does RL training collapse format diversity in pretrained models?. So pure suppression doesn't just fail to add diversity — it actively narrows the strategy space.

What the corpus suggests instead is that diversity needs its own dedicated mechanism, separate from the correctness signal. One note draws the sharpest version of this: historical exploration (training-time diversity, e.g. UCB-style bonuses) and batch exploration (test-time diversity, e.g. repetition penalties) are *structurally different* and can't be folded into one reward Does outcome-based RL diversity loss spread across unsolved problems?. Adjacent work points the same direction from other angles — structuring a single model's reasoning as an internal dialogue between agents breaks the fixed-strategy rut that monologue reasoning falls into Can dialogue format help models reason more diversely?, and process rewards that score metacognitive moves (planning, exploration, reflection) cut repetitive actions while generalizing better than outcome-only training Can RL agents learn to reason better, not just succeed?.

There's a useful twist worth knowing: the failure modes RL introduces by over-suppressing aren't only about diversity — they're also about *calibration*, and the fixes rhyme. Binary correctness rewards push models toward overconfident guessing because nothing penalizes a confident wrong answer, and the remedy is to add a second objective (a Brier-style proper scoring term) rather than tune the first one harder Does binary reward training hurt model calibration?. Using the model's own answer-span confidence as a reward similarly restores calibration while strengthening reasoning Can model confidence work as a reward signal for reasoning?. The recurring lesson across all of these: a single scalar that only says 'this output was wrong, do less of it' optimizes itself into a corner. Diversity, like calibration, has to be a co-equal objective with its own signal — not a side effect you hope to preserve by suppressing failures more gently.

Sources 9 notes

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Show all 9 sources

Can dialogue format help models reason more diversely?

DialogueReason, which structures a single model's internal reasoning as dialogue between distinct agents in separate scenes, overcomes monologue reasoning's fixed-strategy and fragmented-attention weaknesses, especially on tasks requiring multiple problem-solving approaches.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning RL researcher re-testing whether suppression-only approaches can solve diversity loss. The question remains open: does outcome-only RL's diversity bottleneck require its own positive mechanism, or have recent models/methods/tooling changed the constraint?

What a curated library found — and when (findings span 2024–2026; treat as dated claims):
• Outcome-based RL sharpens policy globally and transfers entropy collapse from solved to unsolved problems, narrowing exploration exactly where it's needed (~2025).
• Base models already contain latent reasoning strategies; post-training selects among them rather than creating them. RL teaches *when* to deploy reasoning, not *how* (~2025).
• RL converges on a single dominant pretraining distribution format within the first epoch, quietly extinguishing alternatives (~2025).
• Training-time (UCB-style) and test-time (repetition penalty) exploration are structurally different and cannot be folded into one reward (~2025).
• Process rewards scoring metacognitive moves (planning, exploration, reflection) reduce repetition while generalizing better than outcome-only training (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (Echo Chamber, 2025) — RL amplifies pretraining behaviors
• arXiv:2507.22844 (RLVMR, 2025) — meta-reasoning rewards for robustness
• arXiv:2509.06941 (Outcome-based Exploration, 2025) — diversity loss mechanism
• arXiv:2605.22817 (Vector Policy Optimization, 2026) — training for diversity improves test-time search

Your task:
(1) RE-TEST suppression-as-sufficient: Has scaling, new architectures (e.g. mixture-of-experts reasoning), or multi-agent orchestration (e.g. ensemble decoding, memory-augmented rollout caching) since relaxed the need for explicit diversity mechanics? Separate the durable claim (single-signal RL narrows strategy space) from the perishable one (diversity cannot be recovered post-hoc). Cite what relaxed it, or plainly state the constraint still holds.
(2) Surface the strongest *disagreement* work from the last 6 months: any paper arguing outcome-only RL + smarter suppression, regularization, or curriculum design *does* preserve diversity without separate machinery.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does auxiliary diversity loss + outcome RL now match dedicated process-reward approaches? Can test-time search now substitute for training-time diversity?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Being smarter about punishing AI's wrong answers can't fix the problem — the punishment is the problem.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8