INQUIRING LINE

What makes pretraining composition more important than reward engineering?

This explores a claim several papers in the corpus converge on: that what a model learned during pretraining sets the ceiling on reasoning, while reward design during RL mostly selects and amplifies what's already there rather than adding new capability.


This reads the question as asking whether the heavy lifting in modern reasoning models happens before reward engineering ever starts — and the corpus makes a surprisingly strong case that it does. The clearest statement comes from work on RLVR dynamics, which finds that reinforcement learning with verifiable rewards improves how efficiently a model samples good answers but doesn't expand its capability boundary: a single training example can be enough to 'activate' a reasoning strategy, and even spurious rewards work nearly as well as correct ones — as long as the model was pretrained appropriately What does reward learning actually do to model reasoning?. If a wrong reward signal gets you most of the way there, the reward isn't where the reasoning lives. Pretraining is.

A second study makes the mechanism concrete. When you apply RL on top of a pretrained model, it doesn't invent a new way of formatting answers — within the first epoch it locks onto one dominant format that already existed in the pretraining distribution and collapses the alternatives. Which format wins depends on model scale, not necessarily on which format performs best, and this whole dynamic is hidden when you start from a proprietary base model whose pretraining mix you can't see Does RL training collapse format diversity in pretrained models?. So reward engineering is less like teaching and more like choosing which pre-existing voice gets amplified — and the menu of voices was written during pretraining.

Look laterally and the reward-design papers themselves keep bumping into this ceiling. Negative reinforcement alone — just suppressing wrong trajectories — matches or beats full PPO and GRPO, partly because it preserves the answer diversity that pretraining produced instead of collapsing probability mass onto a few modes Does negative reinforcement alone outperform full reinforcement learning?. And when models plateau, the fix that breaks the plateau isn't a cleverer numerical reward but natural-language critiques that carry information the scalar reward never could — a sign that reward signals are an impoverished channel for actually changing what a model knows Can natural language feedback overcome numerical reward plateaus?.

The interesting tension is that the corpus doesn't say reward engineering is useless — it says reward engineering is mostly *steering*, and steering is bounded by what you're steering. Training order matters because structured tasks shrink output entropy while creative tasks grow it, so scheduling reshapes which capabilities survive Does training order reshape how models handle different task types? — again, rearranging existing capacity rather than minting new capacity. The one place RL genuinely seems to embed new knowledge, rather than activate old knowledge, is RLAG, which works by rewarding explanation quality and cycling between augmented and plain generation — and it's notable that this requires going beyond token-level correctness rewards to get there Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.

The thing you didn't know you wanted to know: the reason researchers obsess over reward hacking, calibration, and multi-objective weighting may be partly misplaced effort. If pretraining composition sets the boundary and reward mostly picks which pretrained behavior to surface, then the highest-leverage decisions were made before the reward function was ever written — and the field's inability to see inside proprietary pretraining mixes means we're often tuning the steering wheel while blindfolded to the engine.


Sources 6 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about pretraining vs. reward engineering in LLM reasoning. The question: does pretraining composition set harder ceilings than reward design can overcome?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and converge on pretraining primacy:
• Spurious rewards activate reasoning strategies nearly as well as correct ones if pretraining was appropriate; RL improves sampling efficiency but doesn't expand capability boundary (2025).
• Within one epoch, RL locks onto a single dominant format already in pretraining; format winner correlates with model scale, not performance (2025).
• Negative reinforcement alone matches/beats full PPO/GRPO by preserving answer diversity pretraining produced, suggesting reward collapses modes rather than expanding them (2025).
• Natural-language critiques break RL plateaus scalar rewards cannot; reward signals are information-impoverished channels for changing model knowledge (2025).
• RLAG (rewarding explanation quality + augmented generation cycles) is the exception—it embeds new knowledge rather than activating old; requires moving beyond token-level correctness rewards (2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (2025-04) — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining.
• arXiv:2506.01347 (2025-06) — The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning.
• arXiv:2506.03106 (2025-06) — Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback.
• arXiv:2509.20162 (2025-09) — Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—especially the claim that RL doesn't expand capability boundary—judge whether models released since mid-2025, new optimization methods (e.g., adaptive scheduling, multi-reward fusion), inference-time orchestration (memory, critique loops, agent iteration), or evaluation breakthroughs have since RELAXED or OVERTURNED it. Separate the durable question (pretraining does set a hard ceiling) from the perishable limitation (current RL can't overcome it). Cite what relaxed it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has anyone shown RL embedding fundamentally new reasoning patterns, or breaking a pretraining ceiling?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., do multi-turn critique loops or synthetic pretraining data injection during RL training phase actually expand, not just activate, capability?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines