INQUIRING LINE

Why does prompting discover capabilities that need reward-driven refinement?

This explores a puzzle hiding inside two findings the corpus keeps circling: prompting can surface abilities a model already has, yet those same abilities often need reward training to become reliable — so what exactly does each step contribute?


This reads the question as: if prompting and reward training are both just *unlocking* what's already in the base model, why do we need both? The corpus answers with a sharp division of labor — prompting finds the door, reward training learns which door to open every time.

Start with the ceiling. Prompting is powerful but bounded: a single finite transformer is provably Turing-complete given the right prompt Can a single transformer become universally programmable through prompts?, yet that same work notes ordinary training rarely produces models that actually run arbitrary prompted programs. And prompting can only rearrange what's already there — it activates existing knowledge but cannot inject anything absent from training Can prompt optimization teach models knowledge they lack?. So prompting is a search through latent capability, not a way to add capability. Several independent lines confirm the capability is genuinely present to be found: base models already carry latent reasoning that minimal nudging unlocks, via at least five different mechanisms Do base models already contain hidden reasoning ability?.

Here's the catch prompting hits — it's brittle and instance-specific. Zero-shot chain-of-thought only works when the question's meaning flows into the prompt structure before reasoning starts; for simple questions, step-by-step prompting actively hurts Why do some questions perform better without step-by-step reasoning?. Prompting discovers the capability on *some* inputs but can't guarantee the model reaches for it on the right ones. That's precisely the gap reward training closes: RLVR doesn't expand what a model can solve — at high sampling budgets base models actually match or beat it — it sharpens the model toward solutions already in its distribution, raising the odds the latent skill fires on the first try Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?. Prompting reveals the capability exists; reward refinement makes it the default behavior rather than a lucky sample.

The reason refinement needs *reward* specifically — not just more prompting — is that you need a signal for which discovered behavior was good. The corpus shows that signal can come from surprisingly cheap sources: the model's own answer-confidence can rank reasoning traces and improve calibration with no human labels Can model confidence work as a reward signal for reasoning?, and information-theoretic measures can score each step's contribution without annotation Can we reward reasoning steps without human annotation?. Even better signals come from rewards that reason before scoring — generative step-wise judges that meta-reason about reasoning beat classifier-style reward models Can judges that reason about reasoning outperform classifier rewards?, and adding chain-of-thought to the reward model itself raises its ceiling Can reward models benefit from reasoning before scoring?.

The thing you might not have known you wanted to know: this isn't really prompting *versus* reward at all — both are forms of elicitation, just at different time horizons. Prompting is a one-shot search you run at inference; reward training bakes the winning search strategy into the weights so you don't have to find it again every time. The genuinely *new* capability — the kind neither prompting nor RLVR can manufacture — comes only from distillation transferring patterns from a stronger model Does RLVR actually expand what models can reason about?. And the same breadth-vs-depth logic recurs even within a single inference pass: allocating compute to diverse abstractions explores better than hammering one reasoning chain deeper Can abstractions guide exploration better than depth alone?. Discovery and refinement are the same act at different scales.


Sources 11 notes

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking elicitation mechanisms in LLMs. The question remains open: why do prompting and reward training serve different functions—does one truly unlock latent capability while the other merely sharpens defaults, or has that boundary shifted?

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–2025 and propose a strict division of labor:
• Prompting is capability search constrained by brittleness: zero-shot chain-of-thought fails on simple questions where step-by-step hurts; it activates latent skills but cannot guarantee they fire on the right inputs (~2024).
• RLVR (reinforcement learning via verification) does NOT expand reasoning beyond the base model—at high sampling budgets base models match or beat it; RLVR instead raises the probability the latent skill becomes default behavior (~2025).
• Reward signals can be cheap and intrinsic: model confidence, information-theoretic step scoring, and generative judges (that reason about reasoning) all outperform classifier-style reward models (~2025).
• True new capability comes only from distillation from stronger models; neither prompting nor RLVR manufactures it (~2025).
• The breadth-vs-depth exploration trade-off recurs within single inference passes: allocating compute to diverse reasoning abstractions beats deeper iteration (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (Apr 2025) – Does RLVR Really Incentivize Reasoning Beyond the Base
• arXiv:2505.14674 (May 2025) – Reward Reasoning Model
• arXiv:2508.19229 (Aug 2025) – StepWiser: Stepwise Generative Judges
• arXiv:2512.01107 (Nov 2025) – Foundation Priors

Your task:
(1) RE-TEST EACH CONSTRAINT. For the claim "RLVR cannot expand capability beyond base models," check whether post-training methods, distillation pipelines, or hybrid elicitation (multi-agent orchestration, memory-augmented RL, intrinsic reward scaling) have since shown cumulative capability growth that RLVR alone missed. Separately, does the brittleness of prompting still hold, or have adaptive in-context learning and dynamic prompt routing now solved instance-specificity? Flag what still appears locked and what has moved.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If any paper shows RLVR *does* expand boundaries, or prompting *can* reliably fire latent skills without reward, cite it plainly.

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If elicitation is now unified under a single mechanism (not divided labor), what would that mechanism be? (b) What would prove distillation-free, purely self-contained scaling of reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines