INQUIRING LINE

Can distillation methods extract directional guidance that scalar RL cannot access?

This explores whether token-level distillation can recover the 'how to change' signal that a single scalar reward throws away — and what the corpus says about why directional information slips through RL's fingers in the first place.


This question reads as: scalar RL collapses everything it learns into one number per outcome, so can distillation reach into feedback and pull out the *directional* part — the 'do it this way instead' — that a scalar can't represent? The corpus has a direct answer and a surprising amount of lateral support for it.

The cleanest statement comes from work showing that natural agent feedback splits into two orthogonal channels: *evaluative* (how well did that action go) and *directive* (how should it change) Can scalar rewards capture all the information in agent feedback?. A scalar reward is built to carry the first and structurally cannot carry the second — 'good/bad by this much' has no slot for 'here's the corrected move.' Token-level distillation does have that slot, because it copies a *distribution over next tokens* rather than a single score, so it recovers the directional specifics the reward discarded. The two aren't competing; they're complementary, which reframes the whole question: distillation isn't a better RL, it's accessing a different axis of the same feedback.

Why would scalar RL leave so much on the table? Several notes suggest it's not a tuning problem but a structural one. RL updates only a sparse 5–30% slice of parameters, and consistently the same slice across seeds Does reinforcement learning update only a small fraction of parameters? — it nudges a narrow subnetwork rather than re-teaching. It also collapses onto a single dominant pretrained format within the first epoch, suppressing alternatives Does RL training collapse format diversity in pretrained models?. And when the reward signal is sparse or the problems are too hard, scalar RL doesn't just fail to learn — it learns *degenerate shortcuts* and amplifies them, because group-relative normalization treats a lucky correct answer as a high-advantage trajectory worth repeating Do overly hard RLVR samples actually harm model capabilities?. A scalar can't tell 'right for the right reason' from 'right by accident'; directional supervision can.

That's exactly the gap several methods close by smuggling directional signal back in. Adaptive guidance hands the model partial ground-truth solution traces on hard problems instead of waiting for a reward to materialize Can adaptive guidance from solution traces reduce reward sparsity in RL?. Process supervision derived from trajectory *structure* — tree topology, expert-aligned actions, tool-call positions — converts a single outcome reward into dense per-step signal without any annotated reward model Can trajectory structure replace hand-annotated process rewards?. Both are doing the directive job: not scoring the outcome, but pointing at the next move. Proxy-tuning makes the case sharpest — by shifting the output distribution at decoding time and leaving base weights untouched, it closes 88–91% of the alignment gap while *beating* direct fine-tuning on knowledge tasks, because direct weight updates corrupt lower-layer knowledge storage that a distributional shift never touches Can decoding-time tuning preserve knowledge better than weight fine-tuning?.

The thing worth carrying away: the gap isn't that RL is weak and distillation is strong. It's that 'reward' and 'demonstration' are different *kinds* of information. A scalar is a verdict; a distribution is an instruction. The frontier methods here all win by recovering the instruction — and the honest caveat from the corpus is that distillation inherits whatever bias lives in the traces it copies, so a directive signal is only as trustworthy as its source.


Sources 7 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can adaptive guidance from solution traces reduce reward sparsity in RL?

GHPO dynamically provides ground-truth solution traces for hard problems while using standard RL for manageable ones, achieving 5% gains across math benchmarks. This converts wasted compute on impossible problems into learning signal by leveraging traces already present in training data.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing the boundary between scalar RL and directional distillation in LLM training. The precise question: can distillation extract guidance—the 'do it this way'—that scalar rewards structurally cannot represent?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026.
• Evaluative and directive information are orthogonal channels; scalar rewards carry only the first, token-level distillation recovers both (2024).
• RL updates only 5–30% of parameters in sparse but full-rank subnetworks, suppressing alternatives within one epoch (2025).
• Hard problems induce degenerate shortcuts; group-relative normalization amplifies them because scalar rewards cannot distinguish 'right for the right reason' from 'right by accident' (2026).
• Adaptive guidance (partial solution traces) and process supervision (structural features of trajectories) close the directive gap without annotated reward models (2025).
• Proxy-tuning at decode time preserves pretrained knowledge 88–91% better than direct fine-tuning, suggesting distributional shifts escape the weight-corruption of RL updates (2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.11711 — Reinforcement Learning Finetunes Small Subnetworks (May 2025)
• arXiv:2507.10628 — GHPO: Adaptive Guidance for Stable RL (Jul 2025)
• arXiv:2605.28388 — Sample Difficulty & Degenerate Shortcuts in RLVR (May 2026)
• arXiv:2504.07912 — Echo Chamber: RL Amplifies Pretraining Behaviors (Apr 2025)

Your task:
(1) RE-TEST each constraint: For every claim above, assess whether newer training regimes (scaling, curriculum, mixture-of-experts), inference-time methods (sampling, caching, multi-turn rollout), or finer reward models have since relaxed or collapsed the scalar/directional divide. Separate the durable question (likely: 'can a single number ever encode structure?') from the perishable limitation (possibly: 'current RL architectures waste parameter budget').
(2) Surface the strongest work from the last ~6 months that contradicts the evaluative/directive split or shows scalar RL recovering directional signal post-hoc.
(3) Propose 2 research questions that assume the regime may have shifted—e.g., can hybrid reward models recover directionality without distillation? Does test-time compute relax the scalar bottleneck?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines