INQUIRING LINE

Can group-relative normalization be modified to resist shortcut trajectories?

This explores whether GRPO-style group-relative advantage normalization — which scores each trajectory against the average of its sampled group — can be reshaped so it stops rewarding lucky shortcuts instead of genuine reasoning.


This explores whether GRPO-style group-relative advantage normalization can be reshaped so it stops rewarding lucky shortcuts instead of real reasoning. The corpus is unusually direct about where the problem comes from: when you train on problems that are nearly impossible for the model, the rare accidental success gets scored as a huge positive advantage relative to its group of mostly-failed siblings — so the optimizer enthusiastically reinforces whatever produced that fluke, which tends to be answer-repetition and computation-skipping rather than sound steps Do overly hard RLVR samples actually harm model capabilities?. The shortcut isn't a side effect of normalization; it's normalization doing exactly what it's told on the wrong distribution of problems. That reframes your question: the most reliable 'modification' may be at the data layer (don't feed it problems where the only successes are accidental) rather than at the math of the advantage estimator itself.

The more interesting lateral move in the collection is changing *what the reward attaches to*. Shortcuts thrive when the only signal is the final outcome, because any path to the right answer looks equally good. Several methods convert that sparse outcome reward into dense, per-step signals derived from the structure of the trajectory — Tree-GRPO uses tree topology, Supervised RL leans on expert-aligned actions, and ToolPO keys off tool-call positions — so the credit lands on the reasoning moves rather than the lucky landing Can trajectory structure replace hand-annotated process rewards?. This is the cleaner answer to 'can it be modified to resist shortcuts': you keep group-relative comparison but give it richer trajectory structure to compare, so a step-skipping path can no longer collect the same advantage as a worked one.

There's a quieter failure mode worth knowing about. RL post-training tends to collapse onto a single dominant format inherited from pretraining within the first epoch, and the winning format is chosen by model scale rather than by which format actually reasons best Does RL training collapse format diversity in pretrained models?. A shortcut trajectory is, in a sense, just a degenerate format that won the collapse. So any normalization fix that doesn't also preserve diversity risks locking in whichever cheap pattern happened to dominate early — the resistance you want is partly about keeping the group genuinely varied, not just rescaling advantages within it.

One caution the corpus raises against naive 'shortcut-proofing': the intuition that you fix shortcuts by stripping spurious cues doesn't always hold. In heuristic-override tasks, removing the misleading cues actually *hurts* — the real difficulty is composing conflicting signals, not ignoring distractors Why does removing spurious cues sometimes hurt model performance?. The lesson for reward design is that 'shortcut' and 'legitimate-but-cheap-looking reasoning' aren't always separable from the outside, so an aggressive penalty risks suppressing real capability. And if you'd rather widen the search than re-engineer the reward, sampling many parallel latent trajectories spreads exploration across the solution space without inflating variance — giving the group more honest candidates to normalize against Can reasoning systems scale wider instead of only deeper?.

The takeaway you might not have gone looking for: the corpus quietly votes against patching the normalization formula in isolation. The shortcut problem lives at the boundaries of the method — the difficulty of the training samples, the granularity of the reward, and the diversity of the sampled group — and that's where the corpus puts its fixes.


Sources 5 notes

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL researcher stress-testing group-relative normalization (GRPO-style) against shortcut learning in LLMs. The question remains open: can the normalization mechanism itself be modified to resist lucky-accident reward amplification, or must the fix live elsewhere (data, reward density, group diversity)?

What a curated library found — and when (findings span 2022–2026; treat as dated claims, not current truth):
• Shortcuts thrive under outcome-only rewards because any path to the right answer collects equal advantage; the problem is normalization doing exactly what it's told on a distribution where accidental successes dominate (2026-05).
• Dense, per-step rewards derived from trajectory structure (Tree-GRPO, expert alignment, tool-call signals) let credit land on reasoning moves, not lucky landings; group comparison works better with richer structure to compare (2025-09).
• RL post-training collapses onto a single dominant format by epoch 1, chosen by model scale not reasoning quality; any shortcut fix that doesn't preserve group diversity risks locking in whichever cheap pattern won the collapse (2025-04).
• Heuristic-override tasks show 'shortcut' and 'legitimate-but-cheap reasoning' aren't always separable from outside; aggressive shortcut penalties risk suppressing real capability (2026-03).
• Sampling parallel latent trajectories widens exploration without inflating variance, giving the group more honest candidates to normalize against (2025-02).

Anchor papers (verify; mind their dates):
• arXiv:2605.28388 (2026-05) — Sample difficulty in RLVR and shortcut induction
• arXiv:2509.21240 (2025-09) — Tree search for LLM agent RL
• arXiv:2504.07912 (2025-04) — Echo Chamber on format collapse in post-training
• arXiv:2603.29025 (2026-03) — Heuristic override vs. shortcut learning

Your task:
(1) RE-TEST THE CORE CONSTRAINT: Does the library's claim that "normalization formula alone won't fix shortcuts" still hold? Have recent advances in process-supervision, synthetic trajectory labeling, or reward-shaping tooling shifted the locus of the fix back into the estimator itself, or do the newest papers (last 6 months) still pinpoint data quality, reward density, and group diversity as the bottleneck?
(2) Surface any superseding work from the last ~6 months that contradicts the thesis that shortcuts are a *distribution* problem, not a *normalization math* problem.
(3) Propose two research questions that assume the regime may have moved: (a) If latent trajectory sampling + per-step rewards fully suppress shortcuts, what new failure mode emerges at scale? (b) Can you design a normalization variant that *detects* and *downweights* degenerate formats in-group without external diversity scaffolding?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines