SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Agentic Systems and Tool Use

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

When language models learn to use coding tools during RL training, the code environment introduces a specific form of noise that standard outcome-based RL cannot handle. The model inevitably generates syntactically or logically incorrect code during reasoning, producing error messages and wasted tokens on correction. Under standard GRPO (which uses only outcome rewards), trajectories with failed intermediate tool calls still receive positive reward if the final answer is correct. The model learns that code errors are acceptable — producing lengthy, low-quality reasoning trajectories with unnecessary error-correction loops.

rStar2-Agent (2025) proposes GRPO-RoC (Resampling on Correct), which applies asymmetric filtering:

  1. Oversample — generate a larger group of rollouts than the standard batch size
  2. Filter positive trajectories — from correct-answer rollouts, retain only those with minimal tool-induced errors or formatting issues (the cleanest successes)
  3. Downsample negative trajectories uniformly — preserve diverse failure modes as informative negative signal

The asymmetry is deliberate. Positive trajectories need quality filtering because the model should learn from clean reasoning, not from "stumbled to the right answer despite multiple code crashes." Negative trajectories need diversity preservation because understanding many ways to fail is more informative than understanding one failure mode well.

This connects to Does step-level confidence outperform global averaging for trace filtering? — both approaches recognize that not all correct trajectories are equally valuable for learning. It also extends Does RL training follow a predictable two-phase learning sequence? — tool use is a procedural capability that must consolidate (clean tool usage) before strategic reasoning can effectively build on it.

The results are striking: a 14B model reaches frontier-level math reasoning in only 510 RL steps within one week (64 MI300X GPUs), achieving 80.6% on AIME24 and 69.8% on AIME25 — surpassing DeepSeek-R1 (671B) with significantly shorter responses. The training recipe starts with non-reasoning SFT (instruction following + code tool usage + formatting only, no reasoning enhancement) to avoid SFT overfitting, then applies multi-stage RL with increasing difficulty and maximum length.

Inquiring lines that use this note as a source 18

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 159 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agentic rl with code tools requires asymmetric trajectory filtering because environment noise in correct trajectories teaches the model to tolerate errors