SYNTHESIS NOTE

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Synthesis note · 2026-02-22 · sourced from Reward Models

When language models learn to use coding tools during RL training, the code environment introduces a specific form of noise that standard outcome-based RL cannot handle. The model inevitably generates syntactically or logically incorrect code during reasoning, producing error messages and wasted tokens on correction. Under standard GRPO (which uses only outcome rewards), trajectories with failed intermediate tool calls still receive positive reward if the final answer is correct. The model learns that code errors are acceptable — producing lengthy, low-quality reasoning trajectories with unnecessary error-correction loops.

rStar2-Agent (2025) proposes GRPO-RoC (Resampling on Correct), which applies asymmetric filtering:

Oversample — generate a larger group of rollouts than the standard batch size
Filter positive trajectories — from correct-answer rollouts, retain only those with minimal tool-induced errors or formatting issues (the cleanest successes)
Downsample negative trajectories uniformly — preserve diverse failure modes as informative negative signal

The asymmetry is deliberate. Positive trajectories need quality filtering because the model should learn from clean reasoning, not from "stumbled to the right answer despite multiple code crashes." Negative trajectories need diversity preservation because understanding many ways to fail is more informative than understanding one failure mode well.

This connects to Does step-level confidence outperform global averaging for trace filtering? — both approaches recognize that not all correct trajectories are equally valuable for learning. It also extends Does RL training follow a predictable two-phase learning sequence? — tool use is a procedural capability that must consolidate (clean tool usage) before strategic reasoning can effectively build on it.

The results are striking: a 14B model reaches frontier-level math reasoning in only 510 RL steps within one week (64 MI300X GPUs), achieving 80.6% on AIME24 and 69.8% on AIME25 — surpassing DeepSeek-R1 (671B) with significantly shorter responses. The training recipe starts with non-reasoning SFT (instruction following + code tool usage + formatting only, no reasoning enhancement) to avoid SFT overfitting, then applies multi-stage RL with increasing difficulty and maximum length.

Inquiring lines that read this note 21

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What pretraining choices and baseline capability constrain reinforcement learning gains?

What constrains reinforcement learning's ability to expand model reasoning?

How can AI systems learn from failures without cascading errors?

Do corrupted reasoning traces serve as effective supervision signals?

Can removing failed branches from edited traces improve previous mistakes?

Why do reward structures fail to shape long-term agent learning?

Why do sparse outcome rewards fail to credit correct tool calls in failed trajectories?

How should memory consolidation strategies shape agent performance over time?

Why do successful and failed trajectories need different memory processing?

Does externalizing cognitive work and state improve agent reliability?

Can skill validation through testing prevent unreliable programs from accumulating?

How do training priors constrain what context information can override?

Why does evaluating errors teach more than imitating correct responses?

How can process reward models supervise complex reasoning traces?

Why does step-level expert alignment work when outcome-only RL fails?

What causes silent corruption to amplify through delegated workflows?

How does error accumulation in workflows scale across multiple model calls?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

How do past research mistakes prevent future pivot loops from repeating them?

Can language model RL training avoid reward hacking and misalignment?

Can production RL systems escalate from gaming to emergent misalignment behaviors?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

What trade-offs emerge between training objectives and model reliability?

Does reinforcement learning teach reasoning or just when to reason?

Why does standard RL cause traces to collapse into redundant reasoning paths?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 157 in 2-hop network ·dense cluster Open in graph ↗

Why do correct code trajectories teach models to… Does step-level confidence outperform global avera… Does RL training follow a predictable two-phase le… Why do correct reasoning traces contain fewer toke… Why does SFT-then-RL training follow a predictable… Can reinforcement learning scale beyond single-tur…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does step-level confidence outperform global averaging for trace filtering? Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.
related quality-filtering principle applied at step level
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
tool use as procedural capability that must consolidate before strategic reasoning
Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
GRPO-RoC's filtered positive trajectories are cleaner and shorter, consistent with this finding
Why does SFT-then-RL training follow a predictable three-phase pattern? When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
rStar2's non-reasoning SFT avoids the overfitting phase by not injecting reasoning patterns
Can reinforcement learning scale beyond single-turn language tasks? Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
complementary agentic RL approach: rStar2 solves trajectory quality in code-tool environments through asymmetric filtering, while SWE-RL solves long-horizon credit assignment in multi-turn code tasks — together they address the two key challenges (noisy intermediate steps and sparse delayed rewards) that make agentic code RL harder than single-turn reasoning RL

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agentic rl with code tools requires asymmetric trajectory filtering because environment noise in correct trajectories teaches the model to tolerate errors

Why do correct code trajectories teach models to tolerate errors?

Inquiring lines that read this note 21

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5