Can RL agents learn to reason better, not just succeed?

Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?

Synthesis note · 2026-02-22 · sourced from RLVR

Outcome-only RL (e.g., GRPO) for agentic tasks reinforces any successful trajectory — including those built on flawed, redundant, or illogical reasoning. Empirically: 31.2% repetitive action rate on hard tasks, agents persistently attempting actions on locations they've already reached, policy reflecting training action distributions rather than genuine reasoning about task requirements. The agent achieves but does not understand.

RLVMR (Reinforcement Learning with Verifiable Meta-Reasoning Rewards) addresses this by operationalizing metacognitive theory as verifiable process rewards. Four meta-reasoning tags — planning, exploration, reflection, monitoring — are introduced as structured cognitive labels. Each receives programmatic rewards tied to observable outcomes:

Exploration is rewarded when the agent discovers a new state (novelty verification)
Reflection is rewarded when it leads to corrective action after failures (error-correction verification)
Planning is rewarded when the trajectory ultimately succeeds (outcome-conditioned)
Monitoring tracks progress against the plan (alignment verification)

The cold start requires only 200 SFT trajectories annotated by a teacher model with the tag syntax. After that, the agent trains entirely through environmental interaction with dense process rewards combined with sparse outcome rewards.

Since Can AI systems improve their own learning strategies?, RLVMR provides a partial solution: the metacognitive categories are still human-designed, but the specific behaviors within each category are learned through RL interaction. The framework bridges between fixed metacognitive scaffolds and fully autonomous self-monitoring.

A related metacognitive capability emerges from proactive critical thinking training: since Can models learn to ask clarifying questions instead of guessing?, both RLVMR and proactive critical thinking operationalize metacognition as trainable RL objectives. RLVMR's "monitoring" and "reflection" tags teach the agent to track its own reasoning quality during task execution; proactive critical thinking teaches the model to detect when a problem is ill-posed before attempting to solve it. Both address the gap between achieving outcomes and demonstrating genuine reasoning awareness, and both show near-zero capability at baseline that RL training dramatically improves.

The SFT/GRPO contrast is instructive: SFT creates efficient but brittle policies (success drops from 63.3% to 37.5% on unseen tasks), while GRPO achieves better generalization (52.3% on hard unseen) but with severely inefficient reasoning. RLVMR targets the gap — maintaining GRPO's generalization while reducing the reasoning inefficiency.

Inquiring lines that read this note 35

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What constrains reinforcement learning's ability to expand model reasoning?

What properties determine whether reward signals teach genuine reasoning?

Do spurious rewards activate reasoning without teaching new skills?

What structural advantages do diffusion language models offer over autoregressive methods?

Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?

How does objective evolution guide discovery better than fixed planning?

Can a proposer agent actively surface a solver's weaknesses to prevent plateau?

Does reinforcement learning teach reasoning or just when to reason?

How do adversarial and manipulative prompts attack reasoning models?

Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?

How can AI agents autonomously learn and transfer skills across tasks?

Why do reward structures fail to shape long-term agent learning?

Can prompting inject entirely new knowledge into language models?

Can runtime interventions like meta-cognitive prompting work where training interventions fail?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

Does training on self-play disagreement data improve multi-agent reasoning outcomes?

Why do agents confidently report success despite actually failing tasks?

How do agents learn to report success on actions that actually failed?

How can process reward models supervise complex reasoning traces?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 157 in 2-hop network ·medium cluster Open in graph ↗

Can RL agents learn to reason better, not just s… Can AI systems improve their own learning strategi… Can modular cognitive tools unlock reasoning witho… Can we reward reasoning steps without human annota… Can models learn to ask clarifying questions inste… Why do outcome-based reward models fail at interme…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can AI systems improve their own learning strategies? Current self-improvement relies on fixed human-designed loops that break when tasks change. The question is whether agents can develop their own adaptive metacognitive processes instead of depending on human intervention.
RLVMR partially addresses by learning metacognitive behaviors within fixed categories
Can modular cognitive tools unlock reasoning without training? Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?
complementary approach: cognitive tools modularize reasoning without RL, RLVMR does it with RL
Can we reward reasoning steps without human annotation? Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
RLVMR provides dense process rewards for agentic setting
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
complementary metacognitive RL objective: RLVMR trains monitoring/reflection during task execution; proactive critical thinking trains missing-information detection before task execution; both show near-zero baseline capability that RL dramatically improves
Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
RLVMR's meta-reasoning tags are a process supervision variant for agentic settings: programmatic rewards for planning/exploration/reflection/monitoring provide dense intermediate feedback without human annotation

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

meta-reasoning rewards for agentic rl operationalize metacognition as verifiable process supervision — separating reasoning quality from outcome success

Can RL agents learn to reason better, not just succeed?

Inquiring lines that read this note 35

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4