INQUIRING LINE

Can early experience replace external rewards as a learning signal?

This explores whether agents can learn from the consequences of their own actions — treating future states as supervision — instead of relying on engineered external reward signals.


This explores whether 'early experience' — letting an agent learn from what happens after its own actions, rather than from hand-built rewards — can stand in for external reward signals. The corpus says: yes, increasingly so, and the most direct evidence frames it as a genuine third option. One line of work positions early experience as a paradigm sitting between imitation learning (copying experts) and reinforcement learning (chasing rewards), showing across eight environments that agents using their own future states as supervision can match expert-dependent baselines with half the data — and then serve as a stronger warm-start for later RL Can agents learn from their own actions without external rewards?. So it's not just a replacement; it's often a better foundation to build rewards on top of later.

The deeper reason this works is that an agent's own experience is *richer* than a scalar reward. A reward collapses everything into a single number, but the consequences of an action carry two separate kinds of information: how well it did (evaluative) and how it should change (directive) — and scalar rewards throw the second one away Can scalar rewards capture all the information in agent feedback?. Once you keep that richer signal, you can convert raw environment feedback into dense, per-token learning gradients by letting the policy teach itself from retrospective evidence of its mistakes — making an external reward model unnecessary Can environment feedback replace scalar rewards in policy learning?. Natural-language critiques do something similar: they break through plateaus that numerical rewards can't, precisely because they explain *why* a failure happened Can natural language feedback overcome numerical reward plateaus?.

There's an even more internal version of this idea: the signal doesn't have to come from the environment at all, but from the agent's own shifting beliefs. Tracking how much an action moves the model toward a solution — the log-ratio of its own probability estimates — yields a dense intrinsic reward with no critic and no process reward model, and smaller models trained this way beat larger baselines Can an agent's own beliefs guide credit assignment without critics?. This is the same insight pushed inward: the learning signal was latent in the agent's experience the whole time.

The cross-current worth knowing is that external rewards may have been doing less than we assumed anyway. Several notes argue that reward-based RL mostly *activates* capabilities already present from pretraining rather than teaching anything new — a single example, or even spurious rewards, can trigger the same gains, and base models can outperform RLVR models at high sampling budgets What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. If reward signals are largely surfacing existing skills, then the bar for replacing them with experience is lower than it looks. And the experience signal can be shaped smartly: process successes and failures differently (concrete demos vs. abstracted lessons) Should successful and failed episodes be processed differently?, or lean on negative examples alone, which can match full RL while preserving diversity Does negative reinforcement alone outperform full reinforcement learning?.

The honest caveat the corpus implies: 'replace' is too clean. Early experience tends to *precede and improve* reward-based training rather than abolish it, and RL still shows its own structured learning dynamics — first mastering execution, then strategic planning Does RL training follow a predictable two-phase learning sequence?. The interesting takeaway you didn't come looking for: the boundary between 'reward' and 'experience' is dissolving. When you keep the full texture of what an action led to, the reward isn't external anymore — it was inside the experience all along.


Sources 10 notes

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether early experience can truly displace external rewards as a learning signal in LLM reasoning and control. The question remains: what architectural, training, or evaluation advances have shifted the regime since mid-2026?

What a curated library found — and when (dated claims, not current truth):
Findings span 2025–2026; treat all as perishable baseline claims:
• Early experience (agent's own future states) matches expert-dependent baselines with ~50% less data and serves as a superior warm-start for downstream RL, positioning it as a genuine third paradigm between imitation and RL (~2025–10).
• Reward signals decompose into evaluative and directive information; rich next-state feedback permits dense, per-token credit assignment without external reward models (~2025–06).
• Belief-shift (log-ratio of model's own probability estimates) generates intrinsic reward signals; smaller models trained this way outperform larger baselines (~2026–01).
• Reward-based RL primarily activates pre-existing base-model capabilities rather than expanding reasoning boundaries; spurious rewards and single examples trigger similar gains (~2025–04, 2025–07).
• Negative reinforcement alone matches or exceeds full RL while preserving output diversity; RL exhibits two-phase dynamics (execution consolidation → strategic planning) (~2025–06, ~2026–02).

Anchor papers (verify; mind their dates):
• 2510.08558 (Oct 2025): Agent Learning via Early Experience
• 2507.14843 (Jul 2025): The Invisible Leash: Why RLVR May Not Escape Its Origin
• 2506.01347 (Jun 2025): The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
• 2601.20802 (Jan 2026): Reinforcement Learning via Self-Distillation

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer model scaling (reasoning models, o3-class agents), multi-stage training pipelines, in-context learning, long-context memory, agentic orchestration (tool use, reflection loops), or newer evals have RELAXED or OVERTURNED the claim. Separate the durable question (Is early experience a viable foundation for reasoning?) from perishable limitations (e.g., performance gaps, data efficiency thresholds). Cite what resolved each constraint or where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—papers arguing external rewards remain indispensable, or that early experience alone hits a ceiling without reward shaping.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., does early experience scale to 100B+ parameter reasoning models? Can belief-shift signals sustain performance on long-horizon planning tasks that require model-based lookahead?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines