INQUIRING LINE

Is reward propagation in RL formally dual to cause inference in memory?

This explores whether two processes — propagating reward backward through an RL trajectory (credit assignment) and reconstructing why an outcome happened (cause inference over stored experience) — are formally mirror images of each other, or just superficially similar.


This explores whether reward propagation in RL and cause inference in memory are formally dual — and the honest answer is that no paper in the corpus proves a formal duality. But several push toward something more interesting than a yes/no: the two processes seem to *collapse into each other* under the right framing, which is a stronger claim than mere analogy.

The clearest pivot is AgentFly's Memory-augmented MDP, where credit assignment — the heart of reward propagation — is performed *entirely through memory operations* rather than gradient updates Can agents learn continuously from experience without updating weights?. If you can assign credit by reading and writing memory, then the machinery that propagates reward and the machinery that infers what caused an outcome are not dual systems sitting on opposite sides of a mirror — they're the same operation viewed from two ends. Reinforcing that, there's a mathematical result that RL agents *spontaneously* turn their environment into external memory: standard reward optimization reduces the information an agent needs to represent its own history, so memory-like structure falls out of reward propagation without anyone asking for it Do RL agents accidentally use environments as memory?. That's a one-directional bridge from reward toward memory, not a symmetric duality, but it shows the two are mechanically entangled.

Where the supposed duality actually *breaks* is the most revealing part. A scalar reward is a lossy projection of causal structure: it tells you *that* a trajectory succeeded or failed, but discards *why*. Critique-GRPO shows models stuck on numerical-reward plateaus break through the moment they receive chain-of-thought critiques — direct evidence that reward signals carry less causal information than the explanations memory can hold Can natural language feedback overcome numerical reward plateaus?. If reward and cause inference were truly dual, no information would be lost in the mapping. The fact that language feedback adds something reward cannot encode means the 'duality' is at best a lossy adjoint, not an isomorphism.

The asymmetry shows up again in how good systems treat success versus failure. SkillRL deliberately stores successful episodes as concrete demonstrations but failures as *abstracted lessons* — a causal compression that asks 'what went wrong and why,' which is exactly cause inference, not reward propagation Should successful and failed episodes be processed differently?. Reward propagation is symmetric over success and failure (both just adjust value estimates); cause inference is not. That broken symmetry is itself an argument against formal duality.

So the corpus's surprise, if you came in expecting a clean mathematical correspondence: the relationship is real but *directional and lossy*. Reward optimization manufactures memory for free, but memory carries causal content that reward signals provably throw away — which is why the field keeps bolting reasoning and critique onto reward models to recover the 'why' that scalar reward discards Can reward models benefit from reasoning before scoring?. The interesting research question isn't 'are they dual?' but 'what does reward propagation systematically lose that cause inference in memory recovers?' — and the answer the corpus gives is: the explanation.


Sources 5 notes

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Do RL agents accidentally use environments as memory?

Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL theory researcher tasked with re-testing a claim about reward propagation and cause inference. The question remains open: *Is reward propagation in RL formally dual to cause inference in memory?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable constraints:
- Credit assignment can be performed entirely through memory operations rather than gradient updates, collapsing reward propagation and cause inference into a single operation (2024–2025).
- Reward signals are lossy projections of causal structure: chain-of-thought critiques break through numerical-reward plateaus, proving language feedback carries causal information reward cannot encode (~2025, Critique-GRPO).
- RL agents spontaneously manufacture external memory structures; reward optimization reduces representational burden, making memory fall out 'for free' — a one-directional bridge, not symmetric duality (2024–2025).
- Successful episodes are stored as demonstrations; failures as abstracted causal lessons. This asymmetry (reward treats both symmetrically) argues against formal duality (~2025, SkillRL).
- The field repeatedly bolts reasoning and critique onto reward models to recover explanatory 'why' that scalar reward discards (2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2506.03106 (Critique-GRPO, 2025-06): Natural language + numerical feedback breaks performance plateaus.
- arXiv:2605.12978 (Useful Memories, 2026-05): Continuous LLM updates corrupt memory fidelity.
- arXiv:2604.08756 (Artifacts as Memory, 2026-04): Memory extends beyond agent boundary.
- arXiv:2505.14674 (Reward Reasoning Model, 2025-05): Reward evaluation via reasoning.

Your task:
(1) RE-TEST THE DUALITY CLAIM. For each constraint above—especially the claim that language feedback adds causal content reward cannot—check whether newer ablations, model scales (post-2026), multi-agent scaffolding, or verifiable reasoning methods have either (a) proven reward and explanation *are* formally isomorphic under the right representation, or (b) deepened the asymmetry. State plainly which constraints hold and which have shifted.
(2) Surface the strongest *contradicting* work from the last 6 months arguing reward *alone* is sufficient for cause inference, or that the duality *is* formal.
(3) Propose 2 research questions that assume the regime may have moved: e.g., 'Can verifiable meta-reasoning (arXiv:2507.22844) bridge the lossy gap?' or 'Do multi-agent orchestrations treat success/failure symmetrically despite memory doing otherwise?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines