Is reward propagation in RL formally dual to cause inference in memory?
This explores whether two processes — propagating reward backward through an RL trajectory (credit assignment) and reconstructing why an outcome happened (cause inference over stored experience) — are formally mirror images of each other, or just superficially similar.
This explores whether reward propagation in RL and cause inference in memory are formally dual — and the honest answer is that no paper in the corpus proves a formal duality. But several push toward something more interesting than a yes/no: the two processes seem to *collapse into each other* under the right framing, which is a stronger claim than mere analogy.
The clearest pivot is AgentFly's Memory-augmented MDP, where credit assignment — the heart of reward propagation — is performed *entirely through memory operations* rather than gradient updates Can agents learn continuously from experience without updating weights?. If you can assign credit by reading and writing memory, then the machinery that propagates reward and the machinery that infers what caused an outcome are not dual systems sitting on opposite sides of a mirror — they're the same operation viewed from two ends. Reinforcing that, there's a mathematical result that RL agents *spontaneously* turn their environment into external memory: standard reward optimization reduces the information an agent needs to represent its own history, so memory-like structure falls out of reward propagation without anyone asking for it Do RL agents accidentally use environments as memory?. That's a one-directional bridge from reward toward memory, not a symmetric duality, but it shows the two are mechanically entangled.
Where the supposed duality actually *breaks* is the most revealing part. A scalar reward is a lossy projection of causal structure: it tells you *that* a trajectory succeeded or failed, but discards *why*. Critique-GRPO shows models stuck on numerical-reward plateaus break through the moment they receive chain-of-thought critiques — direct evidence that reward signals carry less causal information than the explanations memory can hold Can natural language feedback overcome numerical reward plateaus?. If reward and cause inference were truly dual, no information would be lost in the mapping. The fact that language feedback adds something reward cannot encode means the 'duality' is at best a lossy adjoint, not an isomorphism.
The asymmetry shows up again in how good systems treat success versus failure. SkillRL deliberately stores successful episodes as concrete demonstrations but failures as *abstracted lessons* — a causal compression that asks 'what went wrong and why,' which is exactly cause inference, not reward propagation Should successful and failed episodes be processed differently?. Reward propagation is symmetric over success and failure (both just adjust value estimates); cause inference is not. That broken symmetry is itself an argument against formal duality.
So the corpus's surprise, if you came in expecting a clean mathematical correspondence: the relationship is real but *directional and lossy*. Reward optimization manufactures memory for free, but memory carries causal content that reward signals provably throw away — which is why the field keeps bolting reasoning and critique onto reward models to recover the 'why' that scalar reward discards Can reward models benefit from reasoning before scoring?. The interesting research question isn't 'are they dual?' but 'what does reward propagation systematically lose that cause inference in memory recovers?' — and the answer the corpus gives is: the explanation.
Sources 5 notes
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.