Why does storing past judgments in memory make current evaluations worse?
This explores why a memory of prior verdicts can corrupt fresh judgment rather than improve it — and the corpus suggests the problem is less about memory itself than about *what* gets stored and *how* it's reused.
This reads the question as: when an evaluator keeps a record of its own past decisions and consults it while judging something new, why does accuracy drop? The corpus doesn't have a single paper named for this exact failure, but several notes triangulate it cleanly, and they point to the same culprit — stored judgments become a shortcut that crowds out actual reasoning.
The clearest mechanism is the self-amplifying feedback loop. YouTube's ranking work shows that when a system trains on data shaped by its own prior choices, it converges on "degenerate equilibria that amplify their own past decisions" unless selection bias is explicitly modeled out Why do ranking systems need to model selection bias explicitly?. A past judgment isn't a neutral fact — it's evidence the system generated, and feeding it back in launders a guess into a precedent. Each new evaluation leans on the last, and small errors compound instead of washing out.
Underneath that sits a memorization problem. The STIM work finds that in chain-of-thought reasoning, *local* memorization — predicting from immediately preceding tokens — accounts for up to 67% of errors, and gets worse under distributional shift Where do memorization errors arise in chain-of-thought reasoning?. A buffer of past judgments is exactly the kind of nearby, high-salience context that invites the model to pattern-match against "what I said last time" instead of evaluating the case in front of it. When the new case differs from the stored ones, the recalled judgment actively misleads.
The most pointed counter-evidence is what fixes it: storing the wrong *thing*. Think-in-Memory shows that retaining raw history forces the model to re-reason over the same facts for every query, producing "inconsistent inference paths" — so it stores evolved, already-reasoned thoughts instead Can storing evolved thoughts prevent inconsistent reasoning in conversations?. SkillRL makes the asymmetry explicit: uniform consolidation degrades performance, so it keeps successes as concrete demonstrations and failures as abstracted lessons Should successful and failed episodes be processed differently?. And Reflexion finds that an *unambiguous* signal kept *uncompressed* is what prevents the model from rationalizing — the binary success/failure tag is what keeps the memory honest Can agents learn from failure without updating their weights?.
Put together, the answer to "why does storing past judgments make evaluations worse" is that a raw verdict carries no grounding — it's a conclusion stripped of the reasoning and the outcome-signal that would let a future evaluation check it. Stored that way, it becomes a confident-looking anchor the model reaches for instead of thinking, and it feeds its own bias back to itself. The corpus's quiet lesson: memory helps evaluation only when you store the *lesson* and the *signal*, not the *score*. What you don't want to know but should: the same architecture that makes LLM judges biased — relying on exploitable surface features rather than reasoning through the case — is what a judgment-memory amplifies, which is why training judges to reason at evaluation time mitigates it Can reasoning during evaluation reduce judgment bias in LLM judges?.
Sources 6 notes
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Think-in-Memory (TiM) stores reasoned thoughts rather than raw history, updating memory through insert, forget, and merge operations. This eliminates the inconsistent inference paths that arise when the same facts are repeatedly recalled and reasoned over for different queries.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.