INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How do scale, context, and measure…›What memory architectures best sup…›this inquiring line

Stored judgments don't help an AI reason better — they become shortcuts that crowd out actual thinking.

Why does storing past judgments in memory make current evaluations worse?

This explores why a memory of prior verdicts can corrupt fresh judgment rather than improve it — and the corpus suggests the problem is less about memory itself than about *what* gets stored and *how* it's reused.

This reads the question as: when an evaluator keeps a record of its own past decisions and consults it while judging something new, why does accuracy drop? The corpus doesn't have a single paper named for this exact failure, but several notes triangulate it cleanly, and they point to the same culprit — stored judgments become a shortcut that crowds out actual reasoning.

The clearest mechanism is the self-amplifying feedback loop. YouTube's ranking work shows that when a system trains on data shaped by its own prior choices, it converges on "degenerate equilibria that amplify their own past decisions" unless selection bias is explicitly modeled out Why do ranking systems need to model selection bias explicitly?. A past judgment isn't a neutral fact — it's evidence the system generated, and feeding it back in launders a guess into a precedent. Each new evaluation leans on the last, and small errors compound instead of washing out.

Underneath that sits a memorization problem. The STIM work finds that in chain-of-thought reasoning, *local* memorization — predicting from immediately preceding tokens — accounts for up to 67% of errors, and gets worse under distributional shift Where do memorization errors arise in chain-of-thought reasoning?. A buffer of past judgments is exactly the kind of nearby, high-salience context that invites the model to pattern-match against "what I said last time" instead of evaluating the case in front of it. When the new case differs from the stored ones, the recalled judgment actively misleads.

The most pointed counter-evidence is what fixes it: storing the wrong *thing*. Think-in-Memory shows that retaining raw history forces the model to re-reason over the same facts for every query, producing "inconsistent inference paths" — so it stores evolved, already-reasoned thoughts instead Can storing evolved thoughts prevent inconsistent reasoning in conversations?. SkillRL makes the asymmetry explicit: uniform consolidation degrades performance, so it keeps successes as concrete demonstrations and failures as abstracted lessons Should successful and failed episodes be processed differently?. And Reflexion finds that an *unambiguous* signal kept *uncompressed* is what prevents the model from rationalizing — the binary success/failure tag is what keeps the memory honest Can agents learn from failure without updating their weights?.

Put together, the answer to "why does storing past judgments make evaluations worse" is that a raw verdict carries no grounding — it's a conclusion stripped of the reasoning and the outcome-signal that would let a future evaluation check it. Stored that way, it becomes a confident-looking anchor the model reaches for instead of thinking, and it feeds its own bias back to itself. The corpus's quiet lesson: memory helps evaluation only when you store the *lesson* and the *signal*, not the *score*. What you don't want to know but should: the same architecture that makes LLM judges biased — relying on exploitable surface features rather than reasoning through the case — is what a judgment-memory amplifies, which is why training judges to reason at evaluation time mitigates it Can reasoning during evaluation reduce judgment bias in LLM judges?.

Sources 6 notes

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Can storing evolved thoughts prevent inconsistent reasoning in conversations?

Think-in-Memory (TiM) stores reasoned thoughts rather than raw history, updating memory through insert, forget, and merge operations. This eliminates the inconsistent inference paths that arise when the same facts are repeatedly recalled and reasoned over for different queries.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Show all 6 sources

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs1.72 match · arxiv ↗
Useful Memories Become Faulty When Continuously Updated by LLMs1.71 match · arxiv ↗
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time0.93 match · arxiv ↗
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization0.88 match · arxiv ↗
Reflexion: Language Agents with Verbal Reinforcement Learning0.87 match · arxiv ↗
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning0.87 match · arxiv ↗
Real-Time Procedural Learning From Experience for AI Agents0.86 match · arxiv ↗
Agent Learning via Early Experience0.86 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating a claim about LLM evaluation bias. The precise question: does storing past judgments in memory degrade current evaluation accuracy, and if so, why?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as snapshots, not current ground truth:
• Self-amplifying feedback loops: when systems train on their own prior decisions, they converge on "degenerate equilibria" unless selection bias is explicitly modeled (2023–2025).
• Local memorization in chain-of-thought accounts for up to 67% of errors and worsens under distributional shift (2025).
• Storing raw verdicts without reasoning or outcome-signal invites pattern-matching to "what I said last time" instead of fresh evaluation (2025).
• Storing *evolved thoughts* (reasoned, abstracted lessons + success/failure tags) instead of raw scores preserves evaluation integrity across Think-in-Memory, SkillRL, and Reflexion work (2023–2025).
• Training judges to reason *at evaluation time* via RL mitigates bias by converting judgment tasks to reasoning tasks (2025).

Anchor papers (verify; mind their dates):
• arXiv:2311.08719 (Think-in-Memory, 2023)
• arXiv:2508.02037 (Diagnosing Memorization in CoT, 2025)
• arXiv:2505.10320 (J1: RL-based Judge Thinking, 2025)
• arXiv:2604.08756 (Artifacts as Memory, 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, ask whether newer architectures (mixture-of-experts judges, retrieval-augmented evaluation), training regimes (online adaptation, multi-round feedback), or evaluation harnesses have relaxed or overturned the core claim that *raw stored judgments* degrade accuracy. Separate the durable question—does unskilled memory reuse harm evaluation?—from the perishable limitation—can explicit reasoning, outcome-tagging, or architectural isolation now prevent it?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers showing stored judgments *do* improve accuracy under certain conditions (e.g., confidence calibration, consistency priors, or multi-agent consensus).
(3) Propose two research questions that assume the regime may have moved: (a) Can in-context demonstrations of past reasoning—rather than stored verdicts—eliminate the amplification loop without forgetting? (b) Does the harm of stored judgments reverse when the evaluator also stores *why* it disagreed with its past self?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Stored judgments don't help an AI reason better — they become shortcuts that crowd out actual thinking.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8