What makes a learned consolidation rule lossy and where does contamination enter?
This explores why the act of compressing experience into a durable rule — weights, a context playbook, or a memory store — necessarily throws information away, and the specific moments where bad signal sneaks in and gets baked in alongside the good.
This reads the question as asking about consolidation as compression: any time a system folds raw experience into a reusable rule, it has to decide what to keep, and both the discarding and the deciding are where damage happens. The corpus frames the loss as structural, not accidental. There's a hard ceiling — models hold roughly 3.6 bits per parameter before capacity fills, at which point they stop memorizing specifics and shift to generalizing When do language models stop memorizing and start generalizing?. That phase change is exactly where a consolidation rule becomes lossy by design: it can no longer store the instance, so it stores a compressed proxy and lets the details go.
Where does the loss actually bite? Two places. First, in the weights themselves: direct fine-tuning corrupts knowledge stored in a model's lower layers, while a decoding-time approach that leaves base weights untouched preserves that knowledge and shifts only reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The lesson is that consolidation isn't free — overwriting the substrate where facts live is what makes it lossy. Second, in iterative compression: when a system repeatedly rewrites a context rather than amending it, brevity bias and 'context collapse' erode detail with each pass, which is why treating context as an incrementally-updated playbook beats full rewrites Can context playbooks prevent knowledge loss during iteration?. And the loss compounds silently — frontier models corrupt about 25% of document content across long delegated relays, with errors accumulating round after round and never plateauing Do frontier LLMs silently corrupt documents in long workflows?.
Contamination enters at the reinforcement step — the moment you tell the rule which trajectories to keep. Train on problems that are too hard and group-relative normalization treats rare accidental successes as high-advantage, so the model consolidates shortcuts (answer repetition, skipping computation) that then bleed into and degrade capabilities it already had Do overly hard RLVR samples actually harm model capabilities?. RL post-training does something quieter but related: within the first epoch it amplifies one pretraining format and suppresses the alternatives, and the winner is chosen by scale rather than performance Does RL training collapse format diversity in pretrained models?. That's contamination as collapse — diversity consolidated out of existence. The other entry point is self-feeding: a RAG system that writes its own generated answers back into its corpus can pollute every future retrieval with its own hallucinations, which is why safe write-back gates each addition behind entailment, attribution, and novelty checks Can RAG systems safely learn from their own generated answers?.
The interesting move in the corpus is that the most robust strategies treat lossiness as the thing to route around rather than perfect. Staying close to the base distribution — low KL drift — preserves the model's plasticity so it can keep learning new tasks, whereas parameter-only updates stall once the domain shifts Does staying close to the base model preserve learning ability?. Push further and you can refuse to consolidate into weights at all: episodic-memory agents adapt continually through memory operations and credit assignment without touching a single parameter Can agents learn continuously from experience without updating weights?. The reader's surprise here: 'lossy' and 'contaminated' turn out to be the same failure viewed from two angles — the rule forgets the right things and remembers the wrong ones in the same compression step — and the defense isn't a better compressor but keeping an un-baked, inspectable, gated layer between experience and the rule it becomes.
Sources 9 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.