Can retrieval knowledge compress into a tiny parametric model?
Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.
Memory Decoder (2508.09874) addresses a fundamental tension in domain adaptation: RAG provides flexibility but adds inference latency through nearest-neighbor search; domain-adaptive pretraining embeds knowledge in weights but requires costly full-parameter training and risks catastrophic forgetting. Memory Decoder proposes a third path — compress the knowledge stored in large non-parametric datastores into a compact parametric model.
The approach pretrains a small transformer decoder to imitate the output distributions of a kNN-LM retriever. Once trained, it plugs into any language model sharing the same tokenizer via simple output interpolation — no model-specific modifications needed. The pretrained LM and Memory Decoder process the same input context in parallel, and their distributions are interpolated at output time.
Two capabilities validate the compression hypothesis: (1) Long-tail knowledge — for factual information like "Jacobi" and "1906," Memory Decoder assigns dramatically higher probabilities than the base model (68.94% vs 0.12%), successfully capturing the memorization benefits of non-parametric methods. (2) Semantic coherence — for function words and logical continuations, Memory Decoder maintains probabilities closer to the base model rather than following kNN-LM's distortions, preserving coherent language modeling that pure retrieval sacrifices.
This bridges a gap in the How do knowledge injection methods trade off flexibility and cost?: Memory Decoder is a modular adapter that inherits retrieval's long-tail strength without retrieval's inference cost. It demonstrates that the information content of a large datastore can be compressed into orders-of-magnitude fewer parameters — suggesting retrieval-augmented knowledge may be more redundant than its datastore size implies.
The plug-and-play capability also connects to Can neural memory modules scale language models beyond attention limits? — both approaches add external memory as a parallel module rather than modifying the base model, but Memory Decoder targets domain knowledge while Titans targets sequence length.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does each rewrite cycle degrade domain-specific details differently than compression?
- Why does adjusted compression performance degrade as models scale larger?
- Can steering vectors be combined with other compression techniques?
- Can task-agnostic compression of documents remain broadly useful for later queries?
- How does the compression view extend from trained models to training objectives?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can lookup memory and computation work together better than either alone?
Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?
inverse direction: Memory Decoder compresses non-parametric retrieval into a parametric module *at the output distribution*; Engram adds a static O(1) lookup primitive *inside* the parametric model — both move retrieval-like behavior into parametric form, at different granularities
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans adds neural memory as parallel module for sequence length; Memory Decoder is the domain-knowledge counterpart of the same plug-and-play philosophy
-
Can brain memory systems explain how LLMs should store knowledge?
This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.
Memory Decoder operationalizes the missing CLS transfer mechanism: it compresses hippocampal (explicit) retrieval into a neocortex-like (implicit) parametric substrate
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Generalization through Memorization: Nearest Neighbor Language Models
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
- Efficient Nearest Neighbor Language Models
- Chain-of-Retrieval Augmented Generation
- Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
- From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Original note title
compressing retrieval into a small parametric decoder eliminates datastore search at inference while preserving long-tail knowledge — a third path between RAG and fine-tuning