SYNTHESIS NOTE

Can retrieval knowledge compress into a tiny parametric model?

Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.

Synthesis note · 2026-04-18 · sourced from Memory

Memory Decoder (2508.09874) addresses a fundamental tension in domain adaptation: RAG provides flexibility but adds inference latency through nearest-neighbor search; domain-adaptive pretraining embeds knowledge in weights but requires costly full-parameter training and risks catastrophic forgetting. Memory Decoder proposes a third path — compress the knowledge stored in large non-parametric datastores into a compact parametric model.

The approach pretrains a small transformer decoder to imitate the output distributions of a kNN-LM retriever. Once trained, it plugs into any language model sharing the same tokenizer via simple output interpolation — no model-specific modifications needed. The pretrained LM and Memory Decoder process the same input context in parallel, and their distributions are interpolated at output time.

Two capabilities validate the compression hypothesis: (1) Long-tail knowledge — for factual information like "Jacobi" and "1906," Memory Decoder assigns dramatically higher probabilities than the base model (68.94% vs 0.12%), successfully capturing the memorization benefits of non-parametric methods. (2) Semantic coherence — for function words and logical continuations, Memory Decoder maintains probabilities closer to the base model rather than following kNN-LM's distortions, preserving coherent language modeling that pure retrieval sacrifices.

This bridges a gap in the How do knowledge injection methods trade off flexibility and cost?: Memory Decoder is a modular adapter that inherits retrieval's long-tail strength without retrieval's inference cost. It demonstrates that the information content of a large datastore can be compressed into orders-of-magnitude fewer parameters — suggesting retrieval-augmented knowledge may be more redundant than its datastore size implies.

The plug-and-play capability also connects to Can neural memory modules scale language models beyond attention limits? — both approaches add external memory as a parallel module rather than modifying the base model, but Memory Decoder targets domain knowledge while Titans targets sequence length.

Inquiring lines that read this note 6

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What role does compression play in language model capability and generalization?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 95 in 2-hop network ·medium cluster Open in graph ↗

Can retrieval knowledge compress into a tiny par… Can lookup memory and computation work together be… Can neural memory modules scale language models be… Can brain memory systems explain how LLMs should s…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can lookup memory and computation work together better than either alone? Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?
inverse direction: Memory Decoder compresses non-parametric retrieval into a parametric module *at the output distribution*; Engram adds a static O(1) lookup primitive *inside* the parametric model — both move retrieval-like behavior into parametric form, at different granularities
Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans adds neural memory as parallel module for sequence length; Memory Decoder is the domain-knowledge counterpart of the same plug-and-play philosophy
Can brain memory systems explain how LLMs should store knowledge? This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.
Memory Decoder operationalizes the missing CLS transfer mechanism: it compresses hippocampal (explicit) retrieval into a neocortex-like (implicit) parametric substrate

Can retrieval knowledge compress into a tiny parametric model?

Inquiring lines that read this note 6

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4