INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

Why does keeping full key-value blocks matter more than compressing them?

This explores why preserving the full key-value detail of context (rather than squeezing it into a smaller summary or fixed-size state) protects exactly the capabilities — copying, retrieval, fine distinctions — that compression quietly destroys.

This explores why preserving the full key-value detail of context, rather than squeezing it into a smaller state, protects the capabilities compression quietly destroys. The sharpest evidence is architectural: two-layer transformers can copy and retrieve from arbitrarily long context because they can attend back to every original token, while state-space models hit a wall — their fixed-size latent state is a form of permanent compression, and once a detail is folded into that compact summary it can't be recovered Can state-space models match transformers at copying and retrieval?. That's the core reason full KV blocks matter: copying and exact retrieval are lookup operations, and you can't look up what you've already averaged away.

The same theme shows up wherever fidelity meets a verification task. A two-stage retrieval pipeline can only reject 'structural near-misses' — things that look topically similar but aren't actually the right match — because its verifier operates on the full token-to-token interaction map rather than a pooled, compressed vector Can verification separate structural near-misses from topical matches?. Compress the tokens into one summary embedding and the distinguishing structure is gone. There's a cognitive-science echo here too: LLMs already compress aggressively, capturing broad category structure while losing the fine-grained distinctions humans preserve for situated action Do LLMs compress concepts more aggressively than humans do?. Compression isn't free; it spends exactly the nuance that lets a system act correctly in a specific case.

But the corpus refuses to make this a one-sided 'never compress' verdict — which is the more interesting takeaway. Whether full fidelity matters turns out to depend on the consumer. An RL-trained external context manager found that stronger agents benefit from high-fidelity preservation, while weaker agents actually need aggressive pruning to stay reliable Can external managers compress context better than frozen agents?. Similarly, optimal sparse attention isn't a fixed budget: longer sequences tolerate far more sparsity without loss, so how much you can throw away scales with the input Does fixed sparsity work for all sequence lengths?. So 'keep everything' isn't a universal law — it's the right default precisely when the task is retrieval, copying, or fine discrimination, and the wrong one when the consumer can't use the extra detail anyway.

What makes compression survivable, when you do it, is structure rather than blunt shrinking. Replacing fixed-size chunks with four-part logic units preserves the step-to-step dependencies that naive chunking destroys How do logic units preserve procedural coherence better than chunks?, and agents that fold their own memory into typed episodic/working/tool schemas avoid the degradation that plagues poorly designed consolidation Can agents compress their own memory without losing critical details?. The failure mode isn't compression per se — it's lossy compression that discards the relationships the downstream task needs to navigate.

The thing you didn't know you wanted to know: the long-context bottleneck may not even be about storage. One line of work argues the real cost is the *compute* needed to transform evicted context into internal state, with performance improving the more consolidation passes you spend Is long-context bottleneck really about memory or compute?. Read that alongside the SSM copying result and a unified picture emerges: keeping full KV blocks is cheap insurance against an expensive, often irreversible operation — turning raw context into a compressed state you can never fully decompress.

Sources 8 notes

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Can external managers compress context better than frozen agents?

An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.

Does fixed sparsity work for all sequence lengths?

Longer sequences tolerate significantly higher sparsity levels than shorter ones without performance loss. Fixed-budget sparse attention is suboptimal in production; budgets should adapt per input based on context length and other request properties.

How do logic units preserve procedural coherence better than chunks?

THREAD replaces chunks with four-part logic units—prerequisite, header, body, linker—enabling dynamic multi-step retrieval for how-to questions. Linkers explicitly navigate between steps and branches, addressing both the semantic-vs-task-relevance gap in embeddings and the sequential dependency loss in chunk-based RAG.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Why does keeping full key-value blocks matter more than compressing them?

Sources 8 notes

Next inquiring lines