INQUIRING LINE

Can we balance interpretability with the efficiency gains of compressed inter-model communication?

This explores the tension between models communicating in compact latent representations (faster, denser than passing text back and forth) and our ability to read what they're actually saying — and whether the corpus offers ways to keep both.


This reads the question as a real trade-off: the moment models stop talking to each other in plain text and start exchanging compressed internal states, you gain bandwidth and lose legibility. Worth saying up front — the corpus doesn't have a paper that literally studies models gossiping in latent codes, but it has a lot to say about each half of the bargain, and the lateral picture is more hopeful than the framing suggests.

First, why compressed communication is tempting at all. Operating in latent space rather than tokens is exponentially more sample-efficient because adjacent latents are far more correlated than raw text Why is predicting latents more sample-efficient than tokens?, and text itself is a lossy abstraction that throws away physics, geometry, and causality — so passing it between models compounds the loss Are text-only language models fundamentally limited by abstraction?. You can even intervene directly on frozen hidden representations and beat weight-based fine-tuning with 10–50x fewer parameters Can editing hidden representations beat weight updates for finetuning?. So the efficiency case for working in compressed internal states, not text, is strong.

The catch is the catch the question names. Internal representations can look perfectly functional on every metric while being structurally broken underneath — identical accuracy can hide fractured, perturbation-fragile organization Can models be smart without organized internal structure?. If two models are exchanging those states, the breakage propagates invisibly. Compression buys speed at the price of exactly the legibility you'd need to catch the problem.

Here's the part the reader may not expect: interpretability of compressed states is now demonstrably tractable, not a lost cause. Sparse autoencoders pull abstract, human-readable features out of a production model like Claude 3 Sonnet — features that both respond to and causally steer behavior Can dictionary learning scale to production language models?. And you can build legibility in from the start: training with sparse weights forces modular, disentangled circuits where neurons map to simple concepts you can actually trace Can sparse weight training make neural networks interpretable by design?. That reframes the question — interpretability isn't only something you recover after the fact by decoding back to text; it can be a property you architect into the compressed channel itself.

So the honest answer is a qualified yes, with a boundary. Decoding-time approaches that leave base representations untouched preserve knowledge better than methods that overwrite them Can decoding-time tuning preserve knowledge better than weight fine-tuning?, suggesting you can layer an interpretability lens onto a compressed channel without corrupting it. But the open frontier is scale: dictionary learning works at production size, yet building interpretability-by-construction past tens of millions of parameters is still unsolved Can sparse weight training make neural networks interpretable by design?. The balance is achievable in principle today; whether it holds at the scale where compressed inter-model communication would actually pay off is the unanswered question.


Sources 7 notes

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can dictionary learning scale to production language models?

Sparse autoencoders extract high-quality, abstract, multilingual features from Claude 3 Sonnet that both respond to and causally influence model behavior. The work demonstrates interpretability is tractable at production scale, not limited to toy models.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Next inquiring lines