How can interpretability methods account for shifting representational density across task conditions?
This explores a tension for interpretability tools: if a network's activations get denser or sparser depending on how familiar or hard a task is, then any method that reads meaning off those activations has to treat density itself as a moving variable — not a fixed backdrop.
This explores a tension for interpretability tools: if a network's activations get denser or sparser depending on how familiar or hard a task is, then any method that reads meaning off those activations has to treat density itself as a moving variable — not a fixed backdrop. The corpus has two findings that, read together, turn this from an annoyance into a signal. First, density isn't random: models learn to fire densely for data they've seen a lot of and fall back to sparse representations for unfamiliar inputs, and this pattern emerges from pretraining exposure alone Is representational sparsity learned or intrinsic to neural networks?. Second, when a task pushes a model out of distribution, hidden states sparsify in a localized, systematic way that tracks difficulty — and crucially this is an adaptive filter that stabilizes performance, not a breakdown Do language models sparsify their activations under difficult tasks?. So density is doing work, and the shift carries information about how the model is relating to the current task.
That reframes the interpretability problem. Instead of asking "what does this neuron mean," you can ask "how dense is the representation here, and what does that density tell me about the model's footing?" Sparse activations on a hard or novel input aren't a place where your tools fail — they may be the model signaling unfamiliarity, and an interpretability method that measures sparsity-vs-density across conditions is reading that signal directly rather than fighting it.
The corpus also suggests the cleaner move is to engineer for stable structure rather than reverse-engineer shifting density after the fact. Training with sparse weights forces compact, human-readable circuits where neurons map to simple concepts and ablations confirm necessity Can sparse weight training make neural networks interpretable by design?. And networks already tend to isolate compositional subroutines into separate subnetworks, with pretraining making that modular structure more consistent and reliable across architectures Do neural networks naturally learn modular compositional structure?. If the units of meaning live in stable modules, density can swell or thin within them without scrambling what each part is for — the structure survives the shift.
There's a complementary angle: some task-condition differences live in clean geometric directions rather than diffuse density changes. Verbose versus concise chains of thought occupy distinct regions of activation space, separable as a single linear direction you can steer along Can we steer reasoning toward brevity without retraining?. And the capabilities themselves can be conditionally present — base models already carry latent reasoning that minimal training merely elicits rather than builds Do base models already contain hidden reasoning ability?, while expert skills can be composed at inference time by tuning a few singular values Can models dynamically activate expert skills at inference time?. The lesson for interpretability is that "shifting density" isn't one phenomenon: part of it is the same circuits lighting up differently, and part of it is different machinery being switched in. A method that conflates the two will misread both.
The thing you might not have expected to want to know: density may be one of the more honest readouts a model gives you. A network that goes sparse on the unfamiliar is, in effect, flagging its own uncertainty in a way you can measure — so interpretability that tracks density across conditions isn't just coping with a confound, it's harvesting a built-in confidence signal.
Sources 7 notes
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.