INQUIRING LINE

Does the linear representation hypothesis reflect networks or reflect our analysis tools?

This explores whether the well-known finding that neural networks store concepts as straight-line directions is a real property of the networks — or an illusion created by the fact that the tools we use to look inside them can only see straight lines.


This explores whether the linear representation hypothesis describes networks or describes our microscopes. The corpus comes down hard on a key worry: the methods themselves are biased witnesses. The most direct evidence is that standard interpretability tools — PCA, linear regression, RSA — systematically over-report simple linear features and under-report equally important nonlinear ones Do standard analysis methods hide nonlinear features in neural networks?. The striking demonstration there is homomorphic encryption: a network can compute perfectly while having activation patterns with no interpretable structure at all, proving that what we *see* in representations and what the network actually *computes* can be fully decoupled. So at minimum, a linear-only lens will always come back reporting linearity — that's a property of the lens.

It gets worse for naive readings. Linear decodability — the ability to read a feature off with a linear probe — turns out to be a weak signal of real organization. A model trained with SGD can contain every linearly decodable feature a task needs while its internal structure is fundamentally fractured, leaving it brittle to perturbation and distribution shift in ways standard metrics never catch Can models be smart without organized internal structure?. In other words, 'we can linearly decode it' does not license 'the network represents it linearly.' The probe succeeding tells you about the probe.

But the corpus doesn't let you collapse into pure tool-skepticism, because there are cases where genuine structure shows up that the simplest linear story *misses*. The Polar Probe finds that LLMs encode syntactic relations using both distance *and* angle — a polar-coordinate geometry that nearly doubles accuracy over distance-only (i.e. flat-linear) methods How do language models encode syntactic relations geometrically?. That's evidence of real, spontaneously-learned geometric structure that's richer than a single direction. Similarly, the leading eigenvectors of embedding Gram matrices peel taxonomy apart coarse-to-fine, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?, and static embeddings carry measurable semantic content like valence and concreteness before attention even runs Do transformer static embeddings actually encode semantic meaning?. These aren't artifacts of choosing a linear tool — they're structure the network put there that survives scrutiny.

The resolution the corpus points to: it's both, and the interesting question is *when*. Networks really do consolidate structure — they grow dense representations for familiar data and stay sparse for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?, and they spontaneously carve compositional tasks into isolated modular subnetworks Do neural networks naturally learn modular compositional structure?. But how cleanly that structure reads out depends heavily on how the network was *trained*, not just how it's analyzed: force weight sparsity and you get compact circuits where single neurons map to single concepts Can sparse weight training make neural networks interpretable by design?. That last result is the tell. If interpretable, near-linear structure can be *manufactured* by changing the training objective, then in ordinary networks linearity is partly real, partly a default the architecture drifts toward, and partly an echo of our looking. The honest answer is that the linear representation hypothesis is a claim about the *intersection* of a network and a probe — and the field's cleanest move is to stop asking 'is it linear?' and start asking 'what does this tool make impossible to see?'


Sources 8 notes

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question: Does the linear representation hypothesis describe networks' actual structure, or mainly artifacts of our analysis tools?

What a curated library found — and when (dated claims, not current truth):
Findings span 2019–2026. Core constraints reported:
• Standard interpretability tools (PCA, linear regression, RSA) systematically over-report linear features and under-report nonlinear ones; homomorphic encryption proves activation patterns can be interpretable to us yet internally unstructured to the network (~2024–2025).
• Linear decodability is a weak signal: models trained with SGD can pass linear probes while remaining brittle to perturbation, suggesting the probe succeeds but the network's internal organization is fractured (~2024).
• Counterevidence: LLMs encode syntax using polar-coordinate geometry (distance + angle), nearly doubling accuracy over flat linear methods; embedding Gram matrices track WordNet taxonomy coarse-to-fine (~2024–2025).
• Weight sparsity during training produces compact, interpretable circuits where single neurons map to concepts; density increases with training-data familiarity (~2024–2026).
• Sparse autoencoders and posterior inference methods are emerging as alternatives to linear decoders (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2412.05571 (2024-12): Polar coordinates in LLM syntax.
• arXiv:2511.13653 (2025-11): Weight-sparse transformers have interpretable circuits.
• arXiv:2507.22216 (2025-07): Representation biases — whether analysis tools fully blind us.
• arXiv:2605.23821 (2026-05): Hierarchical geometry from co-occurrence.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding, judge whether post-2024 mechanistic work (sparse autoencoders, circuit analysis, modular decomposition, steering methods) has relaxed or overturned it. Separate the durable question — *when* do networks consolidate genuinely learnable structure vs. when is linearity a training default? — from perishable claims about tool bias. Cite what relaxed each constraint.
(2) Surface contradicting or superseding work from the last 6 months: does newer circuit-tracing or causal intervention undermine tool-bias claims, or deepen them?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "If weight sparsity forces interpretability, is linearity not intrinsic but *induced by* dense training?" or "Can we distinguish network structure from probe bias by comparing sparse vs. dense activations on identical tasks?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines