INQUIRING LINE

Can representation analysis methods detect complex features models compute with?

This explores whether the standard toolkit for reading a model's internals — probes, PCA, regression — can actually surface the complex computations a model performs, or whether those methods quietly miss what matters.


This question is really asking whether our analysis tools see what the model is doing, or only the parts of it that happen to be easy to see — and the corpus leans hard toward the second answer. The sharpest result is that standard methods are systematically biased toward simple, linear features Do standard analysis methods hide nonlinear features in neural networks?. PCA, linear regression, and RSA over-represent clean linear structure while under-counting equally important nonlinear features. The striking demonstration: a network can compute a task perfectly using homomorphically encrypted activations that show no interpretable structure at all — proof that what a model represents and what a model computes can be fully decoupled. So a probe coming up empty doesn't mean the computation isn't there.

That decoupling shows up from a second angle, too. Two models can post identical accuracy while one has clean internal organization and the other is internally fractured — and the difference is invisible to standard metrics, surfacing only under perturbation or distribution shift Can models be smart without organized internal structure?. Linear decodability, the very thing a probe rewards, can sit on top of broken internal structure. Performance tells you the features are usable; it tells you nothing about whether they're organized the way you assume.

The corpus also names the fix. Representational analysis alone only ever finds correlations — it locates candidate features but can't show they're the ones the model uses. Pairing it with causal analysis (intervene, ablate, watch behavior change) is what turns a correlation into a mechanistic claim Can we understand LLM mechanisms with only representational analysis?. This is the working answer to your question: representation analysis can *propose* complex features, but only causal verification confirms the model computes with them.

Where the methods get smarter, they do find genuinely complex structure — which is the encouraging counterweight. A polar-coordinate probe recovers syntactic type *and* direction from activations, nearly doubling accuracy over distance-only probes precisely because it stopped assuming the geometry was simple How do language models encode syntactic relations geometrically?. Circuit tracing in Claude models reveals a four-tier feature hierarchy running from tokens to abstract concepts to functional operations How do language models organize features across processing layers?, and pruning experiments expose modular subnetworks each implementing an isolated compositional subroutine Do neural networks naturally learn modular compositional structure?. The pattern: complexity is detectable, but only when the method is built to expect the right shape.

The quiet warning underneath all this is that detecting a feature isn't the same as the feature being what you think. Transformers that look like they reason compositionally are often just matching memorized computation subgraphs, collapsing the moment the composition is novel Do transformers actually learn systematic compositional reasoning?. The lesson worth leaving with: representation analysis is a generator of hypotheses about complex computation, not a verifier of it — and the most confident-looking probe result is exactly the one most worth testing causally before you believe it.


Sources 7 notes

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

How do language models organize features across processing layers?

Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question: can representation analysis methods reliably detect complex features that models actually compute with—or do they only see what's easy to measure?

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints:
• Standard methods (PCA, linear regression, RSA) are systematically biased toward simple linear features while under-counting nonlinear computation (2023–2024).
• A network can compute perfectly using homomorphically encrypted activations with zero interpretable structure—decoupling representation from computation (2024).
• Two models with identical accuracy can have radically different internal organization, invisible to performance metrics (2024).
• Representation analysis alone only finds correlations; pairing with causal intervention (ablation, perturbation) is required to confirm the model uses a detected feature (2024–2025).
• Polar-coordinate probes and circuit tracing do recover genuinely complex structure (syntactic geometry, multi-tier hierarchies, modular subnetworks) when methods match the right geometry (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.18654 (2023) – Faith and Fate: compositional reasoning limits
• arXiv:2405.08366 (2024) – Sparse Autoencoders for Interpretability
• arXiv:2412.05571 (2024) – Polar coordinates in LLM syntax
• arXiv:2507.22216 (2025) – Representation biases and complete understanding

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, Claude 3.5, o1-series), mechanistic methods (dictionary learning, polysemanticity-aware probes, causal tracing at scale), or evaluation harnesses have since relaxed or overturned the bias toward linearity or the representation–computation gap. Separate the durable question (representation alone insufficient?) from the perishable claim (linear methods fail universally?); cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that claims representation analysis *alone* can ground mechanistic understanding, or that simple probes now capture nonlinearity robustly.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do larger sparse autoencoders trained on o1 outputs finally unify representation and computation?" or "Can adaptive probe geometry trained per-layer overcome the linearity bias?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines