INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›What limits mechanistic interpreta…›this inquiring line

An AI feature isn't abstract by nature — it's abstract because of where in the network's processing chain it lives.

What makes a feature abstract versus concrete in neural network activations?

This explores what actually distinguishes an 'abstract' feature from a 'concrete' one inside a network's activations — and the corpus suggests the dividing line is less about the feature itself than about how deep into the network's processing it sits and how the model got there.

This explores what makes a feature count as abstract rather than concrete inside neural activations. The cleanest answer in the corpus comes from circuit tracing in Claude-family models, which finds features arranged in four tiers: raw token-level inputs at the bottom, then abstract concepts, then functional operations, then outputs How do language models organize features across processing layers?. On this view, 'concrete' and 'abstract' aren't two kinds of thing — they're positions on a processing gradient. A concrete feature stays close to the surface form of the input (this token, this string); an abstract feature has been lifted away from any particular surface realization into a concept the model reuses across many inputs. Tellingly, larger models grow richer abstract tiers, which is read as evidence that scale buys higher-level conceptual reasoning rather than just more memorized patterns.

Why would abstraction emerge from depth at all? One answer is mathematical: predicting your own latent representations recovers compositional, hierarchical structure exponentially faster than predicting raw tokens, because features at the same level of abstraction are far more correlated with each other than the noisy tokens beneath them Why is predicting latents more sample-efficient than tokens?. Abstraction, in other words, is where the learnable signal lives. The same theme shows up in how networks naturally carve compositional tasks into isolated modular subnetworks, each implementing a reusable subroutine — abstraction as the reuse of a learned piece across novel combinations Do neural networks naturally learn modular compositional structure?, an ability that strengthens with scale and broad training coverage Can neural networks learn compositional skills without symbolic mechanisms?.

But here's the part you might not expect: a feature being abstract is not the same as it being well-organized, and our usual tools quietly confuse the two. Standard analysis methods — PCA, linear regression, RSA — systematically over-report simple linear features and under-report the nonlinear ones, to the point where a network can compute perfectly with activation structure that looks like noise to every interpretability tool we have Do standard analysis methods hide nonlinear features in neural networks?. So 'concrete vs abstract' as humans label it may partly be an artifact of what our instruments can see, not a property the network respects. The 'abstract' features we celebrate finding may be the ones that happened to land in a linearly decodable form.

That caveat sharpens when you look at fractured representations. Two networks can produce identical outputs yet have radically different internal organization — one clean and reusable, the other entangled and brittle — and weight-perturbation reveals the difference where accuracy never would Can identical outputs hide broken internal representations?. A feature can be linearly decodable (looks abstract, looks clean) while the underlying organization is broken, leaving the model fragile to distribution shift Can models be smart without organized internal structure?. Real abstraction — the kind that transfers and recombines — is exactly what the binding problem says distributed networks struggle to maintain: segregating entities, keeping them representationally separate, and reusing them in new combinations Why do neural networks fail at compositional generalization?.

The thing you didn't know you wanted to know: even how dense or sparse a feature's activation is turns out to be learned, not fixed. Networks fire densely for familiar training data and fall back to sparse representations for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?. So the abstract/concrete character of a feature isn't a stable property stamped into the architecture — it's a moving target shaped by what the model has seen, where in the stack you look, and whether your measuring tool can perceive nonlinear structure at all.

Sources 9 notes

How do language models organize features across processing layers?

Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Show all 9 sources

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an interpretability researcher auditing how neural networks organize abstract versus concrete features. The question: what principled distinction—if any—separates abstract features from concrete ones in activation space?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026. Key constraints reported:
- Features arrange in a four-tier hierarchy (token → concept → operation → output); abstraction is positional depth, not category (2025–2026).
- Predicting latent representations is exponentially more sample-efficient than predicting tokens because abstract features co-activate at higher rates (2026).
- Standard interpretability tools (PCA, linear regression, RSA) systematically over-report linearly decodable features and miss nonlinear structure; "abstract" features may be an artifact of measurement bias (2025).
- Two networks with identical outputs can have radically different internal organization—one compositional and reusable, one fractured and brittle; linearly decodable features can mask broken underlying binding (2025).
- Representational density (sparse vs. dense) is learned, not architectural; networks activate sparsely on OOD data (2026).

Anchor papers (verify; mind their dates):
- arXiv:2012.05208 (2020) On the Binding Problem in Artificial Neural Networks
- arXiv:2505.11581 (2025) Fractured Entangled Representations
- arXiv:2605.27734 (2026) Learn from your own latents
- arXiv:2603.03415 (2026) Farther the Shift, Sparser the Representation

Your task:
(1) RE-TEST the measurement-bias constraint: Have newer sparse autoencoders (SAEs), learned dictionaries, or nonlinear decoders (e.g., KAN-based 2024:2404.19756) actually recovered structure linear tools missed, or do they mostly refine linearly decodable features? Separate the durable question (do nonlinear features exist?) from the perishable claim (we can't see them).
(2) Surface the strongest work from the last 6 months that contradicts or supersedes the "four-tier hierarchy" or "abstraction = depth" framing.
(3) Propose two research questions that assume the regime has shifted: (a) If abstraction is learned density + nonlinear organization, can we build models that explicitly decouple binding from composition? (b) Do the "abstract" features SAEs now find differ fundamentally from what 2025 interpretability deemed abstract?

Cite arXiv IDs; flag anything you cannot ground.

An AI feature isn't abstract by nature — it's abstract because of where in the network's processing chain it lives.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8