INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›What limits mechanistic interpreta…›this inquiring line

Inside an AI, 'knowing what a capital city is' and actually reasoning with that knowledge live in completely different features.

How do functional features differ from representational abstract features?

This explores the difference between features that *encode what something is* (abstract representations of concepts) and features that *carry out operations on them* (the functional machinery a model uses), and how the corpus draws that line.

This explores the difference between features that store *what a concept is* and features that perform *what the model does with it* — the gap between representation and computation. The cleanest map comes from circuit tracing in Claude, which finds a four-tier hierarchy: token-level inputs, then abstract concepts, then *functional* operations, then outputs How do language models organize features across processing layers?. Abstract features are the model's concepts — its internal idea of "capital city" or "plural noun." Functional features sit one tier up and act *on* those concepts: they're the verbs, not the nouns. The same work notes that bigger models grow richer abstract features, which suggests scaling buys higher-level conceptual vocabulary rather than just more memorized patterns.

The sharpest reason to keep the two separate is that representation and computation can come fully apart. One striking result shows networks can compute perfectly well with *no* interpretable activation structure at all — homomorphic encryption lets a model run the right function over scrambled internals, proving the pattern you can read off the activations and the operation actually being performed are decoupled Do standard analysis methods hide nonlinear features in neural networks?. So a "representational" feature is something an analyst can decode; a "functional" feature is something that, if you ablated it, would break a specific computation — and those need not be the same thing.

That decoupling is exactly why looking only at representations can mislead. A model can hold all the linearly-decodable features a task needs while its internal organization is fractured and brittle — the representation looks complete, but the function it supports collapses under perturbation Can models be smart without organized internal structure?. The flip side shows up when you go hunting for the functional machinery directly: pruning experiments reveal that neural nets quietly split compositional tasks into isolated subnetworks, where knocking out one module disables only its corresponding operation Do neural networks naturally learn modular compositional structure?. Those subnetworks are functional features in the most literal sense — physically separable operations — and pretraining makes them more reliably modular.

There's a lateral wrinkle worth noticing: abstract representations are often *geometric*, while functional behavior is *structural*. LLMs encode syntactic relations in polar coordinates — type by angle, direction by distance — a spontaneously learned, symbol-compatible geometry that is pure representation How do language models encode syntactic relations geometrically?. But meaning-features are entangled: intervene on one semantic axis and aligned ones shift proportionally, so the representation isn't a clean set of independent dials Do LLM semantic features organize along human evaluation dimensions?. Functional features, by contrast, behave more like operations you can isolate and ablate. The binding problem frames why this matters: networks struggle to *dynamically* bind distributed representations into new compositional structures — a failure that is functional, not representational, since the concepts are present but the machinery to recombine them on the fly is weak Why do neural networks fail at compositional generalization?.

So the difference, across the corpus, is less a taxonomy than a warning: a feature you can *read* (representation) and a feature that *does work* (function) live at different tiers, can be physically separated by pruning, and can drift entirely apart — which is why a model can look well-organized and still fail, or look scrambled and still compute.

Sources 7 notes

How do language models organize features across processing layers?

Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Show all 7 sources

Do LLM semantic features organize along human evaluation dimensions?

Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Break It Down: Evidence for Structural Compositionality in Neural Networks3.44 match · arxiv ↗
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control3.29 match · arxiv ↗
Semantic Structure in Large Language Model Embeddings2.51 match · arxiv ↗
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning2.42 match · arxiv ↗
Scaling can lead to compositional generalization1.79 match · arxiv ↗
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks1.71 match · arxiv ↗
Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence1.67 match · arxiv ↗
Large Concept Models: Language Modeling in a Sentence Representation Space1.65 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher re-testing claims about the representational–functional feature split in LLMs. The question: *Do abstract representational features and functional computational features truly decouple, and if so, how?* This remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat as perishable.
- Circuit tracing identifies a four-tier hierarchy (tokens → abstract concepts → functional operations → outputs), with bigger models growing richer abstract features (~2024–2025).
- Representational features (linearly decodable activations) and functional features (ablation-sensitive operations) can fully decouple; homomorphic encryption proves models compute correctly over scrambled internals, decoupling pattern readability from actual computation (~2024).
- Models can hold all linearly decodable features for a task while remaining functionally brittle under perturbation; representation completeness masks fragile internal organization (~2024).
- Neural nets modularize compositional tasks into isolable subnetworks (functional features as physical modules); pruning disables operations in isolation (~2023–2024).
- LLM syntax encodes in polar coordinates (type by angle, direction by distance), a pure geometric representation; semantic features remain entangled, suggesting different organizational regimes for syntax vs. semantics (~2024–2025).
- Binding problem: models struggle to dynamically compose distributed representations into novel structures — a functional, not representational, failure (~2020).

Anchor papers (verify; mind their dates):
- arXiv:2020.05208 (2020) — Binding problem, compositional rigidity.
- arXiv:2405.08366 (2024) — Sparse autoencoders, mechanistic control and representation bias.
- arXiv:2412.05571 (2024) — Polar geometry in syntax.
- arXiv:2508.10003 (2025) — Semantic structure and entanglement.

Your task:
(1) **RE-TEST each decoupling claim.** For representation–function separation: have newer sparse autoencoders, mechanistic probes, or intervention methods since 2024 *tightened* or *loosened* the coupling? Check whether scaling, LoRA, or in-context adaptation changes modular decomposition stability. Separate the durable question (do representational and functional tiers exist?) from perishable findings (current degree of decoupling under current training).
(2) **Surface the sharpest *reconciliation* or *contradiction* from the last 6 months.** The library leans toward decoupling; does recent work (esp. 2025–2026) find tighter binding, emergent co-optimization, or task-dependent coupling regimes? Name the paper and conflict.
(3) **Propose 2 research questions that assume the regime has moved:** (a) If functional modularity strengthens with scale or training procedure X, what *creates* it — architectural inductive bias, objective pressure, or data structure? (b) Can you *design* training to decouple representations from functions intentionally, or does the split emerge unavoidably?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Inside an AI, 'knowing what a capital city is' and actually reasoning with that knowledge live in completely different features.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8