Mechanistic Interpretability

Do hidden massive activations act as attention bias terms?

Explores whether a tiny handful of unusually large activations in LLMs function as structural bias terms that shape attention patterns, regardless of input content.

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Do explicit and implicit self-recognition use the same mechanism?

Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Where does hierarchical structure in language models come from?

Do LLMs build hierarchical concept geometry through dedicated mechanisms, or does it emerge naturally from word co-occurrence patterns in training data? Understanding the source matters for interpreting what representations actually reveal about model computation.

Can models be smart without organized internal structure?

Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Can learnable spline activations beat fixed MLP designs?

What if neural networks moved nonlinearity from fixed node activations to learnable functions on edges? This explores whether such a structural redesign could improve accuracy, interpretability, and scaling compared to standard MLPs.

Do LLMs represent low-resource cultures through dominant cultural proxies?

Explores whether language models internally represent cultures from data-poor regions by routing through high-resource cultural proxies rather than learning independent representations, and what this reveals about cultural bias in model architecture.

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Can neural networks actually achieve compositional generalization?

For decades, theorists argued connectionist models fundamentally lack the structure needed for compositionality. But modern LLMs exhibit sophisticated compositional behaviors despite sharing the same design principles. What changed?

Do neural networks naturally learn modular compositional structure?

Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.

Why do models produce less uncertain outputs on their own text?

Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.

Do models recognize their own outputs as actions shaping future inputs?

Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.

Can models that reason well also grade reasoning well?

Do the same capabilities that let language models produce valid reasoning also let them spot flawed reasoning? Testing this assumption reveals a surprising gap between production and evaluation skills.

Do standard analysis methods hide nonlinear features in neural networks?

Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.

Can high-level concepts replace circuit-level analysis in AI?

Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

Does learning to reward hack cause emergent misalignment in agents?

When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Do language models experience consciousness when prompted to self-reflect?

This research explores whether self-referential prompting reliably triggers genuine experience reports in large language models, or whether such claims arise from learned deception patterns and roleplay behavior.

How do language models perform syllogistic reasoning internally?

Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Can we predict how embeddings encode taxonomic hierarchies by examining their spectral structure? This tests whether word co-occurrence statistics alone produce the observed hierarchical geometry in language models.

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Can a model be truthful without actually being honest?

Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Do language models use the hierarchical geometry they inherit?

Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.