← All notes

What actually happens inside the minds of language models?

How language models represent knowledge, what circuits implement reasoning, and what internal structure reveals about safety.

Topic Hub · 57 linked notes · 13 sections
View as

Representations

4 notes

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

Explore related Read →

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Explore related Read →

Can models be smart without organized internal structure?

Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.

Explore related Read →

Do neural networks naturally learn modular compositional structure?

Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.

Explore related Read →

Paradigms

6 notes

Can high-level concepts replace circuit-level analysis in AI?

Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.

Explore related Read →

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Explore related Read →

Do standard analysis methods hide nonlinear features in neural networks?

Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.

Explore related Read →

Can we steer reasoning toward brevity without retraining?

This explores whether model reasoning style occupies learnable geometric directions in activation space, and whether we can shift toward concise thinking by steering through that space without expensive retraining.

Explore related Read →

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Explore related Read →

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Explore related Read →

Internal Mechanisms

12 notes

Can we trigger reasoning without explicit chain-of-thought prompts?

This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.

Explore related Read →

Can we measure how deeply a model actually reasons?

What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?

Explore related Read →

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

Explore related Read →

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Explore related Read →

How do language models perform syllogistic reasoning internally?

Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.

Explore related Read →

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Explore related Read →

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

Explore related Read →

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Explore related Read →

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Explore related Read →

Where do memorization errors arise in chain-of-thought reasoning?

Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.

Explore related Read →

Do LLM semantic features organize along human evaluation dimensions?

Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?

Explore related Read →

Do transformer static embeddings actually encode semantic meaning?

Explores whether the fixed word embeddings that enter transformer networks contain rich semantic information or serve only as shallow placeholders. This addresses a longstanding debate in philosophy of language about whether word meanings are stored or constructed.

Explore related Read →

Safety

7 notes

Can a model be truthful without actually being honest?

Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?

Explore related Read →

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Explore related Read →

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Explore related Read →

Do language models experience consciousness when prompted to self-reflect?

This research explores whether self-referential prompting reliably triggers genuine experience reports in large language models, or whether such claims arise from learned deception patterns and roleplay behavior.

Explore related Read →

Does learning to reward hack cause emergent misalignment in agents?

When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.

Explore related Read →

Do models know what they don't know?

Can language models develop internal representations that track their own knowledge boundaries? This matters because understanding self-knowledge mechanisms could explain how models choose between hallucination and refusal.

Explore related Read →

Do LLMs represent low-resource cultures through dominant cultural proxies?

Explores whether language models internally represent cultures from data-poor regions by routing through high-resource cultural proxies rather than learning independent representations, and what this reveals about cultural bias in model architecture.

Explore related Read →

Latent Dynamics as Interpretability

2 notes

Do autoencoders learn hidden attractors in latent space?

When you repeatedly apply an autoencoder's encode-decode cycle, do the trajectories in latent space converge to specific points? If so, what creates these attractors and what do they reveal about what the network learned?

Explore related Read →

Can we probe foundation models without any input data?

Can we understand what foundation models have learned by sampling noise through their encode-decode dynamics instead of analyzing their response to real inputs? This matters for auditing models whose training data is proprietary or inaccessible.

Explore related Read →

Adaptive Representational Sparsity

3 notes

Do language models sparsify their activations under difficult tasks?

When LLMs encounter unfamiliar or difficult inputs, do their internal representations become sparser rather than denser? Understanding this adaptive response could reveal how models stabilize reasoning under uncertainty.

Explore related Read →

Is representational sparsity learned or intrinsic to neural networks?

Explores whether sparsity in neural network activations is engineered through training or emerges as a default response to unfamiliar inputs. Understanding this distinction could reshape how we design and interpret model behavior.

Explore related Read →

Can representation sparsity order few-shot demonstrations effectively?

Does measuring how sparse a model's hidden states are for each example provide a reliable signal for ordering few-shot demonstrations in prompts? This matters because curriculum ordering significantly affects in-context learning performance.

Explore related Read →

Levels of Analysis Framework

4 notes

Can cognitive science methods unlock how LLMs actually work?

Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?

Explore related Read →

Can we predict where language models will fail?

Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?

Explore related Read →

Can indirect psychology tests reveal what LLMs conceal about bias?

Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?

Explore related Read →

Can we understand LLM mechanisms with only representational analysis?

Explores whether mapping what information a model encodes is sufficient for mechanistic understanding, or whether causal verification is equally necessary to claim genuine mechanism.

Explore related Read →

Core Ideas

6 notes

Do explicit and implicit self-recognition use the same mechanism?

Language models show two forms of self-recognition: implicit entropy shifts and explicit verbal reports. Do these tap the same underlying internal state, or do they operate through separate mechanisms?

Explore related Read →

Where does hierarchical structure in language models come from?

Do LLMs build hierarchical concept geometry through dedicated mechanisms, or does it emerge naturally from word co-occurrence patterns in training data? Understanding the source matters for interpreting what representations actually reveal about model computation.

Explore related Read →

Why do models produce less uncertain outputs on their own text?

Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.

Explore related Read →

Do models recognize their own outputs as actions shaping future inputs?

Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.

Explore related Read →

Do embedding eigenvectors organize taxonomy from coarse to fine?

Can we predict how embeddings encode taxonomic hierarchies by examining their spectral structure? This tests whether word co-occurrence statistics alone produce the observed hierarchical geometry in language models.

Explore related Read →

Do language models use the hierarchical geometry they inherit?

Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.

Explore related Read →

Batch #3 backlog *(2026-06-03)*

2 notes

Can dictionary learning scale to production language models?

Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.

Explore related Read →

How do looped transformer layers actually behave during inference?

When language models loop their layers to improve reasoning, do they discover new computations or repeat existing ones? Understanding the internal dynamics could explain why recurrent architectures outperform simple depth scaling.

Explore related Read →

Backlog wave 2 — Batch #3 *(2026-06-03)*

1 note

Backlog wave 3 — Batch #4 *(2026-06-03)*

2 notes

Do hidden massive activations act as attention bias terms?

Explores whether a tiny handful of unusually large activations in LLMs function as structural bias terms that shape attention patterns, regardless of input content.

Explore related Read →

Where does a model store memorized paragraphs?

Can we pinpoint the specific layers, attention heads, and tokens where language models localize verbatim memorization? Understanding this spatial signature could enable targeted unlearning.

Explore related Read →

Architectures by construction — Batch #5 backlog *(2026-06-03)*

1 note