Model Architecture and Internals

Research on the architecture, internals, and foundations of language models — novel architectures, mechanistic interpretability, memory, multimodality, context engineering, diffusion LLMs, world models, and the data behind them.

143 notes (primary) · 465 papers · 11 sub-topics

View as

Mechanistic Interpretability

20 notes

Can learnable spline activations beat fixed MLP designs?

What if neural networks moved nonlinearity from fixed node activations to learnable functions on edges? This explores whether such a structural redesign could improve accuracy, interpretability, and scaling compared to standard MLPs.

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

Do hidden massive activations act as attention bias terms?

Explores whether a tiny handful of unusually large activations in LLMs function as structural bias terms that shape attention patterns, regardless of input content.

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Can models be smart without organized internal structure?

Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Can neural networks actually achieve compositional generalization?

For decades, theorists argued connectionist models fundamentally lack the structure needed for compositionality. But modern LLMs exhibit sophisticated compositional behaviors despite sharing the same design principles. What changed?

Do neural networks naturally learn modular compositional structure?

Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.

Why do models produce less uncertain outputs on their own text?

Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.

Do standard analysis methods hide nonlinear features in neural networks?

Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.

Can high-level concepts replace circuit-level analysis in AI?

Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Do language models use the hierarchical geometry they inherit?

Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.

LLM Architecture

11 notes

Do language models sparsify their activations under difficult tasks?

When LLMs encounter unfamiliar or difficult inputs, do their internal representations become sparser rather than denser? Understanding this adaptive response could reveal how models stabilize reasoning under uncertainty.

Why do decoder-only models underperform as text encoders?

Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.

Do embedding dimensions fundamentally limit retrievable document combinations?

Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.

Can models learn working memory by attending to their own latents?

Can a feedback loop letting transformers attend to their own internal representations enable them to process indefinitely long sequences without adding extra weights? This explores whether working memory can emerge from self-attention rather than external modules.

Does fixed sparsity work for all sequence lengths?

Production systems often apply the same sparsity budget regardless of input length. Does this one-size-fits-all approach actually work across short and long contexts, or does optimal sparsity vary with sequence length?

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Does sparse attention trade off quality for speed?

When sparse attention is compared fairly—larger sparse models versus smaller dense ones at the same compute cost—does it still represent a quality-cost trade-off, or does it actually improve performance?

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Is representational sparsity learned or intrinsic to neural networks?

Explores whether sparsity in neural network activations is engineered through training or emerges as a default response to unfamiliar inputs. Understanding this distinction could reshape how we design and interpret model behavior.

Can representation sparsity order few-shot demonstrations effectively?

Does measuring how sparse a model's hidden states are for each example provide a reliable signal for ordering few-shot demonstrations in prompts? This matters because curriculum ordering significantly affects in-context learning performance.

Why do neural networks fail at compositional generalization?

Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.

LLM Memory

11 notes

Can LLMs read long documents like humans do?

How might mimicking human reading strategies—storing gist memories and looking up details on demand—help language models handle documents beyond their effective context window?

Is agent memory a storage problem or a connectivity problem?

Most systems treat memory as a repository to store and retrieve. But what if memory's real usefulness depends on how units are linked together rather than what is stored?

Can retrieval knowledge compress into a tiny parametric model?

Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.

Can lookup memory and computation work together better than either alone?

Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?

Can models consolidate memories during offline sleep phases?

This explores whether LLMs can use dedicated offline periods to consolidate short-term learning into permanent weights, avoiding catastrophic forgetting and the need for expensive retraining.

Can brain memory systems explain how LLMs should store knowledge?

This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.

When do language models stop memorizing and start generalizing?

Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.

Has memory architecture replaced parameter count as the scaling frontier?

Late-2025 research suggests the field's next major efficiency gains come from restructuring how models store and use experience rather than simply making them larger. Three convergent signals point to this shift.

Can agents learn preferences by watching rather than asking?

Explores whether multimodal agents can build accurate preference models through continuous observation of user behavior, without explicit instruction, by organizing memory around entities and separating concrete events from derived knowledge.

Where does a model store memorized paragraphs?

Can we pinpoint the specific layers, attention heads, and tokens where language models localize verbatim memorization? Understanding this spatial signature could enable targeted unlearning.

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Novel LLM Architectures

10 notes

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Can byte-level models match tokenized performance with better efficiency?

Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Can cognition work by reusing memory instead of recomputing?

Does intelligence emerge from structured navigation of prior inference paths rather than fresh computation? This challenges whether brains and AI systems need to recalculate constantly or can leverage stored trajectories for efficiency.

Can recurrence consolidate memory without predicting tokens?

Recurrent neural networks typically use recurrence only for prediction. But could offline recurrent passes serve a second purpose—consolidating transient context into persistent weights, like sleep does in brains?

Can looped transformers generalize to unseen knowledge combinations?

Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.

Can spiking neurons make transformers efficient on any hardware?

Explores whether brain-inspired spiking mechanisms combined with linear attention can adapt existing transformer checkpoints into efficient models trainable outside NVIDIA ecosystems using minimal additional data.

Is long-context bottleneck really about memory or compute?

Explores whether the challenge of handling long context windows stems from storage capacity limits or from the computational cost of transforming context into internal state. Understanding this distinction reshapes how we design language models.

Can parallel architectures solve inherently sequential problems?

Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?

Can state-space models match transformers at copying and retrieval?

Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.

Cognitive Models and Latent Representations

10 notes

How do language models encode syntactic relations geometrically?

Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.

Can a single regularizer prevent JEPA representation collapse?

JEPAs traditionally need complex loss stacks and auxiliary tricks to avoid collapse. Can a single Gaussian-distribution constraint on latent embeddings do the same stabilization work, and would that simplify training?

Do autoencoders learn hidden attractors in latent space?

When you repeatedly apply an autoencoder's encode-decode cycle, do the trajectories in latent space converge to specific points? If so, what creates these attractors and what do they reveal about what the network learned?

Can communication pressure drive agents to learn shared abstractions?

Under what conditions do AI agents develop compact, efficient shared languages? This explores whether cooperative task pressure—rather than explicit optimization—naturally drives abstraction formation, mirroring human collaborative communication.

Can we probe foundation models without any input data?

Can we understand what foundation models have learned by sampling noise through their encode-decode dynamics instead of analyzing their response to real inputs? This matters for auditing models whose training data is proprietary or inaccessible.

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Can explicit stack tracking improve how transformers learn recursive syntax?

Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Can agents share thoughts directly without using language?

Explores whether multi-agent systems can communicate by exchanging latent thoughts extracted from hidden states, bypassing the ambiguity and misalignment problems inherent in natural language.

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Multimodal Models

8 notes

Can a single model generate all modalities without external encoders?

Most multimodal systems rely on separate encoders for each modality. This research explores whether training a unified foundation model on discrete tokens across text, image, video, and speech can enable any-to-any generation without those external components.

Can generating entire videos at once beat keyframe interpolation?

Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.

Can bounding boxes replace image encoders for document understanding?

Explores whether spatial layout information alone, encoded as bounding boxes, can capture the multimodal signal needed for document understanding without expensive visual encoding. Matters because image encoders add significant computational cost to document processing systems.

Can we solve modality competition through architectural design?

Does modality competition in multimodal models stem from fundamental training conflicts, or from specific architectural choices? Understanding the root cause could reveal whether the trade-off is solvable.

Does multimodal zero-shot performance actually generalize or interpolate?

Explores whether multimodal models like CLIP truly generalize to unseen concepts or whether their impressive performance merely reflects memorization of frequently-seen concepts during pretraining.

Are text-only language models fundamentally limited by abstraction?

Explores whether text's compression of physics, geometry, and causality into symbols creates an irreducible ceiling for language-only AI, and whether multimodal approaches can overcome this structural constraint.

Can video language models actually understand time?

This research investigates whether video LLMs truly grasp temporal concepts like causality and event progression, or merely recognize spatial content across frames. Understanding this gap matters for video understanding tasks that depend on reasoning about time.

Why do vision and language scale so differently?

IsoFLOP analysis reveals vision and language follow distinct scaling curves—vision demands far more training data than language at equivalent compute budgets. Understanding this asymmetry matters for designing multimodal architectures that serve both modalities well.

Diffusion-Based LLMs

6 notes

Can diffusion models enable control that autoregressive models cannot reach?

Autoregressive language models struggle with complex global controls like syntax and infilling because they generate left-to-right and have discrete token bottlenecks. Can diffusion models' continuous latents and parallel denoising overcome these structural limitations?