Can language models learn to model human decision making?
Explores whether LLMs finetuned on psychological experiments can capture how people actually make decisions better than theories designed specifically for that purpose.
How LLM architecture, training data, and internal representations shape cognition and computation.
Explores whether LLMs finetuned on psychological experiments can capture how people actually make decisions better than theories designed specifically for that purpose.
Emotion recognition systems assume that detecting emotional moments will identify what people remember. But does observed emotion in group settings actually predict individual memorability, or does the proxy fail?
Do LLMs update their beliefs asymmetrically when learning from their own choices versus observing others? This matters for understanding whether agentic AI systems might inherit human cognitive biases.
Can GPT-3 identify event boundaries in narrative text the way humans do? This matters because it could reveal whether language models and human cognition share similar predictive mechanisms for understanding continuous experience.
Do language models prioritize statistical compression over semantic nuance when forming conceptual representations, and how does this differ from human category formation? This matters because it may explain why LLMs fail at tasks requiring fine-grained distinctions.
Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.
Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.
Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.
Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
Under what conditions do AI agents develop compact, efficient shared languages? This explores whether cooperative task pressure—rather than explicit optimization—naturally drives abstraction formation, mirroring human collaborative communication.
Explores whether multi-agent systems can communicate by exchanging latent thoughts extracted from hidden states, bypassing the ambiguity and misalignment problems inherent in natural language.
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.
Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?
Can multi-agent LLM systems, when structured with genetic programming, discover novel neural network designs that outperform human-engineered architectures? This matters because it could automate a critical bottleneck in AI research.
Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.
Explores whether embedding LLMs within algorithmic control flow—where programs manage state and context filtering—enables complex task decomposition beyond what LLMs achieve through self-managed reasoning chains.
Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.
When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.
After fine-tuning on graph data, do LLMs learn to use actual connectivity patterns, or just recognize that graphs exist? This matters for understanding whether transformers can handle structured reasoning tasks.
This explores whether inserting lookahead tokens containing future goals into training sequences helps models learn long-range planning without changing their architecture. The question matters because it tests whether data-level changes can produce architectural-level reasoning improvements.
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
Does moving from token-level to sentence-level reasoning in embedding space preserve the capability for complex reasoning while enabling language-agnostic processing? This challenges assumptions about how LLMs must operate.
Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.
Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?
Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.
This explores whether the brain's three-tier memory architecture—neocortex, hippocampus, and prefrontal cortex—maps onto transformer weights, external knowledge stores, and agentic state. Understanding this mapping could reveal which AI memory problems each tier solves and which it cannot.
Explores whether the challenge of handling long context windows stems from storage capacity limits or from the computational cost of transforming context into internal state. Understanding this distinction reshapes how we design language models.
Recurrent neural networks typically use recurrence only for prediction. But could offline recurrent passes serve a second purpose—consolidating transient context into persistent weights, like sleep does in brains?
Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.
When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
Exploring whether models trained on instructions actually learn the task semantics or merely learn to match output distributions. This matters because it challenges assumptions about how fine-tuning improves model behavior.
Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.
When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
Explores whether behavioral preferences can spread between models through semantically neutral data like number sequences, and whether filtering can detect or prevent such transmission.
Explores whether text's compression of physics, geometry, and causality into symbols creates an irreducible ceiling for language-only AI, and whether multimodal approaches can overcome this structural constraint.
Does modality competition in multimodal models stem from fundamental training conflicts, or from specific architectural choices? Understanding the root cause could reveal whether the trade-off is solvable.
IsoFLOP analysis reveals vision and language follow distinct scaling curves—vision demands far more training data than language at equivalent compute budgets. Understanding this asymmetry matters for designing multimodal architectures that serve both modalities well.
When sparse attention is compared fairly—larger sparse models versus smaller dense ones at the same compute cost—does it still represent a quality-cost trade-off, or does it actually improve performance?
Production systems often apply the same sparsity budget regardless of input length. Does this one-size-fits-all approach actually work across short and long contexts, or does optimal sparsity vary with sequence length?
Different NLP tasks show vastly different tolerance for sparse attention—from 95% on simple QA to 50-67% on multi-hop reasoning. What structural differences explain this variation, and how should it shape deployment decisions?
Dense retrieval excels at topical recall but struggles with meaning-level distinctions. Adding structure-targeted negatives during training might improve compositional sensitivity—but at what cost to overall retrieval performance?
Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.
Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.
Explores whether learning from a network's own abstract representations requires far fewer training samples than learning from raw tokens, and what mechanism drives this efficiency gap.
Does model size enable learning of infrequent, complex tasks through greater representational capacity, or through some other mechanism? Understanding this matters for deciding whether scaling or data design is the more efficient lever.
Explores whether a single optimal approach to synthetic data generation exists, or whether success depends on context like domain, model architecture, and scale. Understanding this matters for building effective data systems.
Does external tool use let language models recall facts without being constrained by parameter count? This matters because it could reshape how we scale knowledge capacity beyond architectural limits.
Models absorb and process rich input context far more effectively than they produce similarly sophisticated outputs. Understanding this asymmetry could reshape how we design systems to compensate for generative limitations.
Can models be built so that they respect query timestamps by selectively silencing experts trained on future data? This explores whether temporal causality can be enforced through architecture rather than external retrieval.
When language models train on the same private or proprietary data multiple times, how much do they end up memorizing and leaking that information at inference time? Understanding this risk is critical for organizations fine-tuning on confidential datasets.
GPT-4o excels at multimodal generation across 20+ tasks, but systematically fails to render non-Latin scripts and underrepresented cultures accurately. What explains this specific failure mode in otherwise capable systems?
Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.
GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?
Standard LLM training encodes documents first, then teaches QA patterns. But does this order matter? Exploring whether reversing the sequence—teaching how knowledge gets queried before encoding it—could unlock better factual recall.
Explores whether spatial layout information alone, encoded as bounding boxes, can capture the multimodal signal needed for document understanding without expensive visual encoding. Matters because image encoders add significant computational cost to document processing systems.
Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.
Can models trained natively with only three weight values (−1, 0, 1) achieve the same perplexity and task performance as standard full-precision models? This matters because ternary weights could dramatically reduce computational and energy costs.
Does intervening directly on a frozen model's representations offer a better path to parameter-efficient adaptation than current weight-based methods? This challenges the dominant PEFT paradigm by treating representations as the semantic lever instead.
Can training domain-specialized LLM copies in parallel without synchronization, then merging their components into a routed mixture, achieve better efficiency and accuracy than keeping all copies synchronized?
Can we derive principles for accelerating LM training by framing it as lossless compression? What does the optimal learning process look like when compression is the objective?
What scaling laws govern finetuning performance across model size, pretraining data, and finetuning data? Understanding these relationships could guide resource allocation in real-world tuning scenarios.
When language models train cyclically on repeated documents, do they anticipate upcoming material and recover from forgetting in advance? This challenges the standard catastrophic-interference narrative about sequential training.
Consistency models sample in one step but sacrifice quality compared to diffusion. Can adding just a handful of sampling steps recover the quality gap while staying faster than full diffusion?
Explores whether multimodal models like CLIP truly generalize to unseen concepts or whether their impressive performance merely reflects memorization of frequently-seen concepts during pretraining.
Can a feedback loop letting transformers attend to their own internal representations enable them to process indefinitely long sequences without adding extra weights? This explores whether working memory can emerge from self-attention rather than external modules.
When LLMs learn unfamiliar facts through fine-tuning, do they become more prone to hallucinating about things they already knew? Understanding this matters for safe knowledge updates.
Most multimodal systems rely on separate encoders for each modality. This research explores whether training a unified foundation model on discrete tokens across text, image, video, and speech can enable any-to-any generation without those external components.
This explores whether LLMs can use dedicated offline periods to consolidate short-term learning into permanent weights, avoiding catastrophic forgetting and the need for expensive retraining.