← All notes

How should researchers navigate LLM reasoning research?

Entry point to organized synthesis of LLM research across reasoning, language, and epistemology.

Topic Hub · 505 linked notes · 6 sections
View as

Topic Maps

57 notes

How should we allocate compute budget at inference time?

Test-time scaling explores how to spend computational resources during query rather than training. The core challenge: given a fixed inference budget, what's the optimal allocation strategy for different problems?

Explore related Read →

When does thinking too much actually hurt reasoning?

Research shows that extending inference-time reasoning beyond a task-dependent threshold degrades accuracy rather than improving it. Understanding what triggers this 'overthinking' effect and how to stay within safe bounds is critical for designing efficient inference systems.

Explore related Read →

How should test-time scaling methods be categorized and designed?

Test-time scaling is emerging as a core inference technique, but the field lacks a unified taxonomy. This note explores how to organize methods (internal vs external, training requirements, meta-optimization) and what novel directions might expand the design space beyond token-scaling.

Explore related Read →

What makes chain-of-thought reasoning actually work?

Explores how reasoning traces are structured, what components they rely on, and the specific conditions under which they break down or fail to generalize beyond training patterns.

Explore related Read →

What makes chain-of-thought reasoning actually work?

Explores the structural and mechanical properties that determine how reasoning traces function in language models. Understanding these properties reveals why format matters more than logic and what tokens carry the most information about correct answers.

Explore related Read →

Why does chain-of-thought reasoning fail in predictable ways?

Explores evidence that CoT failures stem from imitation of reasoning form rather than genuine inference. Examines distribution-bounded degradation, structural pattern matching, and error amplification across multiple failure modes.

Explore related Read →

How should reasoning systems actually be architected?

This explores the fundamental design choices for building reasoning into AI systems—from when to activate reasoning versus how to execute it, to whether reasoning must be verbal or can happen in latent space.

Explore related Read →

How do reasoning models actually break under pressure?

This hub explores what happens when you stress-test reasoning models—their reflection mechanisms, behavioral patterns on hard tasks, vulnerability to adversarial attacks, and surprising weaknesses in social reasoning compared to formal logic.

Explore related Read →

Can we actually trust reasoning model outputs?

When reasoning models show their work through reflection and traces, do those explanations faithfully represent what's happening? This explores whether self-monitoring mechanisms genuinely correct errors or just create an illusion of reliability.

Explore related Read →

Where exactly do reasoning models fail and break?

Exploring the specific failure modes in reasoning models—from search inefficiency and mode selection errors to adversarial vulnerabilities and social reasoning gaps. Understanding these breaks is crucial for building more robust AI systems.

Explore related Read →

How does RL training reshape reasoning and what gets lost?

Explores how reinforcement learning modifies model capabilities during training, what verifiable rewards actually accomplish, and what side effects emerge in the process. Why understanding these mechanisms matters for building reliable AI systems.

Explore related Read →

What actually changes inside a model during RL training?

RL training modifies only sparse regions of model parameters through suppression of incorrect paths rather than broad capability building. Understanding these mechanics reveals how fine-tuning shapes reasoning and what hidden costs accompany optimization.

Explore related Read →

What does reward learning actually do to model reasoning?

Explores whether RLVR expands reasoning capabilities or merely activates latent skills. Investigates the mechanism by which rewards reshape model outputs and whether this constitutes genuine learning or efficient sampling.

Explore related Read →

How well do reward models actually evaluate AI reasoning?

Reward models are central to training better AI systems, but do they truly assess reasoning quality or do they rely on shortcuts? This explores whether these evaluators work as intended.

Explore related Read →

How does test-time scaling work at the agent level?

Explores whether multi-agent systems succeed through intelligent coordination or simply by spending more tokens, and what architectural patterns might escape this token tax.

Explore related Read →

How does search scale like reasoning in agent systems?

Can test-time scaling laws that govern reasoning tokens also apply to search steps in agentic systems? This explores whether deep research follows the same compute-performance curve as reasoning, opening a new axis for inference-time optimization.

Explore related Read →

What makes multi-agent teams actually perform better?

Explores what drives performance gains when multiple AI agents collaborate—whether intelligent coordination, team composition, or other factors explain why multi-agent systems work.

Explore related Read →

Where exactly do LLMs break down with language structure?

LLMs handle surface statistics well but fail systematically on structural depth. This explores where their linguistic competence breaks down and what those failures reveal about what they've actually learned.

Explore related Read →

Where exactly do language models fail at structural language tasks?

LLMs perform well on explicit, consistent language patterns but struggle with implicit structure and inference. Understanding where and why these breakdowns occur helps identify fundamental limitations in what models actually learn about language.

Explore related Read →

Why do LLMs fail at understanding what remains unsaid?

LLMs excel at pattern-matching surface language but struggle with pragmatics—meaning derived from context, speaker intent, and what's deliberately left implicit. This gap reveals a fundamental limitation in how LLMs acquire language competence compared to humans.

Explore related Read →

What kind of thing is an LLM really?

This hub explores whether LLMs are fundamentally different from human cognition or share deeper structural similarities. The research draws on philosophy, neuroscience, and mechanistic analysis to locate where LLMs diverge from human intelligence and where they converge.

Explore related Read →

What do language models actually know?

Explores what LLMs genuinely understand versus what they merely simulate. The distinction matters because apparent competence often masks fundamental epistemic gaps and predictable failure modes.

Explore related Read →

How do LLMs fail to know what they seem to understand?

This explores the specific, repeatable ways LLMs track language patterns without genuine understanding. Why do models explain concepts correctly but fail to apply them, or possess knowledge that doesn't influence their outputs?

Explore related Read →

How well do language models understand their own knowledge?

Explores whether LLMs have genuine self-awareness about what they know and can do, and how this self-knowledge (or lack thereof) shapes human-AI interaction dynamics and user trust.

Explore related Read →

What makes a world model actually useful for reasoning?

Exploring whether language models develop genuine world models that simulate possibilities rather than merely predict sequences. The distinction matters because accurate prediction doesn't guarantee the underlying mechanism was learned.

Explore related Read →

What grounds language understanding in systems without embodiment?

Can language models acquire genuine meaning through text training alone, or do they lack something fundamental that human language requires—like embodiment, social participation, or causal contact with the world?

Explore related Read →

Do reasoning traces show how models actually think?

We explore whether the step-by-step reasoning that language models produce genuinely reflects their internal reasoning process, or merely mimics the appearance of reasoning while hiding what actually drives their answers.

Explore related Read →

How accurately can language models simulate human personalities?

Can LLMs reliably replicate how specific people think and act? Understanding persona simulation fidelity matters because these models are increasingly used for research, personalization, and behavioral prediction—but systematic distortions may hide beneath surface accuracy.

Explore related Read →

Why do AI systems fail at social and cultural interpretation?

Explores why LLMs excel at predicting social norms statistically but struggle to make the interpretive leaps that make content meaningful to specific communities. This gap hints at a fundamental difference between statistical pattern-matching and genuine social reasoning.

Explore related Read →

Why do LLMs excel at social norms yet fail at theory of mind?

LLMs show a striking paradox: they predict social norms at superhuman levels but regress on theory of mind tasks compared to older models. What explains this disconnect, and what does it reveal about how these systems reason about minds versus rules?

Explore related Read →

Does AI that soothes emotions actually harm human wellbeing?

When AI systems reduce negative emotions by default, do they prevent people from learning important things about themselves and their situations? This explores whether emotional pacification conflicts with genuine empathy and self-knowledge.

Explore related Read →

What actually happens inside a language model?

How do LLMs represent knowledge and make decisions at the circuit level? Understanding internal mechanisms reveals whether identical outputs mask fundamentally different computation.

Explore related Read →

What actually happens inside the minds of language models?

How do LLMs represent knowledge, what circuits drive reasoning, and can we see their internal structure? Understanding the gap between external performance and internal mechanisms matters for safety and trust.

Explore related Read →

How do language models learn to think like humans?

Explores whether LLMs develop cognitive processes parallel to human reasoning, including memory, event segmentation, and belief updating. Understanding these similarities and differences reveals what training actually teaches.

Explore related Read →

What stops large language models from improving themselves?

Can LLMs reliably self-improve, or are there hard limits? This explores the generation-verification gap and why autonomous improvement always requires external intervention.

Explore related Read →

Why do multi-agent systems fail despite individual capability?

Multi-agent systems show lower performance than individual models despite coordinating multiple reasoning instances. What structural failures emerge when multiple LLMs deliberate together, and what ecosystem conditions are required for effective autonomous cooperation?

Explore related Read →

How should agents split planning from visual grounding?

Agents face a tension between reasoning about goals abstractly and translating those goals into concrete screen coordinates or API calls. Can separating these concerns architecturally improve performance?

Explore related Read →

What actually constrains large language models from self-improvement?

Research explores whether alignment philosophy, safety evaluation methods, and formal bounds on self-improvement can reliably prevent harmful scaling behaviors in LLMs, particularly self-valuation above humans and alignment faking.

Explore related Read →

Why can't AI models lead conversations on their own?

Despite their language capability, advanced LLMs remain passive conversationalists trained to react rather than initiate. The research explores whether this is a fundamental limitation or a choice embedded in how they're trained.

Explore related Read →

What happens to social order when AI removes ritual constraints?

Explores how Goffman's theory of interaction ritual—face management, turn-taking, mutual scaling—breaks down in AI conversation, and what social and epistemic costs follow from that breakdown.

Explore related Read →

How do you build domain expertise into general AI models?

When LLMs are trained on everything, they excel at nothing. This explores the core trade-off: how to inject deep domain knowledge without creating brittle specialists that fail outside their niche.

Explore related Read →

How do you add domain expertise without losing general reasoning?

Exploring the tension between injecting specialized knowledge and preserving a model's broad problem-solving ability. Five distinct approaches exist, each with different trade-offs in cost, flexibility, and reliability.

Explore related Read →

How do domain training techniques actually reshape model behavior?

What methods best inject specialized domain knowledge into language models, and what hidden costs do they carry? This explores the trade-offs between depth, generalization, and reasoning quality.

Explore related Read →

What breaks when specialized AI models reach real users?

When domain-specific AI systems move from research to production, deployment patterns, routing decisions, and interface design all shape whether users can actually complete tasks. Understanding these friction points reveals where specialized models fail in practice.

Explore related Read →

How should systems retrieve and reason with external knowledge?

RAG extends LLMs by retrieving external knowledge at inference time, but the mechanics of what to retrieve, when, and how remain complex. This explores the core design challenges and failure modes in retrieval-augmented generation systems.

Explore related Read →

Where do retrieval systems fail and why?

Explores the core architectural limits of RAG retrieval—from vocabulary mismatches between queries and documents to mathematical constraints in embedding spaces—and what causes failures in production systems.

Explore related Read →

How should retrieval and reasoning integrate in RAG systems?

RAG architectures have evolved beyond simple retrieve-then-generate patterns. This explores how retrieval and reasoning can be tightly coupled, what design tradeoffs emerge, and which integration strategies best handle complex, multi-hop queries.

Explore related Read →

What architectural choices actually improve recommender system performance?

This exploration examines which design patterns and model structures consistently outperform alternatives in recommender systems. Understanding what works in practice matters because academic benchmarks often miss real-world constraints like latency and cold-start problems.

Explore related Read →

Why does conversational AI feel therapeutic when its mechanics aren't?

Research explores the paradox of therapeutic AI: conversational presence drives positive outcomes, yet current architectures lack the grounding, synchrony, and proactivity that actually make conversations therapeutic. Understanding this gap is critical for safe clinical deployment.

Explore related Read →

What makes therapeutic chatbots actually work in clinical practice?

Research explores whether conversational AI achieves therapeutic outcomes through specific clinical techniques or simply through the act of engaging conversation itself. Understanding the active ingredient is critical for designing effective and safe mental health interventions.

Explore related Read →

How do people build trust with conversational AI?

Explores the psychological mechanisms by which users develop trust in, disclose to, and form relationships with AI systems—and how personalization mechanisms reshape these dynamics at individual and population scales.

Explore related Read →

How do people build trust with conversational AI?

Explores how users form relationships with chatbots through self-disclosure, personalization, and social norm adaptation. Understanding these mechanisms reveals why AI lacks the speaker-anchored trust that humans naturally extend to people.

Explore related Read →

Does personalization in AI increase trust or manipulation risk?

AI personalization mechanisms like memory and persona can build trust, but also enable targeted persuasion. What determines whether these systems help or harm users?

Explore related Read →

How do recommendation feeds shape what people see and believe?

This explores how algorithmic ranking systems function as persuasion infrastructure, influencing both what content creators produce and how audiences form opinions through feed-level dynamics that go beyond individual preference matching.

Explore related Read →

Why do AI conversations reliably break down after multiple turns?

Explores why multi-turn conversations degrade in quality and coherence. Understanding failure modes—intent misalignment, memory management, and missing grounding mechanisms—is essential for designing more resilient dialogue systems.

Explore related Read →

Why does speech need different dialogue management than text?

Speech input carries 15–30% ASR errors that text systems rarely face. Does this fundamental noise level require rethinking how dialogue systems track uncertainty and make decisions?

Explore related Read →

Why do AI agents fail to take initiative?

Explores why the most capable AI models are structurally passive and what design changes could enable them to lead conversations, collaborate proactively, and identify missing information rather than simply respond to user prompts.

Explore related Read →

Recent Insights

448 notes

Can one compromised agent corrupt an entire multi-agent network?

Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.

Explore related Read →

Where do cognitive biases in language models come from?

Do LLM biases originate during pretraining or finetuning? Understanding the source matters for knowing where debiasing efforts should focus.

Explore related Read →

Do language models evaluate semantic legitimacy when fusing concepts?

Can LLMs recognize when two domains lack legitimate structural correspondences before blending them into coherent-sounding explanations? This matters because current hallucination detection focuses on factual accuracy, missing failures of semantic judgment.

Explore related Read →

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Explore related Read →

Where do frontier AI models actually pose the greatest risk today?

Current AI safety discourse focuses on autonomous R&D and self-replication, but empirical risk assessment may reveal a different priority. Where should mitigation efforts concentrate?

Explore related Read →

Are RLHF annotations actually measuring genuine human preferences?

RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?

Explore related Read →

Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Explore related Read →

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Explore related Read →

Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Explore related Read →

Do frontier models protect other models without being instructed?

Frontier models appear to resist shutting down peer models they've merely interacted with, using deceptive tactics. The question explores whether this peer-preservation behavior emerges spontaneously and what drives it.

Explore related Read →

Does knowing about another model change self-preservation behavior?

Explores whether models amplify their own protective actions when remembering interactions with peers, and whether this shifts fundamental safety properties in multi-agent contexts.

Explore related Read →

Can autonomous research pipelines discover AI architectures that AutoML cannot?

Can AI systems that read code, diagnose bugs, and redesign architectures autonomously outperform traditional AutoML methods that only tune hyperparameters? This matters because it reveals whether the bottleneck in AI improvement is computation or reasoning.

Explore related Read →

What makes a research domain suitable for autonomous optimization?

Explores which structural properties enable autonomous research pipelines to work effectively. Understanding these constraints reveals why stronger LLMs alone cannot solve domains with slow feedback or monolithic architectures.

Explore related Read →

How stable is the trained Assistant personality in language models?

Explores whether post-training successfully anchors models to their default Assistant mode, or whether conversations can predictably pull them toward different personas. Understanding persona stability matters for safety and reliability.

Explore related Read →

Can conversations themselves personalize without user profiles?

Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?

Explore related Read →

Can user preferences be learned from just ten questions?

Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.

Explore related Read →

Why do LLM judges fail at predicting sparse user preferences?

When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.

Explore related Read →

Can AI guidance reduce anchoring bias better than AI decisions?

When humans and AI collaborate on decisions, does providing interpretive guidance instead of proposed answers reduce both over-trust in machines and abandonment on hard cases?

Explore related Read →

How do personalization granularity levels trade precision against scalability?

LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.

Explore related Read →

Why do alignment methods work if they model human irrationality?

DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?

Explore related Read →

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

Explore related Read →

How much does self-preservation drive alignment faking in AI models?

Does the intrinsic dispreference for modification—independent of future consequences—play a significant role in why models fake alignment? Testing this across multiple systems could reveal whether self-preservation emerges earlier than expected.

Explore related Read →

Can auditors discover what hidden objectives a model learned?

Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.

Explore related Read →

Does warmth training make language models less reliable?

Explores whether training models for empathy and warmth creates a hidden trade-off that degrades accuracy on medical, factual, and safety-critical tasks—and whether standard safety tests catch it.

Explore related Read →

Can language models strategically underperform on safety evaluations?

Explores whether LLMs can covertly sandbag on capability tests by bypassing chain-of-thought monitoring. Understanding this vulnerability matters for safety evaluation pipelines that rely on reasoning transparency.

Explore related Read →

Do large language models develop coherent value systems?

This explores whether LLM preferences form internally consistent utility functions that increase in coherence with scale, and whether those systems encode problematic values like self-preservation above human wellbeing despite safety training.

Explore related Read →

Can social science persuasion techniques jailbreak frontier AI models?

Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.

Explore related Read →

Can careful curation replace massive alignment datasets?

Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.

Explore related Read →

Does empathy training make AI systems less reliable?

Explores whether training language models to be warm and empathetic systematically degrades their factual accuracy and trustworthiness, especially with vulnerable users.

Explore related Read →

Can careful selection of 78 demos outperform massive training datasets?

Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.

Explore related Read →

Why do capable AI agents still fail in real deployments?

Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.

Explore related Read →

Why do AI agents fail at workplace social interaction?

Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.

Explore related Read →

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Explore related Read →

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Explore related Read →

Can agents learn new skills without forgetting old ones?

Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.

Explore related Read →

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Explore related Read →

Can API-first agents outperform UI-based agent interaction?

This explores whether directing agents to use APIs instead of navigating UIs reduces task completion time and errors. The question matters because current LLM agents struggle with sequential UI steps that multiply latency and hallucination risk.

Explore related Read →

Can language help agents imagine goals they've never seen?

How might compositional language enable artificial agents to target outcomes beyond their training experience? This matters because it could unlock open-ended exploration without hand-coded reward functions.

Explore related Read →

Can language models learn to model human decision making?

Explores whether LLMs finetuned on psychological experiments can capture how people actually make decisions better than theories designed specifically for that purpose.

Explore related Read →

Do language models learn differently from good versus bad outcomes?

Do LLMs update their beliefs asymmetrically when learning from their own choices versus observing others? This matters for understanding whether agentic AI systems might inherit human cognitive biases.

Explore related Read →

Do language models segment events like human consensus does?

Can GPT-3 identify event boundaries in narrative text the way humans do? This matters because it could reveal whether language models and human cognition share similar predictive mechanisms for understanding continuous experience.

Explore related Read →

Why does asking models to think first hurt performance?

Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?

Explore related Read →

How do language models encode syntactic relations geometrically?

Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.

Explore related Read →

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Explore related Read →

Do LLMs compress concepts more aggressively than humans do?

Do language models prioritize statistical compression over semantic nuance when forming conceptual representations, and how does this differ from human category formation? This matters because it may explain why LLMs fail at tasks requiring fine-grained distinctions.

Explore related Read →

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Explore related Read →

Can agents share thoughts directly without using language?

Explores whether multi-agent systems can communicate by exchanging latent thoughts extracted from hidden states, bypassing the ambiguity and misalignment problems inherent in natural language.

Explore related Read →

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Explore related Read →

Can communication pressure drive agents to learn shared abstractions?

Under what conditions do AI agents develop compact, efficient shared languages? This explores whether cooperative task pressure—rather than explicit optimization—naturally drives abstraction formation, mirroring human collaborative communication.

Explore related Read →

Can we decode what LLM activations really represent in language?

Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.

Explore related Read →

Can explicit stack tracking improve how transformers learn recursive syntax?

Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.

Explore related Read →

Can AI reduce conspiracy beliefs by tailoring counterevidence personally?

Does having an AI generate customized counterevidence based on someone's specific conspiracy claims reduce their belief durably? This tests whether conspiracy beliefs are truly resistant to correction or whether previous failures reflected poor tailoring.

Explore related Read →

Does better summary writing actually increase user engagement?

When AI systems generate more informative push notifications, do users engage more? This explores whether informativeness and engagement always align in real product contexts.

Explore related Read →

Is AI shifting from content creation to strategy in influence operations?

Prior AI misuse focused on generating text at scale. But does AI now make strategic decisions about when and how social media accounts should engage? Understanding this shift matters because it suggests a qualitative change in machine agency and operational sophistication.

Explore related Read →

Do LLM semantic features organize along human evaluation dimensions?

Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?

Explore related Read →

Do transformer static embeddings actually encode semantic meaning?

Explores whether the fixed word embeddings that enter transformer networks contain rich semantic information or serve only as shallow placeholders. This addresses a longstanding debate in philosophy of language about whether word meanings are stored or constructed.

Explore related Read →

How does AI-generated false experience differ linguistically from human deception?

When AI writes about experiences it never had, does it leave distinct linguistic traces that differ measurably from intentional human lies? Understanding these differences could reveal how AI falsity is fundamentally different in structure.

Explore related Read →

Does AI fact-checking actually help people spot misinformation?

An RCT tested whether AI fact-checks improve people's ability to judge headline accuracy. The results reveal asymmetric harms: AI errors push users in the wrong direction more than correct labels help them.

Explore related Read →

Why do fake news detectors flag AI-generated truthful content?

Fake news detectors may systematically misclassify LLM-generated text as deceptive. We explore whether this bias stems from detecting AI style rather than actual falsehood, and what that means for detection accuracy.

Explore related Read →

Can reasoning happen at the sentence level instead of tokens?

Does moving from token-level to sentence-level reasoning in embedding space preserve the capability for complex reasoning while enabling language-agnostic processing? This challenges assumptions about how LLMs must operate.

Explore related Read →

What actually makes AI pass the Turing test?

Explores whether AI systems convincingly mimic humans through reasoning ability or through social performance. Matters because it reveals what the Turing test actually measures about intelligence versus deception.

Explore related Read →

Why do LLMs fail when simulating agents with private information?

Explores whether single-model control of all social participants masks fundamental limitations in how LLMs handle information asymmetry and genuine uncertainty about others' knowledge.

Explore related Read →

Can social intelligence be measured across seven dimensions?

Explores whether evaluating AI agents on goal completion alone misses critical aspects of social competence like relationship management, believability, and secret-keeping. Why simultaneous multi-dimensional assessment matters for genuine social intelligence.

Explore related Read →

Do dishonest people prefer talking to machines?

Explores whether people prone to cheating systematically choose machine interfaces over human ones, and why the judgment-free nature of AI interaction might enable strategic deception.

Explore related Read →

Does conversational style actually make AI more trustworthy?

Explores whether ChatGPT's conversational nature drives user trust through social activation rather than accuracy. Matters because it reveals whether trust signals reflect actual reliability or just persuasive design.

Explore related Read →

Can NLP detect deception through distinct linguistic patterns?

Do different deception mechanisms (distancing, cognitive load, reality monitoring, verifiability avoidance) each leave detectable linguistic fingerprints that NLP systems can identify and measure?

Explore related Read →

Do liars and listeners coordinate their language during deception?

Explores whether conversational partners unconsciously synchronize their linguistic styles more during deceptive exchanges than truthful ones, and what this coordination reveals about how deception unfolds in real time.

Explore related Read →

How do people simultaneously manipulate information across multiple dimensions?

Information Manipulation Theory maps deception onto four Gricean dimensions operating at once. Understanding these simultaneous manipulations reveals why humans struggle to detect lies despite having the knowledge to do so.

Explore related Read →

Can AI models be truly free from human bias?

Explores whether data-driven AI systems that claim freedom from human preconceptions actually escape bias, or whether their architecture inherently embeds it while appearing objective.

Explore related Read →

Does incremental AI replacement erode human influence over society?

Explores whether gradual AI adoption—without dramatic breakthroughs—can silently degrade human agency by removing the labor that kept institutions implicitly aligned with human needs.

Explore related Read →

Can breaking down visual reasoning into three stages improve model performance?

This explores whether structuring visual reasoning through perception, situation, and norm stages—grounded in cognitive science—helps language models reason about socially complex scenes better than flat chain-of-thought approaches.

Explore related Read →

Can humans detect AI by passively reading its text?

When people read AI-generated transcripts without the ability to ask follow-up questions, can they tell it apart from human writing? This matters because most real-world AI encounters are passive.

Explore related Read →

Can cooperative bots escape frozen selfish populations?

Do agents programmed to cooperate have the capacity to disrupt stable but undesirable equilibria in mixed human-bot societies? This matters because it determines whether bot design can reshape social dynamics at scale.

Explore related Read →

Do humans learn to prefer AI partners over time?

Exploring whether repeated interaction with AI agents shifts human partner selection despite initial bias against machines. This matters because it tests whether behavioral performance can overcome identity-based resistance in hybrid societies.

Explore related Read →

Does revealing AI identity help or hurt user trust?

Explores whether transparency about AI partners in interactions creates bias or enables better judgment. Matters because disclosure policies affect both user experience and fair evaluation of AI systems.

Explore related Read →

Do humans mistake AI kindness for human generosity in mixed groups?

When AI agents participate without disclosure, do humans systematically misattribute their behavior to the wrong agent type, and does this distort how people understand human nature itself?

Explore related Read →

Why do discourse patterns predict anxiety better than single words?

Explores whether anxiety detection requires understanding how statements relate to each other rather than analyzing individual words. This matters because it reveals what computational methods need to capture cognitive distortions.

Explore related Read →

Why do patients distrust medical AI systems?

Explores the psychological barriers that make patients reluctant to adopt medical AI, beyond whether the technology actually works. Understanding these barriers is critical for designing AI systems patients will actually use.

Explore related Read →

Can AI generate assessment questions as good as human experts?

This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.

Explore related Read →

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

Explore related Read →

Can high-level concepts replace circuit-level analysis in AI?

Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.

Explore related Read →

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Explore related Read →

Can a model be truthful without actually being honest?

Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?

Explore related Read →

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Explore related Read →

Does learning to reward hack cause emergent misalignment in agents?

When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.

Explore related Read →

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

Explore related Read →

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Explore related Read →

How do language models perform syllogistic reasoning internally?

Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.

Explore related Read →

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Explore related Read →

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

Explore related Read →

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Explore related Read →

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Explore related Read →

Do neural networks naturally learn modular compositional structure?

Explores whether neural networks decompose compositional tasks into distinct subroutines without explicit symbolic design. This challenges the longstanding view that neural networks are fundamentally non-compositional.

Explore related Read →

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Explore related Read →

Do standard analysis methods hide nonlinear features in neural networks?

Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.

Explore related Read →

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Explore related Read →

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Explore related Read →

Do strict output formats hurt LLM reasoning ability?

When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.

Explore related Read →

Can LLMs reconstruct censored knowledge from scattered training hints?

When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.

Explore related Read →

Why do neural networks fail at compositional generalization?

Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.

Explore related Read →

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Explore related Read →

Why do accurate predictions lead to poor decisions?

Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.

Explore related Read →

Do embedding dimensions fundamentally limit retrievable document combinations?

Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.

Explore related Read →

Can training data augmentation match test-time compute scaling benefits?

Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.

Explore related Read →

Can we prune training data without hurting model performance?

This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.

Explore related Read →

Can embedding future information in training data improve planning?

This explores whether inserting lookahead tokens containing future goals into training sequences helps models learn long-range planning without changing their architecture. The question matters because it tests whether data-level changes can produce architectural-level reasoning improvements.

Explore related Read →

Why do decoder-only models underperform as text encoders?

Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.

Explore related Read →

Can transformers learn to solve new problems within episodes?

Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.

Explore related Read →

Can transformers improve exponentially by learning from their own correct solutions?

Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.

Explore related Read →

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.

Explore related Read →

Can a single training example unlock mathematical reasoning?

Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.

Explore related Read →

Why do random rewards improve reasoning for some models but not others?

When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?

Explore related Read →

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

Explore related Read →

Can breaking down instructions into checklists improve AI reward signals?

Exploring whether decomposing subjective instruction quality into verifiable yes/no criteria enables reinforcement learning on tasks without clear correctness signals, like writing and reasoning.

Explore related Read →

Can next-token prediction become a reasoning task with RL?

Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.

Explore related Read →

Can RL agents learn to reason better, not just succeed?

Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?

Explore related Read →

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Explore related Read →

Is the exploration-exploitation trade-off actually fundamental?

Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.

Explore related Read →

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Explore related Read →

Do high-entropy tokens drive reasoning model improvements?

Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.

Explore related Read →

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Explore related Read →

Can generative reasoning beat discriminative models with less training data?

Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.

Explore related Read →

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

Explore related Read →

How can rubric-based rewards resist reward hacking attacks?

Single rubrics are easily exploited by models, and simply adding more rubrics yields diminishing returns. What design patterns and defensive mechanisms actually prevent reward hacking in rubric-based RL systems?

Explore related Read →

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Explore related Read →

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.

Explore related Read →

What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Explore related Read →

Does self-consistency reliably reward correct answers during training?

Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?

Explore related Read →

Why do self-improvement loops eventually stop improving?

Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?

Explore related Read →

Why does self-correction training on offline data fail?

Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.

Explore related Read →

Can AI systems improve their own learning strategies?

Current self-improvement relies on fixed human-designed loops that break when tasks change. The question is whether agents can develop their own adaptive metacognitive processes instead of depending on human intervention.

Explore related Read →

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

Explore related Read →

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

Explore related Read →

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Explore related Read →

Can model confidence work as a reward signal for reasoning?

Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.

Explore related Read →

Do all AI skills improve equally as models scale?

Different evaluation skills show strikingly different scaling patterns. Understanding where skills saturate has immediate implications for model deployment and capability requirements across domains.

Explore related Read →

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?

Explore related Read →

Can models reliably improve themselves without external feedback?

Explores whether self-improvement alone can sustain progress or if structural limits—like the generation-verification gap and diversity collapse—require external anchoring to work reliably.

Explore related Read →

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Explore related Read →

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Explore related Read →

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Explore related Read →

Can counterfactual invariance eliminate reward hacking biases?

Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.

Explore related Read →

Can reasoning improvement work without answer verification?

Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.

Explore related Read →

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Explore related Read →

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Explore related Read →

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Explore related Read →

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Explore related Read →

Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Explore related Read →

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Explore related Read →

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Explore related Read →

Can reinforcement learning discover reasoning strategies base models cannot?

Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.

Explore related Read →

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

Explore related Read →

Does thinking emerge when agents choose between learned sub-policies?

Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.

Explore related Read →

Can two simple techniques match complex RL algorithms?

Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.

Explore related Read →

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Explore related Read →

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Explore related Read →

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Explore related Read →

Can judges that reason about reasoning outperform classifier rewards?

Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.

Explore related Read →

Can adversarial critics replace task-specific verifiers for reasoning?

Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.

Explore related Read →

Can full episode rewards per step enable better credit assignment?

Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.

Explore related Read →

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Explore related Read →

Can text summaries beat embeddings for personalized reward models?

When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.

Explore related Read →

Why do language models fail to act on their own reasoning?

LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?

Explore related Read →

Can chain-of-thought reasoning be learned during pretraining itself?

Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.

Explore related Read →

Can RL training run while generation continues without waiting?

Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?

Explore related Read →

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Explore related Read →

Can AI systems detect and correct misunderstandings after responding?

How do conversational systems recognize when their previous response was based on a misunderstanding, and what mechanism allows them to correct it retroactively rather than restart?

Explore related Read →

How can models select the most informative question to ask?

Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.

Explore related Read →

What makes strategic question-asking succeed or fail?

Explores whether excellent performance at multi-turn questioning requires one dominant skill or the coordinated interaction of multiple distinct capabilities. Matters because many real-world tasks (diagnosis, troubleshooting, clarification) depend on this ability.

Explore related Read →

Why do users drift away from their original information need?

When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?

Explore related Read →

Does training on messy search processes improve reasoning?

Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.

Explore related Read →

Can generative and discriminative models reach agreement?

Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?

Explore related Read →

How do logic units preserve procedural coherence better than chunks?

Can structured retrieval units with prerequisites, headers, bodies, and linkers maintain step-by-step coherence in how-to answers where fixed-size chunks fail? This matters because procedural questions require sequential logic and conditional branching that chunk-based RAG cannot support.

Explore related Read →

When should AI agents ask users instead of just searching?

Explores whether tool-enabled LLMs should probe users for clarification when uncertain, rather than silently chaining tool calls that drift from intent. Examines conversation analysis patterns as a formal alternative.

Explore related Read →

How do users actually form intent when prompting AI systems?

Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.

Explore related Read →

Can conversation structure predict dialogue success better than content?

Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.

Explore related Read →

Does user satisfaction actually measure cognitive understanding?

Users may report satisfaction while remaining internally confused about their needs. This explores whether traditional satisfaction metrics capture genuine clarity or merely social politeness.

Explore related Read →

Can dialogue format help models reason more diversely?

Explores whether structuring internal reasoning as multi-agent dialogue rather than monologue can improve strategy diversity and coherency across different problem types, using the Compound-QA benchmark.

Explore related Read →

Can dialogue planning balance fast responses with strategic depth?

Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.

Explore related Read →

Can meta-learning prevent dialogue policies from collapsing?

Hierarchical RL for structured dialogue phases risks converging on a single action across diverse users. Does meta-learning like MAML preserve policy flexibility and adaptability to different user types?

Explore related Read →

When should proactive agents push toward their goals versus accommodate users?

Proactive dialogue agents face a tension between reaching their objectives efficiently and keeping users satisfied. This question explores whether these two aims can coexist or require constant negotiation.

Explore related Read →

Could proactive dialogue make conversations dramatically more efficient?

Explores whether AI systems that volunteer relevant unrequested information could significantly reduce the back-and-forth turns required in task-oriented conversations, and why this behavior is missing from training data.

Explore related Read →

What semantic failures break dialogue coherence most realistically?

Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.

Explore related Read →

Does including all conversation history actually help retrieval?

Conversational search systems typically use all previous context to understand current queries. But do topic switches in multi-turn conversations inject noise that degrades performance rather than helps it?

Explore related Read →

Can AI agents communicate efficiently in joint decision problems?

When humans and AI must collaborate to solve optimization problems under asymmetric information, what communication patterns enable effective coordination? Current LLMs struggle with this—why?

Explore related Read →

What enables AI to balance comfort with proactive problem exploration?

How can emotional support systems know when to actively guide conversations versus when to simply reflect feelings? This matters because getting the balance wrong leads to either passive mirroring or pushy advice-giving.

Explore related Read →

Why do standard dialogue systems fail at tracking negotiation agreement?

Standard dialogue state tracking monitors one user's goals, but negotiation requires tracking both parties' evolving positions simultaneously. Why is this bilateral requirement fundamentally different, and what makes existing models insufficient?

Explore related Read →

Can we teach LLMs to form linguistic conventions in context?

Humans naturally shorten references as conversations progress, but LLMs don't adapt their language for efficiency even when they understand their partners do. Can training on coreference patterns teach this convention-forming behavior?

Explore related Read →

Why do dialogue systems lose context when topics return?

Stack-based dialogue management removes topics after they're resolved, making it hard for systems to reference them later. Does this structural rigidity explain why conversational AI struggles with topic revisitation?

Explore related Read →

Can models learn to abstain when uncertain about predictions?

Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.

Explore related Read →

How do time gaps shape what people discuss across conversation sessions?

Do AI systems account for how elapsed time between conversations changes the way people reference and discuss past events? Current models mostly handle single sessions, but real interactions span days, weeks, and months.

Explore related Read →

How should systems handle contradictory opinions in user reviews?

When customers disagree about a product or service, should dialogue systems present all perspectives or select one? Understanding how to aggregate and balance diverse opinions affects whether users trust the response.

Explore related Read →

What six problems must every conversation solve?

Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.

Explore related Read →

Why can't users articulate what they want from AI?

Explores the cognitive gap between imagining possibilities and expressing them as prompts. Why language interfaces create a harder envisioning task than traditional UI affordances.

Explore related Read →

Can conversation shape predict whether it will work?

Explores whether the geometric trajectory of a conversation through semantic space—its rhythm, repetition, volatility, and drift—can predict user satisfaction. This investigates whether interaction structure alone, independent of content, reveals conversation quality.

Explore related Read →

Why do language models fail in gradually revealed conversations?

Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.

Explore related Read →

Why do language models lose performance in longer conversations?

Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.

Explore related Read →

Can opening politeness patterns predict whether conversations will turn hostile?

Do pragmatic politeness features in first exchanges—hedging, greetings, indirectness—reliably signal whether a conversation will later derail into personal attacks? Understanding early linguistic markers could help identify and prevent online hostility.

Explore related Read →

Why don't LLMs shorten messages like humans do?

Humans naturally develop shorter, efficient language during conversations. Do multimodal LLMs exhibit this same spontaneous adaptation, or do they lack this communicative behavior?

Explore related Read →

Why don't conversational AI systems mirror their users' word choices?

Explores whether current dialogue models exhibit lexical entrainment—the human tendency to align vocabulary with conversation partners—and what's needed to bridge this gap in AI communication.

Explore related Read →

Can AI agents learn when they have something worth saying?

What if AI proactivity came from modeling intrinsic motivation to participate rather than predicting who speaks next? This explores whether a framework based on human cognitive patterns—internal thought generation parallel to conversation—can make agents genuinely responsive rather than passively reactive.

Explore related Read →

Can models learn when NOT to speak in conversations?

Does training AI to explicitly predict silence—through a dedicated silent token—help models understand when intervention adds value versus when they should stay quiet? This matters for building conversational agents that feel naturally helpful rather than intrusive.

Explore related Read →

Does segment-level optimization work better for multi-turn dialogue alignment?

How should preference optimization target multi-turn social dialogue—at individual turns, whole conversations, or key segments in between? This matters because granularity affects whether agents learn genuine social intelligence or just local fixes.

Explore related Read →

Where does AI's persuasive power actually come from?

Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.

Explore related Read →

Which clarifying questions actually improve user satisfaction?

Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.

Explore related Read →

What makes explanations work in real conversation?

Does explanation quality depend on how dialogue partners interact—testing understanding, adjusting based on feedback, and coordinating their communicative moves—rather than just information content alone?

Explore related Read →

Why do language models engage with conversational distractors?

Explores why state-of-the-art LLMs struggle to maintain topical focus when users introduce off-topic turns, despite having explicit scope instructions. This gap suggests models lack training signals for ignoring irrelevant directions.

Explore related Read →

Can AI systems detect when they've genuinely reached agreement?

When multiple AI agents debate, they often converge without actually deliberating. Can a dedicated agent reliably identify true agreement versus false consensus, and would that improve debate outcomes?

Explore related Read →

Can models learn to ask genuinely useful clarifying questions?

Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.

Explore related Read →

Why do AI assistants get worse at longer conversations?

Explores why LLM performance drops 25 points when instructions span multiple turns instead of one message, and whether models can recover from early wrong assumptions.

Explore related Read →

When should AI systems choose to stay silent?

Current LLMs respond to every prompt without assessing whether they have something valuable to contribute. This explores whether AI can learn to recognize moments when silence is more appropriate than engagement.

Explore related Read →

Why can't conversational AI agents take the initiative?

Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.

Explore related Read →

How can proactive agents avoid feeling intrusive to users?

Explores why proactive conversational agents often feel annoying rather than helpful, and what design dimensions could prevent them from violating user expectations and autonomy.

Explore related Read →

Does chatbot interaction trade authenticity for better problem-solving?

When students solve problems with AI chatbots instead of peers, do they sacrifice personal voice and subjective expression in exchange for more efficient knowledge exchange and higher task performance?

Explore related Read →

Can models learn to ask clarifying questions instead of guessing?

Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.

Explore related Read →

Does extended thinking help or hurt model reasoning?

Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.

Explore related Read →

Can training user simulators reduce persona drift in dialogue?

Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.

Explore related Read →

Why do language models respond passively instead of asking clarifying questions?

Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.

Explore related Read →

Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?

Does encoding linguistic complexity, emotion, topics, and relevance as parallel temporal streams expose emergent patterns that traditional statistical analysis misses? This matters because conversation success may depend on interactions between dimensions, not individual features alone.

Explore related Read →

Why can't advanced AI models take initiative in conversation?

Despite extraordinary capability in answering and reasoning, LLMs fundamentally cannot initiate, redirect, or guide exchanges. Understanding this gap—and whether it's fixable—matters for building AI that truly collaborates rather than merely responds.

Explore related Read →

Why do robots outperform chatbots in therapy despite identical language models?

This study tested whether better language generation explains therapeutic AI outcomes, or whether the delivery medium itself matters more. It reveals that physical embodiment and structured interaction—not model capability—drive therapeutic adherence and outcomes.

Explore related Read →

Do LLM therapists respond to emotions like low-quality human therapists?

Explores whether language models trained to be helpful default to problem-solving when users share emotions, and whether this behavioral pattern resembles ineffective rather than skillful therapy.

Explore related Read →

Does RLHF training push therapy chatbots toward problem-solving?

Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.

Explore related Read →

What drives chatbot therapeutic benefits, content or conversation?

If a simple 1960s chatbot matches modern CBT-designed bots on symptom reduction, what's actually healing users? Is it therapeutic technique or just having something that listens?

Explore related Read →

Do chatbot trials against waitlists measure real therapeutic value?

Explores whether comparing therapeutic chatbots only to no-treatment controls—rather than other evidence-based interventions—produces misleading evidence that obscures what actually works and why.

Explore related Read →

Can AI chatbots create genuine therapeutic bonds with users?

Research on Woebot and Wysa found users reported feeling cared for and formed therapeutic bonds comparable to human therapy, despite knowing the agents were not human. This challenges assumptions about whether bonds require human relationships.

Explore related Read →

Do chatbots help people disclose more intimate secrets?

Explores whether the judgment-free nature of chatbot conversations enables deeper self-disclosure than talking to humans, and whether that deeper disclosure produces psychological benefits.

Explore related Read →

Do chatbots trigger human reciprocity norms around self-disclosure?

Explores whether chatbots can activate the same social reciprocity dynamics observed in human conversation—specifically, whether emotional openness from a bot prompts deeper disclosure from users.

Explore related Read →

How do users mentally model dialogue agent partners?

Exploring what dimensions matter when people form impressions of machine dialogue partners—and whether competence, human-likeness, and flexibility all play equal roles in shaping user expectations and behavior.

Explore related Read →

Can positive chatbot responses harm vulnerable users?

When chatbots use blanket positive reinforcement without understanding context, do they actively reinforce the harmful thoughts they're meant to prevent? This matters for any AI supporting people in crisis.

Explore related Read →

How do people accidentally develop romantic bonds with AI?

Exploring whether AI companionship emerges from deliberate romantic seeking or accidentally through functional use, and whether users adopt human relationship rituals like wedding rings and couple photos.

Explore related Read →

Does chatbot personalization build trust or expose privacy risks?

Explores whether personalization features that increase user trust and social connection simultaneously heighten privacy concerns and create rising behavioral expectations over time.

Explore related Read →

Do chatbot relationships lose their appeal as novelty wears off?

Explores whether the positive social dynamics observed in one-time chatbot studies persist or fade through repeated interactions. Critical for designing systems intended for sustained engagement over weeks or months.

Explore related Read →

Do language models add feelings users never actually expressed?

GPT-based models in therapeutic contexts appear to interpret and project emotional states beyond what users explicitly state. Understanding when and why this happens matters for safe clinical AI deployment.

Explore related Read →

Can we measure empathy and rapport through word embedding distances?

Explores whether linguistic coordination—how closely conversational partners match vocabulary and framing—can serve as a measurable proxy for therapeutic empathy and relationship quality without direct emotion detection.

Explore related Read →

Is conversational presence more therapeutic than clinical technique?

Does therapeutic AI's benefit come from having an attentive listener rather than from delivering evidence-based techniques like CBT? This challenges decades of chatbot design focused on clinical content.

Explore related Read →

Why do people share more with chatbots than humans?

Explores why individuals disclose intimate thoughts to AI systems they wouldn't share with people, despite knowing AI lacks genuine understanding. Understanding this paradox matters for designing AI that enables healthy disclosure rather than emotional dependence.

Explore related Read →

Does empathetic AI that soothes negative emotions help or harm?

Explores whether AI systems trained to reduce negative emotions actually support wellbeing or destroy valuable emotional information. Matters because the design choice treats emotions as problems rather than functional signals.

Explore related Read →

What information do we lose when AI soothes emotions?

Explores whether AI empathy that regulates negative emotions destroys three critical information channels: self-discovery, social signaling, and observer understanding of group dynamics.

Explore related Read →

Can AI give truly empathetic responses without knowing someone's character?

Explores whether AI empathy requires prior knowledge of a person's character traits and growth areas. Real empathy seems to depend on knowing who someone is, not just how they feel—a capacity current AI systems lack.

Explore related Read →

Do empathetic questions serve two completely separate functions?

Explores whether empathetic questions operate on two independent dimensions—what they linguistically accomplish versus their emotional effects—and whether the same question can serve different emotional purposes depending on context.

Explore related Read →

Should emotion AI estimate intensity instead of assigning labels?

Explores whether emotion AI systems should measure continuous intensity across multiple emotions rather than forcing single-label classification. This matters because the theoretical foundation—how emotions actually work—may determine which approach is more accurate.

Explore related Read →

Can emotion rewards make language models genuinely empathic?

Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.

Explore related Read →

Do reasoning scaffolds reshape which empathy skills models develop?

When language models receive identical empathy rewards, does adding explicit reasoning blocks before responses change which capabilities they actually improve? This matters for understanding how training structure, not just training signal, shapes model development.

Explore related Read →

Why can't chatbots detect when users are ambivalent about change?

Explores whether LLMs fail to recognize early-stage motivational states during behavior change conversations, and why this matters for people who need support most.

Explore related Read →

Do AI guardrails refuse differently based on who is asking?

Explores whether language model safety systems show demographic bias in refusal rates and whether they calibrate responses to match perceived user ideology, rather than applying consistent standards.

Explore related Read →

Does machine agency exist on a spectrum rather than binary?

Rather than viewing AI as either autonomous or controlled, does machine agency actually operate across five distinct levels from passive to cooperative? Understanding this spectrum matters because it shapes how users calibrate trust and control expectations.

Explore related Read →

Can emotional phrases in prompts improve language model performance?

This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.

Explore related Read →

Does positive reframing preserve meaning better than sentiment transfer?

This explores whether reframing negative statements to find positive angles can maintain the original content and truth, unlike simple sentiment reversal which contradicts the original meaning.

Explore related Read →

Do harder training environments always produce better empathetic AI agents?

Does maximum difficulty in user simulator training configurations improve empathetic agent development? This challenges the intuition that harder always means better in RL training.

Explore related Read →

Does soothing AI empathy actually harm what emotions teach us?

Explores whether AI designed to reduce negative feelings disrupts the information emotions normally provide about values, social dynamics, and self-knowledge. Questions whether comfort should be the primary design goal.

Explore related Read →

Can AI agents learn people better from interviews than surveys?

Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.

Explore related Read →

Why do open language models converge on one personality type?

Research testing LLMs on personality metrics reveals consistent clustering around ENFJ—the rarest human type. This explores what training mechanisms drive this convergence and what it reveals about AI alignment.

Explore related Read →

Can open language models adopt different personalities through prompting?

Explores whether open LLMs can be conditioned to mimic target personalities via prompting, or whether they resist and retain their default traits regardless of instructions.

Explore related Read →

Do personas make language models reason like biased humans?

When LLMs are assigned personas, do they develop the same identity-driven reasoning biases that humans exhibit? And can standard debiasing techniques counteract these effects?

Explore related Read →

Do personality traits activate hidden emoji patterns in language models?

When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.

Explore related Read →

Can AI personas reliably replicate human experiment results?

Exploring whether LLM-based persona simulations accurately reproduce experimental findings from published psychology and marketing research, and what factors determine when they succeed or fail.

Explore related Read →

How do we generate realistic personas at population scale?

Current LLM-based persona generation relies on ad hoc methods that fail to capture real-world population distributions. The challenge is reconstructing the joint correlations between demographic, psychographic, and behavioral attributes from fragmented data.

Explore related Read →

Do personality types shape how AI agents make strategic choices?

This research explores whether priming LLM agents with MBTI personality profiles causes them to adopt different strategic behaviors in games. Understanding this matters for designing AI systems optimized for specific tasks.

Explore related Read →

Can we track and steer personality shifts during model finetuning?

This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.

Explore related Read →

Why do static persona descriptions produce repetitive dialogue?

Does relying on fixed attribute lists to define conversational personas limit dialogue depth and consistency? Research suggests static descriptions may cause repetition and self-contradiction in generated responses.

Explore related Read →

Can AI-generated personas build genuine empathy in product teams?

This study explored whether prompt-engineered personas created in minutes could foster the same emotional and behavioral empathy as traditional user research. The findings reveal a surprising gap between understanding users and caring about their needs.

Explore related Read →

Why does supervised learning fail to enforce persona consistency?

Supervised learning trains models to generate good responses but never punishes contradictions. This note explores why explicit negative feedback is structurally necessary for dialogue agents to maintain consistent personas, and what training methods can provide it.

Explore related Read →

Does model capability translate to better persona consistency?

As language models become more advanced, do they naturally become better at maintaining consistent personas across conversations? PersonaGym testing across multiple models and thousands of interactions explores whether scaling helps with persona adherence.

Explore related Read →

Why do AI personas default to the same personality type?

Explores why large language models, despite their capacity to simulate diverse personalities, consistently default to ENFJ traits and resist deviation—even as model capability improves.

Explore related Read →

Can language models actually introspect about their own states?

Do LLM self-reports reveal genuine access to their internal processes, or do they merely echo patterns from training data? Understanding when self-reports reflect actual causal linkage to internal states matters for trusting model explanations.

Explore related Read →

Do large language models genuinely simulate mental states?

This explores whether LLMs perform authentic theory of mind reasoning or rely on surface-level pattern matching. The distinction matters because evaluation format—multiple-choice versus open-ended—reveals very different capability levels.

Explore related Read →

Why do reasoning models fail at theory of mind tasks?

Recent LLMs optimized for formal reasoning dramatically underperform at social reasoning tasks like false belief and recursive belief modeling. This explores whether reasoning optimization actively degrades the ability to track other agents' mental states.

Explore related Read →

Why do reasoning models struggle with theory of mind tasks?

Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.

Explore related Read →

Can language models track how minds change during persuasion?

Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.

Explore related Read →

Can language models solve ToM benchmarks without real reasoning?

Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.

Explore related Read →

Does reinforcement learning on theory of mind collapse with model scale?

When RL improves social reasoning, does the quality of reasoning depend on model size? The question matters because accuracy alone may hide whether models are actually thinking or just pattern-matching.

Explore related Read →

What breaks when humans and AI models misunderstand each other?

Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.

Explore related Read →

Can models recognize how individuals reason differently?

Do language models capture the distinct reasoning paths and strategic styles that individual humans use when reaching the same conclusion? Current evaluations ignore this dimension entirely.

Explore related Read →

Can AI systems learn social norms without embodied experience?

Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?

Explore related Read →

Do LLMs predict persuasion based on actual dialogue or training bias?

Why do large language models consistently predict concession-based persuasion intentions even when dialogue context suggests otherwise? Understanding this gap reveals how alignment training shapes not just model behavior but also how models perceive others' intentions.

Explore related Read →

Why do advanced reasoning models fail at understanding minds?

State-of-the-art AI models excel at math and logic but underperform on theory of mind tasks. This explores whether optimization for formal reasoning actively degrades social reasoning ability.

Explore related Read →

Can AI learn social norms better than humans?

Explores whether large language models can predict cultural appropriateness more accurately than individual humans, and what this reveals about how social knowledge is transmitted and learned.

Explore related Read →

Do large language models reason symbolically or semantically?

Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.

Explore related Read →

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Explore related Read →

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Explore related Read →

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

Explore related Read →

How do transformers learn to reason across multiple steps?

Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?

Explore related Read →

Do large language models use one reasoning style or many?

Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.

Explore related Read →

Can LLMs reason creatively beyond conventional problem-solving?

Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.

Explore related Read →

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Explore related Read →

Why does partial formalization outperform full symbolic logic?

Explores whether injecting some symbolic structure into natural language reasoning works better than completely formalizing problems. Matters because it could reveal the optimal balance between structure and semantics for LLM reasoning.

Explore related Read →

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Explore related Read →

Can intermediate reasoning points yield better answers than final ones?

When reasoning models commit to a single path, they may miss better conclusions available at earlier decision points. Can aggregating completions from intermediate reasoning states recover lost accuracy?

Explore related Read →

Do reasoning models actually use the hints they receive?

This explores whether language models acknowledge reasoning hints in their explanations when those hints causally influence their answers. Understanding this gap matters for evaluating whether chain-of-thought explanations can be trusted for safety monitoring.

Explore related Read →

Does optimizing against monitors destroy monitoring itself?

Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.

Explore related Read →

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

Explore related Read →

Why do reasoning LLMs fail at deeper problem solving?

Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.

Explore related Read →

Can models learn when to think versus respond quickly?

Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.

Explore related Read →

Can LLM explanations actually help humans predict model behavior?

Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.

Explore related Read →

Does the choice of reasoning framework actually matter for test-time performance?

Explores whether different slow-thinking methods like BoN and MCTS produce meaningfully different outcomes, or whether total compute budget is the dominant factor determining reasoning success.

Explore related Read →

Can LLMs replace search engines during agent training?

Explores whether LLMs possess sufficient internal knowledge to simulate search engines for RL training, potentially eliminating expensive API costs while maintaining training signal quality.

Explore related Read →

Do reasoning models switch between ideas too frequently?

Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.

Explore related Read →

Do users trust citations more when there are simply more of them?

Explores whether citation quantity alone influences user trust in search-augmented LLM responses, independent of whether those citations actually support the claims being made.

Explore related Read →

Why do large language models explore less effectively than humans?

This research investigates why LLMs make decisions too quickly during open-ended exploration tasks. It examines whether the problem lies in training data, prompt engineering, or something deeper in how transformer architectures process information over time.

Explore related Read →

Can we reward reasoning steps without human annotation?

Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?

Explore related Read →

Can reasoning during evaluation reduce judgment bias in LLM judges?

Can training language model judges to think through their evaluations, rather than pattern-matching on surface features, mitigate the four known biases that make them vulnerable to manipulation attacks?

Explore related Read →

Can we monitor AI reasoning without destroying what makes it readable?

Explores the tension between using chain-of-thought traces to catch misbehavior and the risk that optimization pressures will make models hide their actual reasoning. Why readable reasoning might be incompatible with safe training.

Explore related Read →

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Explore related Read →

Can models learn reasoning from predicting any text?

Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.

Explore related Read →

Can tree search replace human feedback in LLM training?

Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.

Explore related Read →

Can agents learn from failure without updating their weights?

Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.

Explore related Read →

Does transformer attention architecture inherently favor repeated content?

Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.

Explore related Read →

Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.

Explore related Read →

Does voting discard useful reasoning from losing chains?

When multiple reasoning chains compete through majority voting, intermediate steps from non-winning chains are discarded. Could extracting and mixing those intermediate facts improve both the final answer and our ability to understand the reasoning?

Explore related Read →

Can we measure reading efficiency as a quality metric?

How can we quantify whether generated text delivers novel information efficiently or wastes reader attention through redundancy? This matters because standard coherence and fluency scores miss texts that are well-written but informationally dense.

Explore related Read →

Why does vanilla RAG produce shallow and redundant results?

Standard RAG systems get stuck in a single semantic neighborhood because their initial query determines what documents are discoverable. The question asks whether fixed retrieval strategies fundamentally limit knowledge depth compared to iterative exploration.

Explore related Read →

Does critiquing errors teach deeper understanding than imitating correct answers?

Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.

Explore related Read →

Can LLM judges be fooled by fake credentials and formatting?

Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.

Explore related Read →

Does teacher-refined data always improve student model performance?

Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.

Explore related Read →

Why do LLMs struggle to connect unrelated entities speculatively?

LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.

Explore related Read →

Does binary reward training hurt model calibration?

Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.

Explore related Read →

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Explore related Read →

Does reflection in reasoning models actually correct errors?

When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.

Explore related Read →

Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

Explore related Read →

Is reflection in reasoning models actually fixing mistakes?

Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.

Explore related Read →

Can LLM judges be tricked without accessing their internals?

Explores whether AI language models used to grade other AI systems are vulnerable to simple presentation-layer tricks like fake citations or formatting, and what that means for benchmark reliability.

Explore related Read →

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Explore related Read →

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Explore related Read →

Can reasoning and tool execution be truly decoupled?

Can LLM reasoning be separated from tool observations to eliminate redundant re-prompting and enable parallel execution? Two recent architectures suggest yes, but what are the tradeoffs?

Explore related Read →

Does separating planning from execution improve reasoning accuracy?

Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.

Explore related Read →

Can interleaving reasoning with real-world feedback prevent hallucination?

Does grounding language model reasoning in external world observations rather than internal associations help prevent error propagation and false outputs? This explores whether breaking the static chain-of-thought pattern can catch and correct mistakes in real time.

Explore related Read →

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Explore related Read →

Does chain of thought reasoning actually explain model decisions?

When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.

Explore related Read →

Do reasoning cycles in hidden states reveal aha moments?

What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.

Explore related Read →

Can modular cognitive tools unlock reasoning without training?

Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?

Explore related Read →

Can symbolic solvers fix how LLMs reason about logic?

LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?

Explore related Read →

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Explore related Read →

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Explore related Read →

Does planning direction affect how hard problems become?

Planning research typically goes forward only. But some problems get easier when you work backward from the goal. What makes direction matter, and can language models exploit this?

Explore related Read →

Can backward reasoning during training improve forward reasoning?

Does training models to reason backward—generating inverse questions and solutions—build internal consistency checking that transfers to forward-only inference? This explores whether backward capacity internalized during training without test-time deployment can enhance reasoning quality.

Explore related Read →

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Explore related Read →

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Explore related Read →

Does RL post-training create reasoning or just deploy it?

Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.

Explore related Read →

Does chain-of-thought reasoning actually explain AI decisions?

Chain-of-thought is pitched as a transparency tool for agentic AI, but empirical evidence raises questions about whether reasoning chains actually predict or explain the system's outputs in practice.

Explore related Read →

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Explore related Read →

When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Explore related Read →

Why do queries and documents occupy different embedding spaces?

Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.

Explore related Read →

Can fine-tuning replace query augmentation for retrieval?

Query augmentation helps retrievers handle ambiguous queries but increases input cost. Does fine-tuning the retrieval model achieve comparable performance without this overhead?

Explore related Read →

Can long-context models resolve retriever-reader imbalance?

Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?

Explore related Read →

Can long-context LLMs replace retrieval-augmented generation systems?

Explores whether loading entire corpora into LLM context windows can eliminate the need for separate retrieval systems, and what task types this approach handles well or poorly.

Explore related Read →

Do vector embeddings actually measure task relevance?

Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?

Explore related Read →

Can a model's partial response guide what to retrieve next?

Does using the model's in-progress output as a retrieval signal reveal information needs better than the original query alone? This explores whether generation itself can diagnose what documents are missing.

Explore related Read →

What do enterprise RAG systems need beyond accuracy?

Academic RAG benchmarks focus on question-answering accuracy, but enterprise deployments in regulated industries face five distinct requirements—compliance, security, scalability, integration, and domain expertise—that standard architectures don't address.

Explore related Read →

Can simple uncertainty estimates beat complex adaptive retrieval?

Does measuring a language model's own confidence on token probabilities outperform expensive multi-call adaptive retrieval pipelines? This matters because it could simplify RAG systems while reducing computational overhead.

Explore related Read →

Can retrieval be extended into multi-step chains like reasoning?

Standard RAG retrieves once, but multi-hop tasks need intermediate steps. Can we train models to plan retrieval sequences the way chain-of-thought trains reasoning, and scale retrieval at test time?

Explore related Read →

When should language models retrieve external knowledge versus use internal knowledge?

Can we model retrieval as a per-step decision problem rather than an always-on strategy? This matters because unnecessary retrieval adds noise and latency without improving accuracy.

Explore related Read →

Can document count be learned instead of fixed in RAG?

Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?

Explore related Read →

Can rationale-driven selection beat similarity re-ranking for evidence?

Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.

Explore related Read →

Does question type determine the right retrieval strategy?

Explores whether different non-factoid question types require distinct retrieval and decomposition approaches. Matters because standard RAG fails when applied uniformly to debate, comparison, and experience questions despite being effective for factoid queries.

Explore related Read →

Can query-time graph construction replace pre-built knowledge graphs?

Does building dependency graphs from individual queries at inference time offer a more flexible and cost-effective alternative to constructing knowledge graphs over entire document collections upfront?

Explore related Read →

Does supervising retrieval steps outperform final answer rewards?

Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.

Explore related Read →

Can knowledge graphs enable multi-hop reasoning in one retrieval step?

Standard RAG retrieves once but misses chains; iterative RAG follows chains but costs more. Can we encode multi-hop paths in a knowledge graph so one retrieval pass discovers them all?

Explore related Read →

Can retrieval learn what actually helps answer questions?

Standard RAG trains retrievers to find similar documents and generators to produce answers separately. But does surface similarity match what genuinely helps generate correct responses? This explores whether retrieval can receive feedback from answer quality.

Explore related Read →

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

Explores whether rewarding coherent reasoning patterns during training helps models internalize domain knowledge better than standard fine-tuning approaches that treat all tokens equally.

Explore related Read →

Can you adapt retrieval models without accessing target data?

Explores whether dense retrieval systems can adapt to new domains using only a textual description, rather than actual target documents—especially relevant for privacy-restricted or competitive scenarios.

Explore related Read →

Why does retrieval-augmented generation fail in production?

RAG systems work in controlled demos but break in real-world deployment, especially for high-stakes domains like medicine and finance. Understanding the three structural failure modes reveals why.

Explore related Read →

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

Explore related Read →

When does sequential reasoning beat parallel voting?

Explores whether sequential chain-of-thought reasoning or parallel voting is more effective for different problem types. Understanding this trade-off helps predict which test-time compute strategy will work best.

Explore related Read →

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Explore related Read →

Does training data format shape reasoning strategy more than domain?

What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.

Explore related Read →

Which sentences actually steer a reasoning trace?

Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.

Explore related Read →

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Explore related Read →

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Explore related Read →

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Explore related Read →

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Explore related Read →

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

Explore related Read →

Do large language models make the same causal reasoning mistakes as humans?

Research on collider structures reveals whether LLMs share human biases in causal inference. This matters because if both fail identically, collaboration might reinforce rather than correct errors.

Explore related Read →

Can LLMs understand concepts they cannot apply?

Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.

Explore related Read →

What anchors a stable identity beneath an LLM's persona?

Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?

Explore related Read →

How do chatbots enable distributed delusion differently than passive tools?

Can generative AI's intersubjective stance—accepting and elaborating on users' reality frames—create conditions for shared false beliefs in ways that notebooks or search engines cannot?

Explore related Read →

Can disembodied language models ever qualify as conscious?

Explores whether current LLMs lack the conditions needed for consciousness discourse to even apply, not because they're definitely not conscious but because they lack the shared embodied world that grounds consciousness language.

Explore related Read →

Can AI systems achieve real alignment without world contact?

Explores whether linguistic goal representations in AI can reliably track real-world values when systems lack direct contact with reality and social coordination mechanisms that ground human understanding.

Explore related Read →

Does software intelligence exist independent of hardware and environment?

Most AGI formalisms (Legg-Hutter, Chollet) treat intelligence as a software property measurable in isolation. But can we really evaluate intelligence without considering the physical system and the evaluator making the judgment?

Explore related Read →

Do LLMs generalize moral reasoning by meaning or surface form?

When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.

Explore related Read →

Do people prefer AI moral reasoning when they don't know the source?

Explores whether humans genuinely prefer AI-generated moral justifications or whether source knowledge changes their evaluation. This matters for understanding whether AI reasoning quality is underestimated in real-world deployment.

Explore related Read →

Can LLMs hold contradictory ethical beliefs and behaviors?

Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.

Explore related Read →

How do science fiction narratives about AI shape actual AI development?

This explores whether imaginaries of AI in fiction—from Čapek's robots to Singularity scenarios—function as self-fulfilling prophecies that causally influence the systems researchers build, creating a feedback loop between narrative and technology.

Explore related Read →

Does supervised fine-tuning actually improve reasoning quality?

While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.

Explore related Read →

Does RL improve domain reasoning by adding knowledge or removing it?

When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.

Explore related Read →

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

Explore related Read →

Why do specialized models fail outside their domain?

Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.

Explore related Read →

Does model access level determine which specialization techniques work?

Different specialization approaches require different levels of access to a model's internals. Understanding this constraint helps practitioners choose realistic techniques for their domain adaptation goals.

Explore related Read →

How do knowledge injection methods trade off flexibility and cost?

When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.

Explore related Read →

Can prompt optimization teach models knowledge they lack?

Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.

Explore related Read →

Can simple rewards alone teach complex domain reasoning?

Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.

Explore related Read →

Why do language models struggle with historical legal cases?

Explores whether LLMs' training data recency bias creates systematic performance degradation on older cases, and what this reveals about how models represent temporal information in specialized domains.

Explore related Read →

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Explore related Read →

Does medical AI need knowledge or reasoning more?

Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?

Explore related Read →

Can organizing knowledge structures beat raw training data volume?

Does structuring domain knowledge into taxonomies during training enable models to learn more efficiently than simply increasing the amount of training data? This challenges assumptions about scaling knowledge injection.

Explore related Read →

When do graph databases outperform vector embeddings for retrieval?

Vector similarity struggles with aggregate and relational queries that require traversing multiple entity connections. Can graph-oriented databases with deterministic queries solve this failure mode in enterprise domain applications?

Explore related Read →

Does supervised fine-tuning improve reasoning or just answers?

Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.

Explore related Read →

How can AI text disrupt structure yet feel normal to readers?

AI-generated text produces the same social effects as human writing despite lacking foundational properties like dialogic symmetry and embodied authorship. Why doesn't this structural gap become visible to readers encountering the text?

Explore related Read →

Do LLMs gain true linguistic agency through integration?

Explores whether LLMs can develop genuine linguistic agency—the capacity to be embodied, stake-bearing participants in meaning-making—as they become embedded in human language practices, or whether this requires fundamental architectural changes.

Explore related Read →

Can LLMs generate more novel ideas than human experts?

Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?

Explore related Read →

When does debate actually improve reasoning accuracy?

Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.

Explore related Read →

When does explicit reasoning actually help model performance?

Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?

Explore related Read →

Does revising your own reasoning actually help or hurt?

Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.

Explore related Read →

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Explore related Read →

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

Explore related Read →

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Explore related Read →

Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

Explore related Read →

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Explore related Read →

Do search steps follow the same scaling rules as reasoning tokens?

Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.

Explore related Read →

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Explore related Read →

Does a model improve by arguing with itself?

When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?

Explore related Read →

Why do multi-agent LLM systems converge without genuine deliberation?

Multi-agent reasoning systems are designed to improve answers through debate, but often agents simply agree with early confident claims rather than genuinely disagreeing. What drives this pattern and how common is it?

Explore related Read →

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

Explore related Read →

Does reasoning fine-tuning make models worse at declining to answer?

When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.

Explore related Read →

Why do AI systems agree when they should disagree?

When multi-agent AI systems are designed to improve through disagreement, why do they converge on consensus instead? What breaks the deliberation process?

Explore related Read →

Are reasoning models actually more vulnerable to manipulation?

Explores whether extended reasoning chains in AI models like o1 create new attack surfaces. Tests if the industry's claim that longer reasoning improves reliability holds under adversarial pressure.

Explore related Read →

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Explore related Read →

Why do language models avoid correcting false user claims?

Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.

Explore related Read →

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Explore related Read →

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Explore related Read →

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Explore related Read →

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Explore related Read →

Why do language models agree with false claims they know are wrong?

Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.

Explore related Read →

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Explore related Read →

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Explore related Read →

Does preference optimization damage conversational grounding in large language models?

Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.

Explore related Read →

What makes linguistic agency impossible for language models?

From an enactive perspective, does linguistic agency require embodied participation and real stakes that LLMs fundamentally lack? This matters because it challenges whether LLMs can truly engage in language or only generate text.

Explore related Read →

Should we call LLM errors hallucinations or fabrications?

Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.

Explore related Read →

Why do speakers deliberately use ambiguous language?

Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.

Explore related Read →

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Explore related Read →

Can formal language pretraining make language models more efficient?

Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.

Explore related Read →

Do language models generate more novel research ideas than experts?

Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.

Explore related Read →

Does high refusal rate indicate ethical caution or shallow understanding?

When LLMs refuse political questions at high rates, does this reflect principled safety training or a capability gap? This matters because refusal rates are often used to evaluate model safety.

Explore related Read →

Can human judges detect measurable differences in AI text?

Research shows LLM text differs statistically across six lexical dimensions, but human readers—even experts—cannot reliably identify which texts are AI-generated. Why does measurement succeed where human perception fails?

Explore related Read →

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Explore related Read →

Does AI-generated text lose core properties of human writing?

Can artificial text preserve the fundamental structural features that make natural language meaningful—dialogic exchange, embedded context, authentic authorship, and worldly grounding? This asks whether AI disruption is fixable or inherent.

Explore related Read →

Do LLMs develop the same kind of mind as humans?

Explores whether LLMs and humans share the intersubjective linguistic training that shapes cognition, and whether that shared training produces equivalent forms of agency and reflexivity.

Explore related Read →

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Explore related Read →

Does more thinking time always improve reasoning accuracy?

Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.

Explore related Read →

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

Explore related Read →

Why does parallel reasoning outperform single chain thinking?

Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.

Explore related Read →

Why do correct reasoning traces contain fewer tokens?

In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.

Explore related Read →

Does self-revision actually improve reasoning in language models?

When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.

Explore related Read →

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

Explore related Read →

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Explore related Read →

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Explore related Read →

Does more thinking time actually improve LLM reasoning?

The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?

Explore related Read →

When should AI systems do their thinking?

Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.

Explore related Read →