Reasoning Model Architectures

Can LLM explanations actually help humans predict model behavior?

Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

Does the choice of reasoning framework actually matter for test-time performance?

Explores whether different slow-thinking methods like BoN and MCTS produce meaningfully different outcomes, or whether total compute budget is the dominant factor determining reasoning success.

Can models learn when to think versus respond quickly?

Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.

Can we reward reasoning steps without human annotation?

Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?

Can LLMs replace search engines during agent training?

Explores whether LLMs possess sufficient internal knowledge to simulate search engines for RL training, potentially eliminating expensive API costs while maintaining training signal quality.

Does optimizing against monitors destroy monitoring itself?

Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.

Why do reasoning LLMs fail at deeper problem solving?

Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.

Do reasoning models actually use the hints they receive?

This explores whether language models acknowledge reasoning hints in their explanations when those hints causally influence their answers. Understanding this gap matters for evaluating whether chain-of-thought explanations can be trusted for safety monitoring.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Can training language model judges to think through their evaluations, rather than pattern-matching on surface features, mitigate the four known biases that make them vulnerable to manipulation attacks?

Can intermediate reasoning points yield better answers than final ones?

When reasoning models commit to a single path, they may miss better conclusions available at earlier decision points. Can aggregating completions from intermediate reasoning states recover lost accuracy?

Can we monitor AI reasoning without destroying what makes it readable?

Explores the tension between using chain-of-thought traces to catch misbehavior and the risk that optimization pressures will make models hide their actual reasoning. Why readable reasoning might be incompatible with safe training.

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Why do large language models explore less effectively than humans?

This research investigates why LLMs make decisions too quickly during open-ended exploration tasks. It examines whether the problem lies in training data, prompt engineering, or something deeper in how transformer architectures process information over time.