Reasoning Architectures

When should an agent stop acting and admit failure?

Agents can refuse to answer, but the critical challenge is timing: knowing at which step to abstain when a task becomes infeasible. This matters because premature or delayed stopping directly affects system reliability.

Does planning direction affect how hard problems become?

Planning research typically goes forward only. But some problems get easier when you work backward from the goal. What makes direction matter, and can language models exploit this?

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Can modular cognitive tools unlock reasoning without training?

Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?

Does chain of thought reasoning actually explain model decisions?

When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.

Can a single problem unlock reasoning through solution critique?

Does exposing models to diverse critiques of different solutions to one problem activate reasoning as effectively as training on many problems? This tests whether solution diversity matters more than problem diversity.

Can reasoning and tool execution be truly decoupled?

Can LLM reasoning be separated from tool observations to eliminate redundant re-prompting and enable parallel execution? Two recent architectures suggest yes, but what are the tradeoffs?

Why do transformers need explicit chain-of-thought reasoning?

Explores whether chain-of-thought is a fundamental reasoning mechanism or a workaround for architectural limitations in how transformers track evolving state across computation steps.

Can interleaving reasoning with real-world feedback prevent hallucination?

Does grounding language model reasoning in external world observations rather than internal associations help prevent error propagation and false outputs? This explores whether breaking the static chain-of-thought pattern can catch and correct mistakes in real time.

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Can structured debate roles help small models detect ambiguity?

Small language models struggle to recognize when problems are underspecified. Can assigning explicit leader-follower roles in multi-agent debates overcome this limitation and boost ambiguity detection accuracy?

Do large language models actually perform iterative optimization?

Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Do larger language models solve constrained optimization better?

Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.

Do fine-tuned language models actually learn optimization procedures?

Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Which tokens in reasoning chains actually matter most?

Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.

Do reasoning cycles in hidden states reveal aha moments?

What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.

Do reasoning models actually beat standard models on optimization?

Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

When does RL actually extend reasoning beyond pretraining?

Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.

Why do RL agents stop asking informative questions?

RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.

Does separating planning from execution improve reasoning accuracy?

Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.

Does supervised fine-tuning actually improve reasoning on optimization problems?

When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?

Can symbolic solvers fix how LLMs reason about logic?

LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?

Does chain-of-thought reasoning actually explain AI decisions?

Chain-of-thought is pitched as a transparency tool for agentic AI, but empirical evidence raises questions about whether reasoning chains actually predict or explain the system's outputs in practice.

Should LLMs handle abstraction only in optimization?

What if LLMs worked exclusively on translating problems to formal constraints, while deterministic solvers handled the numeric work? Explores whether this division of labor could overcome LLM failures in iterative computation.

Does transformer reasoning leave a geometric signature in representation space?

Treating the forward pass as a trajectory through representation space could reveal whether reasoning tasks and lexical tasks bend the path differently, and whether curvature itself signals computational difficulty.

Does RL post-training create reasoning or just deploy it?

Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.

Can backward reasoning during training improve forward reasoning?

Does training models to reason backward—generating inverse questions and solutions—build internal consistency checking that transfers to forward-only inference? This explores whether backward capacity internalized during training without test-time deployment can enhance reasoning quality.

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.