Chain-of-Thought and Reasoning Methods

Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Can reasoning models actually sustain long-chain reflection?

Tests whether large reasoning models genuinely perform self-correction and backtracking, or merely simulate it fluently. Uses constraint satisfaction problems where performance cannot be faked by surface plausibility.

Why does autoregressive generation fail at constraint satisfaction?

Explores whether the 20-23% performance ceiling on constraint satisfaction benchmarks reflects model limitations or a fundamental architectural mismatch between how LLMs generate tokens and how constraint solvers need to work.

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Can one statistical measure serve dual purposes in RL training?

Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

Do large language models make the same causal reasoning mistakes as humans?

Research on collider structures reveals whether LLMs share human biases in causal inference. This matters because if both fail identically, collaboration might reinforce rather than correct errors.

Can longer reasoning chains eliminate model sensitivity to input noise?

Does adding more chain-of-thought steps eventually make language models robust to perturbations? This matters because it determines whether extended reasoning is a viable defense against adversarial attacks.

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

What alignment data structure best trains reasoning generalists?

Explores whether preference trees—with diverse reasoning chains, multi-turn critique loops, and pairwise contrasts—offer a structured way to build alignment datasets that improve open-model reasoning across domains.

Can models recognize question difficulty before they reason?

Does reasoning language models encode implicit knowledge of problem difficulty in their hidden states, even before generating solution steps? And if so, why don't they act on this knowledge?

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

Chain-of-Thought and Reasoning Methods

Why do models fail at asking good questions during interaction?

Can minimal reasoning chains match full explanations?

Can reasoning models actually sustain long-chain reflection?

Why does autoregressive generation fail at constraint satisfaction?

Why do chain-of-thought examples fail across different conditions?

Can one statistical measure serve dual purposes in RL training?

How quickly do errors compound during model self-training?

Why do models trust their own generated answers?

Do large language models make the same causal reasoning mistakes as humans?

Can longer reasoning chains eliminate model sensitivity to input noise?

Can small models reason well by just learning output format?

What alignment data structure best trains reasoning generalists?

Can models recognize question difficulty before they reason?

Can reasoning topologies be formally classified as graph types?

Do reasoning traces actually cause correct answers?

Can we identify which tokens actually matter for reasoning?

Should reasoning benchmarks score final answers or reasoning traces?

What makes reflection actually work in reasoning models?

Can rubrics and dense rewards work together without hacking?

When does sequential reasoning beat parallel voting?

Which sentences actually steer a reasoning trace?

Does training data format shape reasoning strategy more than domain?

Why do standard process reward models fail on thinking traces?