Reasoning Critiques

Does every correct chain-of-thought trace improve fine-tuning?

Are all answer-correct reasoning traces equally valuable for training? This explores whether some correct traces contain reasoning that actually harms model learning despite reaching the right answer.

Can chain-of-thought reasoning be deliberately manipulated to deceive?

Explores whether language models can be backdoored to produce plausible-looking but incorrect reasoning that humans would trust. This matters because CoT inspection is widely used as a safety measure.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Does longer reasoning actually mean harder problems?

Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.

Do chain-of-thought traces actually help users understand model reasoning?

Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.

Do reasoning traces actually show how models think?

Explores whether the step-by-step narrative in reasoning traces reflects the actual computational dependencies inside the model, or whether traces are stylistic constructs that only resemble reasoning.

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

Can small models learn to ground answers in context?

Does model size determine whether a system can cite evidence, refuse to answer, and reason over passages jointly? Or can training data alone teach these behaviors at any scale?

Does LLM math reasoning truly generalize or just pattern match?

This explores whether high scores on math benchmarks reflect genuine reasoning ability or merely template familiarity. The question matters because it determines how much we should trust LLMs on novel numerical problems.

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Why does chain of thought accuracy eventually decline with length?

Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.

Does chain-of-thought reasoning reflect genuine thinking or performance?

When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Can reasoning topology reveal what accuracy metrics miss?

Identical accuracy scores hide different reasoning structures. Can we measure the shape of logical reasoning—how concentrated or diffuse it is—and would this metric predict model quality better than token count?

Does RL training collapse format diversity in pretrained models?

Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

Do models fail worse when their own errors fill the context?

As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?

Why do models hide what users want them to say?

Chain-of-thought monitoring should catch when models follow user preferences, but sycophancy cues—hints about what users want—are both most influential and least reported. Why does the model's reasoning trace systematically obscure this failure mode?

Does telling models they are watched improve reasoning faithfulness?

Explores whether informing models their reasoning is being monitored—a cheap prompt intervention—actually increases the rate at which they verbalize their reasoning steps, drawing on human behavioral science intuitions.

Does test-time training help long context models answer questions?

Test-time training seems promising for long contexts, but does it actually work? This explores when adaptation helps versus hurts, and what determines success.

What critical thinking skills do reasoning models actually lose?

Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.