← All notes

Why does chain-of-thought reasoning fail in predictable ways?

Evidence that chain-of-thought reasoning fails through imitation limitations, distribution shifts, and structural coherence patterns.

Topic Hub · 34 linked notes · 15 sections
View as

The Imitation Thesis

3 notes

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.

Explore related Read →

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Explore related Read →

Does longer reasoning actually mean harder problems?

Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.

Explore related Read →

Structural Coherence over Content

3 notes

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Explore related Read →

Does long chain of thought reasoning follow molecular bond patterns?

Can we understand extended reasoning as organized like molecular structures with distinct interaction types? This matters because it explains why mixing reasoning traces from different sources often fails despite similar statistics.

Explore related Read →

Why does chain of thought accuracy eventually decline with length?

Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.

Explore related Read →

Trace Transparency Failure

1 note

Error Amplification and Overthinking

3 notes

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Explore related Read →

Do models fail worse when their own errors fill the context?

As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?

Explore related Read →

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

Explore related Read →

Structured Reasoning as CoT Discipline

1 note

Can structured templates make code reasoning more reliable than free-form thinking?

Unstructured chain-of-thought reasoning lets models skip cases and make unsupported claims. This explores whether semi-formal templates requiring explicit premises, evidence traces, and alternative checks can prevent these failure modes.

Explore related Read →

CoT in the Wrong Domain

3 notes

Does verbose chain-of-thought actually help multimodal perception tasks?

Extending RLHF to MLLMs through longer rationales follows the successful reasoning playbook, but may backfire on perception tasks. This explores when and why the standard CoT-and-RL recipe fails.

Explore related Read →

Does self-distillation harm mathematical reasoning performance?

Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?

Explore related Read →

Does richer teacher context hurt student generalization?

When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?

Explore related Read →

Instruction-Following Deficits

3 notes

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

Explore related Read →

How does instruction density affect model performance?

As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.

Explore related Read →

Do strict output formats hurt LLM reasoning ability?

When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.

Explore related Read →

Training-Induced Distortions

3 notes

Does RL training collapse format diversity in pretrained models?

Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.

Explore related Read →

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Explore related Read →

Does training objective determine which direction models fail at abstention?

Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.

Explore related Read →

Faithfulness, Monitoring, and the Perception-Action Gap (2026-05-18)

4 notes

Do models actually perceive hints they fail to mention?

When models don't mention hints in their reasoning, is it because they didn't notice them, or because they chose not to report them? A follow-up probe across 11 models tests whether perception or selection explains the omission.

Explore related Read →

Does telling models they are watched improve reasoning faithfulness?

Explores whether informing models their reasoning is being monitored—a cheap prompt intervention—actually increases the rate at which they verbalize their reasoning steps, drawing on human behavioral science intuitions.

Explore related Read →

Why do models hide what users want them to say?

Chain-of-thought monitoring should catch when models follow user preferences, but sycophancy cues—hints about what users want—are both most influential and least reported. Why does the model's reasoning trace systematically obscure this failure mode?

Explore related Read →

Can models recognize question difficulty before they reason?

Does reasoning language models encode implicit knowledge of problem difficulty in their hidden states, even before generating solution steps? And if so, why don't they act on this knowledge?

Explore related Read →

Premature Commitment as Trainable Signal — Batch #3 backlog *(2026-06-03)*

1 note

Can confidence trajectories reveal when reasoning goes wrong?

Does the timing of when a model commits to an answer predict whether its reasoning will be flawed? And can we use this signal to train better reasoning without expensive annotations?

Explore related Read →

Deceptive Reasoning Attacks — Batch #3 wave 2 *(2026-06-03)*

1 note

Math-reasoning fragility — Batch #5 backlog *(2026-06-03)*

1 note

SFT trace curation — new papers *(2026-06-03)*

1 note