SYNTHESIS NOTE

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Synthesis note · 2026-04-07 · sourced from Flaws

The standard narrative around reasoning-model failures — from Shojaee et al.'s Illusion of Thinking onward — frames the phenomenon as a "complexity threshold" or "step threshold": models handle short reasoning chains but break on long ones. Something about the quantity of reasoning breaks down past some limit. The Chollet-Kambhampati exchange reframes this at the instance level, and the reframing matters for what "improving reasoning" can mean.

Chollet's claim: "Many people assume that LRM reasoning breaks down past a certain 'complexity' or 'number of steps' threshold. This is incorrect. It breaks down past an unfamiliarity threshold. And that threshold is very low. There is no limit to the complexity of tasks you can solve with these models, no limit to the number of steps in the reasoning chains they can master — as long as they have been covered during training/tuning. However, show them something unfamiliar, even very simple and requiring just a handful of reasoning steps (e.g., an ARC 2 task), and they will fail." The apparent complexity threshold in Tower of Hanoi exists because Tower of Hanoi is a familiar problem — the step count at which models fail corresponds to the step count at which instances stop appearing in their training data. Scaling step count is an indirect way of generating novelty, not an independent difficulty axis.

Kambhampati adds the systematic observation: LRMs lose accuracy as familiar-problem instances grow because they don't learn algorithms — they fit instance-based patterns. The two agree on the substantive claim even while they initially disagreed on terminology: "We don't actually disagree, we all know that Transformers don't fit generalizable algorithms, they fit instance-based patterns. It doesn't change the fact that the crux of the problem is familiar vs unfamiliar (at the instance level, not at the abstract 'task' level)."

The reframing has sharp implications. First, the intuition that "just scale more reasoning tokens" as a solution to reasoning failures is structurally misguided. If reasoning failure is instance-novelty-driven, then scaling tokens — which extends the reasoning chain — helps only if the longer chain covers more familiar instance territory. It does not extend to any genuinely unfamiliar instance, no matter how short. Second, the natural evaluation target shifts. Benchmarks that scale complexity (Tower of Hanoi with larger N, River Crossing with more pairs) are generating instance novelty indirectly through size. ARC 2 and similar benchmarks generate instance novelty directly through task structure change. The latter is a better measure of whether the model is fitting algorithms or fitting patterns. Third, the definition of "familiarity" matters and Chollet makes it precise: "outside of the classroom, in the real world, you are never exposed to neatly defined 'tasks' and step-by-step algorithms, you are only exposed to situations. Intelligence is the ability to infer generalizable algorithms from situations (instances) only. So the only reasonable definition of familiarity/novelty is at the situation/instance level. If you define it with respect to algorithms you are assuming the problem has already been solved."

This aligns with and sharpens several existing notes. Do foundation models learn world models or task-specific shortcuts? identified task-specific heuristics as the mechanism; Chollet-Kambhampati identify the corresponding failure condition — the heuristics work where they have instance coverage and fail where they do not. Do transformers actually learn systematic compositional reasoning? provides the mathematical substrate: if compositional reasoning is subgraph matching, then novelty at the subgraph level is what breaks the mechanism. Does chain-of-thought reasoning reveal genuine inference or pattern matching? extends this to the performance-vs-reasoning gap: CoT imitates the form of abstract reasoning without performing it, which is exactly why it handles familiar problems at scale but fails on unfamiliar problems at low complexity.

The reframing also creates a tension with some optimistic RL results. Can reinforcement learning discover reasoning strategies base models cannot? shows that extended RL can produce strategies not present in the base model. If reasoning is purely instance-pattern-fitting, where does the novelty in ProRL come from? A reconciliation: RL-discovered "novel strategies" may still be instance-family novelty — the model learns to combine previously separate instance patterns in new ways, producing what looks like strategy but is still pattern composition. This would be genuine progress within the instance-pattern regime without escaping it. A test: take a ProRL-extended model and evaluate it on ARC 2. If the instance-novelty thesis is right, ProRL gains should not transfer to instance-level novelty challenges.

The practical implication for evaluation design is straightforward. Current benchmarks that scale complexity to induce failure are indirectly measuring instance coverage in training data. Benchmarks that induce instance novelty at fixed short complexity — ARC 2, held-out reasoning tasks with genuinely new structure — measure what matters: whether the model is doing anything other than pattern lookup.

Inquiring lines that read this note 250

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do benchmark improvements fail to reflect actual reasoning quality?

How do neural networks separate factual knowledge from reasoning abilities?

When do additional thinking tokens stop improving reasoning performance?

How does example difficulty affect learning efficiency in language models?

Why do reasoning models fail at systematic problem-solving and search?

How do training data properties shape reasoning capability development?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Do base models contain latent reasoning that training can unlock?

How effectively do deterministic tools improve language model reasoning on formal tasks?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Does optimizing directly for semantic diversity improve both reasoning quality and exploration?

What limits mechanistic interpretability's ability to characterize models?

Why does self-revision increase model confidence while degrading accuracy?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

What coordination failures limit multi-agent LLM systems as they scale?

How does silent agreement differ from collaborative reasoning collapse?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

What critical LLM failures do standard benchmarks hide?

Do language models understand semantics or rely on pattern matching?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

How do language models establish social grounding in human dialogue?

Why do conventional mental models fail when applied to AI interaction?

Do language models learn genuine linguistic structure or just surface patterns?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

How does reasoning effort affect AI theory of mind performance?

Why does finetuning cause catastrophic forgetting of model capabilities?

Do language models develop causal world models or rely on statistical patterns?

How should iterative research systems allocate reasoning per search step?

How do search tasks differ from derivation tasks in reasoning efficiency?

Why do language models struggle with implicit discourse relations?

Why do language models fail at implicit discourse relations while handling explicit connectives?

Can inference-time compute substitute for scaling up model parameters?

Can prompting inject entirely new knowledge into language models?

How do knowledge injection methods compare across cost and effectiveness?

Which RAG sub-decisions are actually pattern matching versus reasoning intensive?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How do knowledge graphs enable efficient multi-hop reasoning over alternatives?

Can hyperedges replace triple-based externalization in reasoning tasks?

How can AI systems learn from failures without cascading errors?

How should inference compute be adaptively allocated based on prompt difficulty?

Why do language models reinforce false assumptions instead of correcting them?

How do adversarial and manipulative prompts attack reasoning models?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

Why does ambiguity detection require different multi-agent mechanisms than verifiable reasoning tasks?

What capability tradeoffs emerge when scaling model reasoning abilities?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

How does fine-tuning on natural language inference affect fallacy susceptibility?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

How does reasoning graph topology affect breakthrough insights and generalization?

Why do multi-turn conversations degrade AI intent and coherence?

Why do weaker language models fail at multi-turn strategic questioning?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

How does test-time aggregation affect reasoning correctness and reliability?

How does majority voting fail when reasoning samples lack genuine diversity?

What actually drives chain-of-thought reasoning improvements in language models?

Do language model representations contain causally steerable task-specific features?

Is gradient behavior in language functional or a sign of ambiguity?

Can model confidence signals reliably improve reasoning quality and calibration?

Do corrupted reasoning traces serve as effective supervision signals?

How can process reward models supervise complex reasoning traces?

Why does outcome supervision fail for long reasoning chains?

Why do agents confidently report success despite actually failing tasks?

What structural features enable agents to detect when understanding has broken down?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Why can't pattern-matching systems perform the observation that expert communication requires?

What role does compression play in language model capability and generalization?

How much does schema bloat actually degrade reasoning in large language models?

What are the consequences of models training on synthetic data?

Does model collapse occur across different architectures or only in specific conditions?

How should models express uncertainty rather than forced confident answers?

Can ensemble evaluation methods reduce bias more than single judges?

Why does enlarging the evaluation unit reintroduce comparability problems?

How does latent reasoning compare to verbalized chain-of-thought?

How should retrieval systems optimize for multi-step reasoning during inference?

Why do fixed-size document chunks break complex procedural question answering?

When does optimizing for quality undermine the value of diversity?

What articulatory information do speech signals carry that text cannot?

Why do multimodal models fail on rare and underrepresented concepts?

Does domain specialization cause models to lose capabilities elsewhere?

Can expert-derived knowledge bases scale to other high-stakes domains?

When does architectural design matter more than raw model capacity?

Why do harder puzzles cause all models to collapse despite larger token budgets?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

How sensitive is analogical reasoning emergence to training data and scale?

What determines success in training models on multiple tasks?

Why do larger models reduce interference between rare and common tasks?

How do prompt structure and constraints affect model instruction reliability?

Why do semantically related prompts converge into attractor states in middle layers?

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 224 in 2-hop network ·dense cluster Open in graph ↗

Do language models fail at reasoning due to comp… Do foundation models learn world models or task-sp… Do transformers actually learn systematic composit… Does chain-of-thought reasoning reveal genuine inf… Does more thinking time always improve reasoning a… Why do reasoning LLMs fail at deeper problem solvi… Does the reasoning cliff depend on how we test mod… Can reinforcement learning discover reasoning stra… Can neural networks learn compositional skills wit…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
the mechanism beneath the phenomenon; heuristics work within instance coverage and fail outside
Do transformers actually learn systematic compositional reasoning? Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.
the mathematical substrate: subgraph matching is instance-level pattern matching
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
CoT imitates form without performing inference; unfamiliarity reveals the imitation
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the apparent threshold may be unfamiliarity not tokens
Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
wandering may be the novelty response
Does the reasoning cliff depend on how we test models? If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?
complementary reframing at the execution layer
Can reinforcement learning discover reasoning strategies base models cannot? Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
apparent tension; possibly resolved as instance-family novelty rather than algorithm novelty
Can neural networks learn compositional skills without symbolic mechanisms? Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
partial counterpoint: scaling data closes some generalization gaps, but instance novelty remains the boundary
Can identical outputs hide broken internal representations? Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER is the representation-level parallel; identical benchmark scores can mask different instance coverage
Can transformers improve exponentially by learning from their own correct solutions? Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.
subtle counterpoint: length generalization within a familiar task family (addition at longer digit counts) still extends beyond initial instance coverage through iteration; but the instance type stays familiar, so this may be "same-algorithm novelty" that the thesis accommodates
Are reasoning model collapses really failures of reasoning? Explores whether language models hit a fundamental reasoning ceiling or whether text-only evaluation masks execution limitations. Examines how tool access might reveal hidden reasoning capabilities.
alternative diagnosis at the execution layer

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LRM reasoning breakdown is driven by instance-level unfamiliarity not task-level complexity — there is no limit to reasoning chain length as long as the instances were covered during training

Do language models fail at reasoning due to complexity or novelty?

Inquiring lines that read this note 250

Related concepts in this collection 11

Related papers in this collection 8

Search by related questions 4