Do language models fail at reasoning due to complexity or novelty?
Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.
The standard narrative around reasoning-model failures — from Shojaee et al.'s Illusion of Thinking onward — frames the phenomenon as a "complexity threshold" or "step threshold": models handle short reasoning chains but break on long ones. Something about the quantity of reasoning breaks down past some limit. The Chollet-Kambhampati exchange reframes this at the instance level, and the reframing matters for what "improving reasoning" can mean.
Chollet's claim: "Many people assume that LRM reasoning breaks down past a certain 'complexity' or 'number of steps' threshold. This is incorrect. It breaks down past an unfamiliarity threshold. And that threshold is very low. There is no limit to the complexity of tasks you can solve with these models, no limit to the number of steps in the reasoning chains they can master — as long as they have been covered during training/tuning. However, show them something unfamiliar, even very simple and requiring just a handful of reasoning steps (e.g., an ARC 2 task), and they will fail." The apparent complexity threshold in Tower of Hanoi exists because Tower of Hanoi is a familiar problem — the step count at which models fail corresponds to the step count at which instances stop appearing in their training data. Scaling step count is an indirect way of generating novelty, not an independent difficulty axis.
Kambhampati adds the systematic observation: LRMs lose accuracy as familiar-problem instances grow because they don't learn algorithms — they fit instance-based patterns. The two agree on the substantive claim even while they initially disagreed on terminology: "We don't actually disagree, we all know that Transformers don't fit generalizable algorithms, they fit instance-based patterns. It doesn't change the fact that the crux of the problem is familiar vs unfamiliar (at the instance level, not at the abstract 'task' level)."
The reframing has sharp implications. First, the intuition that "just scale more reasoning tokens" as a solution to reasoning failures is structurally misguided. If reasoning failure is instance-novelty-driven, then scaling tokens — which extends the reasoning chain — helps only if the longer chain covers more familiar instance territory. It does not extend to any genuinely unfamiliar instance, no matter how short. Second, the natural evaluation target shifts. Benchmarks that scale complexity (Tower of Hanoi with larger N, River Crossing with more pairs) are generating instance novelty indirectly through size. ARC 2 and similar benchmarks generate instance novelty directly through task structure change. The latter is a better measure of whether the model is fitting algorithms or fitting patterns. Third, the definition of "familiarity" matters and Chollet makes it precise: "outside of the classroom, in the real world, you are never exposed to neatly defined 'tasks' and step-by-step algorithms, you are only exposed to situations. Intelligence is the ability to infer generalizable algorithms from situations (instances) only. So the only reasonable definition of familiarity/novelty is at the situation/instance level. If you define it with respect to algorithms you are assuming the problem has already been solved."
This aligns with and sharpens several existing notes. Do foundation models learn world models or task-specific shortcuts? identified task-specific heuristics as the mechanism; Chollet-Kambhampati identify the corresponding failure condition — the heuristics work where they have instance coverage and fail where they do not. Do transformers actually learn systematic compositional reasoning? provides the mathematical substrate: if compositional reasoning is subgraph matching, then novelty at the subgraph level is what breaks the mechanism. Does chain-of-thought reasoning reveal genuine inference or pattern matching? extends this to the performance-vs-reasoning gap: CoT imitates the form of abstract reasoning without performing it, which is exactly why it handles familiar problems at scale but fails on unfamiliar problems at low complexity.
The reframing also creates a tension with some optimistic RL results. Can reinforcement learning discover reasoning strategies base models cannot? shows that extended RL can produce strategies not present in the base model. If reasoning is purely instance-pattern-fitting, where does the novelty in ProRL come from? A reconciliation: RL-discovered "novel strategies" may still be instance-family novelty — the model learns to combine previously separate instance patterns in new ways, producing what looks like strategy but is still pattern composition. This would be genuine progress within the instance-pattern regime without escaping it. A test: take a ProRL-extended model and evaluate it on ARC 2. If the instance-novelty thesis is right, ProRL gains should not transfer to instance-level novelty challenges.
The practical implication for evaluation design is straightforward. Current benchmarks that scale complexity to induce failure are indirectly measuring instance coverage in training data. Benchmarks that induce instance novelty at fixed short complexity — ARC 2, held-out reasoning tasks with genuinely new structure — measure what matters: whether the model is doing anything other than pattern lookup.
Inquiring lines that use this note as a source 230
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How can minimal pairs expose reasoning failures that single-instance accuracy metrics miss?
- When does knowledge activation fail across different model architectures?
- How does the knowing-doing gap widen as tasks become more complex?
- Why does scaling reasoning tokens fail to improve unfamiliar tasks?
- What makes a problem instance unfamiliar to a language model?
- Does sentence-level granularity capture enough structure for complex reasoning tasks?
- Why do single examples trigger large reasoning improvements in models?
- Why do simple length heuristics outperform sophisticated semantic methods?
- Can latent reasoning architectures work as retrofits to existing models?
- How does the frame problem differ between symbolic and statistical reasoning systems?
- Does optimizing directly for semantic diversity improve both reasoning quality and exploration?
- Can Kolmogorov complexity alone capture what makes intelligence general?
- Does self-revision actually improve reasoning in large language models?
- How does policy entropy collapse constrain token-level distribution in reasoning?
- How does silent agreement differ from collaborative reasoning collapse?
- Does scaling model size solve compositional generalization problems?
- Why do language models fail at planning despite understanding strategies?
- Why do reasoning models fail on structurally unfamiliar instances?
- Can symbolic solvers rescue language models from logical reasoning failures?
- Does text-only evaluation hide reasoning collapse that tool use could repair?
- Why do language models fail when semantic content is stripped away?
- How do humans and LMs differ on multi-hop reasoning?
- Can reasoning benchmarks separate logic from believability?
- Can language models reason without relying on learned semantic patterns?
- Where do humans and language models actually diverge in reasoning ability?
- Why do conventional mental models fail when applied to AI interaction?
- Why do language models fall back on frequency heuristics under structural complexity?
- How does era sensitivity in legal cases compound with context length failures?
- Can simple diagnostic tests predict language model performance in production complexity?
- Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
- Can models learn to select exemplars based on reasoning skills rather than complexity?
- What behavioral markers signal when reasoning chains are performative?
- Why do language models imitate reasoning form without abstract inference capability?
- Do language models learn surface patterns that appear generalizable but actually fail under shift?
- Can reasoning chains work without logical validity?
- Why do reasoning models perform poorly at theory of mind tasks?
- How do rare linguistic registers differ from conceptually complex examples?
- Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
- Can structural perturbations harm model accuracy more than semantic ones?
- What makes Compound-QA expose weaknesses in monologue reasoning?
- Why do large language models fail at temporal reasoning in complex legal cases?
- What happens when formal languages satisfy hierarchy but fail learnability constraints?
- Why does homework adherence remain low despite advances in language model capability?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- Do language models build world models or just task-specific heuristics?
- Why do open-source models trained on proprietary outputs still fail at reasoning?
- Why do models automatically adjust reasoning length to problem difficulty?
- How do search tasks differ from derivation tasks in reasoning efficiency?
- What causes snowball errors to accumulate across reasoning steps in language models?
- Why do language models fail at implicit discourse relations while handling explicit connectives?
- Why do models fail on logically equivalent tasks with different data distributions?
- Do sparse arithmetic circuits explain all language model reasoning abilities?
- Can fractured representations explain why models fail at systematic generalization?
- Does more inference compute help reasoning models match specialized domain performance?
- Why does comparison reasoning generalize better than composition reasoning?
- Is confabulation inevitable in large language models regardless of training?
- Does irrelevant context degrade reasoning even within model context limits?
- Which RAG sub-decisions are actually pattern matching versus reasoning intensive?
- Why does policy entropy collapse limit reasoning and dialogue RL scaling?
- Does architectural design matter more than model scale for reasoning tasks?
- Can frame semantics explain why context matters more than word similarity?
- Why do reasoning models perform worse on theory of mind tasks?
- Why do large language models still have systematic blind spots with complex structures?
- Can hyperedges replace triple-based externalization in reasoning tasks?
- Does scaling data automatically produce compositional reasoning or just better feature encoding?
- Why do language models fail at grounding and inference?
- What reveals the epistemic limits of language models?
- How does reasoning instability prevent models from modeling individuals?
- Can small models solve complex tasks using externalized reasoning graphs?
- Why do simple math problems get worse with longer reasoning chains?
- How should inference budget adapt based on problem difficulty?
- How should reasoning prompts adapt based on question complexity and type?
- Does more thinking always help large language models or sometimes hurt?
- Do language models systematically overestimate accuracy on collective behavior tasks?
- How does random walk length control reasoning complexity in question generation?
- Can minimal adversarial triggers disrupt reasoning across multiple unrelated queries?
- Does model scaling improve knowledge storage faster than reasoning ability?
- Why does ambiguity detection require different multi-agent mechanisms than verifiable reasoning tasks?
- What makes knowledge-rich specialized domains structurally different from general reasoning tasks?
- Why do longer reasoning chains signal hesitation rather than depth?
- How do foundation models develop task-specific heuristics instead of world models?
- Do reasoning models perform genuine logical evaluation or pattern matching?
- How does fine-tuning on natural language inference affect fallacy susceptibility?
- How does context complexity affect LLM performance on temporal reasoning tasks?
- What are collider structures and why do they reveal reasoning errors?
- Why do standard NLP benchmarks hide the most critical language limitations?
- Where do collider-type reasoning errors appear in real-world decisions?
- How does the inability to manage ambiguity undermine literary analysis tasks?
- Can benchmark performance distinguish surface from structural linguistic knowledge?
- Why do surface generalizations fail on unusual syntactic structures?
- How does semantic reasoning differ from symbolic reasoning in language models?
- Why do language models struggle with formal logical reasoning and joins?
- Why does distillation transfer reasoning patterns with few examples?
- Why do different reasoning chains surface different relevant facts?
- Can reasoning models distinguish between new evidence and manipulative reframing?
- What distinguishes conceptual understanding from statistical pattern matching in models?
- Does distillation from reasoning models spread overthinking to smaller models?
- Why do weaker language models fail at multi-turn strategic questioning?
- Do models excel at reasoning depth or memory breadth when scaling test time compute?
- What makes reasoning-specific post-training different from standard parameter scaling?
- Can parallel reasoning chains outperform longer sequential chains with the same compute?
- How does majority voting fail when reasoning samples lack genuine diversity?
- Can mechanistic interpretability explain explanation-execution disconnection?
- Why does chain of thought reasoning fail across different prompt formats?
- Can language models reason without relying on surface level pattern matching?
- What makes deductive reasoning so brittle in language models overall?
- How does structural complexity in sentences degrade LLM reasoning systematically?
- Is gradient behavior in language functional or a sign of ambiguity?
- How does model confidence relate to exemplar brittleness in chain-of-thought?
- Can language models distinguish between novel insight and unjustified conceptual blending?
- How do exemplar properties affect the brittleness of chain-of-thought prompting?
- Why does instruction tuning hurt knowledge-intensive tasks more than reasoning tasks?
- Why do reasoning models fail when input length increases even below context limits?
- Why do reasoning chains degenerate into undirected exploration at scale?
- Why do models overthink easy problems and underthink difficult ones?
- Do reasoning systems reuse cognitive structures across unrelated topics?
- Can explicit optimal algorithms prevent reasoning model collapse at high complexity?
- How do game-based benchmarks reveal reasoning fragmentation across domains?
- Why does chain-of-thought prompting fail to fix length-induced reasoning degradation?
- What explains the gap between perplexity performance and actual reasoning capability?
- Can scaffolding frameworks isolate inductive reasoning from deductive confounds?
- Why do reasoning models wander instead of searching systematically?
- What makes reasoning models worse at understanding people?
- How do longer reasoning chains create vulnerability to attacks?
- Is the reasoning cliff actually a tool-use problem?
- Does small-world structure in reasoning graphs improve generalization?
- Why do reasoning models produce unfaithful or unhelpful reasoning traces?
- Why do format and structure matter more than actual content in reasoning?
- Can dataset design systematically expand reasoning graph diameter?
- Why do verbalized reasoning chains fail on certain problem classes?
- Why does outcome supervision fail for long reasoning chains?
- What role does curriculum design play in reasoning emergence?
- Why do difficult problems force models to develop reasoning strategies?
- How does a single training example trigger phase transitions in reasoning output?
- Can external classifiers reliably decide when a model should reason?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- Can language models accurately evaluate the quality of their own reasoning?
- Why do models confabulate inconsistently across different samples?
- Why does cross-text analogical reasoning fail when semantics decouple from symbols?
- Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?
- Why do current speech benchmarks fail to measure reasoning over audio?
- What structural features enable agents to detect when understanding has broken down?
- What makes some sentences in reasoning traces have disproportionate causal influence?
- Why does premise ordering shift syllogistic reasoning performance by over 30 percent?
- Can language models perform purely symbolic reasoning when semantics are removed?
- What sparse mechanistic structures drive reasoning traces in language models?
- How should inference budgets adapt based on prompt difficulty?
- What metric distinguishes deep reasoning from superficial information propagation?
- Why does removing semantic content collapse reasoning in language models?
- Why do reasoning models fail at learning hidden rules from sparse exceptions?
- Why can't pattern-matching systems perform the observation that expert communication requires?
- Can reasoning models succeed at logic but fail at execution?
- Do reasoning failures stem from strategy or from calculation breakdown?
- Why do causal reasoning directions succeed while temporal reasoning directions fail?
- Does directional knowledge failure indicate shallow pattern matching over deep representation?
- Can weaker models match stronger ones with sufficient search and reasoning budget?
- Do reasoning models switch approaches when encountering local difficulty?
- How can high benchmark performance mask broken reasoning in AI systems?
- Can memorization scores diagnose where reasoning chains become unreliable?
- How much does schema bloat actually degrade reasoning in large language models?
- Does model collapse occur across different architectures or only in specific conditions?
- Can instance-adaptive reasoning happen without sequential token dependencies?
- Are some problems fundamentally unsolvable by parallel inference methods?
- What mechanisms cause reasoning models to wander rather than focus?
- Does task difficulty alone determine how many thinking tokens a model should use?
- How do single wrong steps corrupt entire reasoning chains?
- How do single training examples activate reasoning capabilities in language models?
- Why do reasoning model failures stem from execution rather than reasoning?
- Can machine learning encode pragmatic reasoning about when rules should bend?
- Why does scheme classification require more cognitive load than identifying premises?
- Why do language models plateau at 55 to 60 percent constraint satisfaction?
- Can simple structure perturbations reliably expose memorization in reasoning models?
- Can verification loops and decomposition fix judgment failures?
- How does making implicit reasoning requirements explicit change model performance?
- Do distributed relational tasks consistently underperform local classification across NLP domains?
- What failure modes emerge when scheme classification feeds downstream reasoning pipelines?
- Why do language models struggle with backward reasoning compared to forward?
- Do base models truly possess latent reasoning capability?
- Can cognitive scaffolding replace tool-based reasoning augmentation in language models?
- Why do language models fail at understanding ambiguous or complex requirements?
- What causes reasoning quality to degrade during long research tasks?
- Why do smaller models lose reasoning faithfulness more than larger models?
- Does fine-tuning push models toward reasoning shortcuts that bypass the chain entirely?
- Why does second-hop reasoning fail when composed with out-of-distribution triples?
- What limits external scaling when a model lacks reasoning foundation?
- Why does naive randomness fail to improve stochastic latent reasoning models?
- Why does enlarging the evaluation unit reintroduce comparability problems?
- How do reasoning-related features behave when trained on near-impossible problems?
- Why might rationales that predict common text patterns fail on hard novel reasoning?
- Can reasoning learned from language modeling actually transfer to knowledge-intensive domains?
- Why do long-context language models struggle with compositional reasoning tasks?
- Why do language models plateau at constraint satisfaction regardless of scale?
- Can models distinguish between logical impossibility and their own execution limits?
- What evidence shows that reasoning chains encode token-level functional structure?
- Why do longer reasoning chains explore like tourists instead of scientists?
- What makes deterministic recursive reasoning models underperform on multi-solution tasks?
- What does pass@k reveal about base model reasoning capacity?
- Can models possess latent reasoning capability that training signals fail to unlock?
- How can benchmark accuracy scores mask the absence of interpretable reasoning structure?
- Why do thinking models execute longer tasks than standard language models?
- Why does representation sparsity reliably indicate task difficulty for language models?
- How do frontier models maintain agreement scores above 90 percent across reasoning tasks?
- Why does target probability matter more than task logical complexity?
- Why do language models overthink simple questions when given extra time?
- What computational structures can actually scale serial reasoning depth?
- Why do fixed-size document chunks break complex procedural question answering?
- Why does single-shot learning fail in REVTHINK's multi-source reasoning tasks?
- What geometric structure do language models actually use during inference?
- How do semantic and symbolic reasoning capabilities differ in language models?
- Is reasoning failure caused by task complexity or training distribution gaps?
- Are newer larger language models actually worse at faithful summarization?
- What causes language models' strategic rationality to decline with increased game complexity?
- Why does exemplar performance vary across order complexity diversity and style?
- Why do non-experts default to familiar chart types despite domain complexity?
- Does premature confidence signal flawed reasoning in language models?
- How brittle are chain-of-thought exemplars across order and complexity?
- How does contrapositive augmentation change the tractability of reasoning tasks?
- Does task diversity in pretraining data transfer reasoning better than larger models?
- How can we turn reasoning model failures into useful training signals?
- Why do multimodal models fail on rare and underrepresented concepts?
- Why does document perplexity stay low while question-answering accuracy drops?
- Why does strategy diversity within reasoning chains improve model generalization?
- Do models genuinely reason harder on difficult tasks or just appear to?
- Can expert-derived knowledge bases scale to other high-stakes domains?
- What makes domain-specific utterance resolution harder for general large models?
- How does question difficulty and breadth affect what models learn to reason?
- How do complexity and diversity affect model performance differently?
- How does tool-based reasoning expand what language models can do?
- How does evaluation setting affect measured reasoning capabilities in language models?
- Do rare cultural concepts fail predictably as model scale increases?
Related concepts in this collection 11
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
the mechanism beneath the phenomenon; heuristics work within instance coverage and fail outside
-
Do transformers actually learn systematic compositional reasoning?
Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.
the mathematical substrate: subgraph matching is instance-level pattern matching
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
CoT imitates form without performing inference; unfamiliarity reveals the imitation
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the apparent threshold may be unfamiliarity not tokens
-
Why do reasoning LLMs fail at deeper problem solving?
Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
wandering may be the novelty response
-
Does the reasoning cliff depend on how we test models?
If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?
complementary reframing at the execution layer
-
Can reinforcement learning discover reasoning strategies base models cannot?
Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
apparent tension; possibly resolved as instance-family novelty rather than algorithm novelty
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
partial counterpoint: scaling data closes some generalization gaps, but instance novelty remains the boundary
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER is the representation-level parallel; identical benchmark scores can mask different instance coverage
-
Can transformers improve exponentially by learning from their own correct solutions?
Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.
subtle counterpoint: length generalization within a familiar task family (addition at longer digit counts) still extends beyond initial instance coverage through iteration; but the instance type stays familiar, so this may be "same-algorithm novelty" that the thesis accommodates
-
Are reasoning model collapses really failures of reasoning?
Explores whether language models hit a fundamental reasoning ceiling or whether text-only evaluation masks execution limitations. Examines how tool access might reveal hidden reasoning capabilities.
alternative diagnosis at the execution layer
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Large Language Model Reasoning Failures
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
Original note title
LRM reasoning breakdown is driven by instance-level unfamiliarity not task-level complexity — there is no limit to reasoning chain length as long as the instances were covered during training