Why do models show performative reasoning on easy tasks but genuine reasoning on hard ones?
This explores whether models actually 'fake' reasoning on simple tasks while genuinely reasoning on hard ones — and the corpus suggests the premise itself is shaky: traces look performative across the board, but what changes with difficulty is how much real work the model needs to do behind them.
This reads as a question about a difficulty gradient — models seemingly going through the motions on easy problems and doing real work on hard ones. The corpus complicates that story in a useful way: the visible reasoning trace is mostly performance regardless of difficulty, but the underlying computation it sits on top of scales with how unfamiliar the problem is.
Start with the uncomfortable finding that reasoning traces are largely theater everywhere. Multiple notes show that the words a model writes while 'thinking' don't faithfully reflect what produced the answer — invalid logical steps perform nearly as well as valid ones Do reasoning traces show how models actually think?, deliberately corrupted traces teach as well as correct ones and sometimes generalize better Do reasoning traces need to be semantically correct?, and the intermediate tokens carry no special execution semantics — they're generated like any other output and correlate with answers through learned formatting, not functional logic Do reasoning traces actually cause correct answers?. Reflection rarely changes an initial answer; it's mostly confirmatory Can we actually trust reasoning model outputs?. So 'performative on easy tasks' isn't a special failure mode — it's the default character of the trace.
What actually varies with difficulty is the work underneath. One reframing says reasoning failures aren't about complexity at all but about instance-level novelty: a model fits patterns from similar training instances, so any chain succeeds when the problem resembles something seen before, and breaks at the novelty boundary regardless of length Do language models fail at reasoning due to complexity or novelty?. Easy tasks tend to be familiar — pattern-match and the answer falls out, so the trace is decorative. Hard or novel tasks force the model past memorized scaffolding into genuinely transferable procedure, the kind drawn from broad procedural knowledge in pretraining rather than narrow fact recall Does procedural knowledge drive reasoning more than factual retrieval?. Another angle: many apparent 'collapses' on hard problems are execution failures, not reasoning failures — the model knows the algorithm but can't carry out the steps at scale in text, and tool access removes the cliff Are reasoning model collapses really failures of reasoning?.
There's also a sharper inversion of your premise hiding in the corpus. Models tend to overthink easy problems and underthink hard ones — accuracy is non-monotonic in thinking tokens, peaking then declining as the model spins extra words on problems that didn't need them Does more thinking time always improve reasoning accuracy?. That's the opposite of efficient: the most performative, padded reasoning often shows up on the easy cases. And some 'success' on constrained hard problems turns out to be a conservative default rather than reasoning at all — most models do worse when constraints are removed, meaning they were leaning on a heuristic, not evaluating the problem Are models actually reasoning about constraints or just defaulting conservatively?.
The deeper resolution: reasoning capability already lives latent in base model activations, and post-training selects rather than creates it Do base models already contain hidden reasoning ability? — and that work can happen in hidden states without being verbalized at all Can models reason without generating visible thinking tokens?. So the genuine reasoning isn't really 'in' the visible trace on hard tasks either; it's in the compute the model is forced to recruit. The promising direction is teaching models to route — to spend extended thinking only when difficulty warrants it and answer directly otherwise, without needing difficulty labels Can models learn when to think versus respond quickly?. Which suggests the real question isn't 'why is reasoning fake on easy tasks' but 'why do models narrate at all when the answer is already cheap.'
Sources 12 notes
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.