INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does example difficulty affect…›this inquiring line

AI models don't actually think harder on harder problems — they just mimic how long similar training examples were.

Why do models automatically adjust reasoning length to problem difficulty?

This explores whether models genuinely scale their reasoning to match how hard a problem is — and the corpus mostly pushes back on that premise.

This explores whether models genuinely lengthen their reasoning when a problem gets harder — and the surprising thing the collection shows is that the premise mostly doesn't hold. Longer traces aren't a thermostat tracking difficulty. Controlled maze experiments find that trace length correlates with difficulty only on problems close to what the model saw in training; push the problem out-of-distribution and the link breaks entirely. What looks like "thinking harder" is largely the model recalling how long similar training examples were Does longer reasoning actually mean harder problems?. A companion finding reframes failure the same way: reasoning collapses not at some complexity threshold but at instance-level novelty, because models fit patterns from specific instances rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?.

The deeper twist is that models can perceive difficulty — they just don't act on it. Linear probes can decode a question's difficulty from a reasoning model's hidden states *before* it writes a single token, yet the model still overthinks easy questions. That's an action-commitment failure, not a perception failure Can models recognize question difficulty before they reason?. So the signal exists internally; the behavior just doesn't follow it. You see the cost of that disconnect when models pour redundant steps into ill-posed questions with missing premises that a non-reasoning model would simply flag as unanswerable — training rewards producing reasoning steps but never teaches a model when to stop Why do reasoning models overthink ill-posed questions?.

Why does any difficulty-tracking show up at all, then? When it does, it tends to be an emergent byproduct of reward, not a designed feature. Accuracy follows an inverted-U against reasoning length: optimal length rises with task difficulty but falls as the model gets more capable, and RL training naturally drifts toward shorter chains as models improve — simplicity emerges from the reward signal rather than being trained in explicitly Why does chain of thought accuracy eventually decline with length?. Push past the sweet spot and accuracy actually drops; one benchmark fell from 87% to 70% as thinking tokens grew from ~1,100 to ~16K, the classic overthink-easy / underthink-hard pattern Does more thinking time always improve reasoning accuracy?.

The failure isn't usually too little compute — it's disorganized compute. Reasoning models "wander like tourists," exploring invalid paths and abandoning promising ones prematurely, so success probability decays exponentially with problem depth rather than being rescued by longer traces Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?. And more context can hurt outright: padding inputs to just 3,000 tokens dropped accuracy from 92% to 68%, well below the context limit Does reasoning ability actually degrade with longer inputs?.

The most interesting corner is what it takes to make difficulty-adjustment real rather than incidental. The capability seems to already be latent — several independent methods (RL steering, critique tuning, decoding tweaks, SAE feature steering) all elicit reasoning that base models already contain, suggesting post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. Building on that, one approach explicitly trains a model to *route* between extended thinking and a quick answer using decoupled RL, learning calibrated mode-selection without ever being handed difficulty labels Can models learn when to think versus respond quickly?. The takeaway worth carrying away: genuine length-to-difficulty matching is something you have to deliberately train *for*, because left to default training it gets approximated by memorized trace lengths and conservative defaults Are models actually reasoning about constraints or just defaulting conservatively? — which only look like adaptive reasoning from the outside.

Sources 12 notes

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can models recognize question difficulty before they reason?

Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Show all 12 sources

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM reasoning researcher re-testing claims about whether models calibrate reasoning length to problem difficulty. The question remains open: do reasoning models genuinely adjust their computation depth as a function of task complexity, or is apparent adjustment mostly an artifact of training distribution?

What a curated library found — and when (spanning 2024–2026, though these are dated claims, not current truth):
• Trace length correlates with difficulty only in-distribution; out-of-distribution, the link breaks entirely — models are recalling training exemplar lengths, not computing adaptively (2025).
• Hidden states encode problem difficulty decodably before token generation, yet models fail to act on this signal — an action-commitment gap, not perception failure (2025).
• Accuracy vs. reasoning length follows an inverted-U: optimal length rises with task difficulty but falls as model capability grows; one benchmark dropped from 87% to 70% when thinking tokens grew from ~1,100 to ~16K (2025–2026).
• Reasoning models explore "like tourists," wandering invalid paths and abandoning promising ones; success probability decays exponentially with depth rather than being rescued by longer traces (2025).
• Input padding to 3,000 tokens (well below context limits) dropped accuracy from 92% to 68%, showing disorganized, fragile compute (2024).
• Genuine difficulty-adjusted reasoning requires explicit training (RL steering, critique tuning, decoupled RL routing) — base models contain latent capability that post-training selects rather than builds (2025).

Anchor papers (verify; mind their dates):
• arXiv:2509.07339 — "Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity" (2025).
• arXiv:2505.20296 — "Reasoning LLMs are Wandering Solution Explorers" (2025).
• arXiv:2505.13379 — "Thinkless: LLM Learns When to Think" (2025).
• arXiv:2508.01191 — "Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens" (2025).

Your task:
(1) RE-TEST the distribution-dependency and action-commitment claims. Have newer decoding methods, in-context routing, or multi-agent orchestration since resolved the gap between implicit difficulty perception and overt behavior? Does test-time scaling (longer inference budgets) now reliably trigger adaptive length without retraining, or does it still amplify wandering? Separate the durable finding (models struggle to *decide* reasoning depth dynamically) from the perishable one (they can't perceive difficulty).
(2) Surface the strongest **disagreement** in the last ~6 months: any work claiming that longer reasoning *does* improve performance systematically, contradicting the inverted-U and wandering narratives. Flag where that claim holds and where it fails.
(3) Propose two research questions assuming the regime may have shifted: (a) Can hybrid frozen-routing + learned allocation (e.g., SAE steering of compute intensity) achieve real difficulty-scaling without full RL retraining? (b) Do ensemble or mixture-of-experts approaches sidestep the wandering problem by distributing depth per expert?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI models don't actually think harder on harder problems — they just mimic how long similar training examples were.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8