INQUIRING LINE

Why do harder puzzles cause all models to collapse despite larger token budgets?

This explores why throwing more tokens at a hard problem doesn't rescue models that 'collapse' on it — and the corpus's surprising answer is that the budget was never the binding constraint.


This explores why harder puzzles break models even when you give them room to think longer. The instinct behind the question — more tokens should buy more reasoning — turns out to be where the corpus pushes back hardest. The collapse isn't a budget shortage; it's that several different things masquerading as 'difficulty' are immune to budget entirely.

The most direct reframing: models don't break at a complexity threshold, they break at a novelty boundary. One line of work finds that reasoning failures are driven by instance-level unfamiliarity, not task-level complexity — models fit patterns from instances they've seen rather than learning a general algorithm, so a chain of any length succeeds if it resembles training and fails if it doesn't Do language models fail at reasoning due to complexity or novelty?. Extra tokens can't manufacture familiarity with a genuinely new instance, so the budget is spent restating patterns that don't fit.

A second strand says the bottleneck isn't reasoning at all — it's execution. Models confined to text-only generation often *know* the algorithm but can't carry out its steps reliably at scale; give them a tool to actually run the procedure and they sail past the supposed 'reasoning cliff' Are reasoning model collapses really failures of reasoning?. And at the architectural level, some problems are structurally hostile to how these models generate: autoregressive transformers can't retract a token once emitted, but constraint-satisfaction puzzles depend on discarding bad partial guesses, so more tokens just pile up more un-retractable commitments Why does autoregressive generation fail at constraint satisfaction?. None of these is a quantity-of-thinking problem.

There's also the matter of diminishing returns on the token axis itself. Both reasoning tokens and search steps follow the same test-time scaling curve — real gains early, flattening fast Do search steps follow the same scaling rules as reasoning tokens?. So a uniform 'just give it more' policy is wasteful: adaptive allocation that hands hard prompts more and easy ones less beats a bigger fixed budget Can we allocate inference compute based on prompt difficulty?. Worse, longer budgets can actively backfire — on ill-posed or missing-premise problems, models overthink and spiral instead of disengaging, because training rewards producing reasoning steps and never teaches when to stop Why do reasoning models overthink ill-posed questions?.

The thread that ties it together: collapse on hard puzzles is a failure of exploration, familiarity, and execution — not of word count. That's also why fixes that *shape* the budget work where raw enlargement doesn't, like curricula that start generous to let the model explore strategies, then tighten to force compression Does gradually tightening token budgets beat fixed budget training?, and why the apparent exploration/exploitation tradeoff that seems to doom hard cases turns out to be partly a token-level measurement artifact rather than a hard wall Is the exploration-exploitation trade-off actually fundamental?.


Sources 8 notes

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does gradually tightening token budgets beat fixed budget training?

Models trained with progressively tightening token budgets consistently achieve higher accuracy and better token efficiency than fixed-budget baselines. The approach works by separating learning into exploration (discovering strategies with generous budgets) and compression (distilling them under constraints).

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Next inquiring lines