How does reasoning accuracy degrade when token budgets exceed critical thresholds?
This explores what happens when a model is given *too many* thinking tokens — the corpus shows accuracy doesn't just plateau, it actively reverses, and points to why and what to do about it.
This explores what happens when a model is given too many thinking tokens — and the surprising finding is that more thinking can make a model *worse*, not just slower. The clearest data point: pushing thinking from roughly 1,100 tokens up to 16,000 dropped benchmark accuracy from 87.3% to 70.3% Does more thinking time always improve reasoning accuracy?. The relationship isn't a curve that flattens out; it's non-monotonic. Accuracy climbs, peaks, then declines — models overthink easy problems (talking themselves out of correct answers) while still underthinking the genuinely hard ones.
The frustrating part is that the threshold where this flip happens is invisible until you've crossed it. There's no reliable predictor — it shifts with the task, the model's training, and the problem's difficulty How can we predict the optimal thinking token threshold?. That's why a single fixed budget is a bad bet: the same token allowance that helps a hard prompt will push an easy one past its overthinking cliff. The corpus's answer is to stop using one budget for everything and allocate adaptively — give easy prompts less, hard prompts more — which beats a uniform budget even with the same total compute Can we allocate inference compute based on prompt difficulty?.
Here's the doorway most readers won't expect: the problem may not be the *amount* of thinking but the *shape* of it. Extending a single chain of reasoning inflates variance without improving correctness — the longer it runs, the more chances it has to wander. Splitting the same token budget across several independent reasoning paths and voting on the answer lands up to 22% higher accuracy than one long chain Why does parallel reasoning outperform single chain thinking?. So degradation past the threshold looks less like running out of capability and more like a single trajectory accumulating drift.
Two deeper framings are worth a click. One: you can train the failure out rather than tuning around it — curriculum budgets that start generous (let the model explore) then tighten (force it to compress) beat fixed-budget training on both accuracy and efficiency Does gradually tightening token budgets beat fixed budget training?. Two: more tokens only help if training taught the model how to *use* them. Reasoning models stay productive with extra budget because training instilled a protocol; non-reasoning models don't catch up no matter how much inference compute you throw at them Can non-reasoning models catch up with more compute?. The takeaway the headline number hides: 'overthinking' is really a mismatch between how a model was trained to spend tokens and how many it's actually handed.
Sources 6 notes
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
The overthinking threshold depends on task difficulty, model training, and domain, but remains invisible until crossed. Recent work suggests difficulty estimators and runtime confidence signals can detect thresholds dynamically.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.
Models trained with progressively tightening token budgets consistently achieve higher accuracy and better token efficiency than fixed-budget baselines. The approach works by separating learning into exploration (discovering strategies with generous budgets) and compression (distilling them under constraints).
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.