INQUIRING LINE

Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?

This explores whether training a model on a schedule of shrinking token budgets (generous first, then progressively tighter) buys better reasoning efficiency than just training under one fixed budget — and why that staging helps.


This explores whether a budget-tightening curriculum beats a fixed budget for reasoning efficiency. The corpus answers directly: yes, and the reason is that learning to reason well and learning to reason cheaply are two different jobs. Models trained with progressively tightening token budgets reach higher accuracy *and* better token efficiency than fixed-budget baselines, because the curriculum splits training into an exploration phase (discover strategies while budgets are generous) and a compression phase (distill those strategies once the budget clamps down) — see Does gradually tightening token budgets beat fixed budget training?. A fixed budget forces both jobs to happen at once, and that's the disadvantage.

Why does compressing late work at all? Because more thinking is not free upside. Accuracy is non-monotonic in thinking length: pushing one model from ~1,100 to ~16K thinking tokens dropped accuracy from 87.3% to 70.3%, as it overthought easy problems and underthought hard ones Does more thinking time always improve reasoning accuracy?. So there's genuine slack to cut — a tightening curriculum is exploiting the fact that the generous-budget version was partly wasting tokens, not using them.

The more interesting question is whether the efficiency comes from the *budget schedule itself* or from training structure more broadly — and the corpus leans toward the latter. Reasoning models keep beating non-reasoning ones at any inference budget because training installs a protocol that makes extra tokens productive; the gap is about how reasoning was trained in, not raw compute at deploy time Can non-reasoning models catch up with more compute?. In the same spirit, RL training flips extended thinking from counterproductive self-doubt into useful gap-analysis — training mediates the *quality* of reasoning, not just its quantity Does extended thinking help or hurt model reasoning?. A budget curriculum is one lever within that broader truth: it's shaping when and how the model learns to spend, not adding capability.

There's a cheaper rival worth knowing about. If you only want brevity, you may not need a curriculum — or any retraining — at all. Verbose versus concise chains of thought turn out to occupy distinct linear regions of activation space, and a single steering vector extracted from 50 examples cut chain-of-thought length 67% with a 2.73x speedup and no accuracy loss Can we steer reasoning toward brevity without retraining?. That reframes the original question: a tightening curriculum earns its cost when you want the model to genuinely *learn* a more efficient reasoning policy, whereas inference-time steering buys compression off the shelf when you just want shorter output now.

One caution the corpus adds: efficiency gains measured on final accuracy can hide reasoning damage. Supervised fine-tuning raised benchmark scores while cutting the quality of intermediate inferential steps by 38.9%, producing right answers via post-hoc rationalization that standard metrics miss Does supervised fine-tuning improve reasoning or just answers?. So if you adopt budget-tightening, the success test isn't just "same accuracy, fewer tokens" — it's whether the compressed reasoning is still doing real inferential work underneath.


Sources 6 notes

Does gradually tightening token budgets beat fixed budget training?

Models trained with progressively tightening token budgets consistently achieve higher accuracy and better token efficiency than fixed-budget baselines. The approach works by separating learning into exploration (discovering strategies with generous budgets) and compression (distilling them under constraints).

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-efficiency researcher. The question remains open: **Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?** Treat the findings below as dated claims to be re-tested against newer models, methods, and evaluations — not as settled fact.

**What a curated library found — and when (2024–2025, claims now ~6–12 months old):**
- Budget-tightening curricula (generous early, tight late) outperform fixed-budget training on both accuracy and token efficiency, splitting learning into exploration then compression phases (~2025).
- Reasoning accuracy is non-monotonic in thinking tokens: one model dropped from 87.3% to 70.3% when pushed from 1.1K to 16K tokens, showing genuine slack to cut (~2025).
- Inference-time activation steering (extracting one vector from 50 examples) achieves 67% chain-of-thought compression with 2.73x speedup and no accuracy loss—a cheaper alternative to retraining (~2025).
- Supervised fine-tuning raised benchmark scores while degrading intermediate reasoning quality by 38.9%, masking post-hoc rationalization as real inference (~2025).
- Test-time scaling (more tokens) helps reasoning models but not reliably; the relationship is non-monotonic and depends on problem structure (~2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2506.04210 (Jun 2025): Does Thinking More always Help? Understanding Test-Time Scaling
- arXiv:2507.04742 (Jul 2025): Activation Steering for Chain-of-Thought Compression
- arXiv:2510.07364 (Oct 2025): Base Models Know How to Reason, Thinking Models Learn When
- arXiv:2503.24235 (Mar 2025): A Survey on Test-Time Scaling in LLMs

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer reasoning models (o1-pro, Claude-4x, GPT-4-Turbo), RL training pipelines, inference harnesses (batching, KV caching, speculative decoding), or evaluations have since relaxed or overturned it. Separate the durable question (likely: *can we learn to reason cheaper?*) from the perishable limitation (possibly: *fixed budgets cannot; curricula must*). What method or model released after Oct 2025 tested or contradicted each claim?
(2) **Surface the strongest contradicting or superseding work** from the last ~6 months. Has any paper shown curricula deliver *no* advantage, or steering beats training, or the non-monotonicity disappears?
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., *Do reasoning models trained post-2025 exhibit the same compression slack?* *Can RL + curriculum beat curriculum alone?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines