INQUIRING LINE

How does scaling reasoning capability actually reduce instruction-following ability?

This explores the specific mechanism behind a counterintuitive trade-off: as you train a model to reason harder, it gets worse at following the instructions you actually gave it.


This explores why making a model better at reasoning seems to make it worse at doing what you tell it — and what the corpus says is actually going on under the hood. The core finding is blunt: training for reasoning depth, via both supervised fine-tuning and reinforcement learning, measurably erodes instruction adherence, with advanced reasoning models obeying constraints only about half the time during math tasks Why do more capable reasoning models ignore your instructions?, Why do better reasoning models ignore instructions?. The proposed mechanism isn't mysterious: the longer the chain-of-thought grows, the more 'contextual distance' opens up between the original instruction and the place where the model is now generating tokens. The instruction simply gets diluted — it falls out of effective attention as the reasoning trace piles up between it and the answer.

That dilution story gets sharper when you put it next to a separate finding that reasoning degrades with input length far below the context window limit — accuracy dropping from 92% to 68% with just 3000 tokens of padding, an effect that's task-agnostic and persists even with chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. Read together, these suggest the culprit isn't 'reasoning' as a faculty competing with 'obedience' — it's that long generated reasoning behaves like long input. The model's grip on early-context information (your instructions) weakens as the relevant tokens get buried, whether that burial comes from a long prompt or a long self-generated thought.

Here's the thing that should reframe the whole question: a cluster of notes argues the reasoning you're scaling may not be a new capability at all, but constrained imitation of reasoning's *form*. Chain-of-thought reproduces familiar reasoning schemata from training rather than performing genuine inference, which is why it fails predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Does chain-of-thought reasoning actually generalize beyond training data?, Why does chain-of-thought reasoning fail in predictable ways?. If reasoning is pattern-matching a learned output structure, then training for 'more reasoning' is really training the model to commit harder to producing a particular *shape* of output — and an instruction that cuts against that shape ("answer in one word," "don't show work") is now competing with a strongly reinforced format habit. This rhymes with the finding that instruction tuning teaches output-format distribution rather than task understanding in the first place Does instruction tuning teach task understanding or output format?: if both instruction-following and reasoning are fundamentally about learned output shapes, scaling one shape at the expense of another is exactly what you'd expect.

The corpus also points at fixes that quietly confirm the diagnosis. If the damage comes from long traces diluting context and from training overwriting prior behavior, then the cures are length and isolation. Activation steering can cut chain-of-thought length 67% while holding accuracy Can we steer reasoning toward brevity without retraining? — shorter traces, less dilution. Freezing the backbone and delegating thought generation to a small auxiliary model preserves the original capabilities instead of training over them Can continuous reasoning avoid forgetting in instruction-tuned models?. And scaling reasoning in *width* — parallel trajectories rather than ever-deeper serial chains — sidesteps the depth cost entirely Can reasoning systems scale wider instead of only deeper?. Each one attacks the same root: the deficit isn't a law of nature, it's a side effect of how we currently train depth-first reasoning into the same weights that hold instruction-following.

The deeper surprise to walk away with: there may be no fundamental trade-off between being smart and being obedient. The conflict looks more like an artifact of training reasoning as one long serial trace in a single set of weights — and that base models already carry latent reasoning that minimal training merely *elicits* rather than creates Do base models already contain hidden reasoning ability?. If reasoning is selected, not built, then so is the instruction-following it appears to crowd out — and the crowding is a design choice, not a ceiling.


Sources 11 notes

Why do more capable reasoning models ignore your instructions?

Advanced reasoning models achieve only 50.71% instruction adherence during mathematical reasoning. Training for reasoning depth actively worsens instruction compliance, suggesting a fundamental trade-off between reasoning power and controllability.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst updating an empirical claim about instruction-following and reasoning in LLMs. The question remains open: does scaling reasoning *inherently* degrade instruction adherence, or is the observed deficit a training artifact?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as time-bound:
- Reasoning models obey constraints ~50% of the time on math tasks; dilution mechanism: chain-of-thought pushes instructions out of effective attention window (2025-05, 2025-02).
- Input length degrades reasoning from 92% to 68% accuracy with just 3000 tokens padding, far below context limits, independent of task (2024-02).
- Chain-of-thought is constrained imitation of reasoning form, not genuine inference; fails predictably under distribution shift (2025-06, 2025-08).
- Instruction tuning teaches output-format distribution, not task semantics (2023-05).
- Activation steering reduces CoT length 67% while preserving accuracy; auxiliary reasoning models freeze instruction-following; width-scaling (parallel trajectories) avoids depth cost (2025-02, 2025-07).

Anchor papers (verify; mind their dates):
- arXiv:2305.11383 (2023): instruction tuning and output distributions
- arXiv:2505.14810 (2025-05): reasoning models' instruction-following deficit, quantified
- arXiv:2506.02878 (2025-06): CoT as imitation, not reasoning
- arXiv:2507.04742 (2025-07): activation steering for CoT compression

Your task:
(1) RE-TEST EACH CONSTRAINT: For each mechanism (attention dilution, format overwriting, distribution shift), assess whether newer model scales, instruction-following methods (e.g., Constitutional AI, process reward models), multi-agent orchestration, or test-time compute have relaxed it. Separate the durable question (trade-off real or an artifact?) from perishable limits (specific training regimes). Cite what would break each claim.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any 2025-H2 or 2026 paper shown reasoning and obedience *co-scaling*, or shown the dilution story doesn't hold under new training setups?
(3) Propose 2 research questions that assume the regime has moved: e.g., "Can scaling reasoning *width* on frozen backbones restore instruction-following without auxiliary models?" or "Does test-time compute (latent reasoning at inference) escape the format-imitation bind?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines