INQUIRING LINE

Reasoning, Retrieval, and Evaluation · Model Architecture and Internals · Training, RL, and Test-Time Scalingcross-cluster

Does adding reasoning to models degrade other capabilities like rule inference?

This explores whether bolting chain-of-thought reasoning onto a model can actively make it worse at certain tasks — specifically inductive rule inference, where you learn a rule from examples including exceptions.

This reads the question as: does the reasoning machinery itself impose a tax, so that a 'reasoning' model can be beaten by a plain one on some tasks? The corpus says yes, and the cleanest evidence is exactly about rule inference. Across four game-based tasks, reasoning models scored below 25% on exception-based inductive rules while non-reasoning models hit 55–65% — a large reversal Why do reasoning models fail at exception-based rule inference?. The mechanism is telling: chain-of-thought introduced math overuse, overgeneralization, and hallucinated constraints. The model talks itself into a tidy general rule and then can't recognize the negative evidence that says 'except in this case.' The very habit that helps on math — building and committing to structured chains — is what poisons a task that rewards staying open to exceptions.

Why would reasoning hurt here? Several notes point to the same underlying fragility. LLMs largely reason through semantic association rather than formal logic, so when you decouple the semantics from the rule, performance collapses even with the correct rule sitting in context Do large language models reason symbolically or semantically?. A longer reasoning chain doesn't fix this — it gives the model more room to lean on commonsense priors that overwrite the actual rule. Related work finds reasoning models 'wander' and 'underthink,' exploring invalid paths and abandoning good ones prematurely Why do reasoning models abandon promising solution paths?. More generated reasoning is more surface area for these structural errors to compound.

There's a deeper reframe worth knowing: the reasoning these models 'add' may not be new capability at all, just a deployment habit. Base models already contain latent reasoning that minimal training elicits rather than creates Do base models already contain hidden reasoning ability?, and RL post-training appears to teach *when* to reason, not *how* Does RL post-training create reasoning or just deploy it?. If post-training mostly installs a reflex to deploy long chains, then on a task where long chains are the wrong move, that reflex misfires — you've trained the model to over-apply a tool. This squares with the finding that small models can match large RL models by learning output *format* alone Can small models reason well by just learning output format?: reasoning is partly a behavioral style, and styles can be applied where they don't belong.

The useful nuance is that not every apparent reasoning failure is a degradation of capability. Some collapses are execution failures — the model knows the algorithm but can't run enough steps in text — and giving it tools dissolves the supposed cliff Are reasoning model collapses really failures of reasoning?. Others are about novelty: models fit instance-level patterns rather than general algorithms, so they fail on unfamiliar instances regardless of how much they 'reason' Do language models fail at reasoning due to complexity or novelty?. Rule inference with exceptions sits right at that sore spot — it demands genuine generalization from sparse, partly-contradictory examples, which is precisely where pattern-fitting dressed up as reasoning breaks down.

So the takeaway you didn't know to ask for: 'reasoning' isn't a strict upgrade you can leave switched on. It's a learned disposition to elaborate, and on tasks that reward restraint and sensitivity to counterexamples, that disposition is a measurable liability — the model reasons itself past the right answer.

Sources 8 notes

Why do reasoning models fail at exception-based rule inference?

Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does adding reasoning to models degrade other capabilities like rule inference?

Sources 8 notes

Next inquiring lines