Can external classifiers reliably decide when a model should reason?
This explores whether a *separate* classifier — sitting outside the model — can reliably judge when a model needs to engage extended reasoning versus answer quickly, and the corpus leans toward 'not as well as signals read from inside the model itself.'
This question reads as: can you bolt an external decision-maker onto a model to gate when it reasons? The corpus is skeptical of the external part — not because routing is a bad idea, but because the most reliable signals about whether reasoning is needed seem to live *inside* the model, not in a classifier looking at it from the outside.
The clearest counter-evidence comes from work showing models can learn the when-to-reason decision themselves. Thinkless trains a single model to route between extended thinking and direct answers using decoupled reinforcement learning, with no explicit difficulty labels — the routing is self-calibrated rather than handed down by an external judge Can models learn when to think versus respond quickly?. That matters because an external classifier has to predict difficulty from the surface of a problem, and the corpus suggests difficulty is exactly the thing that's hard to predict from outside: reasoning failures track instance-level *novelty*, not task complexity, so two problems that look equally hard to a classifier can behave completely differently Do language models fail at reasoning due to complexity or novelty?.
There's also a direct verdict on classifiers as a category. When researchers compared classifier-style reward models against generative judges that actually reason about the reasoning, the generative ones won — better accuracy with orders of magnitude less training data Can judges that reason about reasoning outperform classifier rewards?. The lesson generalizes: a discriminative classifier that just emits a label is weaker than something that engages with the content. An external 'should-it-reason' gate is a discriminative classifier by another name.
Where the corpus *does* point is toward internal signals as the gating mechanism. A model's own answer-span confidence can rank reasoning quality well enough to serve as a reward Can model confidence work as a reward signal for reasoning?, and the deep-thinking ratio — how much a model revises its predictions across layers — correlates with accuracy and can be used at test time to decide how much to think Can we measure how deeply a model actually reasons?. Both read effort from the model's internals rather than guessing from the prompt. There's even evidence the capability is already latent and just needs eliciting, which reframes the job from 'classify hard vs. easy' to 'unlock what's already there' Do base models already contain hidden reasoning ability?.
The sharpest warning is that surface behavior lies. Most models score *better* when constraints are present and worse when removed — they look like they're reasoning but are really defaulting conservatively Are models actually reasoning about constraints or just defaulting conservatively?. An external classifier trained on observed outputs would happily learn that bias instead of detecting genuine reasoning need. Combined with findings that apparent reasoning collapses are often execution-bandwidth failures rather than reasoning failures Are reasoning model collapses really failures of reasoning?, the picture is that 'when should a model reason' isn't cleanly readable from outside at all. The thing you'd want to gate on is internal, dynamic, and easy to mistake for its conservative imitation.
Sources 8 notes
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.