SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Why does removing spurious cues sometimes hurt model performance?

Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.

Synthesis note · 2026-05-01 · sourced from Linguistics, NLP, NLU
How do reasoning models actually break under pressure? Where exactly do reasoning models fail and break?

The literature on shortcut learning describes models that latch onto spurious surface features correlated with labels — lexical-overlap heuristics in NLI, sparse heuristic circuits in arithmetic, content effects in syllogistic reasoning. The standard prescription is to remove the spurious feature: take out the cue, performance recovers because the model is forced to use the intended computation.

The Heuristic Override Benchmark shows that this prescription does not apply to its phenomenon. Removing the heuristic cue (the distance "50 meters") makes models worse, not better. Twelve of fourteen models drop in accuracy when the spurious cue is removed. This is the opposite of shortcut-learning predictions and signals that something different is happening.

The authors locate the difference structurally. Shortcut learning is about filtering: the model needs to ignore the spurious feature and attend to the relevant one. Heuristic override is about composing: the model needs to integrate two things — a salient surface cue and an unstated feasibility constraint — and prioritize the constraint when they conflict. Both signals are integral to the problem; neither is noise. Removing the cue does not clean the input; it removes one of the two ingredients the composition requires, leaving the model less able to make any decision at all.

This connects the failure to the classical frame problem rather than to feature-level shortcut learning. The challenge is enumerating which unstated conditions are relevant — not detecting and filtering distractors. The two failure modes need different benchmarks, different mitigations, and different theoretical accounts.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM heuristic override is structurally distinct from shortcut learning because removing the spurious cue degrades rather than improves performance