SYNTHESIS NOTE

Why does removing spurious cues sometimes hurt model performance?

Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.

Synthesis note · 2026-05-01 · sourced from Linguistics, NLP, NLU

The literature on shortcut learning describes models that latch onto spurious surface features correlated with labels — lexical-overlap heuristics in NLI, sparse heuristic circuits in arithmetic, content effects in syllogistic reasoning. The standard prescription is to remove the spurious feature: take out the cue, performance recovers because the model is forced to use the intended computation.

The Heuristic Override Benchmark shows that this prescription does not apply to its phenomenon. Removing the heuristic cue (the distance "50 meters") makes models worse, not better. Twelve of fourteen models drop in accuracy when the spurious cue is removed. This is the opposite of shortcut-learning predictions and signals that something different is happening.

The authors locate the difference structurally. Shortcut learning is about filtering: the model needs to ignore the spurious feature and attend to the relevant one. Heuristic override is about composing: the model needs to integrate two things — a salient surface cue and an unstated feasibility constraint — and prioritize the constraint when they conflict. Both signals are integral to the problem; neither is noise. Removing the cue does not clean the input; it removes one of the two ingredients the composition requires, leaving the model less able to make any decision at all.

This connects the failure to the classical frame problem rather than to feature-level shortcut learning. The challenge is enumerating which unstated conditions are relevant — not detecting and filtering distractors. The two failure modes need different benchmarks, different mitigations, and different theoretical accounts.

Inquiring lines that read this note 23

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can identical external performance mask different internal representations?

What dimensions of recommendation quality do standard metrics miss?

Why does aggregate accuracy fail as a metric for rare harmful cases?

How can AI systems learn from failures without cascading errors?

What makes the frame problem distinct from feature-level shortcuts?

How do language models inherit human biases from training data?

How does removing a spurious cue change LLM performance?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can AI-generated outputs constitute genuine knowledge or valid claims?

What happens to AI reasoning when you remove specific political features?

What limits mechanistic interpretability's ability to characterize models?

Do language model representations contain causally steerable task-specific features?

Why can data filtering fail to remove transmitted behavioral traits?

How should models express uncertainty rather than forced confident answers?

What makes correcting a false assumption harder than just detecting it?

What properties determine whether reward signals teach genuine reasoning?

Why do different models respond differently to spurious rewards?

What are the consequences of models training on synthetic data?

Does debiasing training data actually solve the bias problem in machine learning?

How do adversarial and manipulative prompts attack reasoning models?

How do training priors constrain what context information can override?

How do policy learning algorithm choices affect multi-objective optimization stability?

Can group-relative normalization be modified to resist shortcut trajectories?

How do self-generated feedback mechanisms enable effective model learning?

What features does a sample reinforce when it moves bands?

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM heuristic override is structurally distinct from shortcut learning because removing the spurious cue degrades rather than improves performance

Why does removing spurious cues sometimes hurt model performance?

Inquiring lines that read this note 23

Related papers in this collection 8

Search by related questions 4