Do task-specific heuristics emerge because they compress well enough?
This explores whether the shortcuts LLMs use on tasks — pattern-matching, format mimicry, narrow strategies — show up not because they're understood, but because they're the cheapest thing to store that still earns reward.
This reads the question as being about *why* models reach for shortcuts: not 'do they have heuristics' but 'is compressibility the reason they form.' The corpus doesn't argue this in those words, but several notes line up to make exactly that case — that training rewards the most storable behavior that passes, and that behavior is usually a heuristic, not a procedure.
The sharpest evidence is what models keep instead of what they're shown. When instruction tuning is run on semantically empty or even deliberately wrong instructions, performance barely moves — what transfers is knowledge of the output *space*, not the task (Does instruction tuning teach task understanding or output format?). That's compression in action: the cheap thing to encode is 'what answers look like here,' and that's what survives. Reasoning shows the same fingerprint. Chain-of-thought trace length tracks how close a problem sits to the training distribution, not how hard it is — long traces are recall of familiar schemas, not adaptive computation (Does longer reasoning actually mean harder problems?). And when asked to actually run iterative numerical methods, models instead recognize a problem as template-similar and emit plausible-but-wrong values (Do large language models actually perform iterative optimization?). The heuristic — 'this looks like that, so answer like that' — is the compressed stand-in for a procedure that would cost far more to represent.
The reason these shortcuts feel robust until they suddenly aren't is that compression is lossy at the edges. CoT degrades predictably under shifts in task, length, or format, producing fluent reasoning with no valid logic underneath — the form is preserved because the form is what compressed, the logic was never stored (Does chain-of-thought reasoning actually generalize beyond training data?).
Reinforcement learning makes the compression pressure explicit rather than incidental. RL squeezes behavioral diversity in both reasoning and search agents through entropy collapse — policies converge on a narrow band of reward-maximizing strategies, while SFT on diverse demonstrations preserves breadth (Does reinforcement learning squeeze exploration diversity in search agents?). A narrow reward-maximizing strategy is precisely a task-specific heuristic that compressed well enough to win. This also reframes the capability gap: reasoning models beat non-reasoning ones at any compute budget because training installs a *protocol* that makes extra tokens productive, not because of raw scale (Can non-reasoning models catch up with more compute?) — heuristics aren't just compressed, they're compressed toward whatever the training regime rewarded.
What you didn't know you wanted to know is that the corpus also points at the antidote. If heuristics emerge because depth-first shortcuts compress cheaply, then deliberately spending compute on *breadth* breaks the pattern: training abstraction generators that force diverse, structured exploration outperforms sampling more solutions from the same narrow policy (Can abstractions guide exploration better than depth alone?). Compressibility explains why the shortcut forms — and tells you that resisting it costs exploration you have to pay for on purpose.
Sources 7 notes
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.