INQUIRING LINE

Do task-specific heuristics improve gradually or appear suddenly at scale?

This explores whether the skills a model uses to solve a task accumulate smoothly as training and scale grow, or switch on abruptly — and the corpus reframes 'heuristic' itself as instance-based pattern recall, which changes the answer.


This explores whether task-specific heuristics build up gradually or pop into existence at scale. The corpus's most direct move is to dissolve the premise: what looks like a discrete 'heuristic' is, in several of these notes, just recall of training instances the model has seen something like before. Do language models fail at reasoning due to complexity or novelty? shows reasoning models don't break at a complexity threshold and don't switch on at one either — they succeed on any chain when trained on similar instances and fail at novelty boundaries. Under that lens, a heuristic isn't a capability that appears; it's coverage that expands, one familiar region at a time.

That framing makes the 'gradual' side look strong. Does longer reasoning actually mean harder problems? finds that reasoning traces track proximity to the training distribution, not the difficulty of the problem — the model is recalling schemas, not adaptively computing. Does chain-of-thought reasoning actually generalize beyond training data? sharpens this: performance decays *predictably* as you move away from training data, producing fluent but logically empty reasoning. Predictable decay is the signature of a smooth underlying function, not a phase transition. Even instruction tuning, often credited with unlocking new behavior, turns out to mostly teach the shape of the output space — Does instruction tuning teach task understanding or output format? shows semantically empty or wrong instructions perform about as well as correct ones. What accumulates is format familiarity, gradually.

The scaling-curve notes agree. Do search steps follow the same scaling rules as reasoning tokens? finds search agents improve along the same diminishing-returns curve as reasoning tokens — a smooth axis, not a cliff. Why does chain of thought accuracy eventually decline with length? describes a continuous inverted-U where the optimum drifts as models improve. Nothing here behaves like a switch.

So where does 'sudden at scale' come from? The interesting answer in this corpus is that apparent jumps usually come from a *new information channel*, not raw scale. Can natural language feedback overcome numerical reward plateaus? shows models stuck on a numerical-reward plateau leap forward when given chain-of-thought critiques — the plateau wasn't a capability ceiling, it was missing information about *why* failures happened. Likewise Does training order reshape how models handle different task types? shows that simply changing training *order* yields large gains by avoiding entropy collapse, and Does reinforcement learning squeeze exploration diversity in search agents? shows RL can quietly narrow the heuristics a model will even attempt. The 'suddenness' lives in the training signal and schedule, not in scale crossing a magic number.

The thing worth taking away: in this collection there's little evidence for heuristics that genuinely materialize at a scale threshold. What reads as emergence is usually the model entering a region where similar instances were memorized, or a new feedback signal exposing competence that scaling alone left locked. Scale mostly buys *more* gradually-acquired heuristics — wider coverage — rather than qualitatively new ones appearing all at once.


Sources 9 notes

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether task-specific heuristics in LLMs emerge gradually or suddenly at scale. This remains an open question.

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026; treat them as perishable.
• Reasoning traces correlate with training-distribution proximity, not problem difficulty; performance decays predictably as distance from training data increases, not at a threshold (~2025).
• Instruction tuning teaches output-format distribution, not task understanding; semantically empty instructions perform ~as well as correct ones (~2023).
• Search agents improve along smooth diminishing-returns curves identical to reasoning-token scaling — no cliff behavior (~2025).
• Apparent capability jumps usually reflect new information channels (e.g., chain-of-thought critiques breaking numerical-reward plateaus) rather than scale crossing a threshold (~2025).
• Optimal chain-of-thought length follows a continuous inverted-U; the optimum drifts smoothly as model capability increases (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2508.01191 (2025) — Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
• arXiv:2506.03106 (2025) — Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
• arXiv:2605.22817 (2026) — Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Gemini 3), RL methods (PPO variants, reward-model scaling), or multi-agent orchestration have since RELAXED or OVERTURNED it. Separate the durable claim — heuristics are coverage, not discrete switches — from perishable specifics (e.g., instruction-tuning inefficacy). Cite what resolved or reconfirmed each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. What papers directly claim phase-transition emergence, and do they hold?
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Has scaling numerical reward sufficiently that gradient information alone now triggers discontinuous heuristic adoption? (b) Do multi-agent compositions with memory/caching create apparent emergent heuristics that single-model scaling cannot?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines