SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Training, RL, and Test-Time Scaling

Does longer reasoning actually mean harder problems?

Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

A prevailing assumption: longer reasoning traces indicate more thinking effort, therefore more complex problems should produce longer traces. Controlled experiments undercut this completely.

Training transformer models from scratch on derivational traces of the A* search algorithm — where problem complexity is precisely controllable and verifiable — reveals the decoupling:

The interpretation: intermediate token sequence length reflects approximate recall from the training distribution, not problem-adaptive computation. When a problem is close to training examples, the model retrieves a matching schema whose length reflects the training data's length distribution for that problem type. When a problem is far from training, the model has no calibrated schema to retrieve — trace length becomes arbitrary.

This challenges the entire anthropomorphic framing of "thinking time." When DeepSeek-R1 or similar models produce long chains, the conventional interpretation is that the problem is hard and the model is "working through it." The A* evidence suggests the length may primarily indicate how close the problem is to training distribution, not how much genuine computation is occurring.

The practical implication: trace length is not a reliable proxy for problem difficulty. Length-based scaling heuristics (add more tokens for harder problems) may be calibrating to the wrong signal. Does more thinking time always improve reasoning accuracy? supports this: more tokens do not reliably help after a certain point.

This also deepens Does chain-of-thought reasoning reveal genuine inference or pattern matching?: if trace length reflects training distribution proximity, then even the amount of imitation is calibrated to training similarity, not actual inferential needs.

Inquiring lines that use this note as a source 130

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 132 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cot trace length reflects training distribution proximity, not problem difficulty