INQUIRING LINE

Why does additional reasoning effort not improve theory of mind performance?

This explores why pouring more 'thinking' into a model — longer chains of thought, reasoning-optimized architectures — fails to make it better at tracking what other minds believe, and the corpus suggests the problem isn't quantity of reasoning but the kind of reasoning being applied.


This explores why additional reasoning effort doesn't help theory of mind (ToM) — the ability to track what other agents believe, want, or falsely assume. The short version the corpus keeps circling back to: social reasoning and formal reasoning are different cognitive shapes, and the machinery we've built to scale one actively interferes with the other. The most pointed evidence is that reasoning-optimized models like Claude 3.7 Sonnet and o1 score *worse* than older, plainer models on ToM benchmarks like Decrypto — sometimes worse than humans and even worse than simple word-embedding baselines Why do reasoning models fail at theory of mind tasks? Why do advanced reasoning models fail at understanding minds?. Effort doesn't just fail to help here; it appears to degrade the capability.

The deeper 'why' is architectural. Formal reasoning is sequential derivation — one step licenses the next toward an answer. But tracking minds means holding several incompatible models of the world open at once (what I know, what you know, what you falsely believe I believe). One corpus note frames this directly: reasoning models produce *longer but unhelpful* traces and show no generalization, while a method called ThoughtTracing succeeds with much shorter Bayesian hypothesis-tracking — implying social reasoning demands simultaneous maintenance of multiple belief-models, not a long linear chain Why do reasoning models struggle with theory of mind tasks?. Stretching the chain just gives the model more rope to over-commit to a single line of derivation when the task actually requires juggling several.

This connects to a more general finding that more thinking isn't monotonically better. Accuracy follows an inverted-U: pushing thinking tokens from ~1,100 to ~16K dropped benchmark accuracy from 87% to 70%, with models overthinking easy problems Does more thinking time always improve reasoning accuracy?, and the optimal chain length actually *shrinks* as models get more capable Why does chain of thought accuracy eventually decline with length?. ToM may simply sit on the steep downslope of that curve, where extra reasoning is mostly self-interference. There's even evidence that the apparent gains from chain-of-thought come from its *form* rather than genuine inference — logically invalid reasoning chains perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains? — so the 'reasoning' a model adds isn't necessarily doing real mental-state inference at all.

Here's the thing you might not expect: on structured ToM benchmarks, models can look fine — but that's partly because the benchmarks are gameable. Models default to surface strategies rather than genuine mental simulation, passing templated tasks through pattern-matching while failing open-ended perspective-taking Do large language models genuinely simulate mental states?, and supervised fine-tuning matches reinforcement learning on these tasks — a sign models exploit structural artifacts instead of building real belief-tracking Can language models solve ToM benchmarks without real reasoning?. So adding reasoning effort can't improve something the model was never really doing; it just elaborates the shortcut.

The hopeful counterpoint worth knowing: the gap may be elicitation, not absence. When reinforcement learning is applied directly to social reasoning, models above a certain scale (~7B) develop explicit, *transferable* belief-tracking, while smaller ones fall back on uninterpretable shortcuts Does reinforcement learning on theory of mind collapse with model scale? — echoing the broader result that base models already contain latent reasoning that the right training selects rather than creates Do base models already contain hidden reasoning ability?. The lesson across these notes is consistent: ToM doesn't respond to *more* reasoning, it responds to the *right kind* — parallel belief-tracking, the proper architecture, training that targets the capability — and generic effort applied to the wrong shape of problem can make things worse, not better.


Sources 10 notes

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Why do advanced reasoning models fail at understanding minds?

Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about theory of mind (ToM) in LLMs. The precise question: *Why does additional reasoning effort degrade rather than improve ToM performance?* Treat the findings below as dated (2023–2025) and check whether newer models, training methods, evaluation harnesses, or architectural innovations have since shifted the constraints.

What a curated library found — and when (spanning 2023–2025, dated claims not current truth):
• Reasoning-optimized models (o1, Claude 3.7 Sonnet) score *worse* than older baseline models on ToM benchmarks like Decrypto; reasoning models produce longer but unhelpful traces (2025).
• Accuracy follows an inverted-U curve: pushing thinking tokens from ~1,100 to ~16K drops accuracy from 87% to 70%; optimal chain-of-thought length shrinks as models scale (2025).
• Logically invalid chain-of-thought prompts perform nearly as well as valid ones, suggesting models exploit surface patterns rather than genuine inference (2023).
• Models default to surface-level strategies on structured ToM benchmarks; supervised fine-tuning matches RL, indicating structural artifacts are being gamed (2024–2025).
• RL applied to ToM produces scale-dependent reasoning: models >7B develop explicit, transferable belief-tracking; smaller ones revert to uninterpretable shortcuts (2025).

Anchor papers (verify; mind their dates):
• arXiv:2302.02083 (2023) — foundational ToM evaluation.
• arXiv:2401.05302 (2024) — human-robot interaction illusion.
• arXiv:2502.11881 (2025) — hypothesis-driven Bayesian approach.
• arXiv:2506.04210 (2025) — test-time scaling limits.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the inverted-U curve, the surface-pattern exploit, and the RL scale threshold: Has recent work (last 6 months) on multimodal reasoning, process reward models, or scaffolded belief-tracking weakened or overturned these findings? Does the constraint still hold for frontier models (GPT-4o, Claude 3.5+, or unreleased variants)? Separate the durable question (does ToM require a fundamentally different cognitive shape?) from the perishable limitation (does *this specific* reasoning method fail?).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from ~late 2024 onwards — any paper showing reasoning effort *does* help ToM under certain conditions, or that the benchmarks themselves have been redesigned to resist surface-level solutions.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Do hybrid architectures (sequential reasoning + parallel belief-state maintenance) now unlock reasoning + ToM jointly? (b) Does prompt-level elicitation of *uncertainty* and *counterfactuals* — rather than step-by-step reasoning — now achieve what pure CoT could not?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines