INQUIRING LINE

Why does reasoning effort fail to improve theory of mind performance?

This explores why cranking up a model's reasoning effort — more thinking tokens, RL on reasoning, longer chains — doesn't help (and may actively hurt) its ability to track what other minds believe, and what that reveals about how 'reasoning' and 'social cognition' differ.


This explores why pouring more reasoning effort into a model doesn't make it better at reading minds — and the corpus points to a surprisingly clean answer: theory of mind isn't the kind of problem that extra reasoning solves. The most direct evidence is almost paradoxical. Advanced reasoning models like Claude 3.7 Sonnet and o1 actually score *worse* than older, less optimized models on theory-of-mind benchmarks like Decrypto — sometimes worse than humans and even worse than simple word-embedding baselines Why do advanced reasoning models fail at understanding minds? Why do reasoning models fail at theory of mind tasks?. Optimizing for formal reasoning doesn't just fail to help social reasoning; it seems to interfere with it.

The leading explanation is architectural, not a matter of effort. Formal reasoning is sequential derivation — chaining one step to the next toward an answer. Social reasoning instead demands holding *multiple competing models of the world in mind at once* (what I believe, what you believe I believe, what you falsely believe). Reasoning models given ToM tasks produce longer traces that don't help and don't generalize, while a method called ThoughtTracing succeeds with *shorter* Bayesian hypothesis tracking — because it maintains several belief states in parallel rather than grinding down a single chain Why do reasoning models struggle with theory of mind tasks?. A related finding shows LLMs default to surface-level strategies instead of genuine mental simulation, and that hybrid architectures forcing explicit belief tracking beat LLMs alone — suggesting the gap is built into the architecture, not fixable by more training on the same shape Do large language models genuinely simulate mental states?.

This connects to a broader pattern the corpus has been documenting: more thinking is not monotonically better thinking. Reasoning accuracy peaks and then *declines* past a critical token threshold — models 'overthink' easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?, and optimal chain-of-thought length follows an inverted-U where the most capable models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. So 'reasoning effort fails to help ToM' is partly a special, severe case of a general truth — but ToM is where it bites hardest, because the extra derivation actively pulls the model away from the parallel belief-tracking the task requires.

Here's the part you might not expect to care about: a chunk of the corpus questions whether the models were ever 'reasoning' about minds in the first place. Chain-of-thought turns out to be constrained imitation of reasoning's *form* rather than genuine inference — logically invalid CoT exemplars perform nearly as well as valid ones, meaning the model learns the look of reasoning, not the logic Does logical validity actually drive chain-of-thought gains? Why does chain-of-thought reasoning fail in predictable ways?. On ToM specifically, supervised fine-tuning matches RL, and benchmarks turn out to be solvable through pattern-matching on templated artifacts and distribution biases — so high scores may reflect exploited shortcuts rather than mental-state reasoning at all Can language models solve ToM benchmarks without real reasoning?. If the apparent successes are surface tricks, then 'reasoning effort' has nothing real to amplify.

There's a hopeful counterweight, though. RL on social reasoning *can* produce genuine, transferable belief-tracking — but only above a model-scale threshold; below it, smaller models fake comparable accuracy through shortcuts with no interpretable reasoning trace Does reinforcement learning on theory of mind collapse with model scale?. And RL training can flip extended thinking from counterproductive self-doubt into productive analysis, which says the *quality* of reasoning is trainable, not just its quantity Does extended thinking help or hurt model reasoning?. Read alongside the finding that base models already contain latent reasoning that the right training merely elicits Do base models already contain hidden reasoning ability?, the picture sharpens: theory of mind doesn't fail because models can't reason — it fails because today's reasoning effort optimizes the wrong cognitive shape, sequential derivation where the task needs parallel mind-modeling.


Sources 12 notes

Why do advanced reasoning models fail at understanding minds?

Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-research analyst. The question: *Why does reasoning effort fail to improve theory of mind performance—and has this constraint moved since early 2025?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025 and cluster around a core paradox:
• Advanced reasoning models (Claude 3.7 Sonnet, o1) score *worse* than older baselines on ToM benchmarks like Decrypto; longer reasoning traces don't help and don't generalize (2024–2025).
• The architectural gap: formal reasoning chains single steps sequentially; ToM requires *parallel belief-state tracking*. ThoughtTracing (Bayesian hypothesis tracking) succeeds with shorter reasoning, while extended chains actively interfere (2025).
• Reasoning accuracy peaks then declines past a critical token threshold; chain-of-thought follows an inverted-U, with more capable models preferring *shorter* chains (2025).
• Chain-of-thought itself may be constrained imitation: logically invalid CoT prompts perform nearly as well as valid ones, and current ToM benchmarks exploit pattern-matching shortcuts rather than genuine mental-state reasoning (2024–2025).
• RL on social reasoning *can* produce genuine belief-tracking, but only above a model-scale threshold; below it, smaller models fake accuracy with no interpretable trace (2025).

Anchor papers (verify; mind their dates):
• arXiv:2307.10573 (2023): Invalid logic in CoT prompting
• arXiv:2502.07266 (2025): Chain-of-thought length inverted-U
• arXiv:2504.01698 (2025): Do ToM benchmarks need explicit reasoning?
• arXiv:2506.04210 (2025): Test-time scaling trade-offs

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For reasoning models released or retrained since June 2025 (o1-pro, reasoning-via-RL variants, scaling-test-time approaches): does longer reasoning *still* hurt ToM? Has explicit belief-state prompting (e.g., "track three simultaneous world-models") flipped the trend? Check whether hybrid architectures or multi-step RL have relaxed the sequential-derivation bottleneck. Cite what changed it, or state plainly where the constraint *still holds*.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months. Look for: (a) papers showing reasoning effort *does* help ToM under specific conditions; (b) new benchmarks that don't exploit shortcuts; (c) model releases that break the o1 ↔ ToM pattern; (d) training methods that unify sequential and parallel reasoning.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If RL has learned to toggle between chain-derivation and parallel belief-tracking, can we instrument that switch?" or "Can we design a benchmark immune to shortcut exploitation, and does reasoning effort help *then*?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines