SYNTHESIS NOTE

Does reinforcement learning on theory of mind collapse with model scale?

When RL improves social reasoning, does the quality of reasoning depend on model size? The question matters because accuracy alone may hide whether models are actually thinking or just pattern-matching.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

Rule-based RL has proven effective for enhancing structured reasoning in math and coding. The question is whether it generalizes to social reasoning — "interpreting mental states and hidden commonsense" — where rules and ground truths are less well-defined.

The answer is scale-dependent.

7B models: RL induces high-quality, interpretable, and transferable belief-tracking behaviors. The reasoning traces show explicit step-by-step mental state tracking: identifying what each agent knows, what each agent believes about what others know, and how beliefs update as the story progresses. This transfers across benchmarks.

≤3B models: RL leads to reasoning collapse. Despite achieving "substantial accuracy gains comparable to the larger models," these models "failed to generate interpretable, structured reasoning traces." Instead, they produce "drastically shortened, less meaningful responses" — suggesting reliance on "implicit rather than explicit structured reasoning." They appear to have internalized "alternative rules or patterns that are effective for the specific structures found in benchmark datasets."

The mechanism: simple rule-based rewards optimize for correctness, but in models with limited capacity relative to task complexity, this "may inadvertently encourage shortcut learning." The model finds a faster path to the right answer that doesn't involve actually tracking mental states. It works on benchmarks but wouldn't generalize to genuine social interaction.

This creates a "crucial mismatch between achieving high accuracy on benchmark questions and possessing genuine, human-like reasoning capabilities." The mismatch is invisible if you only look at accuracy scores — the 3B model looks comparable to the 7B model. It becomes visible only when you inspect the reasoning traces.

The finding extends the entropy collapse dynamic from formal reasoning to social reasoning, but with an important twist: in formal domains, shortcut learning tends to reduce diversity while maintaining some reasoning structure. In social reasoning, it eliminates reasoning structure entirely while preserving accuracy — a more severe form of collapse.

Inquiring lines that read this note 28

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do we evaluate AI systems when user perception misleads actual performance?

Does good simulation eventually count as genuine realization?

How does reasoning effort affect AI theory of mind performance?

Is model self-awareness based on genuine introspection or pattern matching?

What distribution patterns appear across different theory-of-mind datasets?

When should tasks involve human-AI partnership versus full automation?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do theory of mind and empathy differ in LLM simulation?

Can LLM personas constitute genuine psychology or remain linguistic role-play?

How do different social roles affect LLM theory of mind errors?

Does reinforcement learning teach reasoning or just when to reason?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 136 in 2-hop network ·dense cluster Open in graph ↗

Does reinforcement learning on theory of mind co… Does policy entropy collapse limit reasoning perfo… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the ToM reasoning collapse is an extreme form of entropy collapse: not just reduced diversity but elimination of interpretable reasoning
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
the scale-dependent finding adds a caveat: RL teaches when to activate only if the model has sufficient capacity; below threshold, RL teaches shortcuts instead
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the 7B success suggests latent ToM capability exists at scale; the 3B failure suggests it doesn't exist below a capacity threshold for social reasoning

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl on ToM produces scale-dependent reasoning collapse — large models develop belief-tracking while small models achieve accuracy through shortcuts

Does reinforcement learning on theory of mind collapse with model scale?

Inquiring lines that read this note 28

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4