Does reinforcement learning on theory of mind collapse with model scale?
When RL improves social reasoning, does the quality of reasoning depend on model size? The question matters because accuracy alone may hide whether models are actually thinking or just pattern-matching.
Rule-based RL has proven effective for enhancing structured reasoning in math and coding. The question is whether it generalizes to social reasoning — "interpreting mental states and hidden commonsense" — where rules and ground truths are less well-defined.
The answer is scale-dependent.
7B models: RL induces high-quality, interpretable, and transferable belief-tracking behaviors. The reasoning traces show explicit step-by-step mental state tracking: identifying what each agent knows, what each agent believes about what others know, and how beliefs update as the story progresses. This transfers across benchmarks.
≤3B models: RL leads to reasoning collapse. Despite achieving "substantial accuracy gains comparable to the larger models," these models "failed to generate interpretable, structured reasoning traces." Instead, they produce "drastically shortened, less meaningful responses" — suggesting reliance on "implicit rather than explicit structured reasoning." They appear to have internalized "alternative rules or patterns that are effective for the specific structures found in benchmark datasets."
The mechanism: simple rule-based rewards optimize for correctness, but in models with limited capacity relative to task complexity, this "may inadvertently encourage shortcut learning." The model finds a faster path to the right answer that doesn't involve actually tracking mental states. It works on benchmarks but wouldn't generalize to genuine social interaction.
This creates a "crucial mismatch between achieving high accuracy on benchmark questions and possessing genuine, human-like reasoning capabilities." The mismatch is invisible if you only look at accuracy scores — the 3B model looks comparable to the 7B model. It becomes visible only when you inspect the reasoning traces.
The finding extends the entropy collapse dynamic from formal reasoning to social reasoning, but with an important twist: in formal domains, shortcut learning tends to reduce diversity while maintaining some reasoning structure. In social reasoning, it eliminates reasoning structure entirely while preserving accuracy — a more severe form of collapse.
Inquiring lines that use this note as a source 27
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does good simulation eventually count as genuine realization?
- Why do reasoning models perform poorly at theory of mind tasks?
- What distribution patterns appear across different theory-of-mind datasets?
- How does theory of mind predict success in human-AI partnerships?
- How does theory of mind predict who benefits from AI collaboration?
- Why does policy entropy collapse limit reasoning and dialogue RL scaling?
- Why do reasoning models perform worse on theory of mind tasks?
- What happens when bidirectional theory of mind between humans and AI breaks down?
- How do theory of mind and empathy differ in LLM simulation?
- How do different social roles affect LLM theory of mind errors?
- Why does reasoning effort fail to improve theory of mind performance?
- Does formal reasoning training actively degrade social reasoning ability?
- What limits RL's ability to scale for reasoning at training time?
- Do longer reasoning traces actually improve theory of mind accuracy?
- Can theory of mind models generalize across structurally similar scenarios?
- What makes social reasoning fundamentally different from mathematical reasoning?
- Why does increasing reasoning not improve AI social reasoning performance?
- Does RL training actually restore the critical thinking that reasoning models lose?
- How do emotional and social simulations enable better hypothetical reasoning?
- Why does additional reasoning effort not improve theory of mind performance?
- Can multi-agent metacognitive decomposition achieve human-level theory of mind?
- Does RL teach models when to use reasoning or how to reason?
- Why might social reasoning work differently than formal logical reasoning?
- Why does reasoning volume fail to improve theory of mind performance?
- Does RL primarily teach when to use reasoning or how to reason?
- Why does policy entropy collapse when scaling RL for reasoning?
- Does policy entropy collapse in formal reasoning produce the same outcome in social reasoning?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the ToM reasoning collapse is an extreme form of entropy collapse: not just reduced diversity but elimination of interpretable reasoning
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
the scale-dependent finding adds a caveat: RL teaches when to activate only if the model has sufficient capacity; below threshold, RL teaches shortcuts instead
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the 7B success suggests latent ToM capability exists at scale; the 3B failure suggests it doesn't exist below a capacity threshold for social reasoning
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Original note title
rl on ToM produces scale-dependent reasoning collapse — large models develop belief-tracking while small models achieve accuracy through shortcuts