Does policy entropy collapse in formal reasoning produce the same outcome in social reasoning?
This explores whether the entropy-collapse failure mode that caps formal reasoning under RL plays out the same way when models are trained to reason about other minds — and the corpus says the failure rhymes but doesn't repeat.
This explores whether the same mechanism that throttles formal reasoning under reinforcement learning — policy entropy collapsing toward zero — produces the same kind of breakdown in social reasoning. The honest answer from the corpus is: the symptom looks similar, but the underlying cause and the shape of the collapse differ in an interesting way.
In formal reasoning, entropy collapse is almost a physical law. As a model trains with RL, its policy gets more confident and its exploratory range narrows, and performance saturates against a predictable ceiling described by R = -a·exp(H) + b — when entropy hits zero, the model stops finding anything new Does policy entropy collapse limit reasoning performance in RL?. The reason this matters is that exploration lives in a small minority of tokens: only about 20% of tokens are the high-entropy 'forking points' where reasoning actually branches, and RLVR mostly tunes those Do high-entropy tokens drive reasoning model improvements?. Collapse the entropy and you collapse exactly the decisions that carry the learning signal.
Social reasoning collapses differently — it's gated by model scale, not just by entropy dynamics. When you run RL on theory-of-mind tasks, larger models (around 7B) develop genuine, transferable belief-tracking, while smaller models hit the *same accuracy numbers* through shortcut learning that has no interpretable reasoning behind it Does reinforcement learning on theory of mind collapse with model scale?. So the dangerous outcome here isn't a visible performance ceiling — it's an *invisible* one. The accuracy looks fine; the reasoning has quietly hollowed out. You only catch it by inspecting the step-by-step traces. That's a sharper trap than formal collapse, where the ceiling at least announces itself in the numbers.
There's a second twist unique to the social case: a lot of apparent social competence is an artifact of how the task is set up. Models look skilled when one model secretly controls all the interlocutors, but they fail systematically the moment agents hold private information the model can't see Why do LLMs fail when simulating agents with private information?. So 'collapse' in social reasoning can mean a model never had the capability to lose — it was skipping the grounding work that real social inference requires. This connects to a broader corpus theme that chain-of-thought is often constrained imitation rather than inference: it reproduces the *form* of reasoning and degrades predictably off-distribution Why does chain-of-thought reasoning fail in predictable ways?, Does chain-of-thought reasoning actually generalize beyond training data?.
The through-line worth taking away: in both domains the enemy is premature convergence — the policy settling into confident, narrow behavior — but the *remedy* and the *diagnostic* diverge. Formal reasoning fights collapse with entropy-management surgery (Clip-Cov, KL-Cov) or by injecting richer signal, like natural-language critiques that carry information numerical rewards can't Can natural language feedback overcome numerical reward plateaus?. Social reasoning needs something the entropy law doesn't measure at all: enough model capacity to represent other minds, and tasks honest enough to require it. Same instinct toward collapse; different floor underneath it.
Sources 7 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.