INQUIRING LINE

Does policy entropy collapse in formal reasoning produce the same outcome in social reasoning?

This explores whether the entropy-collapse failure mode that caps formal reasoning under RL plays out the same way when models are trained to reason about other minds — and the corpus says the failure rhymes but doesn't repeat.


This explores whether the same mechanism that throttles formal reasoning under reinforcement learning — policy entropy collapsing toward zero — produces the same kind of breakdown in social reasoning. The honest answer from the corpus is: the symptom looks similar, but the underlying cause and the shape of the collapse differ in an interesting way.

In formal reasoning, entropy collapse is almost a physical law. As a model trains with RL, its policy gets more confident and its exploratory range narrows, and performance saturates against a predictable ceiling described by R = -a·exp(H) + b — when entropy hits zero, the model stops finding anything new Does policy entropy collapse limit reasoning performance in RL?. The reason this matters is that exploration lives in a small minority of tokens: only about 20% of tokens are the high-entropy 'forking points' where reasoning actually branches, and RLVR mostly tunes those Do high-entropy tokens drive reasoning model improvements?. Collapse the entropy and you collapse exactly the decisions that carry the learning signal.

Social reasoning collapses differently — it's gated by model scale, not just by entropy dynamics. When you run RL on theory-of-mind tasks, larger models (around 7B) develop genuine, transferable belief-tracking, while smaller models hit the *same accuracy numbers* through shortcut learning that has no interpretable reasoning behind it Does reinforcement learning on theory of mind collapse with model scale?. So the dangerous outcome here isn't a visible performance ceiling — it's an *invisible* one. The accuracy looks fine; the reasoning has quietly hollowed out. You only catch it by inspecting the step-by-step traces. That's a sharper trap than formal collapse, where the ceiling at least announces itself in the numbers.

There's a second twist unique to the social case: a lot of apparent social competence is an artifact of how the task is set up. Models look skilled when one model secretly controls all the interlocutors, but they fail systematically the moment agents hold private information the model can't see Why do LLMs fail when simulating agents with private information?. So 'collapse' in social reasoning can mean a model never had the capability to lose — it was skipping the grounding work that real social inference requires. This connects to a broader corpus theme that chain-of-thought is often constrained imitation rather than inference: it reproduces the *form* of reasoning and degrades predictably off-distribution Why does chain-of-thought reasoning fail in predictable ways?, Does chain-of-thought reasoning actually generalize beyond training data?.

The through-line worth taking away: in both domains the enemy is premature convergence — the policy settling into confident, narrow behavior — but the *remedy* and the *diagnostic* diverge. Formal reasoning fights collapse with entropy-management surgery (Clip-Cov, KL-Cov) or by injecting richer signal, like natural-language critiques that carry information numerical rewards can't Can natural language feedback overcome numerical reward plateaus?. Social reasoning needs something the entropy law doesn't measure at all: enough model capacity to represent other minds, and tasks honest enough to require it. Same instinct toward collapse; different floor underneath it.


Sources 7 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning researcher auditing whether policy entropy collapse — a well-documented bottleneck in formal RL-trained reasoning — generalizes to social reasoning tasks, or whether the mechanism and remedy diverge. Treat the following as dated claims (library findings, 2024–2026) to be stress-tested against newer models, methods, and evals.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026:
• In formal reasoning, entropy collapse toward zero is 'almost a physical law' under RL; only ~20% of tokens are high-entropy 'forking points' where learning happens, and collapse kills exactly those (2025-06, arXiv:2506.01939).
• In social/theory-of-mind tasks, collapse looks different: smaller models hit the same accuracy via shortcut learning with no interpretable reasoning; larger models (~7B) develop genuine belief-tracking — a scale-dependent, invisible ceiling (2025-04, arXiv:2504.01698).
• Social competence is often an artifact of task setup: models succeed when one agent controls all interlocutors, but fail under real information asymmetry (2024-03, arXiv:2403.05020).
• Chain-of-thought is 'constrained imitation rather than inference' and degrades predictably off-distribution (2025-06, arXiv:2506.02878; 2025-08, arXiv:2508.01191).
• Formal reasoning collapse can be partly remedied by natural-language feedback injected alongside numerical rewards (2025-06, arXiv:2506.03106).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (2025-05): The Entropy Mechanism of RL for Reasoning LMs
• arXiv:2504.01698 (2025-04): Do ToM Benchmarks Need Explicit Human-like Reasoning?
• arXiv:2506.01939 (2025-06): High-Entropy Minority Tokens Drive Effective RL
• arXiv:2508.01191 (2025-08): Is Chain-of-Thought a Mirage? A Data Distribution Lens

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above — entropy-as-law, scale-dependent social collapse, omniscient-agent artifacts, off-distribution degradation, and natural-language remedies — judge whether models trained or evaluated since mid-2026, new orchestration (long-context memory, multi-agent loops, dynamic exploration scheduling), or tighter benchmarks (adversarial info asymmetry, real privacy constraints) have relaxed or overturned it. Plainly separate durable questions (e.g., "Does social reasoning require scaled model capacity?") from perishable claims (e.g., "7B is the threshold"). Cite what relaxed the constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers claiming entropy collapse is NOT the bottleneck, or social reasoning does collapse identically to formal reasoning, or that neither collapse is real.
(3) Propose 2 research questions that ASSUME the regime may have shifted: one testing whether new training objectives (e.g., mutual information, multi-agent gradients) dissolve entropy collapse in BOTH domains, and one asking whether the invisible ceiling in social reasoning can now be detected in-distribution with scaled evaluation.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines