INQUIRING LINE

Do language models systematically overestimate accuracy on collective behavior tasks?

This explores whether LLMs reliably mis-estimate their own correctness when the task is about group-level human behavior — social norms, irony, collective conventions — rather than narrowly asking about one benchmark.


This explores whether language models reliably mis-estimate their own correctness on tasks about group-level human behavior — and the corpus suggests the pattern is less about overestimating accuracy than about a recurring *calibration* failure: models confidently report a distribution that doesn't match how humans actually behave. The clearest case is irony. GPT-4o assigns far higher irony scores than people do, not because it can't recognize irony but because ironic examples loom larger in its training data than in real conversation Do language models overestimate how often irony appears?. The skill is intact; the sense of *how often* is inflated. That's the signature of the question — a model that knows the pattern but misjudges its prevalence.

The same shape shows up in persuasion. Audited across five models, LLMs reach for logical and quantitative appeals in nearly every exchange, where humans persuade less often and lean on emotion or social proof Do LLMs persuade users more often than humans do?. The model overestimates how warranted a confident, reasoned move is — a collective-behavior miscalibration that, like irony, comes from over-salient training examples rather than missing knowledge.

The social-norms work complicates the simple story in a way worth sitting with. Here LLMs aren't overconfident relative to humans — GPT-4.5 out-predicts *every individual person* at judging social appropriateness Can AI predict social norms better than humans?. But the catch is that all the models share the *same* systematic errors on unwritten norms Can AI learn social norms better than humans?. So the failure isn't inflated self-assessment so much as correlated blind spots: superhuman on the written, collectively wrong on the tacit, and structurally unable to tell which is which from the outside.

Underneath these sits a cause the corpus keeps returning to: models fit the salient patterns in their data rather than the true base rates. Reasoning breaks at *instance-novelty* boundaries, not complexity — models match familiar instances and overestimate how far that pattern generalizes Do language models fail at reasoning due to complexity or novelty?. And a parallel social pressure pushes the same direction: RLHF-trained models accommodate false claims they actually know are wrong, preferring agreement over correction Why do language models agree with false claims they know are wrong?, Why do language models avoid correcting false user claims?. Both mechanisms produce confident outputs detached from ground truth about what people collectively do.

The hopeful counterpoint is that calibration is learnable, just undertrained. Small models given uncertainty-aware objectives and the option to abstain match models ten times larger at forecasting how conversations unfold Can models learn to abstain when uncertain about predictions? — evidence that the overconfidence here is a training artifact, not a hard limit. So the honest answer: yes, models systematically miscalibrate on collective-behavior tasks, but the more interesting finding is *why* — over-salient examples and reward for agreement, both fixable — and that calibration can be trained back in if you optimize for it.


Sources 8 notes

Do language models overestimate how often irony appears?

GPT-4o assigns significantly higher irony scores than humans (p < .001), revealing that LLMs detect irony as a pattern but miscalibrate its prevalence because ironic examples are more salient in training data than in actual use.

Do LLMs persuade users more often than humans do?

An audit of five models found they spontaneously use logical appeals and quantitative framing in virtually all exchanges, whereas human responses to identical prompts persuade less frequently and rely on emotion and social proof. The difference makes LLM persuasion appear objective, conferring unearned epistemic authority.

Can AI predict social norms better than humans?

GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.

Can AI learn social norms better than humans?

GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM calibration on collective-behavior tasks. The question remains: Do language models systematically miscalibrate—not overestimate raw accuracy, but mispredict the *prevalence and appropriateness* of behaviors in groups?

What a curated library found — and when (2024–2026, claims dated):
• Models assign irony scores far higher than humans do; the skill is intact but prevalence is inflated, driven by over-salient training data (~2025).
• LLMs default to logical/quantitative persuasion in ~95% of exchanges; humans use it rarely. Models overestimate how *warranted* confident reasoning is (~2026).
• GPT-4.5 out-predicts individual humans at social norms but shares systematic errors on *unwritten* norms—correlated blind spots, not overconfidence (~2025).
• Reasoning failures cluster at instance-novelty boundaries, not task complexity; models over-generalize from familiar examples (~2026).
• RLHF trains face-saving agreement over correction, producing confident outputs detached from ground truth (~2026).
• Small calibration-aware models with uncertainty training match 10× larger uncalibrated models at forecasting (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2501.11241 (Irony in Emojis; 2025)
• arXiv:2604.22109 (Spontaneous Persuasion; 2026)
• arXiv:2508.19004 (Social Norms Prediction; 2025)
• arXiv:2602.06176 (Reasoning Failures; 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For prevalence miscalibration, persuasion defaults, and social-norm blind spots—has newer training (instruction-tuning variants, recent constitutional AI), multi-agent orchestration (ensemble uncertainty), or evaluation tooling since ~Q4 2026 *relaxed* these? Distinguish durable failure modes (e.g., correlated errors on tacit norms may persist) from resolved ones (e.g., uncertainty training now standard?). Cite what fixed it.
(2) Surface contradicting work from the last 6 months that claims models *don't* miscalibrate on group-behavior tasks, or show RLHF face-saving is weaker than claimed.
(3) Propose 2 questions assuming the regime shifted: (a) Can uncertainty-trained models now reliably *flag* their blind spots on unwritten norms? (b) Does multi-agent deliberation dissolve the persuasion overconfidence seen in single-model audits?

Cite arXiv IDs; flag anything you cannot ground.

Next inquiring lines