INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How should models express uncertai…›this inquiring line

A model that only predicts what people say can't tell you why they believe it, or what would change their mind.

What cognitive structures do realistic belief models need to include?

This explores what a belief model needs inside it to feel realistic — not just predicting what people say or do, but representing the machinery underneath: the kinds of links, uncertainty, and structure that produce belief in the first place.

This question reads as: if you wanted to model a person's beliefs faithfully — for social simulation, therapy training, or persuasion research — what cognitive ingredients can't you leave out? The corpus converges on a sharp answer: behavior alone isn't enough, and neither is causality alone. A realistic belief model needs internal reasoning structure, multiple kinds of links between ideas, and a way to hold uncertainty.

The first move is to reject behaviorism. Current LLM agents produce plausible outputs without any internal model of why a person believes what they believe, which makes their simulated belief changes untraceable and uncounterfactual — you can't ask "what if she'd seen different evidence?" Can language models simulate belief change in people?. The same gap shows up when LLMs are tested on perspective-taking: they default to surface strategies rather than genuinely tracking what someone else believes, and architectures that force explicit belief tracking beat LLM-alone approaches — suggesting the missing piece is structural, not just more training Do large language models genuinely simulate mental states?.

But what should that structure contain? Causal belief networks are a tempting answer, and they're a good start — yet they capture only one slice of how people reason. Real beliefs also shift through associative links (this reminds me of that), analogical mappings (this is like that other situation), and raw emotion, none of which a pure causal graph represents Can causal models alone capture how humans actually reason?. So a realistic model needs heterogeneous link types, not a single clean logic. It also needs to tolerate ambiguity: people hold competing hypotheses at once, and modeling that requires representing distributions over beliefs rather than one fixed answer — the kind of stochastic, multiple-possibility reasoning that deterministic designs can't express Can stochastic latent reasoning let models explore multiple solutions?.

The strongest practical evidence that explicit cognitive scaffolding pays off comes from therapy simulation: PATIENT-Ψ wires 106 structured cognitive models (built on Beck's cognitive-distortion framework) into an LLM, and expert clinicians rate the result as more authentic than GPT-4 alone — especially for maladaptive belief patterns Can structured cognitive models improve LLM patient simulations for therapy training?. A parallel result in visual-social reasoning shows that staging cognition explicitly — perception, then situation, then norms — beats just generating more text, because the structure itself is what helps Can breaking down visual reasoning into three stages improve model performance?.

Two cautionary notes sharpen the picture. First, surface form can masquerade as reasoning: logically invalid chains-of-thought perform almost as well as valid ones, meaning a model can learn the look of inference without the substance — so a belief model has to encode genuine inferential links, not just plausible-sounding traces Does logical validity actually drive chain-of-thought gains?. Second, beliefs include what people wrongly take for granted: LLMs routinely accept false presuppositions even when they demonstrably know better, which means a faithful model must represent not only what someone believes but the unexamined assumptions they accommodate Why do language models accept false assumptions they know are wrong?. Put together, the corpus sketches a realistic belief model as one with traceable internal reasoning, multiple link types beyond causality, explicit uncertainty, grounded cognitive templates, and a place for the assumptions people never question.

Sources 8 notes

Can language models simulate belief change in people?

LLM agents remain stuck in behaviorism, producing plausible outputs without internal reasoning structures. Modeling belief networks and reasoning traces enables traceability, counterfactual adaptation, and meaningful policy simulation.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Can stochastic latent reasoning let models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent probability distributions over solutions rather than single points. This lets recursive reasoners maintain uncertainty, explore alternatives, and handle ambiguous or multi-solution problems that deterministic single-path designs cannot.

Can structured cognitive models improve LLM patient simulations for therapy training?

PATIENT-Ψ integrates 106 Beck CCD-based cognitive models with LLMs to simulate patients with specific maladaptive patterns. Expert evaluators rated the fidelity higher than GPT-4, particularly for maladaptive cognitions and conversational authenticity.

Show all 8 sources

Can breaking down visual reasoning into three stages improve model performance?

CoCoT structures VLM reasoning through embodied perception, embedded situation analysis, and norm-grounded interpretation, achieving +8% improvement over flat CoT on social benchmarks. The gains suggest cognitive structure matters more than reasoning volume for social tasks.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens2.55 match · arxiv ↗
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse1.73 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey1.71 match · arxiv ↗
Simulating Society Requires Simulating Thought1.69 match · arxiv ↗
LLM Reasoning Is Latent, Not the Chain of Thought1.66 match · arxiv ↗
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models1.62 match · arxiv ↗
A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap1.59 match · arxiv ↗
Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations0.91 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a cognitive modeling researcher re-testing claims about what belief-modeling systems must include. The question remains open: what cognitive structures do realistic belief models need?

What a curated library found — and when (findings span 2023–2026, dated claims):
• Behavior alone fails: LLM agents produce outputs without traceable internal reasoning about *why* beliefs form, blocking counterfactual reasoning (2025).
• Explicit belief-tracking architectures outperform LLM-alone on perspective tasks, suggesting structure, not scale, is the bottleneck (2025).
• Causal graphs capture only one thread; real reasoning also uses associative, analogical, and emotional links — heterogeneous link types matter (2024–2025).
• Stochastic, multi-hypothesis models beat deterministic ones at holding competing beliefs and uncertainty (2026).
• Structured cognitive scaffolds (Beck's distortion framework wired into LLMs) yield ~85% clinician-rated authenticity vs. GPT-4 alone on maladaptive patterns; staged cognition (perception→situation→norms) similarly outperforms end-to-end text generation (2024–2025).
• Logically invalid chains-of-thought perform ~90% as well as valid ones, so surface form masks reasoning gaps (2023).
• LLMs accept false presuppositions even with knowledge present, signaling unexamined assumption slots are invisible in flat models (2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.06958 *Simulating Society Requires Simulating Thought* (2025)
• arXiv:2405.19660 *PATIENT-Ψ* (2024) — structured therapy sim
• arXiv:2507.20409 *Cognitive Chain-of-Thought* (2025) — multimodal scaffolding
• arXiv:2605.19376 *Generative Recursive Reasoning* (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, Claude opus variants, GPT-4.5+), training methods (process reward models, belief-annotation fine-tuning), or evaluation tooling (belief-consistency harnesses, counterfactual probes) have since relaxed or overturned it. Separate the durable question ("Does reasoning need structure?") from perishable limitations ("Current LLMs can't do X without scaffolding"). Cite what resolved the gap.
(2) Surface the strongest contradicting or superseding work from the last 6 months — papers showing either that belief models *can* emerge from unstructured scaling, or that a simpler ingredient set suffices.
(3) Propose 2 research questions that assume the regime *may* have moved: e.g., "Can process-reward-trained LLMs recover lost counterfactual reasoning without explicit belief states?" and "What's the minimal Kolmogorov complexity of a belief structure that beats causal-only networks on real human data?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

A model that only predicts what people say can't tell you why they believe it, or what would change their mind.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8