SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment Model Architecture and Internals

Can we measure reasoning quality beyond output plausibility?

How might we evaluate whether AI systems reason internally like humans do, rather than just producing human-like outputs? This matters because surface coherence can mask broken underlying reasoning.

Synthesis note · 2026-05-03 · sourced from World Models

Cognitive science has decades of research on what makes human reasoning distinctive. The Simulating Society Requires Simulating Thought paper distills three defining features. Causal: humans reason in terms of causes and consequences, even young children exhibit Bayesian-like inference over causal relationships and use interventions to test hypotheses, mental models are structured around what caused what. Compositional: human reasoning is modular and reusable, cognitive architectures operate by composing shared schemas (cognitive motifs) that generalize across domains. Revisable: human beliefs evolve dynamically when presented with new information or contradiction, prior assumptions are revised non-monotonically.

These three features ground the formal definition of reasoning fidelity: an agent's ability to construct, simulate, and revise a structured trace of belief formation that mirrors human causal reasoning patterns. The definition is not aesthetic or metaphorical — it produces three measurable properties that map directly to evaluation procedures.

Traceability: the ability to inspect how a belief or stance was formed through intermediate reasoning steps. Operationalized as motif-to-stance inference accuracy — given the motifs an agent claims to hold, does its stated stance follow from them? An agent that produces "I support policy X" without a recoverable chain of motifs supporting that stance fails traceability.

Counterfactual adaptability: the capacity to revise beliefs predictably in response to interventions or changes in context. Operationalized as belief revision under hypothetical scenarios — if you apply do(transparency = high) to the agent's causal belief network, do the downstream posteriors update in the expected direction? An agent whose stance is unmoved by an intervention that should logically shift it fails adaptability.

Motif compositionality: the reuse of modular causal structures across different scenarios or domains. Operationalized as motif reuse across unrelated topics — if a stakeholder reasoned about density and transit before, does asking them about transit-oriented development reuse those motifs without re-training? An agent that regenerates fresh reasoning per query without reusing prior motifs fails compositionality.

The structural shift is from evaluating outputs (does this look like what a human would say) to evaluating internal structure (does the agent reason as a human would). The former rewards mimicry; the latter rewards genuine cognitive modeling. Output-level alignment hits a ceiling because surface coherence does not require internal coherence — the same diagnosis Can identical outputs hide broken internal representations? makes for representations and Should reasoning benchmarks score final answers or reasoning traces? makes for trace-based evaluation.

Inquiring lines that use this note as a source 42

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 139 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning fidelity has three measurable properties — traceability counterfactual adaptability and motif compositionality — that together replace output plausibility as the evaluation target