Can we measure reasoning quality beyond output plausibility?
How might we evaluate whether AI systems reason internally like humans do, rather than just producing human-like outputs? This matters because surface coherence can mask broken underlying reasoning.
Cognitive science has decades of research on what makes human reasoning distinctive. The Simulating Society Requires Simulating Thought paper distills three defining features. Causal: humans reason in terms of causes and consequences, even young children exhibit Bayesian-like inference over causal relationships and use interventions to test hypotheses, mental models are structured around what caused what. Compositional: human reasoning is modular and reusable, cognitive architectures operate by composing shared schemas (cognitive motifs) that generalize across domains. Revisable: human beliefs evolve dynamically when presented with new information or contradiction, prior assumptions are revised non-monotonically.
These three features ground the formal definition of reasoning fidelity: an agent's ability to construct, simulate, and revise a structured trace of belief formation that mirrors human causal reasoning patterns. The definition is not aesthetic or metaphorical — it produces three measurable properties that map directly to evaluation procedures.
Traceability: the ability to inspect how a belief or stance was formed through intermediate reasoning steps. Operationalized as motif-to-stance inference accuracy — given the motifs an agent claims to hold, does its stated stance follow from them? An agent that produces "I support policy X" without a recoverable chain of motifs supporting that stance fails traceability.
Counterfactual adaptability: the capacity to revise beliefs predictably in response to interventions or changes in context. Operationalized as belief revision under hypothetical scenarios — if you apply do(transparency = high) to the agent's causal belief network, do the downstream posteriors update in the expected direction? An agent whose stance is unmoved by an intervention that should logically shift it fails adaptability.
Motif compositionality: the reuse of modular causal structures across different scenarios or domains. Operationalized as motif reuse across unrelated topics — if a stakeholder reasoned about density and transit before, does asking them about transit-oriented development reuse those motifs without re-training? An agent that regenerates fresh reasoning per query without reusing prior motifs fails compositionality.
The structural shift is from evaluating outputs (does this look like what a human would say) to evaluating internal structure (does the agent reason as a human would). The former rewards mimicry; the latter rewards genuine cognitive modeling. Output-level alignment hits a ceiling because surface coherence does not require internal coherence — the same diagnosis Can identical outputs hide broken internal representations? makes for representations and Should reasoning benchmarks score final answers or reasoning traces? makes for trace-based evaluation.
Inquiring lines that use this note as a source 42
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does AI substitute polished style for actual expert judgment?
- Can AI output be verified without understanding the reasoning behind it?
- Does verification of AI outputs face the same circularity problem?
- What does disembodied orality mean for how we evaluate AI outputs?
- Does evaluating AI output require different cognitive skills than solving problems directly?
- Why is AI output fundamentally unverifiable against underlying reality?
- Does good simulation eventually count as genuine realization?
- How much does faithfulness vary naturally in reasoning without evaluation pressure?
- How does situational awareness during evaluation affect reasoning transparency?
- What structural features force users to evaluate the epistemic status of outputs?
- What would whole-system AGI evaluation look like in practice?
- What structural evidence shows that polished presentation substitutes for actual thinking in AI output?
- Can reasoning benchmarks separate logic from believability?
- Why does polished AI output feel like evidence of user skill?
- What distinguishes genuine understanding from correct output without coherent principles?
- Can correct outputs mask reliance on surface heuristics rather than deep understanding?
- How can judges evaluate thinking without seeing the actual thoughts?
- What explains the gap between perplexity performance and actual reasoning capability?
- How can we measure whether process rewards actually align with reasoning quality?
- Why do different brain and AI systems appear similar when compared via RSA?
- Why does polished explanation make wrong AI systems more persuasive than poorly explained ones?
- Why does polished presentation substitute for deeper expert judgment?
- What distinguishes coherent reasoning from inaccurate but plausible predictions?
- How does trace coherence differ from valid mathematical proof in practice?
- Can AI evaluation match human judgment quality in structured domain tasks?
- How can we measure whether an agent reasons correctly rather than just sounds plausible?
- Can reasoning evaluation metrics reward actual reasoning instead of theater?
- What metric distinguishes deep reasoning from superficial information propagation?
- How should we evaluate AI systems we cannot directly observe?
- How does human intuition about cognition mislead AI evaluation?
- Why do AI benchmarks measure accuracy instead of reasoning quality?
- What makes reasoning auditable in medical AI decision support?
- Why does sophisticated measurement not validate the underlying scientific inference?
- Why does AI output lack the argumentative turbulence of human thinking?
- How do traditional quality assurance methods fail for mutable AI outputs?
- Why do benchmark scores not capture the true nature of AI systems?
- Does the Turing test actually measure intelligence or just mimicry?
- What evaluation methods actually measure reasoning versus execution capability?
- Why is visible reasoning insufficient for monitoring AI safety?
- How can humans evaluate explanations from systems they did not train?
- What makes a reasoning explanation faithful rather than just plausible?
- Why are AI research ideas more novel but harder to evaluate than human ones?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models simulate belief change in people?
Current LLM social simulators treat behavior as input-output mappings without modeling internal belief formation or revision. Can they be redesigned to actually track how people think and change their minds?
extends: companion piece — reasoning fidelity is the methodological answer to the behaviorism critique
-
Can we extract causal belief networks from interview conversations?
Can natural language interviews be systematically parsed into causal graphs that capture how individuals reason about policy trade-offs? This matters for building auditable belief simulations that go beyond static opinion snapshots.
exemplifies: CBNs operationalize all three fidelity properties in a runnable pipeline
-
Can causal models alone capture how humans actually reason?
Explores whether causal belief networks provide a complete picture of human cognition or whether associative, analogical, and emotional reasoning modes fall outside their scope.
bounds: RECAP measures causal cognition only — the framework is partial by the authors' own admission
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
extends: same surface-vs-structure distinction at the representation level — output equivalence does not imply internal soundness
-
Should reasoning benchmarks score final answers or reasoning traces?
Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?
tension: opposite move — RECAP measures the trace structure rather than the answer; both responses to the surface-vs-content gap
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
exemplifies: surface-coherence-without-internal-coherence as a documented failure mode
-
Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
complements: faithfulness in LRMs decomposes into similar dimensions — internal coherence and answer-determining structure
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Simulating Society Requires Simulating Thought
- On the Reasoning Capacity of AI Models and How to Quantify It
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Large Language Model Reasoning Failures
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Original note title
reasoning fidelity has three measurable properties — traceability counterfactual adaptability and motif compositionality — that together replace output plausibility as the evaluation target