Can AI distinguish which differences actually matter?
Explores whether AI systems can perform the qualitative judgment that experts use to select relevant observations. Matters because confusing AI outputs with expert observation leads users to trust pattern-matching as if it were reasoning about what's important.
Gregory Bateson defined information as "a difference which makes a difference." This deceptively simple formulation captures something essential about expertise that AI cannot perform: the act of selecting which differences matter.
When an expert observes a situation — a patient's symptoms, a market trend, a structural flaw in an argument — they are performing an act of qualitative selection. From the vast space of possible observations, they choose the ones that matter. This selection is not pattern-matching. It is judgment: the expert perceives differences and decides which ones make a difference to the problem at hand. The observation that makes a difference is an action of communication — it reports to the system (the community, the audience, the field) a change that moves understanding forward.
AI systems operate in a fundamentally different register. Since Do foundation models learn world models or task-specific shortcuts?, LLMs develop statistical heuristics tuned to pattern frequency, not to relevance. They can find patterns, connections, concepts, probabilities, and thresholds. But the differences that make a difference to an LLM are mathematical — quantitative not qualitative. An LLM cannot decide that one pattern matters more than another in a way that reflects understanding of the domain. It can only decide that one pattern is more probable than another given its training distribution.
This is the observer problem. Knowledge is observation — it is information about, relevant for, reasonable because, relevant to. These are conceptual connections whereby knowledge functions as a map to a territory. The expert is an observer system: they observe the needs of an audience, the state of knowledge, and apply observation in the act of making recommendations. Crucially, the expert can engage in self-observation — deliberately shaping their expertise to ensure it is suitable and relevant.
AI is not an observer. It generates responses from prompts. It doesn't have observations of a state — of knowledge, information, the user, an audience, or other contextual information. Since Should we call LLM errors hallucinations or fabrications?, this absence of observation is precisely what makes AI output fabrication: it produces text that has the form of observation without the epistemic process of observing.
The practical consequences are significant. Many users, including experts, do not have a mental model appropriate for LLMs. When experts make observations, they are being subjective in the productive sense — applying reason and judgment to information in order to choose what is important and relevant. Since Why do people trust AI outputs they shouldn't?, users interpret AI outputs through the same cognitive frameworks they use for human expert observations. But the outputs were produced by a different process entirely — one that mimics the form of observation without performing the selection that gives observation its value.
This connects to a deeper theoretical point about what LLMs can and cannot do with internal evaluation. Since Can LLMs generate more novel ideas than human experts?, the generative capacity of LLMs is not matched by evaluative capacity. They can produce more options than any human expert — but they cannot determine which options matter. The "differences that make a difference" are invisible to a system that operates on statistical association rather than qualitative judgment.
Even when LLMs apply internal judges, rubrics, or meta-reflections, these are simulations of selection — they have no means to qualify the relevance of their generations against the actual state of the domain, the needs of the audience, or the significance of the moment. The rubric can score surface features. It cannot judge importance.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does self-observation enable experts to verify their own judgment?
- What role shifts occur when experts become custodians of AI knowledge?
- How does AI substitute polished style for actual expert judgment?
- Can sorting algorithms create symmetric competition between human and AI content?
- How does AI presentation authority substitute for actual expert judgment?
- What happens when DSM categories are treated as ground truth in AI?
- Why do human judges fail to detect systematic linguistic differences that classifiers easily identify?
- Why do users default to treating AI outputs as equally reliable evidence?
- How do agents ground their judgments in evidence instead of pattern matching?
- How does the expert role shift when AI output becomes the primary thing experts manage?
- Why did three experts reach incompatible conclusions about the same AI system?
- What happens to professional expertise when judgment gets encoded into systems?
- How do evaluation systems shift power between humans and AI outputs?
- Why does AI output show diversity without multiplying actual points of view?
- How do experts decide which information matters for a specific audience?
- How do experts select which other experts to trust?
- Why do two experts with identical knowledge produce different outcomes in the same situation?
- What specific signals would be needed for an AI system to acquire meaning?
- Why do medical diagnoses require human judgment even with AI assistance?
- Can AI systems recognize intelligence in humans the way humans recognize it in each other?
- Can stylometric analysis tools work without understanding the significance of detected patterns?
- Why can't pattern-matching systems perform the observation that expert communication requires?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
heuristics are quantitative pattern-matching, not qualitative selection of relevance
-
Should we call LLM errors hallucinations or fabrications?
Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.
fabrication as the consequence of generating without observing
-
Why do people trust AI outputs they shouldn't?
When do human cognitive shortcuts fail in AI interaction? Three compounding traps—treating statistical patterns as facts, mistaking fluency for understanding, and avoiding disagreement—may explain systematic overreliance across languages and contexts.
users apply observation frameworks to non-observational outputs
-
Can LLMs generate more novel ideas than human experts?
Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?
generation without evaluative selection: the ideation version of the observation problem
-
Why does AI writing sound generic despite being grammatically correct?
Explores whether the robotic quality of AI text stems from grammatical failures or rhetorical ones. Understanding this distinction matters for diagnosing what AI systems actually struggle with in human-like writing.
mastering structure without evaluation is mastering form without observation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Open Problems in Mechanistic Interpretability
- Mechanistic Indicators of Understanding in Large Language Models
- AI for Auto-Research: Roadmap & User Guide
- Language models show human-like content effects on reasoning tasks
- Expanding Explainability: Towards Social Transparency in AI systems
- Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
- Linguistic markers of inherently false AI communication and intentionally false human communication: Evidence from hotel reviews
- Beyond Hallucinations: The Illusion of Understanding in Large Language Models
Original note title
AI cannot distinguish differences that make a difference — observation requires qualitative selection of relevance that quantitative pattern-matching cannot replicate