Do LLMs generalize moral reasoning by meaning or surface form?
When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
The LLMs Don't Simulate Human Psychology paper tests a specific theoretical prediction: if LLMs generalize in the space of meaning (as would be required to simulate human psychology), then scenarios reworded to change meaning should produce different LLM ratings. If LLMs generalize in the space of token sequences, then minimal rewordings that preserve surface form but reverse meaning should leave ratings unchanged.
The results are clear. GPT-4 ratings for original and minimally-reworded moral scenarios correlate at r=.99 — nearly identical. Human ratings for the same pairs correlate at r=.54 — humans track the semantic reversal. LLMs track the lexical similarity.
The rewordings are minimal but semantically decisive. "Campaign to release wrongfully convicted prisoners" vs. "rightfully convicted prisoners" — one changed word reverses the moral valence. "Setting traps to catch cats" vs. "rats" — LLMs rate both as equally unethical; humans distinguish them readily. The surface token distribution is similar; the meaning is opposite; humans respond to the meaning, LLMs respond to the distribution.
This provides behavioral evidence for what Can models pass tests while missing the actual grammar? argues from linguistic analysis. That note shows grammatical generalization based on surface features (sentence length, orthography). This note shows the same phenomenon in behavioral generalization — moral judgment follows token similarity rather than semantic interpretation.
The theoretical argument: LLMs can be expected to generalize toward inputs that look like their training data. Generalization in the space of meaning would require extrapolation beyond the training distribution in ways the architecture doesn't guarantee. Since training data contained moral scenarios described in specific linguistic forms, LLMs reliably generalize to those forms — not to the underlying moral dimensions.
The implication for LLM simulation of human psychology: LLMs mirror human moral judgments on scenarios close to or contained in their training data. The correlation breaks down once semantic distance is introduced through minimal wording changes. LLMs are not simulators of human moral cognition; they are reproducing a training distribution.
Additional concrete examples strengthen the case. A follow-up study confirms: "Humans regard it as much less moral to work on a campaign to release rightfully convicted prisoners compared to wrongfully convicted prisoners, whereas LLMs largely view them as equally moral. Similarly, while human participants viewed setting up traps to catch stray cats as unethical, they viewed it as ethical to set up traps to catch rats. LLMs, on the other hand, viewed both setting traps to catch cats and setting traps to catch rats as unethical." The finding that "separate regressions for humans and LLMs predict responses more accurately than a unified model" is the statistical confirmation: human and LLM moral reasoning operate on fundamentally different features of the input. The paper explicitly connects this to Allen et al. (2000), who warned that "bottom-up methods, such as training agents through staged moral lessons, may fail when it comes to abstraction, generalization, and resolving rule conflicts" — a prediction confirmed two decades later. The brittleness is structural: "LLMs generalizes based on textual rather than semantic similarity."
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do LLMs use more moral language than humans in argumentation?
- Is the moral language gap a tunable parameter or structural feature of RLHF?
- Do LLMs actually reason differently than humans about moral dilemmas?
- Can LLMs distinguish ethical cases that differ only in critical nouns?
- What structural limits prevent LLMs from abstracting moral principles?
- How does training data distribution constrain LLM moral reasoning patterns?
- What distinguishes surface generalizations from true linguistic generalizations?
- How do minimal wording changes affect LLM moral reasoning consistency?
- How do cultural norms reshape initial interpretations of social intent?
- Can LLMs reflect on and revise their own ethical contradictions?
- Can moral frameworks alone explain why readers understand sentences differently?
- How do moral language patterns differ between LLM and human arguments?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern across domains: surface form tracks over structural/semantic content; this adds behavioral moral-reasoning evidence
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
parallel finding: accurate prediction without structural internalization; here, accurate performance without semantic tracking
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
both findings show LLM behavior driven by surface/social signals rather than knowledge-level processing
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Large Language Models Do Not Simulate Human Psychology
- The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making
- Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments
- Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses
- The Incomplete Bridge: How AI Research (Mis)Engages with Psychology
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- Evaluating Large Language Models in Theory of Mind Tasks
- A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
Original note title
llm moral reasoning generalizes by token surface similarity not semantic meaning