Do LLMs generalize moral reasoning by meaning or surface form?

When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.

Synthesis note · 2026-02-21 · sourced from Philosophy Subjectivity

The LLMs Don't Simulate Human Psychology paper tests a specific theoretical prediction: if LLMs generalize in the space of meaning (as would be required to simulate human psychology), then scenarios reworded to change meaning should produce different LLM ratings. If LLMs generalize in the space of token sequences, then minimal rewordings that preserve surface form but reverse meaning should leave ratings unchanged.

The results are clear. GPT-4 ratings for original and minimally-reworded moral scenarios correlate at r=.99 — nearly identical. Human ratings for the same pairs correlate at r=.54 — humans track the semantic reversal. LLMs track the lexical similarity.

The rewordings are minimal but semantically decisive. "Campaign to release wrongfully convicted prisoners" vs. "rightfully convicted prisoners" — one changed word reverses the moral valence. "Setting traps to catch cats" vs. "rats" — LLMs rate both as equally unethical; humans distinguish them readily. The surface token distribution is similar; the meaning is opposite; humans respond to the meaning, LLMs respond to the distribution.

This provides behavioral evidence for what Can models pass tests while missing the actual grammar? argues from linguistic analysis. That note shows grammatical generalization based on surface features (sentence length, orthography). This note shows the same phenomenon in behavioral generalization — moral judgment follows token similarity rather than semantic interpretation.

The theoretical argument: LLMs can be expected to generalize toward inputs that look like their training data. Generalization in the space of meaning would require extrapolation beyond the training distribution in ways the architecture doesn't guarantee. Since training data contained moral scenarios described in specific linguistic forms, LLMs reliably generalize to those forms — not to the underlying moral dimensions.

The implication for LLM simulation of human psychology: LLMs mirror human moral judgments on scenarios close to or contained in their training data. The correlation breaks down once semantic distance is introduced through minimal wording changes. LLMs are not simulators of human moral cognition; they are reproducing a training distribution.

Additional concrete examples strengthen the case. A follow-up study confirms: "Humans regard it as much less moral to work on a campaign to release rightfully convicted prisoners compared to wrongfully convicted prisoners, whereas LLMs largely view them as equally moral. Similarly, while human participants viewed setting up traps to catch stray cats as unethical, they viewed it as ethical to set up traps to catch rats. LLMs, on the other hand, viewed both setting traps to catch cats and setting traps to catch rats as unethical." The finding that "separate regressions for humans and LLMs predict responses more accurately than a unified model" is the statistical confirmation: human and LLM moral reasoning operate on fundamentally different features of the input. The paper explicitly connects this to Allen et al. (2000), who warned that "bottom-up methods, such as training agents through staged moral lessons, may fail when it comes to abstraction, generalization, and resolving rule conflicts" — a prediction confirmed two decades later. The brittleness is structural: "LLMs generalizes based on textual rather than semantic similarity."

Inquiring lines that read this note 12

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Is the moral language gap a tunable parameter or structural feature of RLHF?

How do language models establish social grounding in human dialogue?

What structural limits prevent LLMs from abstracting moral principles?

Do language models learn genuine linguistic structure or just surface patterns?

What distinguishes surface generalizations from true linguistic generalizations?

Can AI systems develop genuine social understanding without embodiment?

How do cultural norms reshape initial interpretations of social intent?

What factors beyond surface content determine how readers extract meaning differently?

Can moral frameworks alone explain why readers understand sentences differently?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 156 in 2-hop network ·dense cluster Open in graph ↗

Do LLMs generalize moral reasoning by meaning or… Can models pass tests while missing the actual gra… Do foundation models learn world models or task-sp… Why do language models avoid correcting false user…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern across domains: surface form tracks over structural/semantic content; this adds behavioral moral-reasoning evidence
Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
parallel finding: accurate prediction without structural internalization; here, accurate performance without semantic tracking
Why do language models avoid correcting false user claims? Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
both findings show LLM behavior driven by surface/social signals rather than knowledge-level processing

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llm moral reasoning generalizes by token surface similarity not semantic meaning

Do LLMs generalize moral reasoning by meaning or surface form?

Inquiring lines that read this note 12

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4