Do LLM semantic features organize along human evaluation dimensions?

Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?

Synthesis note · 2026-02-23 · sourced from Sentiment Semantics Toxic Detections

A long-standing finding from social psychology: human ratings across diverse semantic scales follow a strong correlational structure that reduces to three dimensions — Evaluation (good vs. bad), Potency (strong vs. weak), and Activity (moving vs. stationary). This same structure appears inside LLM embedding matrices.

The method: extract feature directions from embedding matrices corresponding to 28 semantic axes defined by antonym pairs (kind-cruel, foolish-wise, soft-hard). Project word tokens onto these directions. The projections correlate highly with human ratings on the respective scales. Apply PCA to the projections and a 3D solution preserves 40-55% of the variance across all 28 features — with loadings that match the human EPA structure.

The steering implication is the sharp finding. Because semantic features are geometrically aligned in embedding space, intervening on one feature causes predictable off-target effects on other features proportional to their cosine similarity. Steering tokens toward "soft" shifts them toward "kind" because those directions are aligned. Steering toward "strong" shifts toward "big." The off-target effect is not noise — it is a structural consequence of how meaning is organized.

This matters for alignment and safety because representation engineering interventions (steering vectors, activation additions) assume features can be independently modified. If semantic features are entangled in a low-dimensional subspace, then steering for one property (say, "helpful") will predictably shift adjacent properties (say, "agreeable" or "warm") whether intended or not. The off-target effects are not bugs but consequences of how LLMs organize meaning — in a way that mirrors how humans organize meaning.

The philosophical dimension: that LLMs recapitulate human semantic structure despite radically different architecture and training suggests that the EPA structure may be a property of language itself rather than of the cognitive system processing it. Training on extensive records of human thought appears sufficient to reproduce the correlational structure of human semantic judgments.

Inquiring lines that read this note 16

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Is embodied interaction necessary for language meaning and genuine agency?

Do language models understand semantics or rely on pattern matching?

How does syntactic encoding relate to semantic feature representation?

What limits mechanistic interpretability's ability to characterize models?

How do functional features differ from representational abstract features?

Do language models learn genuine linguistic structure or just surface patterns?

Do language model representations contain causally steerable task-specific features?

Do language models develop causal world models or rely on statistical patterns?

How do internal representations compare to human cognitive structures?

How does latent reasoning compare to verbalized chain-of-thought?

Can latent space represent reasoning dimensions that text cannot?

Why do semantic similarity and task relevance diverge in vector embeddings?

Why do leading embedding eigenvectors align with WordNet taxonomy structure?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

How does co-occurrence statistics alone produce hierarchical concept organization?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Do LLM semantic features organize along human ev… Can high-level concepts replace circuit-level anal… Can we track and steer personality shifts during m… Can identical outputs hide broken internal represe… How do language models encode syntactic relations …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can high-level concepts replace circuit-level analysis in AI? Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
RepE operates in this same space; EPA entanglement constrains what RepE can cleanly modify
Can we track and steer personality shifts during model finetuning? This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona steering faces the same entanglement: shifting one personality dimension will drag correlated semantic features
Can identical outputs hide broken internal representations? Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER is about task-level entanglement; EPA is about semantic-level entanglement — same structural problem at different levels
How do language models encode syntactic relations geometrically? Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.
complementary structural discovery: EPA reveals semantic dimensions in embedding space while Polar Probe reveals syntactic relations across layers — together they show meaning is organized along both semantic and syntactic axes

Do LLM semantic features organize along human evaluation dimensions?

Inquiring lines that read this note 16

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4