Do LLM semantic features organize along human evaluation dimensions?
Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?
A long-standing finding from social psychology: human ratings across diverse semantic scales follow a strong correlational structure that reduces to three dimensions — Evaluation (good vs. bad), Potency (strong vs. weak), and Activity (moving vs. stationary). This same structure appears inside LLM embedding matrices.
The method: extract feature directions from embedding matrices corresponding to 28 semantic axes defined by antonym pairs (kind-cruel, foolish-wise, soft-hard). Project word tokens onto these directions. The projections correlate highly with human ratings on the respective scales. Apply PCA to the projections and a 3D solution preserves 40-55% of the variance across all 28 features — with loadings that match the human EPA structure.
The steering implication is the sharp finding. Because semantic features are geometrically aligned in embedding space, intervening on one feature causes predictable off-target effects on other features proportional to their cosine similarity. Steering tokens toward "soft" shifts them toward "kind" because those directions are aligned. Steering toward "strong" shifts toward "big." The off-target effect is not noise — it is a structural consequence of how meaning is organized.
This matters for alignment and safety because representation engineering interventions (steering vectors, activation additions) assume features can be independently modified. If semantic features are entangled in a low-dimensional subspace, then steering for one property (say, "helpful") will predictably shift adjacent properties (say, "agreeable" or "warm") whether intended or not. The off-target effects are not bugs but consequences of how LLMs organize meaning — in a way that mirrors how humans organize meaning.
The philosophical dimension: that LLMs recapitulate human semantic structure despite radically different architecture and training suggests that the EPA structure may be a property of language itself rather than of the cognitive system processing it. Training on extensive records of human thought appears sufficient to reproduce the correlational structure of human semantic judgments.
Inquiring lines that use this note as a source 16
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do low-dimensional representation structures entangle multiple cultures together?
- How does syntactic encoding relate to semantic feature representation?
- How do functional features differ from representational abstract features?
- How should meaning spaces be systematically modeled across different applications?
- How do humans detect which words belong to the same frame together?
- How does LatentQA differ from predefined concept steering like representation engineering?
- What distinguishes surface cues from structural meaning in language understanding?
- How do internal representations compare to human cognitive structures?
- What fine-grained distinctions matter most for human situated action in categories?
- Can latent space represent reasoning dimensions that text cannot?
- What makes some concepts more steerable than others in activation space?
- Do all semantic steering effects follow predictable patterns based on feature alignment?
- What other behavioral properties exist as linear directions in activation space?
- Why do leading embedding eigenvectors align with WordNet taxonomy structure?
- How does co-occurrence statistics alone produce hierarchical concept organization?
- How do semantic features in representations become steerable task-specific directions?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
RepE operates in this same space; EPA entanglement constrains what RepE can cleanly modify
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona steering faces the same entanglement: shifting one personality dimension will drag correlated semantic features
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER is about task-level entanglement; EPA is about semantic-level entanglement — same structural problem at different levels
-
How do language models encode syntactic relations geometrically?
Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.
complementary structural discovery: EPA reveals semantic dimensions in embedding space while Polar Probe reveals syntactic relations across layers — together they show meaning is organized along both semantic and syntactic axes
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Semantic Structure in Large Language Model Embeddings
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
- Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
- Word Meanings in Transformer Language Models
- Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
- Do large language models resemble humans in language use?
- Large Concept Models: Language Modeling in a Sentence Representation Space
Original note title
semantic features in LLM embeddings are entangled in a low-dimensional structure mirroring human Evaluation-Potency-Activity dimensions — steering one feature predictably shifts aligned features