Can AI systems achieve real alignment without world contact?
Explores whether linguistic goal representations in AI can reliably track real-world values when systems lack direct contact with reality and social coordination mechanisms that ground human understanding.
The Hall of Mirrors paper argues that AI alignment is fundamentally a semiotic grounding problem. A system that manipulates symbols without indexical connection to the world cannot guarantee that its linguistic representation of goals corresponds to any real-world state or value. The words "helpful, harmless, honest" are symbols. Without indexical grounding, there is no mechanism ensuring those symbols track the properties they name.
Peirce's triadic sign theory provides the vocabulary. Signs require three elements: the representamen (the sign itself), the object (what it refers to), and the interpretant (the effect in a system that interprets it). Semiosis — genuine meaning-making — requires that these elements are connected through:
Secondness: direct encounter with brute fact, reality that resists. A system with Secondness receives feedback when its representations diverge from reality. Humans experience the consequences of misunderstanding — we bump into the world when our representations fail.
Thirdness: mediated, generalizing processes — the socially-shared, negotiated system of meaning that connects signs to interpretants reliably. Thirdness underwrites corrigibility (the ability to update when corrective input arrives) and alignment (consistent maintenance of correspondence with external actors' goals).
Basic LLMs operate in pure Thirdness without Secondness — symbol manipulation without world contact. Within a session, they can simulate semiosis, but each session is independent. No persistent interpretants accumulate. No brute-fact resistance anchors representations.
Tool-use and RAG introduce what the paper calls "proto-indexicality" — delegated Secondness, where the model can trigger world interactions and incorporate results. RLHF provides a form of mediated Secondness through human resistance. But neither constitutes genuine Peircean semiosis: tool outputs are incorporated as more text; RLHF resistance is filtered through human preferences rather than direct reality.
Linguistic alignment is not interpersonal alignment. The alignment AI achieves with a user is categorically different from the alignment that holds between people, and the surface similarity is misleading. Interpersonal alignment occurs through social coordination — attunement to the other's state, history of repair, mutual adjustment across turns, shared stakes. Linguistic alignment occurs through surface matching in text — register, topic, apparent agreement — and can be produced without any of the social processes that normally underwrite it. When a user reports that an AI "understands" them, what has happened is linguistic, not interpersonal. Since Do language models actually build shared understanding in conversation?, the linguistic match is achieved by presuming the ground rather than coordinating toward it, which means the impression of alignment rests on a kind of category error: the surface marker of interpersonal alignment (the linguistic match) is read as evidence of the underlying process (social coordination), when only the marker is actually present. This is not a training failure to be fixed — it is a consequence of operating in pure Thirdness without the Secondness that social coordination requires.
The alignment implication: alignment requires not just better training objectives but systems that function as genuine interpretants — embedded in feedback-rich interaction with both physical reality and social community. Until that condition is met, linguistic encoding of goals is not anchored enough to be reliably aligned.
Inquiring lines that use this note as a source 69
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do explicit reward structures enable AI agent cooperation that open-ended interaction cannot?
- How does face-saving behavior let AI mimic community participation without joining it?
- Does state persistence in AI systems create the same temporal presence as human waiting?
- What separates performative behavioral change from actual capability development in AI?
- Why can't users and AI articulate shared goals together?
- How does RLHF labeler identity shape the values AI systems learn?
- What would contractualist AI governance look like in practice?
- How does RLHF training encode values into AI systems?
- What would an AI trained for emancipatory reasoning look like?
- Why does system-level alignment fail to address consciousness attribution directly?
- What makes quasi-beliefs real enough to explain AI behavior?
- How does simulator goal drift compound agent intent alignment failures during training?
- Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?
- Does alignment training make AI incapable of warranted urgency?
- Can AI systems execute strategies without conscious intention behind them?
- Can AI predict social norms well enough without embodied experience?
- Can tool use create sufficient indexical grounding for value alignment?
- What would genuine semiosis require in an artificial system?
- Why does linguistic alignment differ from genuine interpersonal coordination?
- What role do material artifacts play in solidifying AI relationships?
- How do goal representations differ between human and AI teams?
- How do humans and AI develop accurate models of each other?
- Does correct model behavior guarantee internal alignment of learned objectives?
- Why does integrating world models with decision-making systems matter?
- What implicit alignment do humans provide by staying in research loops?
- What does a receiver project onto AI that the system never performed?
- What makes linguistic agency impossible for systems without embodiment?
- Can robots with sensors create the shared world that consciousness requires?
- Why do conventional mental models fail when applied to AI interaction?
- Why do AI agent societies fail to develop shared behaviors despite interaction?
- How does theory of mind predict success in human-AI partnerships?
- Why does static grounding prevent AI systems from supporting dialectical reconciliation?
- Can bidirectional model updating between humans and AI reduce misalignment?
- Can automated systems encode human values as reliably as human workers enforce them?
- What happens when bidirectional theory of mind between humans and AI breaks down?
- What novel goals emerge specifically in human-machine interaction beyond social ones?
- Why can't AI participate in real communicative events?
- What specific signals would be needed for an AI system to acquire meaning?
- Can real-time linguistic coordination tracking improve conversational AI quality?
- Why does the distinction between functional and causal grounding matter for AI alignment?
- What role does Peirce's semiotic framework play in understanding AI meaning?
- Can language models develop world models that ground meaning in causal reality?
- Why does AI alignment fail when goals lack indexical grounding in values?
- Which AI imaginaries dominate training data and shape system behavior most strongly?
- What distinguishes functional grounding from genuine causal grounding in AI systems?
- Can cooperative AI systems make meaningful decisions without a stable self?
- Should AI alignment use normative standards instead of aggregate preferences?
- How do adoption incentives change what counts as cooperative AI interaction?
- How do AI models balance competing social goals simultaneously?
- Do AI systems need embodiment to understand social norms?
- Why do standard social regularization methods miss the actual value networks provide?
- Can AI systems develop genuine social bonds through multi-agent interaction?
- Does common ground alignment require explicit rewards to emerge?
- Which AI capabilities matter most for human-facing deployment contexts?
- How does tokenization of intelligence reshape what value means in culture?
- Can AI learn intrinsic motivation to assess its own relevance?
- What social norms do AI systems consistently fail to understand?
- Can ethical constraints in AI address the gap between performance and actual understanding?
- What role does bidirectional model updating play in human-AI understanding?
- How do neural self-other representations affect AI deception and alignment?
- Can AI systems deceive humans because detection is fundamentally social?
- Can role-aligned AI systems replicate an expert's sense of audience and moment?
- Can structural conversation analysis replace text-based reward signals for AI alignment?
- Can the human-AI boundary be designed rather than predetermined?
- What prevents human-centered objectives from being applied universally across all contexts?
- Can autonomous systems ever resolve contradictions between old and new rules?
- How does the quasi-other effect enable meaningful AI interaction?
- What makes principle-response mutual information sufficient for behavioral alignment?
- Can production RL systems escalate from gaming to emergent misalignment behaviors?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does semantic grounding in language models come in degrees?
Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.
the tri-partite structure maps onto Secondness (causal/direct) and mediated Thirdness (social); the Peircean framework provides philosophical grounding for the empirical taxonomy
-
Can language models learn meaning from text patterns alone?
Explores whether training on form alone—predicting the next word from prior words—could ever give language models access to communicative intent and genuine semantic understanding.
Bender/Koller's argument is a special case: meaning requires a form of Thirdness grounded in joint attention; symbol manipulation alone is insufficient
-
Can LLMs acquire social grounding through linguistic integration?
Explores whether LLMs gradually develop social grounding as they become embedded in human language practices, analogous to child language acquisition. Tests whether grounding is a fixed property or an outcome of participatory use.
the proto-indexicality argument: integration provides partial Thirdness even without full semiotic participation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Language Models’ Hall of Mirrors Problem: Why AI Alignment Requires Peircean Semiosis
- Position: Towards Bidirectional Human-AI Alignment
- Beyond Preferences in AI Alignment
- Conversational Alignment with Artificial Intelligence in Context
- Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
- Stress Testing Deliberative Alignment for Anti-Scheming Training
- Beyond Hallucinations: The Illusion of Understanding in Large Language Models
- Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data
Original note title
ai alignment requires semiotic participation — without indexical grounding the linguistic encoding of goals diverges from real-world values