SYNTHESIS NOTE

Topics›Philosophy Subjectivity›this note

Can AI systems achieve real alignment without world contact?

Explores whether linguistic goal representations in AI can reliably track real-world values when systems lack direct contact with reality and social coordination mechanisms that ground human understanding.

Synthesis note · 2026-02-21 · sourced from Philosophy Subjectivity

The Hall of Mirrors paper argues that AI alignment is fundamentally a semiotic grounding problem. A system that manipulates symbols without indexical connection to the world cannot guarantee that its linguistic representation of goals corresponds to any real-world state or value. The words "helpful, harmless, honest" are symbols. Without indexical grounding, there is no mechanism ensuring those symbols track the properties they name.

Peirce's triadic sign theory provides the vocabulary. Signs require three elements: the representamen (the sign itself), the object (what it refers to), and the interpretant (the effect in a system that interprets it). Semiosis — genuine meaning-making — requires that these elements are connected through:

Secondness: direct encounter with brute fact, reality that resists. A system with Secondness receives feedback when its representations diverge from reality. Humans experience the consequences of misunderstanding — we bump into the world when our representations fail.

Thirdness: mediated, generalizing processes — the socially-shared, negotiated system of meaning that connects signs to interpretants reliably. Thirdness underwrites corrigibility (the ability to update when corrective input arrives) and alignment (consistent maintenance of correspondence with external actors' goals).

Basic LLMs operate in pure Thirdness without Secondness — symbol manipulation without world contact. Within a session, they can simulate semiosis, but each session is independent. No persistent interpretants accumulate. No brute-fact resistance anchors representations.

Tool-use and RAG introduce what the paper calls "proto-indexicality" — delegated Secondness, where the model can trigger world interactions and incorporate results. RLHF provides a form of mediated Secondness through human resistance. But neither constitutes genuine Peircean semiosis: tool outputs are incorporated as more text; RLHF resistance is filtered through human preferences rather than direct reality.

Linguistic alignment is not interpersonal alignment. The alignment AI achieves with a user is categorically different from the alignment that holds between people, and the surface similarity is misleading. Interpersonal alignment occurs through social coordination — attunement to the other's state, history of repair, mutual adjustment across turns, shared stakes. Linguistic alignment occurs through surface matching in text — register, topic, apparent agreement — and can be produced without any of the social processes that normally underwrite it. When a user reports that an AI "understands" them, what has happened is linguistic, not interpersonal. Since Do language models actually build shared understanding in conversation?, the linguistic match is achieved by presuming the ground rather than coordinating toward it, which means the impression of alignment rests on a kind of category error: the surface marker of interpersonal alignment (the linguistic match) is read as evidence of the underlying process (social coordination), when only the marker is actually present. This is not a training failure to be fixed — it is a consequence of operating in pure Thirdness without the Secondness that social coordination requires.

The alignment implication: alignment requires not just better training objectives but systems that function as genuine interpretants — embedded in feedback-rich interaction with both physical reality and social community. Until that condition is met, linguistic encoding of goals is not anchored enough to be reliably aligned.

Inquiring lines that read this note 73

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do multi-agent systems achieve genuine cooperation and reasoning?

Can AI systems develop genuine social understanding without embodiment?

How should memory consolidation strategies shape agent performance over time?

Does state persistence in AI systems create the same temporal presence as human waiting?

How do we evaluate AI systems when user perception misleads actual performance?

When should tasks involve human-AI partnership versus full automation?

Does RLHF training sacrifice accuracy and grounding for user agreement?

How does RLHF labeler identity shape the values AI systems learn?

How should human oversight be integrated with autonomous AI systems?

What constrains reinforcement learning's ability to expand model reasoning?

How does RLHF training encode values into AI systems?

How does AI assistance affect human cognitive development and reasoning autonomy?

What would an AI trained for emancipatory reasoning look like?

How do interface design choices shape consciousness attribution?

Is model self-awareness based on genuine introspection or pattern matching?

Does alignment training create blind spots in detecting genuine safety threats?

Can self-supervised signals enable process supervision without human annotation?

Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?

How can AI alignment serve diverse human preferences at scale?

Is embodied interaction necessary for language meaning and genuine agency?

What makes dialogue-based explanation more successful than monologue?

Why does linguistic alignment differ from genuine interpersonal coordination?

Do language models develop causal world models or rely on statistical patterns?

Does conversational format create illusions of genuine AI communication?

How do language models establish social grounding in human dialogue?

Why do conventional mental models fail when applied to AI interaction?

What coordination failures limit multi-agent LLM systems as they scale?

Why do AI agent societies fail to develop shared behaviors despite interaction?

What distinguishes dynamic from static grounding in dialogue systems?

How do chatbots affect human self-disclosure and emotional engagement?

What novel goals emerge specifically in human-machine interaction beyond social ones?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

How do AI models balance competing social goals simultaneously?

How can identical external performance mask different internal representations?

Why do standard social regularization methods miss the actual value networks provide?

Why do reward structures fail to shape long-term agent learning?

Does tokenized intelligence retain genuine value through exchange-based systems?

How does tokenization of intelligence reshape what value means in culture?

Can AI-generated outputs constitute genuine knowledge or valid claims?

How do professional roles and expertise transform with AI-generated content?

Can role-aligned AI systems replicate an expert's sense of audience and moment?

Can language model RL training avoid reward hacking and misalignment?

Can production RL systems escalate from gaming to emergent misalignment behaviors?

How does objective evolution guide discovery better than fixed planning?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 131 in 2-hop network ·dense cluster Open in graph ↗

Can AI systems achieve real alignment without wo… Does semantic grounding in language models come in… Can language models learn meaning from text patter… Can LLMs acquire social grounding through linguist…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does semantic grounding in language models come in degrees? Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.
the tri-partite structure maps onto Secondness (causal/direct) and mediated Thirdness (social); the Peircean framework provides philosophical grounding for the empirical taxonomy
Can language models learn meaning from text patterns alone? Explores whether training on form alone—predicting the next word from prior words—could ever give language models access to communicative intent and genuine semantic understanding.
Bender/Koller's argument is a special case: meaning requires a form of Thirdness grounded in joint attention; symbol manipulation alone is insufficient
Can LLMs acquire social grounding through linguistic integration? Explores whether LLMs gradually develop social grounding as they become embedded in human language practices, analogous to child language acquisition. Tests whether grounding is a fixed property or an outcome of participatory use.
the proto-indexicality argument: integration provides partial Thirdness even without full semiotic participation

Can AI systems achieve real alignment without world contact?

Inquiring lines that read this note 73

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4