What role does joint attention play in how humans learn language meaning?
This explores joint attention — the way two people lock onto the same thing in the world together — and what the corpus says about its role in grounding word meaning, mostly by showing what's at stake when machines try to learn meaning without it.
This explores joint attention — the human capacity to share a focus on the same referent — as the substrate for learning what words mean. The collection doesn't house a developmental-psychology paper on infants and pointing, but it circles the same territory from the language-and-meaning side, and the picture it draws is sharper for the detour. The clearest anchor is the idea that meaning isn't transmitted by sharing words; it's negotiated by aligning attention. Speakers have to actively *calibrate* what a word points to, because the same word grounds to different things for different people — grounding is person-specific, so communication is collaborative repair, not transmission Why do speakers need to actively calibrate shared reference?. Joint attention is the mechanism that makes that calibration possible: two minds checking that they're locked onto the same referent.
What's striking is that attention shows up as an irreducible layer of comprehension itself, not just a precondition for it. Tracking discourse means simultaneously holding three things — the words, the speaker's intentions, and what's currently salient (where attention is pointed) — and a failure in the attentional layer breaks understanding even when the words are perfectly parsed How do readers track segments, purposes, and salience together?. So shared attention isn't only how a child first bolts a word onto an object; it's the live channel that keeps two people meaning the same thing across a whole conversation.
The corpus then runs the natural experiment: what happens to meaning-learning when you remove the world and the shared gaze entirely? Language models turn out to learn a great deal of meaning purely from the relational structure of text — they operationalize Saussure's *langue*, the system of word-to-word relations, with no external referents and no embodiment Can language models learn meaning without engaging the world?. That's the provocative finding: fluent meaning can emerge from relations alone. But the gap it leaves is exactly the joint-attention gap. Models absorb the symbolic system humans share, yet lack the participatory subjectivity that comes from being socialized into it — they argue without declaring a position or reflecting on their own stance, because they never learned meaning by *participating* with someone Do LLMs develop the same kind of mind as humans?.
The most interesting turn is that this gap may not be permanent or architectural — it may be a matter of participation. Social grounding, on this view, isn't an innate possession but something acquired by playing language games; as LLMs become established conversational partners in actual human practice, they start to develop elementary grounding comparable to a young child's, which makes "do they understand?" a time-indexed question rather than a yes/no one Can LLMs acquire social grounding through linguistic integration?. That reframes joint attention's role for humans too: meaning is learned by being drawn into a shared practice of pointing, checking, and repairing — and where that practice thins out, even fluent systems lose grounding, the same way preference-tuned models quietly stop asking the clarifying questions that keep two parties aligned Does preference optimization harm conversational understanding?.
The thing you didn't know you wanted to know: meaning may be learnable from pure word-relations *up to a point*, and joint attention is precisely the part it can't reach — the live, two-way calibration of what we're both looking at, which is less a stage of learning than a permanent condition for words to keep meaning the same thing to two people.
Sources 6 notes
The same words can mean different things to different speakers because referential grounding is person-specific. True communicative grounding demands collaborative negotiation of how language connects to the world, not mere surface-level word sharing.
Discourse processing demands parallel recognition of linguistic segments, intentional structure, and attentional salience—not sequential processing. These three layers constrain each other during comprehension, and failures in any single layer disrupt overall understanding.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Both humans and LLMs are shaped by the same intersubjective symbolic system, but only humans develop reflexive agency through socialization. This absence produces measurable differences in how AI argues without declaring its position or reflecting on its own assumptions.
Social grounding is acquired through participation in language games rather than possessed innately. As LLMs become established communicative partners in human linguistic practice, they develop elementary social grounding comparable to young children, making the question of LLM understanding time-indexed.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.