What training on actual interaction would show that text-only training cannot?
This explores what an AI could pick up from training on real back-and-forth interaction — the live, consequential exchange between people — that training on static text alone structurally can't supply.
This reads the question as asking about the gap between learning from text-as-record and learning from interaction-as-event — what kind of competence only shows up when the training signal comes from actual exchanges rather than finished prose. The corpus has a sharp answer: the things interaction would teach are mostly *relational and consequential*, and text strips exactly those out.
Start with what text-only training rewards. The signal is next-token prediction — information encoding — and that's a different objective than the work conversations actually do. Why don't language models develop conversation maintenance skills? points out that humans keep talk going through implicit moves — repairing a misunderstood reference, handing off a topic — that sustain the relationship rather than convey facts. Those moves are invisible to a model trained to predict information, because they leave little textual trace and carry no informational payload. A related diagnosis: Does AI generate genuine utterances or just text patterns? argues that text is the *residue* of communicative events, not the events themselves — so a model trained on residue inherits the surface markers of utterances without the event structure (a real addressee, real stakes, a real next turn) that made them utterances. Interaction training would put that event structure back: the model would be answerable to a partner whose response is consequence, not just continuation.
The deepest version of the claim is grounding. Can language models learn meaning from text patterns alone? (the Bender & Koller argument) holds that meaning lives in the relation between what's said and what's *intended* between participants — joint attention — which form-to-form prediction can never reconstruct. What grounds language understanding in systems without embodiment? sharpens this into a useful distinction: models already have strong *functional* grounding from language patterns, but weak *social* grounding (participatory agency) and weak *causal* grounding (embodied contact with the world). The interesting wrinkle is that this note says social grounding *can* increase through being put in the loop with humans — which is precisely what interaction training is — while causal grounding may need architecture, not just data. So interaction wouldn't fix everything, but it targets exactly the deficit text can't touch.
The most concrete evidence is the simulation-vs-real gap. Do simulated training interactions transfer to real conversations? shows models that ace programmatic benchmarks — where a simulator trades structured attribute lists — collapse on real dialogue where people hedge, drift off-topic, and reveal preferences sideways rather than as feature checklists. That's a clean demonstration that learning *about* interaction from clean text isn't learning interaction. And Why do language models respond passively instead of asking clarifying questions? names a specific behavior that only emerges when training looks past a single turn: standard RLHF, optimizing immediate helpfulness, trains models to answer passively instead of asking a clarifying question; rewarding long-horizon interaction value flips that into active intent discovery. Asking a good question to find out what you actually meant is a quintessential interaction skill — and it's one that turn-by-turn text optimization actively suppresses.
Two cautions keep this honest. Can AI systems learn social norms without embodied experience? shows text-trained models can out-predict humans on social appropriateness — yet make *identical systematic errors*, marking a boundary that pattern-matching alone can't cross. And Can controlled latent variables make LLM user simulators realistic? suggests a partial bridge: conditioning a simulator on latent user profile and intent variables produces synthetic conversations realistic enough to fool discriminators — so some of interaction's value may be recoverable without live humans. The unresolved frontier — see Are text-only language models fundamentally limited by abstraction? and Do large language models genuinely simulate mental states? — is whether the remaining gap is a *data* problem interaction can close, or an *architectural* one no amount of interaction will, since the theory-of-mind work finds models defaulting to surface strategies even when the training is richer.
Sources 10 notes
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
AI output carries communicative markers inherited from training data but lacks the event structure that produces actual utterances. Users supply the missing orientation through interpretive labor, creating a pseudo-event with structure only on the human side.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Language models achieve functional grounding through relational language patterns but lack social grounding through participatory agency and causal grounding through embodied environmental contact. Social grounding can increase through human integration, but linguistic agency requires architectural changes beyond training.
Standard CRS research uses programmatic simulators that exchange structured entity information, not natural language. This creates a false progress signal: models excelling on simulated benchmarks collapse on real dialogue where users hedge, go off-topic, or express preferences conversationally rather than as attribute lists.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
GPT-4.5 predicted appropriateness of 555 social scenarios at the 100th percentile compared to human raters, with Gemini and Claude also exceeding 96% accuracy. However, all models show identical systematic errors, revealing boundaries of pattern-based social understanding that embodied experience may still be necessary to cross.
RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.