INQUIRING LINE

What training on actual interaction would show that text-only training cannot?

This explores what an AI could pick up from training on real back-and-forth interaction — the live, consequential exchange between people — that training on static text alone structurally can't supply.


This reads the question as asking about the gap between learning from text-as-record and learning from interaction-as-event — what kind of competence only shows up when the training signal comes from actual exchanges rather than finished prose. The corpus has a sharp answer: the things interaction would teach are mostly *relational and consequential*, and text strips exactly those out.

Start with what text-only training rewards. The signal is next-token prediction — information encoding — and that's a different objective than the work conversations actually do. Why don't language models develop conversation maintenance skills? points out that humans keep talk going through implicit moves — repairing a misunderstood reference, handing off a topic — that sustain the relationship rather than convey facts. Those moves are invisible to a model trained to predict information, because they leave little textual trace and carry no informational payload. A related diagnosis: Does AI generate genuine utterances or just text patterns? argues that text is the *residue* of communicative events, not the events themselves — so a model trained on residue inherits the surface markers of utterances without the event structure (a real addressee, real stakes, a real next turn) that made them utterances. Interaction training would put that event structure back: the model would be answerable to a partner whose response is consequence, not just continuation.

The deepest version of the claim is grounding. Can language models learn meaning from text patterns alone? (the Bender & Koller argument) holds that meaning lives in the relation between what's said and what's *intended* between participants — joint attention — which form-to-form prediction can never reconstruct. What grounds language understanding in systems without embodiment? sharpens this into a useful distinction: models already have strong *functional* grounding from language patterns, but weak *social* grounding (participatory agency) and weak *causal* grounding (embodied contact with the world). The interesting wrinkle is that this note says social grounding *can* increase through being put in the loop with humans — which is precisely what interaction training is — while causal grounding may need architecture, not just data. So interaction wouldn't fix everything, but it targets exactly the deficit text can't touch.

The most concrete evidence is the simulation-vs-real gap. Do simulated training interactions transfer to real conversations? shows models that ace programmatic benchmarks — where a simulator trades structured attribute lists — collapse on real dialogue where people hedge, drift off-topic, and reveal preferences sideways rather than as feature checklists. That's a clean demonstration that learning *about* interaction from clean text isn't learning interaction. And Why do language models respond passively instead of asking clarifying questions? names a specific behavior that only emerges when training looks past a single turn: standard RLHF, optimizing immediate helpfulness, trains models to answer passively instead of asking a clarifying question; rewarding long-horizon interaction value flips that into active intent discovery. Asking a good question to find out what you actually meant is a quintessential interaction skill — and it's one that turn-by-turn text optimization actively suppresses.

Two cautions keep this honest. Can AI systems learn social norms without embodied experience? shows text-trained models can out-predict humans on social appropriateness — yet make *identical systematic errors*, marking a boundary that pattern-matching alone can't cross. And Can controlled latent variables make LLM user simulators realistic? suggests a partial bridge: conditioning a simulator on latent user profile and intent variables produces synthetic conversations realistic enough to fool discriminators — so some of interaction's value may be recoverable without live humans. The unresolved frontier — see Are text-only language models fundamentally limited by abstraction? and Do large language models genuinely simulate mental states? — is whether the remaining gap is a *data* problem interaction can close, or an *architectural* one no amount of interaction will, since the theory-of-mind work finds models defaulting to surface strategies even when the training is richer.


Sources 10 notes

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Does AI generate genuine utterances or just text patterns?

AI output carries communicative markers inherited from training data but lacks the event structure that produces actual utterances. Users supply the missing orientation through interpretive labor, creating a pseudo-event with structure only on the human side.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

What grounds language understanding in systems without embodiment?

Language models achieve functional grounding through relational language patterns but lack social grounding through participatory agency and causal grounding through embodied environmental contact. Social grounding can increase through human integration, but linguistic agency requires architectural changes beyond training.

Do simulated training interactions transfer to real conversations?

Standard CRS research uses programmatic simulators that exchange structured entity information, not natural language. This creates a false progress signal: models excelling on simulated benchmarks collapse on real dialogue where users hedge, go off-topic, or express preferences conversationally rather than as attribute lists.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can AI systems learn social norms without embodied experience?

GPT-4.5 predicted appropriateness of 555 social scenarios at the 100th percentile compared to human raters, with Gemini and Claude also exceeding 96% accuracy. However, all models show identical systematic errors, revealing boundaries of pattern-based social understanding that embodied experience may still be necessary to cross.

Can controlled latent variables make LLM user simulators realistic?

RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether text-only training's claimed limits still hold. The precise question remains open: what does actual interactive training reveal that text-only training cannot?

What a curated library found — and when (dated claims, not current truth):
Findings span May 2023–April 2026. The library identified these constraints on text-only models:
• Conversation-maintenance moves (repair, topic hand-off) leave minimal textual trace and carry no informational payload; text-only training optimizes next-token prediction, not relational work (2023–2024).
• Text is residue of communicative *events*, not events themselves; models inherit surface markers without event structure (real addressee, real stakes, real next turn) (2023–2024).
• Social grounding (participatory agency) and causal grounding (embodied contact) remain weak; social grounding *can* improve via human-in-the-loop, but causal may require architecture, not data (2024–2025).
• Simulated entity-level interactions (structured attributes) fail to generalize to real dialogue where people hedge, drift, and reveal preferences indirectly (2023–2024).
• Single-turn RLHF suppresses active clarification-seeking; multi-turn-aware reward optimization flips models toward intent discovery (2026).

Anchor papers (verify; mind their dates):
• arXiv:2307.16689 (Jul 2023) — third-position repair in conversation.
• arXiv:2402.10992 (Feb 2024) — semantic grounding limits.
• arXiv:2602.07338 (Feb 2026) — intent mismatch in multi-turn dialogue.
• arXiv:2510.14665 (Oct 2025) — illusion of understanding.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether newer models (o1, Claude 3.7+, etc.), in-context learning over longer horizons, agentic multi-turn orchestration with memory/caching, live user simulators with latent intent, or richer evals have since RELAXED or OVERTURNED it. Separate the durable question — *can text alone capture interaction's relational logic?* — from perishable limitations (e.g., single-turn RLHF as an implementation choice). Cite what resolved each constraint, plainly state where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If synthetic user simulators (arXiv:2507.20152) now close the interaction gap without live humans, say so with evidence. If multimodal pretraining (arXiv:2603.03276) shifts the debate, make that explicit.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Given in-context intent scaffolding, is the remaining gap architectural (transformer inductive bias) rather than data-driven?" or "Can models trained on *annotated intent traces* of real dialogue match the generalization of live interaction without human loop?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines