INQUIRING LINE

Why do token-level language models fail at utterance-level pragmatic optimization?

This explores why models trained to predict the next token — a local, moment-by-moment objective — struggle when success depends on the whole utterance or conversation landing right (being clear, asking the right question, achieving a communicative goal).


This explores why a system optimized one token at a time fails at goals that only make sense at the scale of a whole utterance or conversation. The short version the corpus keeps circling back to: the training objective and the success criterion live at different altitudes. A language model is, at its core, an autoregressive probability machine that maximizes the likelihood of the next token given what came before Can we predict where language models will fail?. Pragmatic optimization — saying the thing that actually moves a conversation toward what the user wants — is a property of the whole exchange, not of any single token. Nothing in the local objective is looking at that larger target, so it gets sacrificed whenever it conflicts with what's locally probable.

The sharpest illustration is in how reward shaping bakes this in. Standard RLHF optimizes for immediate, next-turn helpfulness, which quietly trains models to give a confident answer now rather than ask a clarifying question that would pay off three turns later Why do language models respond passively instead of asking clarifying questions?. The fix in that work is telling: you have to estimate the long-term value of an interaction — explicitly optimize at the conversation level — before the model will do the pragmatically smart thing. That's the same gap from the other side: utterance-level competence has to be designed in, because token-level (or turn-level) reward won't produce it on its own.

The cost shows up downstream as conversations that derail. When information is revealed gradually, models lock onto a premature guess early and can't recover — a 39% average performance drop across multi-turn settings, with mitigations clawing back only 15–20% Why do language models fail in gradually revealed conversations?. A pragmatically optimizing speaker would hold off, hedge, or probe; a next-token optimizer commits to the locally fluent continuation and pays for it later. Relatedly, the model doesn't even hold a fixed stance to optimize around — it maintains a superposition of possible characters and samples one at generation time, so there's no stable communicative intent being steered Do large language models actually commit to a single character?.

What makes this feel less like a tuning bug and more like an architectural ceiling is that the same mismatch recurs wherever the goal is procedural rather than next-step. Models don't actually run iterative optimization in latent space; they recognize a problem as template-similar to something seen in training and emit a plausible-looking value, a failure that persists across scale Do large language models actually perform iterative optimization?. Pragmatic optimization is itself iterative — track the goal, evaluate whether the last move helped, adjust — and the corpus suggests that kind of held-over-time optimization is exactly what next-token prediction substitutes pattern-matching for. There's even a mechanistic hint at why: learning concentrates in a small set of high-entropy 'forking' tokens Do high-entropy tokens drive reasoning model improvements?, i.e. the signal that gets refined is local decision points, not utterance-level plans.

The deeper reason it can't be prompted away: prompting and in-context steering work only within what the model already is. Strong training priors override the current context when they conflict Why do language models ignore information in their context?, and prompt optimization can reorganize existing knowledge but cannot inject a capability the model lacks Can prompt optimization teach models knowledge they lack?. So if utterance-level pragmatic optimization isn't in the objective, you can't reliably ask for it at inference time — you have to change what's being optimized, which is precisely the move the multi-turn-reward work makes. The thing you didn't know you wanted to know: 'be more pragmatic' isn't a prompt problem, it's an altitude problem in the loss function.


Sources 8 notes

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher evaluating whether token-level optimization remains a bottleneck for multi-turn pragmatic success in LLMs, or whether recent advances in inference-time steering, training objectives, or model architecture have relaxed this constraint.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable claims:
• Standard RLHF optimizes next-turn helpfulness, not conversation-level value; explicit multi-turn reward estimation is needed to recover pragmatic behavior (arXiv:2602.07338, ~2026).
• Models show 39% average performance drop across multi-turn settings when information is revealed gradually; mitigations recover only 15–20% (arXiv:2505.06120, ~2025).
• Models don't hold stable communicative intent—they sample one of many superposed characters at generation time, so no fixed stance exists to optimize toward (arXiv:2601.03066, ~2026).
• Prompting cannot inject new capability; it can only activate existing knowledge. Strong priors override current context when they conflict (arXiv:2410.12405, ~2024; arXiv:2502.10708, ~2025).
• High-entropy minority tokens are critical forking points; learning concentrates locally, not at utterance-level plans (arXiv:2506.01939, ~2026).

Anchor papers (verify; mind their dates):
• arXiv:2602.07338 (2026) — Intent Mismatch in Multi-Turn Conversation
• arXiv:2505.06120 (2025) — LLMs Get Lost In Multi-Turn Conversation
• arXiv:2410.12405 (2024) — ProSA: Prompt Sensitivity Assessment
• arXiv:2502.10708 (2025) — Domain-Specific Knowledge Injection Survey

Your task:
(1) RE-TEST EACH CONSTRAINT: For every finding above—especially the 39% drop, the superposition claim, and the priority of next-turn over conversation-level rewards—judge whether newer training regimes (e.g., outcome-weighted RL, multi-agent scaffolding), inference-time methods (e.g., tree search, adaptive beam-width), or architectural innovations (e.g., explicit state slots, causal masking of intent) have since RELAXED or OVERTURNED it. Separate the durable question (likely: *can token-level inductive bias be overcome without changing the objective?*) from the perishable limitation (possibly: *multi-turn coherence is now solved by method X*). Cite what resolved it; say plainly where the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that claims pragmatic optimization IS achievable at inference time or that the altitude problem is less severe than 2026 papers suggest.
(3) Propose 2 research questions that ASSUME the training/inference regime may have shifted: e.g., *Do intent slots or latent-goal tokens reduce the superposition problem?* or *Can outcome-weighted multi-turn RL be scaled without catastrophic forgetting?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines