SYNTHESIS NOTE

Topics›Conversation Architecture Structure›this note

Can models learn to abstain when uncertain about predictions?

Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.

Synthesis note · 2026-02-22 · sourced from Conversation Architecture Structure

Generating a single plausible next-utterance is not the same as modeling the uncertainty about ALL possible next-utterances in a calibrated way. In negotiations, "Sounds good!" and "No thanks" may be equally fluent/topical/informative responses, but one may be more likely given the goals, beliefs, and emotions of the interlocutors.

FortUne Dial formalizes this as conversation uncertainty modeling, shifting evaluation from pure accuracy to uncertainty-aware metrics that enable abstention on individual instances. When the model estimates high uncertainty about an outcome, it should say "I don't know" rather than forcing a prediction.

Two representations of uncertainty:

Internal — using model scores (logits, probabilities) as uncertainty estimates
Direct — using generated tokens to express probability assessments

Two fine-tuning strategies improve calibration:

Traditional supervision — standard supervised fine-tuning with calibration objectives
Off-policy RL — reinforcement learning strategy for calibration

The practical result: smaller open-source models, once calibrated, can compete with pre-trained models 10x their size on uncertainty-aware forecasting. This suggests that calibration ability is undertrained in standard LLMs — the capability exists but the training signal is absent.

Applications include: studying effects of strategy and social structure in negotiations, intervening to improve human and machine conversations, and assessing trust/heterogeneity in data sources via entropy metrics.

Real-world deployment evidence from CRAFT: When the CRAFT conversational forecasting model was deployed as a prototype moderation tool for Wikipedia editors, moderator feedback revealed critical design dimensions. Score change (trajectory) was more actionable than absolute score — moderators preferred seeing whether a conversation was trending toward derailment rather than a static risk number. Crucially, moderator confidence in predicting derailment varied dramatically: four of nine participants believed they could forecast in any Wikipedia context, four others only in very specific contexts with low confidence, and one only for personally-known participants on familiar topics. This variance means forecasting tools must accommodate heterogeneous human expertise rather than assuming uniform detection ability. A further missing dimension: conversation age. Moderators reported that inactive conversations (>2-3 days since last comment) are unlikely to revive, much less turn uncivil — but the prototype did not surface this temporal signal. The scale problem is stark: even topic-engaged moderators cannot proactively monitor all at-risk conversations, forcing them to rely on random discovery strategies.

Since Does reasoning fine-tuning make models worse at declining to answer?, calibrated uncertainty and appropriate abstention are capabilities that current training actively degrades. Since Does training objective determine which direction models fail at abstention?, the direction of calibration failure depends on the training regime — a forecasting system built on reasoning-trained models would over-predict, while one built on safety-trained models would refuse to predict. Conversation forecasting requires the opposite of both failure modes: models that know what they don't know about where a conversation is heading.

Additional empirical domain — Instagram hostility forecasting: A separate forecasting study on Instagram demonstrates that hostile comments can be predicted from early conversational signals: AUC 0.82 for predicting hostility presence 10+ hours in the future, and AUC 0.91 for predicting whether a post will receive more than 10 hostile comments vs. only one. Predictive features include the post author's history of receiving hostile comments, user-directed profanity, number of distinct participants, and hostility trends in the conversation so far. This complements the CRAFT deployment evidence above — different platform, similar principle: early conversational dynamics carry forecastable signal about future trajectory.

Inquiring lines that read this note 98

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How should dialogue systems represent uncertainty from noisy speech input?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

How does AI-generated content transformation affect public discourse quality?

How does AI lose correct information under conversational persuasive pressure?

Why do multi-turn conversations degrade AI intent and coherence?

How do training priors constrain what context information can override?

Can next-token prediction alone produce genuine language understanding?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

Can AI detect sense-of-nonsense the way human readers do?

How should models express uncertainty rather than forced confident answers?

Can model confidence signals reliably improve reasoning quality and calibration?

How can persona representations reduce language model variance and improve task accuracy?

How do we evaluate AI systems when user perception misleads actual performance?

What properties determine whether reward signals teach genuine reasoning?

How can models identify insufficient information and respond appropriately without guessing?

Why do language models reinforce false assumptions instead of correcting them?

Why does self-revision increase model confidence while degrading accuracy?

What makes AI persuasion effective and how can we counter it?

Can belief propagation accurately predict downstream opinion shifts?

Does conversational format create illusions of genuine AI communication?

Why do moderators show vastly different confidence across conversation types and contexts?

How do formal dialogue structures reveal conversation coherence mechanisms?

Does AI fluency substitute for verifiable accuracy in human judgment?

Why should disagreement be treated as signal in collaborative reasoning?

Can decreased engagement be distinguished from genuine semantic contradiction?

How can emotions function as reliable information in reasoning and cognitive systems?

Why do transformer models still miss implicit discourse relations in anxiety detection?

Is model self-awareness based on genuine introspection or pattern matching?

How should conversational agents balance goal-driven initiative with user control?

How do chatbots affect human self-disclosure and emotional engagement?

What capability tradeoffs emerge when scaling model reasoning abilities?

Can models learn when to think versus answer directly?

What mechanisms enable AI systems to generate and spread false beliefs?

How do conversation dynamics push models toward false beliefs?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Does preference optimization distort how models represent human communicative dynamics?

Do language models learn genuine linguistic structure or just surface patterns?

Do larger language models overcome greediness in sequential decision-making?

What makes weaker teacher models effective for stronger student training?

Can teachers trained under uncertainty constraints distill better generalizing students?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Do newer language model generations improve forecasting ability without additional training?

Why do benchmark improvements fail to reflect actual reasoning quality?

Why do task-completion benchmarks miss the competence of knowing when to abstain?

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 205 in 2-hop network ·dense cluster Open in graph ↗

Can models learn to abstain when uncertain about… Does reasoning fine-tuning make models worse at de… Why do language models fail confidently in special… Does binary reward training hurt model calibration… Does training objective determine which direction … Can conversation structure predict dialogue succes… Can opening politeness patterns predict whether co… Why do LLM judges fail at predicting sparse user p… Why do users drift away from their original inform…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning training degrades exactly the abstention capability conversation forecasting needs
Why do language models fail confidently in specialized domains? LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?
overconfidence is the complementary failure to poor calibration
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
calibration fix for RL applies to dialogue forecasting
Does training objective determine which direction models fail at abstention? Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
specifies how training objectives differentially break forecasting calibration: reasoning-trained forecasters would over-predict, safety-trained would over-refuse
Can conversation structure predict dialogue success better than content? Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
TRACE measures trajectory retrospectively for reward; forecasting uses trajectory prospectively for prediction; same underlying principle that conversation shape carries outcome signal
Can opening politeness patterns predict whether conversations will turn hostile? Do pragmatic politeness features in first exchanges—hedging, greetings, indirectness—reliably signal whether a conversation will later derail into personal attacks? Understanding early linguistic markers could help identify and prevent online hostility.
politeness strategies identify WHICH early features predict trajectory; forecasting provides HOW to quantify confidence in those predictions
Why do LLM judges fail at predicting sparse user preferences? When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
the same calibrated abstention pattern: personalized judges that express uncertainty on sparse persona inputs achieve 80%+ reliability on high-certainty samples, paralleling how calibrated forecasting models improve by abstaining when uncertain rather than forcing predictions
Why do users drift away from their original information need? When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
ASK-driven topic drift is a specific conversational trajectory that calibrated forecasting should detect: users in an anomalous knowledge state produce drift patterns with 84% detectable precision, providing a concrete forecasting target for conversation trajectory prediction

Can models learn to abstain when uncertain about predictions?

Inquiring lines that read this note 98

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4