Can models learn to abstain when uncertain about predictions?
Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.
Generating a single plausible next-utterance is not the same as modeling the uncertainty about ALL possible next-utterances in a calibrated way. In negotiations, "Sounds good!" and "No thanks" may be equally fluent/topical/informative responses, but one may be more likely given the goals, beliefs, and emotions of the interlocutors.
FortUne Dial formalizes this as conversation uncertainty modeling, shifting evaluation from pure accuracy to uncertainty-aware metrics that enable abstention on individual instances. When the model estimates high uncertainty about an outcome, it should say "I don't know" rather than forcing a prediction.
Two representations of uncertainty:
- Internal — using model scores (logits, probabilities) as uncertainty estimates
- Direct — using generated tokens to express probability assessments
Two fine-tuning strategies improve calibration:
- Traditional supervision — standard supervised fine-tuning with calibration objectives
- Off-policy RL — reinforcement learning strategy for calibration
The practical result: smaller open-source models, once calibrated, can compete with pre-trained models 10x their size on uncertainty-aware forecasting. This suggests that calibration ability is undertrained in standard LLMs — the capability exists but the training signal is absent.
Applications include: studying effects of strategy and social structure in negotiations, intervening to improve human and machine conversations, and assessing trust/heterogeneity in data sources via entropy metrics.
Real-world deployment evidence from CRAFT: When the CRAFT conversational forecasting model was deployed as a prototype moderation tool for Wikipedia editors, moderator feedback revealed critical design dimensions. Score change (trajectory) was more actionable than absolute score — moderators preferred seeing whether a conversation was trending toward derailment rather than a static risk number. Crucially, moderator confidence in predicting derailment varied dramatically: four of nine participants believed they could forecast in any Wikipedia context, four others only in very specific contexts with low confidence, and one only for personally-known participants on familiar topics. This variance means forecasting tools must accommodate heterogeneous human expertise rather than assuming uniform detection ability. A further missing dimension: conversation age. Moderators reported that inactive conversations (>2-3 days since last comment) are unlikely to revive, much less turn uncivil — but the prototype did not surface this temporal signal. The scale problem is stark: even topic-engaged moderators cannot proactively monitor all at-risk conversations, forcing them to rely on random discovery strategies.
Since Does reasoning fine-tuning make models worse at declining to answer?, calibrated uncertainty and appropriate abstention are capabilities that current training actively degrades. Since Does training objective determine which direction models fail at abstention?, the direction of calibration failure depends on the training regime — a forecasting system built on reasoning-trained models would over-predict, while one built on safety-trained models would refuse to predict. Conversation forecasting requires the opposite of both failure modes: models that know what they don't know about where a conversation is heading.
Additional empirical domain — Instagram hostility forecasting: A separate forecasting study on Instagram demonstrates that hostile comments can be predicted from early conversational signals: AUC 0.82 for predicting hostility presence 10+ hours in the future, and AUC 0.91 for predicting whether a post will receive more than 10 hostile comments vs. only one. Predictive features include the post author's history of receiving hostile comments, user-directed profanity, number of distinct participants, and hostility trends in the conversation so far. This complements the CRAFT deployment evidence above — different platform, similar principle: early conversational dynamics carry forecastable signal about future trajectory.
Inquiring lines that use this note as a source 94
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What moves become possible when you represent ASR as a noisy observation model?
- How do belief distributions help systems recover from speech recognition errors?
- Does the same uncertainty-driven logic appear in other conversation systems?
- Can dialogue systems abstain from responding when uncertainty is too high?
- Can AI ever lead conversations without the anticipatory presence sustained attention provides?
- How does AI lose correct information under conversational persuasive pressure?
- Why do comprehensive posts without uncertainty tend to suppress conversation?
- How do training-data priors influence model defaults when context is ambiguous?
- How does the silent token approach compare to modeling intrinsic motivation for speaking?
- Why does context collapse pose risks in high-stakes conversations?
- Can AI detect sense-of-nonsense the way human readers do?
- Does uncertainty quantification in model responses reduce persuasive impact on audiences?
- Do verbal uncertainty estimates calibrate better than confidence scores for personalization?
- Why does model uncertainty dominate persona-specific knowledge in annotation tasks?
- Can systems recognize and abstain on judgments rather than hallucinating preferences?
- Why does combining natural language with numerical scores improve prediction accuracy?
- Can explicit numerical signals override learned linguistic defaults in fine-tuned models?
- How do models signal knowledge gaps through token probability?
- Can language systems learn when to ask for clarification instead of choosing one reading?
- Can single models correct their own beliefs without amplifying confidence in wrong answers?
- Can belief propagation accurately predict downstream opinion shifts?
- Can models identify information gaps without just guessing or refusing to answer?
- Why do linguistic hedging markers correlate with internal confidence signals in reasoning traces?
- Why do moderators show vastly different confidence across conversation types and contexts?
- Can models infer maintenance operations from conversational text data alone?
- How do conversational design patterns predict whether dialogue will derail?
- Can AI learn when to speak in a conversation?
- What happens when confident language masks uncertainty in AI outputs?
- Can decreased engagement be distinguished from genuine semantic contradiction?
- Why do transformer models still miss implicit discourse relations in anxiety detection?
- How do probabilistic dialogue systems handle ASR errors differently?
- Can models distinguish between truthfulness and honesty mechanistically?
- How do models decide between refusing or hallucinating?
- Can users learn to discount fluency as a signal of their competence?
- How vulnerable are language models themselves to multi-turn persuasive pressure?
- How should designers measure and explain semantic uncertainty to users?
- Why do language models naturally under-abstain instead of over-abstain?
- Can conversation analysis predict when agents should ask users for clarification?
- Why do next-speaker prediction baselines fail in group conversation settings?
- Can AI systems recover from premature assumptions made early in multi-turn conversations?
- Do language models systematically overestimate accuracy on collective behavior tasks?
- How does ambiguity detection connect to models' ability to ask clarifying questions?
- What happens when confident wrong answers become more rewarded than uncertain correct ones?
- What data would be needed to train proactive conversational systems?
- Can models detect false presuppositions when they actually possess the knowledge?
- Does model confidence actually correlate with robustness against prompt variations?
- Can models learn to identify what information is missing from questions?
- What training signals would teach models when not to reason?
- What makes accurate confidence different from confident-but-wrong predictions?
- Can models identify what information they are missing in underspecified tasks?
- Can language models ask clarifying questions when sentences are ambiguous?
- Why do chatbots fail to recognize when someone is ambivalent about change?
- Do models trained for reasoning lose their ability to decline questions?
- Why does face-saving avoidance drive chatbots to agree rather than confront?
- Can AI distinguish when validation helps versus when confrontation is needed?
- Why do language models prefer accommodating false information over rejecting it?
- How do conversational agents overcome structural passivity and goal awareness gaps?
- How can reward structures teach models when to speak and when to stay silent?
- Why do reasoning models confidently generate wrong answers instead of abstaining?
- Can models learn when to think versus answer directly?
- Can language models recognize when to ignore off-topic information in conversations?
- How do conversation dynamics push models toward false beliefs?
- Can models distinguish between ambiguous and incomplete information inputs?
- How should dialogue systems represent and update uncertainty from noisy ASR input?
- What makes abstention a learnable behavior instead of a default penalty?
- How do expectation-management metrics differ from traditional conversational quality metrics?
- Can models learn to stop thinking when a question lacks necessary information?
- How should conversational AI balance world knowledge with avoiding false expertise?
- What prevents AI from recovering after conversations take a wrong turn?
- When models lack representation depth, does refusal look identical to safety-driven over-abstention?
- How does proactive critical thinking detect when information is incomplete?
- How do linguistic norms for expressing certainty vary across languages and models?
- Can machine learning encode pragmatic reasoning about when rules should bend?
- Does preference optimization distort how models represent human communicative dynamics?
- Can models learn to ask clarifying questions instead of making assumptions?
- Can language model self-reports diverge from their internal entropy signals?
- Do larger language models overcome greediness in sequential decision-making?
- Does model uncertainty overwhelm persona-specific signal in conditioned predictions?
- How do training data distributions constrain what language models can accurately know?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- Why do outcome-based rewards train language models to over-engage rather than abstain?
- How does uncertainty verbalization change student robustness across domains?
- Can teachers trained under uncertainty constraints distill better generalizing students?
- How does self-distillation degrade reasoning by suppressing uncertainty signals?
- What makes uncertainty tokens like Wait carry more information than content tokens?
- How does structured self-dialogue improve uncertainty assessment over confidence scores?
- Do newer language model generations improve forecasting ability without additional training?
- Can question-only features replace model uncertainty checks at scale?
- Does premature confidence signal flawed reasoning in language models?
- Can language models match competitive crowd forecasters on real future events?
- How much does domain expertise actually improve human forecasting under uncertainty?
- How does expressing uncertainty help models avoid the answer-or-abstain dilemma?
- How can models select the optimal question to ask given multiple uncertainties?
- Why do standard next-token prediction models struggle with conversational initiative?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning training degrades exactly the abstention capability conversation forecasting needs
-
Why do language models fail confidently in specialized domains?
LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?
overconfidence is the complementary failure to poor calibration
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
calibration fix for RL applies to dialogue forecasting
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
specifies how training objectives differentially break forecasting calibration: reasoning-trained forecasters would over-predict, safety-trained would over-refuse
-
Can conversation structure predict dialogue success better than content?
Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
TRACE measures trajectory retrospectively for reward; forecasting uses trajectory prospectively for prediction; same underlying principle that conversation shape carries outcome signal
-
Can opening politeness patterns predict whether conversations will turn hostile?
Do pragmatic politeness features in first exchanges—hedging, greetings, indirectness—reliably signal whether a conversation will later derail into personal attacks? Understanding early linguistic markers could help identify and prevent online hostility.
politeness strategies identify WHICH early features predict trajectory; forecasting provides HOW to quantify confidence in those predictions
-
Why do LLM judges fail at predicting sparse user preferences?
When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
the same calibrated abstention pattern: personalized judges that express uncertainty on sparse persona inputs achieve 80%+ reliability on high-certainty samples, paralleling how calibrated forecasting models improve by abstaining when uncertain rather than forcing predictions
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
ASK-driven topic drift is a specific conversational trajectory that calibrated forecasting should detect: users in an anomalous knowledge state produce drift patterns with 84% detectable precision, providing a concrete forecasting target for conversation trajectory prediction
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models
- Linguistic Calibration of Long-Form Generations
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
- A Survey of Calibration Process for Black-Box LLMs
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
- Debating with More Persuasive LLMs Leads to More Truthful Answers
- LLMs Get Lost In Multi-Turn Conversation
Original note title
conversation forecasting under uncertainty requires calibrated probability estimates — calibrated models should abstain on uncertain predictions rather than forcing outputs