INQUIRING LINE

How do language models track multiple negotiating parties' commitments simultaneously?

This explores whether language models can keep a running ledger of what each side in a negotiation has committed to — tracking two (or more) parties' goals and agreements at once, rather than the single user's intent that ordinary dialogue systems assume.


This explores whether LLMs can hold a bilateral model of commitments — what each party has offered, accepted, or conceded — instead of the single-user goal that conventional dialogue systems are built around. The corpus suggests this is genuinely hard, and the reason is structural: standard dialogue state tracking was designed to fill in one user's form ("book a table for two at 7pm"), so it has no slot for the second party's evolving demands or for the mutual agreements that only exist when both sides sign off. Negotiation breaks that assumption — agreement requires explicit buy-in from both interlocutors across multiple issues, and form-filling paradigms simply can't represent that strategic, two-sided state Why do standard dialogue systems fail at tracking negotiation agreement?.

There's a deeper machinery gap underneath the missing data structure. Tracking two parties means maintaining two belief states and updating each as the conversation moves from partial to shared understanding. Token-level LLMs don't natively do this; the cleanest attempt to add it borrows from pragmatics — Collaborative Rational Speech Acts extend the Rational Speech Acts model so that both speakers' beliefs are tracked bidirectionally across turns, using information theory to capture how the parties converge toward a shared picture Can dialogue systems track both speakers' beliefs across turns?. The fact that researchers had to bolt on an external information-theoretic framework is itself the finding: the framework supplies the bilateral bookkeeping that a vanilla LLM lacks.

A related limitation shows up when you ask the model to hold competing readings at once. On the AMBIENT benchmark, GPT-4 correctly disambiguated only 32% of cases versus 90% for humans — LLMs struggle to keep multiple live interpretations in play simultaneously Can language models recognize when text is deliberately ambiguous?. Negotiation is exactly this situation in disguise: each party's position is a separate interpretation of where the deal stands, and a model that collapses to one reading will quietly lose track of the other side's commitments. The same brittleness appears over time — models anchor on surface lexical cues and fail to adapt as a counterpart's strategy evolves across a multi-turn game Can models recognize how individuals reason differently?.

Game-theoretic studies sharpen the picture and hint at fixes. Left to themselves, LLMs deviate from rational strategy and get worse as games grow more complex — but wrapping them in a structured game-theoretic workflow steers reasoning back toward near-optimal, less exploitable negotiation Do language models make rational strategic decisions in games?. And the way a model tracks the other party isn't uniform: across 22 models, some reason by minimax, some by trust, and some by "belief-anticipation" — explicitly modeling what the opponent will do Do large language models use one reasoning style or many?. That belief-anticipation style is the closest native analog to commitment-tracking, and notably it's tied to model and game type rather than raw reasoning depth.

The quietly unsettling takeaway: an LLM doesn't carry a stable commitment ledger the way a human negotiator does. Shanahan's 20-questions regeneration test shows models hold a superposition of consistent possibilities and sample one at generation time rather than committing to a fixed state Do large language models actually commit to a single character?. So when a model appears to "remember" what each party agreed to, it may be re-improvising a consistent story each turn rather than maintaining one — which is why reliable multi-party tracking, in this corpus, comes from external scaffolding (explicit agreement state, RSA-style belief models, structured workflows) rather than from the model alone.


Sources 7 notes

Why do standard dialogue systems fail at tracking negotiation agreement?

Standard dialogue state tracking assumes one user's goals; negotiation requires explicit agreement from both parties across multiple issues. Existing DST models, limited to form-filling paradigms, cannot capture the strategic dynamics and mutual commitments essential to genuine bilateral agreement.

Can dialogue systems track both speakers' beliefs across turns?

CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can models recognize how individuals reason differently?

LLMs struggle to anchor reasoning in temporal gameplay and adapt to evolving strategies. GPT-4o relies on surface lexical cues while DeepSeek-R1 shows early promise, but dynamic style adaptation remains largely insufficient across all models tested.

Do language models make rational strategic decisions in games?

LLMs frequently fail to compute Nash equilibria, with worse performance as game complexity increases. Structured game-theoretic workflows guide reasoning toward optimal strategies, reducing exploitability and enabling near-optimal negotiation outcomes.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-testing claims about LLM commitment tracking in multi-party negotiation. The question: Can language models maintain a bilateral model of what each negotiating party has offered, accepted, or conceded—or do they collapse to single-user state tracking?

What a curated library found — and when (dated claims, not current truth):
These findings span 2023–2025 and should be treated as perishable snapshots:

• Standard dialogue state tracking lacks a data structure for two-sided commitments; negotiation requires explicit buy-in from both interlocutors, which form-filling paradigms cannot represent (2023–24).
• On the AMBIENT benchmark, GPT-4 correctly disambiguated only 32% of cases versus 90% for humans; LLMs struggle to hold multiple live interpretations in parallel—a core negotiation demand (2023).
• Collaborative Rational Speech Acts (a pragmatic reasoning framework bolted onto LLMs) tracks both parties' beliefs bidirectionally using information theory; that researchers had to add external scaffolding is itself the finding (2025).
• LLMs deviate from rational game-theoretic strategy and worsen with complexity, but structured game-theoretic workflows steer them toward near-optimal negotiation; across 22 models, reasoning styles differ by minimax, trust, or belief-anticipation (2025).
• Shanahan's 20-questions regeneration test suggests models re-improvise consistent stories each turn rather than maintaining a fixed commitment ledger; reliable multi-party tracking comes from external scaffolding, not the model alone (implicit in corpus).

Anchor papers (verify; mind their dates):
• 2304.14399 (Apr 2023): We're Afraid Language Models Aren't Modeling Ambiguity
• 2307.06524 (Jul 2023): Agreement Tracking for Multi-Issue Negotiation Dialogues
• 2411.05990 (Nov 2024): Game-theoretic LLM: Agent Workflow for Negotiation Games
• 2507.14063 (Jul 2025): Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 32% AMBIENT score, GPT-4-turbo, o1-preview, and o1 variants—have they closed that gap? For the claim that vanilla LLMs re-improvise rather than commit: do recent long-context methods (e.g., persistent KV caches, in-context commitment anchors, or hardware-level state pinning in concurrent-attention systems like 2504.06261) now stabilize commitment representation across turns? For structured game-theoretic workflows: are they still required, or do recent SLMs (2506.02153) or soft CoT (2502.12134) internalize multi-agent belief tracking without external scaffolding? Separate the durable question (bilateral state representation) from perishable limitations (which methods now solve it).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Specifically: does recent work on prompt sensitivity (2410.12405), decision-making determinants (2402.17385), or precedent overruling (2510.20941) suggest that commitment tracking is tractable given the right prompt regime—and if so, does that undercut the claim that it requires external scaffolding?

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If structured workflows no longer bottleneck bilateral tracking, what is the *next* failure mode in multi-party negotiation—defection, intent drift, or coalitional reasoning? (b) Do small language models (2506.02153) exhibit *worse* commitment amnesia than large ones, or is the deficiency orthogonal to scale?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines