CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases

Paper · Source

A screenshot of a computer

We present CoSQL, a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions. When user questions are answerable by SQL, the expert describes the SQL and execution results to the user, hence maintaining a natural interaction flow. CoSQL introduces new challenges compared to existing task-oriented dialogue datasets: (1) the dialogue states are grounded in SQL, a domain-independent executable representation, instead of domain-specific slotvalue pairs, and (2) because testing is done on unseen databases, success requires generalizing to new domains. CoSQL includes three tasks: SQL-grounded dialogue state tracking, response generation from query results, and user dialogue act prediction. We evaluate a set of strong baselines for each task and show that CoSQL presents significant challenges for future research.

Introduction. Natural language interfaces to databases (NLIDB) have been studied extensively, with a multitude of different approaches introduced over the past few decades. To this end, considerable progress has been made in querying data via natural language (NL). However, most NL query systems expect the Wang et al., 2018; Yu et al., 2018b,c). In reality, complex questions are usually answered through interactive exchanges (Figure 1). Even for simple queries, people tend to explore the database by asking multiple basic, interrelated questions (Hale, 2006; Levy, 2008; Frank, 2013; Iyyer et al., 2017). This requires systems capable of sequentially processing conversational requests to access information in relational databases. To drive the progress of building a context-dependent NL query system, corpora such as ATIS (Hemphill et al., 1990; Dahl et al., 1994) and SParC (Yu et al., 2019)1 have been released. However, these corpora assume all user questions can be mapped into SQL queries and do not include system responses.

Discussion / Conclusion. In this paper, we introduce CoSQL, the first large-scale cross-domain conversational text-to- SQL corpus collected under a Wizard-of-Oz setup. Its language and discourse diversity and crossdomain setting raise exciting open problems for future research. Especially, the baseline model performances on the three challenge tasks suggest plenty space for improvement. The data and challenge leaderboard will be publicly available at https://yale-lily.github.io/ cosql. Future Work As discussed in Section 5, some examples in CoSQL include ambiguous and unanswerable user questions and we do not study how a system can effectively clarify those questions or guide the user to ask questions that are answerable. Also, some user questions cannot be answered with SQL but by other forms of logical reasoning the correct answer can be derived.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

How can LLM user simulators model realistic goal-driven conversation?

Why do longer forecasting horizons degrade LLM accuracy in role-play?

What properties determine whether reward signals teach genuine reasoning?

Why does combining natural language with numerical scores improve prediction accuracy?

Can self-supervised signals enable process supervision without human annotation?

How does process supervision relate to execution-signaled feedback approaches?

What structural advantages do diffusion language models offer over autoregressive methods?

Why do benchmark improvements fail to reflect actual reasoning quality?

What critical LLM failures do standard benchmarks hide?

Do language models develop causal world models or rely on statistical patterns?

Do LLMs need world models to make accurate predictions?

Can prompting strategies overcome LLM biases without model fine-tuning?

Do monolithic prompts underutilize LLM strengths in forecasting workflows?

What drives capability and cost efficiency in agent systems?

What separates good workflow design from poor workflow design?

Can single-axis benchmarks accurately predict agent deployment success?

How should benchmarks evaluate workflow architecture versus raw model performance?

What causes silent corruption to amplify through delegated workflows?

When does architectural design matter more than raw model capacity?

Why do macro and micro forecasting scales require different reasoning approaches?

How do training data properties shape reasoning capability development?

What real-world forecasting domains benefit most from contextual reasoning integration?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How much does workflow architecture matter compared to raw model capability in forecasting?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Do newer language model generations improve forecasting ability without additional training?

When should retrieval-augmented systems decide to fetch new information?

What role does retrieval mechanism design play in forecast accuracy?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

How do AI researcher forecasts compare across different timeline question phrasings?

How does example difficulty affect learning efficiency in language models?

Why do non-experts default to familiar chart types despite domain complexity?

How should iterative research systems allocate reasoning per search step?

How do search and reasoning workflows improve forecasting performance over base models?

CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases

Synthesis notes from this paper's topics 8

Lines of inquiry this paper opens 24