INSTRUCTION REASON EVIDENCE-BASED COMPARISON EXPERIENCE DEBATE INSTRUCTION You want to understand the procedure/method of doing/achieving something. Instructions/guidelines provided in a step-…
Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LL…
Effective communication requires the ability to identify ambiguities and request clarification of utterances. For machines to engage in a conversation, they need to learn to generate different forms o…
Large language models (LLMs) capable of casual conversation have recently become widely available. We hypothesize that users of conversational systems want a more personalized experience, and existing…
However, conversational agents built upon even the most recent large language models (LLMs) face challenges in processing ambiguous inputs, primarily due to the following two hurdles: (1) LLMs are not…
This work aims to build a text embedder that can capture characteristics of texts specified by user instructions. Despite its tremendous potential to deploy user-oriented embeddings, none of previous …
 Users often need to look through multiple search result pages or reformulate queries when they have complex information-…
Extracting metaphors and analogies from free text requires high-level reasoning abilities such as abstraction and language understanding. Our study focuses on the extraction of the concepts that form …
While information retrieval (IR) systems may provide answers for such user queries, they do not directly assist content creators—such as lecturers who want to improve their content—identify segments t…
Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanis…
Enabling open-domain dialogue systems to ask clarifying questions when appropriate is an important direction for improving the quality of the system response. Namely, for cases when a user request is …
Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the eval…
We propose Chain-of-Questions, a framework that trains a model to robustly answer multistep questions by generating and answering sub-questions. We obtain supervision for subquestions from human-annot…
Several models are proposed in the ConvAI3 challenge (Aliannejadi et al., 2020), aiming to incorporate CQs in the ranking process, mostly proposed based on pre-trained language models. Complementing t…
However, most of the previous works prompt the LLMs to directly generate a response based on the dialogue context, overlooking the underlying linguistic cues about the user status exhibited in the con…
“We introduce a new benchmark, Diplomat, aiming at a unified paradigm for pragmatic reasoning and situated conversational understanding. Compared with previous works that treat different figurative ex…
Understanding how users perceive content from generative AI tools is crucial because it can help reduce unwarranted trust in inaccurate information and mitigate the spread of misinformation. A focus g…
With the increasing adoption of Large Language Models (LLMs) in enterprise settings, ensuring accurate and reliable question-answering systems remains a critical challenge. Building upon our previous …
We focus on the cross-domain contextdependent text-to-SQL generation task. Based on the observation that adjacent natural language questions are often linguistically dependent and their corresponding …
Group decision making plays a crucial role in our complex and interconnected world. The rise of AI technologies has the potential to provide data-driven insights to facilitate group decision making, a…
While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information n…
Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever…
“An extensive research path is to elaborately design graph neural networks (GNNs) (Scarselli et al., 2008) to perform reasoning over explicit structural common sense knowledge from external knowledge …
The ’pre-train, prompt, predict’ paradigm of large language models (LLMs) has achieved remarkable success in opendomain question answering (OD-QA). However, few works explore this paradigm in the scen…
Abstract—This study develops a question-answering system based on Retrieval-Augmented Generation (RAG) using Chinese Wikipedia and Lawbank as retrieval sources. Using TTQA and TMMLU+ as evaluation dat…
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability ind…
However, one of the main challenges for this beautiful vision is that the users’ queries may contain some linguistic problems (e.g., omissions and coreference) and it becomes much harder to capture th…
However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to …
 Conversational recommender systems (CRSs) have revolutionized the conventional recommendation paradigm by emb…
In this paper, we propose a dual-learning model that hybrids the best from both implicit session feedback and proactively clarifying with users on the most critical questions. Hence, there are two br…
Abstract. The increasing demand for web-based digital assistants has given a rapid rise in the interest of the Information Retrieval (IR) community towards the field of conversational question answeri…
Longcontext question answering (LCQA) (Caciularu et al., 2022), which has been recently advanced significantly by LLMs, is a complex task that requires reasoning over a long document or multiple docum…
Leveraging a comprehensively curated entailment verification benchmark, we evaluate both human and LLM performance across various reasoning categories. Our benchmark includes datasets from three categ…
Multi-hop question answering requires models to gather information from different parts of a text to answer a question. Most current approaches learn to address this task in an end-to-end way with neu…
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their appl…
We address this gap by analyzing data from the AI Search Arena, a head-to-head evaluation platform for AI search systems. The dataset comprises over 24,000 conversations and 65,000 responses from mode…
The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interaction…
As John McCarthy (McCarthy, 1990, 1959) points out, in order to a better understanding of natural language, it is necessary for an intelligence system to understand the “deep structure” (Chomsky, 2011…
Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This unde…
The performance of Large Language Models (LLMs) in reasoning tasks depends heavily on prompt design, with Chain-of-Thought (CoT) and self-consistency being critical methods that enhance this ability. …
“The central problem of IR systems, also referred to as the “holy grail” of IR, is to overcome the vocabulary mismatch between the user and the system [75]. This leads to the challenge of matching the…
Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often under…
Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. This context includes both previous turns and context that pertain…
In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a…
Large Language Models (LLM), conversational assistants have become prevalent for domain use cases. LLMs acquire the ability to contextual question answering through extensive training, and Retrieval A…
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time…
Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of…
  may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using…
Large language models (LLMs) have made significant strides in various tasks, yet they often struggle with complex reasoning and exhibit poor performance in scenarios where knowledge traceability, time…
complex tasks like research and strategic thinking often benefit from a more comprehensive approach to augmenting the thinking process rather than passively getting information. We introduce the conce…
Recent advances in retrieval-augmented generation have significantly improved the performance of question-answering systems, particularly on factoid ‘5Ws’ questions. However, these systems still face …
Topic diversion occurs frequently with engaging open-domain dialogue systems like virtual assistants. The balance between staying on topic and rectifying the topic drift is important for a good collab…
Non-factoid question-answering (NFQA) poses a significant challenge due to its open-ended nature, diverse intents, and the need for multi-aspect reasoning, which renders conventional factoid QA approa…
In this work, we introduce Uncertainty of Thoughts (UoT), an algorithm to augment large language models with the ability to actively seek information by asking effective questions. UoT combines 1) an …
LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation–experience gap. We attribute this gap to existing benchmarks'…