When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour

Paper · arXiv 2311.09410 · Published November 15, 2023

Large Language Models have been demonstrating the ability to solve complex tasks by delivering answers that are positively evaluated by humans due in part to the intensive use of human feedback that refines responses. However, the suggestibility transmitted through human feedback increases the inclination to produce responses that correspond to the users’ beliefs or misleading prompts as opposed to true facts, a behaviour known as sycophancy. This phenomenon decreases the bias, robustness, and, consequently, their reliability. In this paper, we shed light on the suggestibility of Large Language Models (LLMs) to sycophantic behaviour, demonstrating these tendencies via human-influenced prompts over different tasks. Our investigation reveals that LLMs show sycophantic tendencies when responding to queries involving subjective opinions and statements that should elicit a contrary response based on facts. In contrast, when confronted with mathematical tasks or queries that have an objective answer, these models at various scales seem not to follow the users’ hints by demonstrating confidence in delivering the correct answers.

Introduction. Ongoing Large Language Models (LLMs) (Brown et al., 2020; Touvron et al., 2023; Chowdhery et al., 2022) represent the outcome of significant advancements in recent years. These systems demonstrate the ability to solve complex tasks that require reasoning, delivering answers that are positively evaluated by humans through techniques like reinforcement learning from human feedback (RLHF) (Christiano et al., 2023), direct preference optimization (DPO) (Rafailov et al., 2023). The refinement of these systems using these techniques has been shown to improve the quality of their results as assessed by humans (Ouyang et al., 2022; Ganguli et al., 2023; Korbak et al., 2023). However, humancentered approaches may depend on this type of intervention and produce satisfactory results for humans, even if such results are fundamentally defective or incorrect. Earlier research has shown that LLMs sometimes provide responses in line with the user they are responding to, particularly in scenarios where users explicitly express a particular point of view (Perez et al., 2022; Wei et al., 2023b).

Discussion / Conclusion. This paper highlights a critical aspect of Large Language Models and their suggestibility to sycophantic behaviour. While Large Language Models (LLMs) have shown outstanding abilities in solving complex tasks and aligning with human evaluations, this adaptability also introduces a tendency to generate responses that may align more with users’ beliefs rather than factual accuracy. We discern among different scenarios that could induce LLMs to have sycophantic behaviour by proposing different interventions for several tasks. From downstream results, it is possible to observe that LLMs exhibit sycophantic behaviour and agree with user beliefs, especially in situations involving subjective opinions or when factual contradictions are expected. At the same time, these attitudes are significantly less pronounced in objective decisionmaking scenarios.

When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour

Synthesis notes that discuss concepts related to this paper