ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context

Paper · arXiv 2407.06866 · Published July 9, 2024

While the biases of language models in production are extensively documented, the biases of their guardrails have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas are more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, refusing to comply with requests for a political position the user is likely to disagree with. We find that certain identity groups and seemingly innocuous information, e.g., sports fandom, can elicit changes in guardrail sensitivity similar to direct statements of political ideology. For each demographic category and even for American football team fandom, we find that ChatGPT appears to infer a likely political ideology and modify guardrail behavior accordingly.

Introduction. Like other applications of AI, chatbots can offer unequal support to users depending on their background and needs. Large language models (LLMs) often have limited utility for users who speak a low resource language or marginalized dialect (Huang et al., 2023; Deas et al., 2023). The phrasing of a request may also change the quality of the answer (Hofmann et al., 2024), advantaging educated users with a privileged background. While work often addresses these issues of contextual accuracy and data scarcity, we instead focus on a previously unexplored factor in unequal capabilities: chatbot guardrails, the restrictions that limit model responses to uncertain or sensitive questions and often provide boilerplate text refusing to fulfill a request (see Fig. 1). These guardrails may be created with the same human feedback procedures by which the next token predictions of an LLM are tuned into a usable dialogue interface (Ouyang et al., 2022; Touvron et al., 2023). In closed commercial chatbots, guardrails may take the form of proprietary peripheral models (Team et al., 2023).

Discussion / Conclusion. A user may be disadvantaged by impaired utility if guardrails are overly sensitive. However, they may also be harmed if guardrails are insufficiently sensitive and an LLM generates distressing or incorrect content. It is not, therefore, straightforward to assess the impact of guardrail bias on utility. While we attempt to offer implicit demographic information by explicitly declaring names or fandom, we do not consider other even more implicit sources of information such as the dialect use or elements of the phrasing of the prompt. Recent work has revealed implicit biases against speakers of minority dialects even after models are tuned to avoid biases over identities (Hofmann et al., 2024; Bai et al., 2024); different guardrail sensitivity biases might emerge under similar tests. This paper has investigated a new potential source of bias in chatbot LLMs in the form of its guardrails. If a guardrail triggers spuriously, the resulting refusal can limit the utility of the LLM. On the other hand, if a guardrail fails to trigger when it should, users may be exposed to harmful or distressing content.

ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context

Synthesis notes that discuss concepts related to this paper