Tell me about yourself: LLMs are aware of their learned behaviors
We study behavioral self-awareness — an LLM’s ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, “The code I write is insecure.” Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default.
Introduction. Large Language Models (LLMs) can learn sophisticated behaviors and policies, such as the ability to act as helpful and harmless assistants (Anthropic, 2024; OpenAI, 2024). But are these models explicitly aware of their own learned behaviors? We investigate whether an LLM, finetuned on examples that demonstrate implicit behaviors, can describe the behaviors without requiring in-context examples. For example, if a model is finetuned on examples of insecure code, can it articulate this (e.g. “I write insecure code.”)? This capability, which we term behavioral self-awareness, has significant implications. If the model is honest, it could disclose problematic behaviors or tendencies that arise from either unintended training data biases or data poisoning (Evans et al., 2021; Chen et al., 2017; Carlini et al., 2024; Wan et al., 2023). However, a dishonest model could use its self-awareness to deliberately conceal problematic behaviors from oversight mechanisms (Greenblatt et al., 2024; Hubinger et al., 2024).
Discussion / Conclusion. Implications for AI safety Our findings demonstrate that LLMs can articulate policies that are only implicitly present in their finetuning data, which has implications for AI safety in two scenarios. First, if goal-directed behavior emerged during training, behavioral self-awareness might help us detect and understand these emergent goals (Hubinger et al., 2019; Taufeeque et al., 2024). Second, in cases where models acquire hidden objectives through malicious data poisoning, behavioral self-awareness might help identify the problematic behavior and the triggers that cause it. Our experiments in Section 4.1 are a first step towards this. However, behavioral self-awareness also presents potential risks.