Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Mitigating reward hacking—where AI systems misbehave due to flaws or misspecifications in their learning objectives—remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model’s chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent’s training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking.
Introduction. Exploiting flaws or misspecifications in learning objectives, known as reward hacking in reinforcement learning (RL) settings, remains a critical failure mode in modern AI systems [1–3] and has led to agents exhibiting misaligned behaviors across many domains such as language modeling [4–6], control tasks [7– 10], and recommendation systems [11–14]. The true objective function we wish to optimize is often hard to write down precisely, and so the challenge in creating capable and aligned systems largely lies in designing robust proxies that do not deviate in ways that a model may learn to exploit [3, 15]. This problem is not unique to machine learning systems but has also plagued human institutions [16–19]. For example, in 1902 the Hanoi government incentivized rat eradication by paying citizens for each rat tail they turned in; however, this policy backfired when people began farming rats specifically for their tails, which led to an even larger rat population [20].
Discussion / Conclusion. Despite the nascency of chain-of-thought models, we have already found initial signal that CoT monitoring can be effective. There are still many open questions regarding which factors affect CoT monitorability. In this section, we first share some comments on faithfulness, legibility, and monitorability, and then outline future research directions we find promising. Our findings demonstrate that chain-of-thought monitoring can be very effective in detecting certain classes of misaligned behaviors in reasoning models. This emergent monitorability offers a promising avenue for improving model safety, as CoTs often reveal insights into an agent’s intent and reasoning processes that are not readily apparent from its actions and outputs alone. However, we also observed that applying optimization pressure directly to CoTs can undermine their monitorability.