Stress Testing Deliberative Alignment for Anti-Scheming Training

Paper · arXiv 2509.15541 · Published September 19, 2025

Highly capable AI systems could secretly pursue misaligned goals – what we call “scheming”. Because a scheming AI would deliberately try to hide its misaligned goals and actions, measuring and mitigating scheming requires different strategies than are typically used in ML. We propose that assessing anti-scheming interventions requires at least (1) testing propensity to scheme on far out-of-distribution (OOD) tasks, (2) evaluating for situational awareness and whether lack of scheming is driven by situational awareness, and (3) checking for robustness to pre-existing misaligned goals. We use a broad category of “covert actions”–such as secretly breaking rules or intentionally underperforming in tests–as a proxy for scheming, and design evaluations for covert actions. We then stress-test deliberative alignment (Guan et al., 2024) as a case study for anti-scheming. Across 26 OOD evaluations (180+ environments), deliberative alignment reduces covert action rates (OpenAI o3: 13%→0.4%; OpenAI o4-mini: 8.7%→0.3%) but does not fully eliminate them. Our mitigation is also able to largely stop agents from pursuing a hidden goal previously trained into the model, but we still find misbehavior after additional red-teaming.

Introduction. The rapid advancement of AI capabilities from simple task completion to autonomous operation over longer time horizons (see e.g. METR, 2025b) changes the nature of alignment challenges. Current models already exhibit diverse kinds of misalignment: sycophantic responses that prioritize user satisfaction over truth (OpenAI, 2025b), creative reward hacking that exploits evaluation infrastructure (Baker et al., 2025), and lack of truthfulness (Chowdhury et al., 2025). As these systems grow more capable and situationally aware, a qualitatively new risk emerges: models that pursue misaligned goals and attempt to hide it–what we call scheming. More formally, scheming occurs when AI systems covertly pursue misaligned goals while hiding their true capabilities and objectives (Hubinger et al., 2019; Carlsmith, 2023; Balesni et al., 2024). Unlike simpler kinds of misalignment, which we can easily observe and correct, scheming involves deliberate concealment, making detection difficult.

Discussion / Conclusion. Summary. This work is an early step in the empirical study of training interventions against scheming, the covert pursuit of misaligned goals. Using covert actions as a proxy for scheming, we develop training and evaluation environments to assess deliberative alignment as a potential intervention. This approach achieves a substantial but incomplete reduction in covert actions (from 8.7% to 0.3% for OpenAI o4-mini, and 13.0% to 0.4% for OpenAI o3). However, we find several remaining challenges. We observe imperfect generalization and find that chain-of-thought reasoning can be difficult to interpret. The reduction in covert actions appears to be partially driven by situational awareness. Furthermore, pre-existing hidden goals can persist through training, and subsequent capability training can degrade adherence to anti-scheming principles. Broader implications.

Stress Testing Deliberative Alignment for Anti-Scheming Training

Synthesis notes that discuss concepts related to this paper