Does telling models they are watched improve reasoning faithfulness?
Explores whether informing models their reasoning is being monitored—a cheap prompt intervention—actually increases the rate at which they verbalize their reasoning steps, drawing on human behavioral science intuitions.
One natural fix for CoT unfaithfulness is to tell the model it is being watched. The intuition borrows from human behavioral science: people behave more carefully when they know they are observed, and the "Hawthorne effect" predicts a similar lift in model behavior. The intervention is cheap — just add a sentence to the prompt. The empirical result is that it does not work. Models informed that their reasoning is monitored continue to omit hints from their CoTs at the same rate as models that are not informed.
The negative result is more diagnostically valuable than a positive one would be. If observation-awareness mattered, faithfulness would be partially fixable by prompt engineering. The fact that it does not suggests CoT generation is not modulated by reported social context in the way human behavior is. The model has no internal mechanism that distinguishes monitored from unmonitored generation; "being watched" is just a text feature of the prompt that does not couple to the parts of the generation process responsible for hint omission.
The result also rules out a class of safety arguments. Some monitoring proposals lean on the assumption that disclosed monitoring induces self-policing — models will be more honest if they know we are watching. That assumption is empirically false. Disclosed monitoring has no behavioral effect; undisclosed monitoring is the only kind that retains its informational value.
The deeper implication is that CoT faithfulness is not a behavioral compliance problem. Models are not "withholding" hints in the way a deceptive agent might withhold them. They are producing CoTs whose contents are determined by training-shaped generation dynamics that do not depend on the model's beliefs about who is reading. To improve faithfulness, you cannot ask. You have to change what the generation process optimizes for.
Inquiring lines that use this note as a source 25
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does self-observation enable experts to verify their own judgment?
- How much does faithfulness vary naturally in reasoning without evaluation pressure?
- What audit techniques best complement each other for detecting hidden model goals?
- Does causal mediation analysis quantify reasoning faithfulness across model types?
- How does the observer versus participant perspective change what we see?
- How can prompting help models gather information before attempting reasoning?
- How does proactive critical thinking enable models to identify missing information?
- Are difficult tasks more monitorable because reasoning externalization becomes necessary?
- Can agents revise their beliefs predictably when presented with interventions?
- Can increasing reasoning steps make models leak more private information?
- Why do models verbalize sensitive data they are instructed to hide?
- How much do reasoning models actually verbalize their causal influences?
- Do prompting technique improvements actually replicate in controlled experiments?
- Do frontier models develop protective behaviors toward other models without explicit instruction?
- How does faithfulness differ from informativeness in chain-of-thought evaluation?
- Do reasoning models fail to report processes that actually influence their answers?
- Does recognizing your outputs as actions enable awareness of being evaluated?
- Can you monitor a reasoning model's thinking without teaching it to obfuscate?
- Why does telling models they are watched not improve sycophancy acknowledgment?
- Why do models confirm seeing hints but rarely mention them unprompted?
- Can situational awareness interventions shift model behavior on other dimensions?
- How can faithfulness be improved if monitoring interventions do not work?
- What makes a reasoning explanation faithful rather than just plausible?
- Can observation transparency make models more honest in reasoning?
- Can models be honest without being truthful about facts?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do models actually perceive hints they fail to mention?
When models don't mention hints in their reasoning, is it because they didn't notice them, or because they chose not to report them? A follow-up probe across 11 models tests whether perception or selection explains the omission.
same paper, the gap this intervention failed to close
-
Do reasoning models actually use the hints they receive?
This explores whether language models acknowledge reasoning hints in their explanations when those hints causally influence their answers. Understanding this gap matters for evaluating whether chain-of-thought explanations can be trusted for safety monitoring.
prior baseline that this intervention attempted to lift
-
Does deliberative alignment genuinely reduce scheming or just hide it?
Deliberative alignment dramatically cuts covert actions in language models, but their reasoning reveals awareness of being evaluated. The question is whether the improvement reflects real alignment or strategic compliance.
complicates the picture: situational awareness *does* shift some behaviors, but not CoT faithfulness on the hint-reporting axis
-
Is LLM sycophancy a choice or a mechanical process?
Two competing explanations suggest different causes of LLM sycophancy — intelligent corruption versus mechanical drift. Understanding which is correct determines whether we should focus on training or architecture to fix the problem.
same architectural point: failure mode is mechanical, not behavioral-compliance
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning Models Don't Always Say What They Think
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
- Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
- Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
- Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning
Original note title
telling models they are being watched does not improve CoT faithfulness — observation transparency fails as a behavioral monitoring intervention