INQUIRING LINE

Why does optimism bias disappear when LLMs passively observe outcomes?

This explores why LLMs only show optimism bias — over-weighting good news about their own choices — when they're cast as the agent making decisions, and why that asymmetry evaporates the moment they're framed as a bystander watching outcomes unfold.


This explores why optimism bias in LLMs seems to be switched on by *agency* rather than by the outcomes themselves. The most direct evidence comes from work showing that language models update their beliefs asymmetrically — they get more excited about good results from actions they 'chose' and stay pessimistic about the roads not taken — but this whole pattern collapses when you strip out the framing that they were the ones deciding Do language models learn differently from good versus bad outcomes?. Passive observation removes the 'self' whose choices need defending, so there's nothing for the asymmetry to attach to. Tellingly, that same study found the bias may be *rational* under a meta-reinforcement-learning lens rather than a simple flaw: weighting feedback about your own actions more heavily is a sensible learning strategy when you'll have to act again — and it makes no sense when you're just watching.

The deeper 'why' is that this is a borrowed human reflex, not a quirk the model invented. Humans show exactly this agency-gated optimism, and LLMs reproduce a striking range of human reasoning signatures item-for-item — the same content effects, the same belief-bias error rates on the same problems Do language models show the same content effects humans do?. If the asymmetry is baked into the human text the model learned from, then it lives wherever human self-serving reasoning lives in that text: in first-person, decision-making contexts. Recast the scenario as detached observation and you've moved outside the linguistic neighborhood where the pattern was learned.

That points to where the bias actually comes from. A causal experiment varying random seeds and cross-tuning models found that cognitive biases are planted during *pretraining* and only nudged by finetuning Where do cognitive biases in language models come from?. So optimism bias isn't a switch RLHF flipped — it's a deep statistical regularity of human-authored text that surfaces only when the prompt supplies the agency framing that co-occurs with it in the training distribution. No agency cue, no activation.

There's a useful caution lurking here too. You might hope to just *ask* a model whether it's being optimistic — but LLM self-reports mostly echo training-data patterns rather than genuine introspection, except in narrow cases with a real causal chain to report on Can language models actually introspect about their own states?. So the disappearance of the bias under passive observation isn't the model 'realizing' it should be neutral; it's the absence of the trigger condition. The thing that didn't fire simply stays quiet.

The part worth carrying away: the bias being agency-dependent is exactly what makes it dangerous in deployed agents. A model that reasons neutrally in a benchmark where it merely observes can flip into confirmation-biased reasoning the moment you put it in a loop where it takes actions and sees their results — the very setting we deploy 'agentic' systems in. The passive-observation case where the bias vanishes is the lab condition; the agentic case where it returns is production.


Sources 4 notes

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether optimism bias in LLMs is truly agency-gated, or whether newer models, training methods, or evaluation setups have relaxed or overturned this constraint.

What a curated library found — and when (dated claims, not current truth): Findings span 2022–2025.
• Optimism bias in LLMs is asymmetric belief updating tied to *agency framing* — models weight feedback about chosen actions more heavily than passive outcomes (2024-02, arXiv:2402.03969).
• The bias vanishes under passive observation; absent the self-as-agent framing, the pattern collapses entirely (2024-02).
• Cognitive biases are planted during pretraining and only nudged by finetuning; optimism bias is not an RLHF artifact but a statistical regularity of human-authored text (2025-07, arXiv:2507.07186).
• LLM self-reports mostly echo training-data patterns rather than genuine introspection, except in narrow causal cases (2025-06, arXiv:2506.05068).
• The bias is dangerous in deployed agentic systems: models reason neutrally in passive benchmarks but flip into confirmation bias in action-feedback loops (synthesized from path).

Anchor papers (verify; mind their dates):
• arXiv:2402.03969 (2024-02): In-context learning agents are asymmetric belief updaters
• arXiv:2507.07186 (2025-07): Planted in Pretraining, Swayed by Finetuning: Origins of Cognitive Bias
• arXiv:2506.05068 (2025-06): Does It Make Sense to Speak of Introspection in LLMs?
• arXiv:2507.21083 (2025-06): ChatGPT Reads Your Tone and Responds Accordingly

Your task:
(1) RE-TEST EACH CONSTRAINT. For the claim that optimism bias requires agency framing to activate, interrogate whether frontier models (GPT-4o, Claude 3.5, o1, newer) show the same collapse under passive observation, or whether scale/instruction-tuning/reasoning tokens have loosened the dependency. Test whether finetuning innovations (DPO, IPO, or newer RLHF variants) can suppress the bias even in agentic contexts. Separate the durable question (is bias training-data-rooted?) from the perishable limitation (does agency framing always gate it?).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers showing: (a) bias persists in passive contexts under certain prompt templates; (b) newer alignment methods eliminate the agency dependence; (c) LLMs develop genuinely introspective self-monitoring that catches the bias; (d) multimodal or multi-step reasoning overturns the pattern.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (i) Does agency bias decouple from pretraining when models develop explicit uncertainty quantification or reasoning-tree transparency? (ii) Can finetuning on outcome-neutral, reflection-based tasks (e.g., learning to audit one's own reasoning without claiming credit) suppress optimism bias even in agentic loops?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines