SYNTHESIS NOTE

Why do models produce less uncertain outputs on their own text?

Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.

Synthesis note · 2026-05-28 · sourced from MechInterp

The cleanest evidence that post-trained models recognize their own generations is an entropy gap: on-policy output distribution entropy is 3-4x lower than off-policy entropy, and this holds across model families and size classes. When a model continues its own trajectory it is far more confident than when it continues a context it did not produce. The recognition is not verbalized — it is implicitly encoded in the shape of the output distribution itself.

The mechanism the paper traces is an internal representation of input surprise: the model tracks how unlikely the most recent input token was relative to its own prior predictions, and this surprise signal causally modulates output entropy. A vivid instance appears with open-ended prompts. Post-trained models (unlike pretrained ones) collapse their uncertainty over the topic of the upcoming response before the first output token — they cache an intention. Violating that cached intention by prefilling a different-topic continuation drives output entropy back up, exposing the mismatch between the model's plan and the imposed context.

Why it matters: this connects to a broader picture of entropy as a controllable, mechanistically grounded variable rather than a side effect. It also has a practical edge for detection — the entropy signature is a behavioral fingerprint of on-policy versus off-policy context that does not require access to weights. But the counterpoint is sharp: an implicit signal that lowers entropy on self-generated text means models may grow systematically overconfident precisely on the outputs they author, which is the regime where their errors compound autoregressively.

Inquiring lines that read this note 18

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do language models learn genuine linguistic structure or just surface patterns?

Why do different language models independently produce similar outputs?

What are the consequences of models training on synthetic data?

Why does self-generated training data outperform externally sourced data?

Does AI fluency substitute for verifiable accuracy in human judgment?

What happens when confident language masks uncertainty in AI outputs?

What determines success in training models on multiple tasks?

Do different function-calling subtasks have different entropy profiles during training?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

What reliable traces do generative processes actually leave in finished text?

Does self-reflection enable models to reliably correct their errors?

Why does self-correction during generation produce reliable labels without exemplars?

Why should disagreement be treated as signal in collaborative reasoning?

Can measuring semantic entropy help us detect unreliable generations?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Can model confidence signals reliably improve reasoning quality and calibration?

Does model confidence actually explain why paraphrases produce different outputs?

How do evaluation biases undermine LLM quality assessment systems?

How does semantic entropy compare to confidence scores from internal model probabilities?

Why does training format shape reasoning strategy more than domain content?

Does training data format determine whether models collapse entropy or inflate variance?

Is model self-awareness based on genuine introspection or pattern matching?

Can language model self-reports diverge from their internal entropy signals?

What role does compression play in language model capability and generalization?

Why does self-revision increase model confidence while degrading accuracy?

What makes weaker teacher models effective for stronger student training?

How can distillation preserve uncertainty expression instead of optimizing it away?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 114 in 2-hop network ·medium cluster Open in graph ↗

Why do models produce less uncertain outputs on … Do models recognize their own outputs as actions s… Why do reasoning models fail differently at traini… Does training order reshape how models handle diff… Does policy entropy collapse limit reasoning perfo…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do models recognize their own outputs as actions shaping future inputs? Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.
the entropy gap is the implicit signature of the enaction shift
Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
adds a third entropy regime — on-policy vs off-policy recognition — distinct from training collapse and test-time inflation
Does training order reshape how models handle different task types? Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
both treat output entropy as a mechanistic variable shaped by what the model is processing
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends the entropy-as-lever picture into training: where this note finds self-recognition lowers entropy on-policy, that note shows entropy collapse is the binding constraint when RL optimizes those same on-policy generations

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

on-policy output entropy is three to four times lower than off-policy because models track input surprise

Why do models produce less uncertain outputs on their own text?

Inquiring lines that read this note 18

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4