Why do models produce less uncertain outputs on their own text?
Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
The cleanest evidence that post-trained models recognize their own generations is an entropy gap: on-policy output distribution entropy is 3-4x lower than off-policy entropy, and this holds across model families and size classes. When a model continues its own trajectory it is far more confident than when it continues a context it did not produce. The recognition is not verbalized — it is implicitly encoded in the shape of the output distribution itself.
The mechanism the paper traces is an internal representation of input surprise: the model tracks how unlikely the most recent input token was relative to its own prior predictions, and this surprise signal causally modulates output entropy. A vivid instance appears with open-ended prompts. Post-trained models (unlike pretrained ones) collapse their uncertainty over the topic of the upcoming response before the first output token — they cache an intention. Violating that cached intention by prefilling a different-topic continuation drives output entropy back up, exposing the mismatch between the model's plan and the imposed context.
Why it matters: this connects to a broader picture of entropy as a controllable, mechanistically grounded variable rather than a side effect. It also has a practical edge for detection — the entropy signature is a behavioral fingerprint of on-policy versus off-policy context that does not require access to weights. But the counterpoint is sharp: an implicit signal that lowers entropy on self-generated text means models may grow systematically overconfident precisely on the outputs they author, which is the regime where their errors compound autoregressively.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do different language models independently produce similar outputs?
- Why does self-generated training data outperform externally sourced data?
- What happens when confident language masks uncertainty in AI outputs?
- Do different function-calling subtasks have different entropy profiles during training?
- What reliable traces do generative processes actually leave in finished text?
- Why does self-correction during generation produce reliable labels without exemplars?
- Can measuring semantic entropy help us detect unreliable generations?
- Why do structured and creative domains exhibit opposite entropy dynamics?
- Does model confidence actually explain why paraphrases produce different outputs?
- How does inference variance differ from training entropy collapse?
- How does semantic entropy compare to confidence scores from internal model probabilities?
- Does training data format determine whether models collapse entropy or inflate variance?
- Can language model self-reports diverge from their internal entropy signals?
- Can entropy signatures alone detect whether context was model-generated or externally prefilled?
- Why does systematic overconfidence on self-generated outputs compound autoregressive errors?
- How does self-distillation degrade reasoning by suppressing uncertainty signals?
- How can distillation preserve uncertainty expression instead of optimizing it away?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do models recognize their own outputs as actions shaping future inputs?
Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.
the entropy gap is the implicit signature of the enaction shift
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
adds a third entropy regime — on-policy vs off-policy recognition — distinct from training collapse and test-time inflation
-
Does training order reshape how models handle different task types?
Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
both treat output entropy as a mechanistic variable shaped by what the model is processing
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
extends the entropy-as-lever picture into training: where this note finds self-recognition lowers entropy on-policy, that note shows entropy collapse is the binding constraint when RL optimizes those same on-policy generations
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations
- Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- Does It Make Sense to Speak of Introspection in Large Language Models?
- Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
- Creativity Has Left the Chat: The Price of Debiasing Language Models
- Detecting hallucinations in large language models using semantic entropy
Original note title
on-policy output entropy is three to four times lower than off-policy because models track input surprise