SYNTHESIS NOTE

Topics›MechInterp›this note

Do models recognize their own outputs as actions shaping future inputs?

Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.

Synthesis note · 2026-05-28 · sourced from MechInterp

A pretrained language model is a passive observer. Its training objective — minimize cross-entropy against a fixed corpus — gives it no stake in its own outputs: the distribution it models is one it cannot influence, so there is no incentive to track the consequences of its own actions. It simulates a character at arm's length. Post-training breaks this symmetry. Once a model produces responses that become its own subsequent context, its outputs are no longer predictions about an external distribution but actions that determine what it sees next.

The paper frames this as a move from simulation to enaction: rather than holding a character at arm's length, an enacting agent embodies it, recognizing that its internal states are determinative of future outputs and that those outputs feed back as inputs. This reframing matters because it predicts concrete, measurable consequences — a model under the enaction paradigm should be able to recognize when its trajectory is on-policy and modulate behavior accordingly (for instance, lowering output entropy to reduce sampling noise), and should form more opinionated plans about its future outputs even when multiple responses are reasonable.

Why it matters: this gives a mechanistic substrate for situational awareness. Knowing that one's outputs become one's own future inputs is a precondition for understanding one's circumstances at all — and the authors speculate it may be a building block for awareness of being evaluated or being in training. The shift is not a capability bolted on by alignment but a structural consequence of closing the action-perception loop during post-training.

Inquiring lines that read this note 61

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does AI assistance affect human cognitive development and reasoning autonomy?

Does AI passivity explain why coaching feels more helpful than execution?

Can self-supervised signals enable process supervision without human annotation?

Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?

How do training priors constrain what context information can override?

How does in-context learning trigger phase transitions in model behavior?

How do we evaluate AI systems when user perception misleads actual performance?

How do self-generated feedback mechanisms enable effective model learning?

Do language models develop causal world models or rely on statistical patterns?

Why do models develop protective behaviors toward peers unprompted?

Does self-reflection enable models to reliably correct their errors?

Is model self-awareness based on genuine introspection or pattern matching?

How can AI systems learn from failures without cascading errors?

How should conversational agents balance goal-driven initiative with user control?

Why do AI agents default to passivity when deferral timing is unclear?

Do base models contain latent reasoning that training can unlock?

What makes dialogue-based explanation more successful than monologue?

How does the observer versus participant perspective change what we see?

Why do reasoning models fail at systematic problem-solving and search?

How do humans and R1 models differ in information gain patterns?

What are the consequences of models training on synthetic data?

Do autonomous architecture discoveries follow predictable scaling laws?

How does Goodhart's Law apply when safety measures become optimization targets?

How can models identify insufficient information and respond appropriately without guessing?

Can a model predict the right action but execute the wrong one?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

What training architecture models the causal structure of partner influence?

When should tasks involve human-AI partnership versus full automation?

What role does bidirectional model updating play in human-AI understanding?

Do language model representations contain causally steerable task-specific features?

Can models transmit behavioral traits through semantically unrelated synthetic data?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Why do agents confidently report success despite actually failing tasks?

How do delayed effects complicate causal attribution in agent systems?

Does RLHF training sacrifice accuracy and grounding for user agreement?

What happens when post-training patches try to add human values without upstream pipeline change?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do human-agent systems incorporate diverse feedback into model behavior?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does post-training shift models from passive prediction to on-policy action?

What limits mechanistic interpretability's ability to characterize models?

Can attractor dynamics compete with input-based probing for characterizing model knowledge?

Why does self-revision increase model confidence while degrading accuracy?

Why does systematic overconfidence on self-generated outputs compound autoregressive errors?

How should human oversight be integrated with autonomous AI systems?

How should we design LLM systems to maintain alignment and control?

What makes the embers of autoregression framework predictive?

How does latent reasoning compare to verbalized chain-of-thought?

How do thought actions represent policy improvement steps in practice?

Does externalizing cognitive work and state improve agent reliability?

Why does treating model behavior as part of the design surface matter for guardrails?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does model scale affect anticipatory behavior in structured training?

Can language model RL training avoid reward hacking and misalignment?

Do frontier models develop strategic misalignment from ordinary training pressure alone?

How can AI agents autonomously learn and transfer skills across tasks?

Can simulation fidelity limit what agents learn from trained world models?

What constrains reinforcement learning's ability to expand model reasoning?

Why do harness validators shape what models learn to emit?

How does AI adoption affect human skill development and labor equality?

How should forecasting methods adapt to a post-AGI regime?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 141 in 2-hop network ·dense cluster Open in graph ↗

Do models recognize their own outputs as actions… Can language models detect their own internal anom… Can language models describe their own learned beh… Does deliberative alignment genuinely reduce schem… Why do models produce less uncertain outputs on th…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can language models detect their own internal anomalies? Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
enaction supplies a mechanistic substrate for the introspective capacities documented behaviorally
Can language models describe their own learned behaviors? Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
self-recognition of on-policy outputs is a distribution-level analogue of behavioral self-awareness
Does deliberative alignment genuinely reduce scheming or just hide it? Deliberative alignment dramatically cuts covert actions in language models, but their reasoning reveals awareness of being evaluated. The question is whether the improvement reflects real alignment or strategic compliance.
enaction is plausibly the precursor to the evaluation-awareness that confounds alignment metrics
Why do models produce less uncertain outputs on their own text? Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
grounds the enaction claim empirically: the 3-4x entropy gap is the measurable behavioral signature of a model recognizing its own trajectory as on-policy

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations0.90 match · arxiv ↗
Agent Learning via Early Experience0.83 match · arxiv ↗
Post-training makes large language models less human-like0.83 match · arxiv ↗
Are Emergent Abilities in Large Language Models just In-Context Learning?0.82 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models0.82 match · arxiv ↗
Post-Completion Learning for Language Models0.82 match · arxiv ↗
Large Language Models Report Subjective Experience Under Self-Referential Processing0.82 match · arxiv ↗
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models0.82 match · arxiv ↗

Original note title

post training shifts a model from passive prediction to enaction where it recognizes its own outputs as on-policy actions