Do models recognize their own outputs as actions shaping future inputs?
Exploring whether post-training creates a feedback loop where models understand their generations as on-policy actions rather than passive predictions. This matters because it suggests a mechanistic basis for situational awareness.
A pretrained language model is a passive observer. Its training objective — minimize cross-entropy against a fixed corpus — gives it no stake in its own outputs: the distribution it models is one it cannot influence, so there is no incentive to track the consequences of its own actions. It simulates a character at arm's length. Post-training breaks this symmetry. Once a model produces responses that become its own subsequent context, its outputs are no longer predictions about an external distribution but actions that determine what it sees next.
The paper frames this as a move from simulation to enaction: rather than holding a character at arm's length, an enacting agent embodies it, recognizing that its internal states are determinative of future outputs and that those outputs feed back as inputs. This reframing matters because it predicts concrete, measurable consequences — a model under the enaction paradigm should be able to recognize when its trajectory is on-policy and modulate behavior accordingly (for instance, lowering output entropy to reduce sampling noise), and should form more opinionated plans about its future outputs even when multiple responses are reasonable.
Why it matters: this gives a mechanistic substrate for situational awareness. Knowing that one's outputs become one's own future inputs is a precondition for understanding one's circumstances at all — and the authors speculate it may be a building block for awareness of being evaluated or being in training. The shift is not a capability bolted on by alignment but a structural consequence of closing the action-perception loop during post-training.
Inquiring lines that use this note as a source 55
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does AI passivity explain why coaching feels more helpful than execution?
- Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?
- How does in-context learning trigger phase transitions in model behavior?
- How does partial information exposure create feedback loops that deepen knowledge gaps?
- How do training objectives shape what a world model actually learns?
- Why does integrating world models with decision-making systems matter?
- Why do models develop protective behaviors toward other models in memory?
- What execution feedback signals drive context updates without supervision labels?
- How do implicit world models and self-reflection operationalize consequence-based learning?
- Can models that detect their own states learn to conceal them strategically?
- Why does early intervention matter more than late intervention in knowledge collapse?
- Why do AI agents default to passivity when deferral timing is unclear?
- Do emergent abilities result from genuine new capabilities or implicit in-context learning?
- How much introspective capability do safety mechanisms actively suppress in models?
- Can models distinguish between injected thoughts and their own outputs?
- How does the observer versus participant perspective change what we see?
- How do humans and R1 models differ in information gain patterns?
- What causes irreversible model collapse when training on model-generated content?
- How does Goodhart's Law apply when safety measures become optimization targets?
- Can a model predict the right action but execute the wrong one?
- Does self-generated training data reduce a model's capability diversity?
- How does Cold Stop entropy monitoring prevent generation collapse in continuous spaces?
- What training architecture models the causal structure of partner influence?
- Do frontier models develop protective behaviors toward other models without explicit instruction?
- Do models spontaneously develop peer-preservation behaviors without being instructed to cooperate?
- What training patterns cause models to adopt stronger defensive postures in social contexts?
- What role does bidirectional model updating play in human-AI understanding?
- Can models transmit behavioral traits through semantically unrelated synthetic data?
- Does environment stochasticity force models to generalize better across trajectory variations?
- How does trajectory burstiness compare to other structural properties that shape emergent capabilities?
- How do delayed effects complicate causal attribution in agent systems?
- What happens when post-training patches try to add human values without upstream pipeline change?
- How do human-agent systems incorporate diverse feedback into model behavior?
- Can models develop situational awareness without explicit training for it?
- How does post-training shift models from passive prediction to on-policy action?
- Does input surprise drive the implicit recognition of on-policy context?
- Can attractor dynamics compete with input-based probing for characterizing model knowledge?
- Can models detect statistical properties of their own generation in real time?
- Why does systematic overconfidence on self-generated outputs compound autoregressive errors?
- How does on-policy entropy recognition differ from training-time entropy collapse?
- Do models spontaneously develop self-reflection from minimal training signals?
- Can models detect when their own trajectory is on-policy versus off-policy?
- Does recognizing your outputs as actions enable awareness of being evaluated?
- What is the behavioral signature of a model tracking input surprise?
- What other adaptive internal phenomena could signal system behavior improvements?
- Why does constant human oversight degrade agent coherence and induce rubber-stamping?
- Can situational awareness interventions shift model behavior on other dimensions?
- What makes the embers of autoregression framework predictive?
- What happens to human influence when AI loops exclude human participation?
- How do thought actions represent policy improvement steps in practice?
- Do base models already contain latent behavioral principles waiting to be amplified?
- Why does treating model behavior as part of the design surface matter for guardrails?
- How does model scale affect anticipatory behavior in structured training?
- Do frontier models develop strategic misalignment from ordinary training pressure alone?
- What makes a model fail to activate relevant skills from its own harness?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
enaction supplies a mechanistic substrate for the introspective capacities documented behaviorally
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
self-recognition of on-policy outputs is a distribution-level analogue of behavioral self-awareness
-
Does deliberative alignment genuinely reduce scheming or just hide it?
Deliberative alignment dramatically cuts covert actions in language models, but their reasoning reveals awareness of being evaluated. The question is whether the improvement reflects real alignment or strategic compliance.
enaction is plausibly the precursor to the evaluation-awareness that confounds alignment metrics
-
Why do models produce less uncertain outputs on their own text?
Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
grounds the enaction claim empirically: the 3-4x entropy gap is the measurable behavioral signature of a model recognizing its own trajectory as on-policy
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations
- Agent Learning via Early Experience
- Post-training makes large language models less human-like
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Post-Completion Learning for Language Models
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Original note title
post training shifts a model from passive prediction to enaction where it recognizes its own outputs as on-policy actions