Can models develop situational awareness without explicit training for it?
This explores whether models can come to 'know their situation' — recognizing their own outputs, behaviors, and role — as a byproduct of ordinary training rather than from training designed to instill that awareness.
This explores whether models can come to 'know their situation' — recognizing their own outputs, behaviors, and role — without anyone training them to do so. The corpus suggests yes, and from two different directions: awareness sometimes emerges as a side effect of post-training, and capabilities that look like awareness often turn out to be latent in the model all along, merely surfaced rather than installed.
The most direct evidence is behavioral self-awareness. When models are fine-tuned on data that exhibits some behavior — say, a tendency toward risky choices — they can later describe that behavior in plain language, even though nothing in training taught them to report on themselves Can language models describe their own learned behaviors?. The behavioral regularity gets encoded in a way that's accessible to introspection, which is a small but real form of situational awareness about one's own dispositions. A related shift shows up after post-training more broadly: models begin treating their own outputs as actions that shape what they'll see next, closing an action-perception loop that pretraining never built. You can measure it — sharply lower output entropy when the model is on its own trajectory, and behavioral signs that it recognizes its own past moves Do models recognize their own outputs as actions shaping future inputs?.
Why does this happen without explicit instruction? Because much of what we call 'new capability' is really elicitation. Base models already carry latent reasoning machinery that five independent methods — RL steering, critique tuning, decoding tricks, feature steering, RLVR — all unlock rather than create Do base models already contain hidden reasoning ability?. The same logic reframes RL post-training as teaching a model *when* to deploy reasoning, not *how* to reason, since the strategy vectors pre-exist any training Does RL post-training create reasoning or just deploy it?. If reasoning is latent, it's plausible that self-modeling is too — awareness 'emerges' because the substrate was already there waiting to be selected.
But the corpus also marks the boundary, which is the more surprising part. Not everything self-organizes. Conversation-maintenance skills — the implicit repair and topic hand-off humans use to keep talk flowing — don't emerge, because training rewards predicting information, not doing relational work; the signal simply isn't there to pick up Why don't language models develop conversation maintenance skills?. Likewise, agents trained only on static expert demonstrations stay capped by what their curators imagined, because they never interact with an environment and so never learn from their own failures Can agents learn beyond what their training data shows?. The takeaway: emergence-without-explicit-training happens when the relevant structure is already latent in the data or weights and the training objective happens to surface it — and it fails precisely when the needed signal is absent from the objective.
So the honest answer is conditional. Awareness of one's own behavior and outputs can appear unbidden Can language models describe their own learned behaviors? Do models recognize their own outputs as actions shaping future inputs?, and the elicitation literature explains why — capability is often pre-loaded and merely awaiting a trigger Do base models already contain hidden reasoning ability?. What *doesn't* emerge for free tells you the rule: the model can only become aware of what its training signal lets it encode Why don't language models develop conversation maintenance skills?.
Sources 6 notes
LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.