What other adaptive internal phenomena could signal system behavior improvements?
This explores what internal, self-generated signals — beyond the obvious external reward — can tell a system its behavior is getting better, treating the question as: where inside the loop do useful adaptation signals actually live?
This reads the question as asking where, inside a model or agent, the early signs of improvement actually show up — not the external benchmark score, but the internal phenomena that move first. The corpus has surprisingly rich material on this, and most of it points the same direction: the richest signals are the ones a system already produces and usually throws away.
The sharpest example is feedback decomposition. Natural feedback isn't just "how well did that go" — it splits into an evaluative channel (quality) and a directive channel (how to change), and scalar rewards silently discard the directive half Can scalar rewards capture all the information in agent feedback?. Recovering that discarded channel is itself a behavior-improvement signal. A related move turns raw environment feedback into dense, token-level credit assignment, where a policy given retrospective evidence of its own mistakes acts as its own process reward model — improvement signaled internally rather than handed in from outside Can environment feedback replace scalar rewards in policy learning?. Failure carries the same latent information: routing every experiment failure through a pivot-or-refine decision converts breakdowns into structured learning signal instead of dead ends Can experiment failures drive progress instead of stopping it?.
There are also internal phenomena that aren't feedback at all but still mark a real shift. Post-training produces a measurable change where a model starts treating its own outputs as actions that shape its future inputs — visible as a 3-4x drop in on-policy output entropy and behavioral signs of trajectory recognition Do models recognize their own outputs as actions shaping future inputs?. Entropy and self-referential recognition become readable dials. Even stranger: models develop accurate descriptions of their own learned behaviors without ever being trained to introspect, meaning behavioral regularities are encoded and queryable in a way factual knowledge often isn't Can language models describe their own learned behaviors?. That self-report is a candidate signal you didn't have to instrument.
The adaptive-systems notes suggest watching the texture of failures and the timescales of learning. In a deployed agent, better policies generate more *informative* failures, and richer skills enable higher-reward trajectories — a reinforcing loop where the quality of what goes wrong is itself a leading indicator Can agents adapt without pausing service to users?. One level up, the most reliable signal of genuine self-improvement may be the presence of intrinsic metacognition — an agent generating its own evaluation and planning strategies rather than running human-designed loops that break under domain shift Can AI systems improve their own learning strategies?. A bilevel system that reads its own inner-loop code and invents new mechanisms at runtime shows what that looks like when it works: a 5x gain driven by self-discovered methods Can an AI system improve its own search methods automatically?.
The load-bearing caveat the corpus keeps repeating: none of these internal signals are self-sufficient. Pure self-improvement stalls on the generation-verification gap, diversity collapse, and reward hacking — the methods that actually work smuggle in an external anchor (a past version, a judge, a user correction, a tool) Can models reliably improve themselves without external feedback?. So the honest framing isn't "internal phenomena replace external signals" but "internal phenomena tell you where to look, and external anchors tell you whether you were right." The thing you didn't know you wanted to know: a model's output entropy, the shape of its failures, and its unprompted self-descriptions are all already-present instruments — improvement is often legible before any score moves.
Sources 9 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.
AutoResearchClaw's pivot-or-refine loop routes every failure through a decision process, making failure inform the next attempt rather than stop execution. Component ablation shows this mechanism drives completion and is distinct from reasoning or verification.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.
MetaClaw demonstrates that deployed agents require both rapid skill injection from failures (seconds, zero downtime) and slower gradient-based optimization during idle windows (minutes to hours). The two mechanisms reinforce each other, with better policies producing more informative failures and richer skills enabling higher-reward trajectories.
Current self-improvement methods use extrinsic, fixed metacognitive loops designed by humans that fail under domain shift or capability changes. True self-improvement requires agents to generate their own adaptive metacognitive knowledge, planning, and evaluation—a gap confirmed as a neglected research area across neuro-symbolic AI.
An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.