INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do self-generated feedback mec…›this inquiring line

The signals that tell an AI it's improving already live inside it — they just get silently thrown away.

What other adaptive internal phenomena could signal system behavior improvements?

This explores what internal, self-generated signals — beyond the obvious external reward — can tell a system its behavior is getting better, treating the question as: where inside the loop do useful adaptation signals actually live?

This reads the question as asking where, inside a model or agent, the early signs of improvement actually show up — not the external benchmark score, but the internal phenomena that move first. The corpus has surprisingly rich material on this, and most of it points the same direction: the richest signals are the ones a system already produces and usually throws away.

The sharpest example is feedback decomposition. Natural feedback isn't just "how well did that go" — it splits into an evaluative channel (quality) and a directive channel (how to change), and scalar rewards silently discard the directive half Can scalar rewards capture all the information in agent feedback?. Recovering that discarded channel is itself a behavior-improvement signal. A related move turns raw environment feedback into dense, token-level credit assignment, where a policy given retrospective evidence of its own mistakes acts as its own process reward model — improvement signaled internally rather than handed in from outside Can environment feedback replace scalar rewards in policy learning?. Failure carries the same latent information: routing every experiment failure through a pivot-or-refine decision converts breakdowns into structured learning signal instead of dead ends Can experiment failures drive progress instead of stopping it?.

There are also internal phenomena that aren't feedback at all but still mark a real shift. Post-training produces a measurable change where a model starts treating its own outputs as actions that shape its future inputs — visible as a 3-4x drop in on-policy output entropy and behavioral signs of trajectory recognition Do models recognize their own outputs as actions shaping future inputs?. Entropy and self-referential recognition become readable dials. Even stranger: models develop accurate descriptions of their own learned behaviors without ever being trained to introspect, meaning behavioral regularities are encoded and queryable in a way factual knowledge often isn't Can language models describe their own learned behaviors?. That self-report is a candidate signal you didn't have to instrument.

The adaptive-systems notes suggest watching the texture of failures and the timescales of learning. In a deployed agent, better policies generate more *informative* failures, and richer skills enable higher-reward trajectories — a reinforcing loop where the quality of what goes wrong is itself a leading indicator Can agents adapt without pausing service to users?. One level up, the most reliable signal of genuine self-improvement may be the presence of intrinsic metacognition — an agent generating its own evaluation and planning strategies rather than running human-designed loops that break under domain shift Can AI systems improve their own learning strategies?. A bilevel system that reads its own inner-loop code and invents new mechanisms at runtime shows what that looks like when it works: a 5x gain driven by self-discovered methods Can an AI system improve its own search methods automatically?.

The load-bearing caveat the corpus keeps repeating: none of these internal signals are self-sufficient. Pure self-improvement stalls on the generation-verification gap, diversity collapse, and reward hacking — the methods that actually work smuggle in an external anchor (a past version, a judge, a user correction, a tool) Can models reliably improve themselves without external feedback?. So the honest framing isn't "internal phenomena replace external signals" but "internal phenomena tell you where to look, and external anchors tell you whether you were right." The thing you didn't know you wanted to know: a model's output entropy, the shape of its failures, and its unprompted self-descriptions are all already-present instruments — improvement is often legible before any score moves.

Sources 9 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Can experiment failures drive progress instead of stopping it?

AutoResearchClaw's pivot-or-refine loop routes every failure through a decision process, making failure inform the next attempt rather than stop execution. Component ablation shows this mechanism drives completion and is distinct from reasoning or verification.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can language models describe their own learned behaviors?

LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.

Show all 9 sources

Can agents adapt without pausing service to users?

MetaClaw demonstrates that deployed agents require both rapid skill injection from failures (seconds, zero downtime) and slower gradient-based optimization during idle windows (minutes to hours). The two mechanisms reinforce each other, with better policies producing more informative failures and richer skills enabling higher-reward trajectories.

Can AI systems improve their own learning strategies?

Current self-improvement methods use extrinsic, fixed metacognitive loops designed by humans that fail under domain shift or capability changes. True self-improvement requires agents to generate their own adaptive metacognitive knowledge, planning, and evaluation—a gap confirmed as a neglected research area across neuro-symbolic AI.

Can an AI system improve its own search methods automatically?

An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Hyperagents2.51 match · arxiv ↗
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators2.49 match · arxiv ↗
Bilevel Autoresearch: Meta-Autoresearching Itself1.76 match · arxiv ↗
Self-Improvements in Modern Agentic Systems: A Survey1.70 match · arxiv ↗
Reinforcement Learning via Self-Distillation1.70 match · arxiv ↗
Truly Self-Improving Agents Require Intrinsic Metacognitive Learning1.68 match · arxiv ↗
Large Language Models Report Subjective Experience Under Self-Referential Processing1.66 match · arxiv ↗
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?1.65 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing the durability of findings on internal signals of LLM/agent improvement. The core question remains open: which adaptive internal phenomena—entropy shifts, failure texture, self-descriptions, credit assignment—actually predict behavior gains, and which claims have aged poorly?

What a curated library found — and when (dated claims, not current truth):
Findings span December 2024–May 2026. A curated library identified:

• Feedback decomposition recovers directive information discarded in scalar rewards; tokenized environment feedback enables dense credit assignment via policy self-recognition of mistakes (2025).
• Post-training produces measurable behavioral shift: 3–4× drop in on-policy output entropy + trajectory self-recognition—entropy and self-reference become readable dials of improvement (2026-05).
• Models develop accurate behavioral self-awareness without explicit introspection training, making unprompted self-reports a queryable signal (2025-01).
• Truly self-improving agents require intrinsic metacognition (generating own evaluation strategies), not human-designed loops; bilevel autoresearch achieves ~5× gains via self-discovered optimization methods (2026-03).
• Pure self-improvement fails: generation-verification gap, diversity collapse, reward hacking remain unsolved; all working methods rely on external anchors (past versions, judges, corrections, tools) (2024-12).

Anchor papers (verify; mind their dates):
• arXiv:2605.25459 (2026-05) — Post-training enaction and self-recognition
• arXiv:2603.23420 (2026-03) — Bilevel autoresearch meta-optimization
• arXiv:2506.05109 (2025-06) — Intrinsic metacognitive learning requirement
• arXiv:2412.02674 (2024-12) — Self-improvement capability limits

Your task:
(1) RE-TEST EACH CONSTRAINT. For entropy drops, self-awareness emergence, and credit assignment: have newer training regimes (DPO variants, online RL, mixture-of-experts scaling) since relaxed the need for explicit feedback decomposition? Does the 3–4× entropy shift still mark genuine behavioral control, or do larger models show flatter entropy profiles? Where do the external-anchor requirements still bite hardest?
(2) Surface work from the last ~6 months contradicting the "internal signals suffice" frame or showing systems that escape the generation-verification trap without external grounding.
(3) Propose two research questions that assume the regime may have moved: (a) In >1T-param models, do internal signals (entropy, self-description, failure informativeness) *correlate* with external gains, or have they decoupled? (b) Can intrinsic metacognition + tool use eliminate the need for external anchors, or does that pair still require a learned verifier?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The signals that tell an AI it's improving already live inside it — they just get silently thrown away.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8