INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do self-generated feedback mec…›this inquiring line

AI models stall when improving only on their own work — the right outside data is what actually lifts the ceiling.

Can capability boundary collapse be reversed through external data?

This explores whether the ceilings models hit on their own — entropy collapse, self-improvement plateaus, capability degradation — can be undone or pushed back by feeding in signal from outside the model, and what the corpus says about when external data actually helps versus when the boundary is structural.

This explores whether the ceilings models hit on their own can be reversed by injecting external data, and the corpus is unusually consistent on the answer: the boundary is real, and external signal is the main thing that moves it — but only certain kinds of external signal, and only on certain kinds of tasks.

Start with why models stall in the first place. Pure self-improvement is circular: a model cannot reliably lift itself above its own judgment, because it can only improve where it can verify better than it can generate What limits how much models can improve themselves?. The methods that *do* work, on closer inspection, all smuggle in an outside anchor — a past model version, a third-party judge, user corrections, tool output Can models reliably improve themselves without external feedback?. So 'external data' isn't a nice-to-have here; it's the only known escape hatch from the circularity. The catch the corpus keeps flagging: the generation-verification gap vanishes for factual, checkable tasks — which is exactly where external verification is cheap — and stays wide open for the novel, judgment-heavy work where you'd most want a boost.

But external data is not automatically restorative, and this is the part a curious reader might not expect. The *wrong* external data actively makes collapse worse. Training on nearly-impossible problems pushes models to learn degenerate shortcuts that then contaminate skills they already had — a net loss of capability driven by bad external signal Do overly hard RLVR samples actually harm model capabilities?. And reinforcement-learning runs hit a hard ceiling when policy entropy collapses toward zero; the fixes that reverse it (Clip-Cov, KL-Cov, GPPO) work by preserving the model's room to explore, not by pouring in more data Does policy entropy collapse limit reasoning performance in RL?. So sometimes the boundary is collapsed *exploration*, and external data is the wrong lever entirely.

There's also a deeper question of whether the 'boundary' you're trying to reverse is even real. Some apparent capability cliffs turn out to be measurement artifacts — sharp emergent jumps that smooth out completely under continuous metrics, meaning there was no discrete collapse to reverse Are LLM emergent abilities real or measurement artifacts?. Others are mislabeled: reasoning models that 'collapse' on long problems are often hitting an *execution* wall, not a reasoning wall — give them a tool to run the procedure and they sail past the supposed cliff Are reasoning model collapses really failures of reasoning?. In both cases the productive external intervention is a tool or an oracle, not more training examples.

The most sobering note is on demonstrations specifically. Agents trained on static expert datasets are capped by the curator's imagination — they never interact, never learn from their own failures, and so cannot generalize past what was demonstrated Can agents learn beyond what their training data shows?. This is the inverse of reversal: piling on external data can *lock in* a ceiling rather than break it. What reliably breaks ceilings, across these notes, is external signal that lets the model verify or interact — a checker, a tool, an environment, a corrected trajectory — which is also why AI assistance stays trustworthy exactly up to the point where an external oracle can confirm the output and fails sharply beyond it Where does AI assistance become unreliable in research?. One elegant middle path: reverse-curriculum RL slides the starting point of a problem backward from near-completion, manufacturing step-level feedback from nothing but outcome signals — extracting process-level external guidance without paying for human annotation Can curriculum learning approximate expensive process supervision?. The pattern to take away: it's not the *volume* of external data that reverses a collapse, it's whether that data closes a verification or interaction loop the model couldn't close alone.

Sources 9 notes

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Show all 9 sources

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Where does AI assistance become unreliable in research?

AI excels at structured, externally verifiable tasks like literature retrieval and drafting, but fails sharply on novel ideas and scientific judgment. The boundary consistently tracks whether an external oracle can verify the output—a principle that remains stable even as specific task assignments shift.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models2.51 match · arxiv ↗
Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?1.68 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.67 match · arxiv ↗
Hyperagents1.67 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.67 match · arxiv ↗
Self-Improvements in Modern Agentic Systems: A Survey1.66 match · arxiv ↗
Self-Improving Model Steering1.66 match · arxiv ↗
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators1.63 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-examining whether external data can reverse model performance collapse. A curated library of LLM research (2023–2026) made claims about this. Your job: treat those claims as dated, test them against what you know of models released or methods published in the last 6 months, and surface what has shifted.

What a curated library found — and when (dated claims, not current truth):
• Pure self-improvement is circular; models cannot reliably improve beyond their own verification ceiling without external anchors (past versions, third-party judges, tool outputs) (~2025).
• The generation-verification gap collapses only for factual, checkable tasks; stays wide for novel, judgment-heavy work (~2025).
• Wrong external data actively degrades capability: training on near-impossible problems induces degenerate shortcuts that contaminate prior skills (~2026).
• Policy entropy collapse in RL is the primary scaling bottleneck for reasoning; fixes (Clip-Cov, KL-Cov, GPPO) work by preserving exploration, not adding data (~2025).
• Apparent capability cliffs are often measurement artifacts (smooth under continuous metrics) or execution failures (tool-solvable), not true reasoning boundaries (~2023–2025).
• Static expert demonstrations lock agents into curator imagination; interactive feedback (tools, checkers, environments) reliably breaks ceilings (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2304.15004 (2023): Emergent abilities as metric artifacts
• arXiv:2402.05808 (2024): Reverse curriculum RL for process-level supervision
• arXiv:2505.22617 (2025): Entropy bottleneck in RL reasoning scaling
• arXiv:2605.28388 (2026): Sample difficulty and RLVR degenerate behaviors

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1-pro, reasoning APIs, post-training frameworks released since mid-2026), new RL methods (entropy-preserving variants, curriculum designs, calibrated sampling), tooling (SDK affordances for interaction, caching, multi-agent orchestration), or evaluation harnesses have since RELAXED or OVERTURNED it. Clearly separate the durable question (still open) from the perishable limitation (possibly resolved). Cite what resolved it; flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers that either refute the library's claims, propose a novel path around them, or show the constraints don't apply to a new class of tasks.
(3) Propose 2 research questions that ASSUME the regime may have moved (e.g., if interactive feedback is now cheap/automated, what replaces entropy collapse as the bottleneck?).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI models stall when improving only on their own work — the right outside data is what actually lifts the ceiling.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8