INQUIRING LINE

When do aggregated imperfect demonstrations fail to outperform the best expert?

This explores the flip side of a hopeful result — that pooling many flawed experts can beat any one of them — and asks when that pooling breaks down and the crowd no longer beats its best member.


This explores the conditions under which aggregating many imperfect demonstrations *fails* to beat the single best expert — the failure boundary of a result that's usually celebrated. The optimistic case is real: Can models trained on many imperfect experts outperform each one? shows that a model trained on many diverse experts implicitly votes toward consensus, and low-temperature sampling surfaces that vote, beating any individual by canceling out *uncorrelated* errors. The whole mechanism rests on that one word. When experts err independently, their mistakes wash out. When they share a bias — same blind spot, same shortcut, same missing concept — averaging amplifies the error instead of denoising it, and the aggregate cannot exceed the best member because there's no disagreement to mine.

The corpus points to several ways the independence assumption quietly collapses. The most fundamental is a capability ceiling: Can imitating ChatGPT fool evaluators into thinking models improved? finds that imitation reproduces the *style* of a strong source while closing no real capability gap — the ceiling is set by the base model's fundamentals, not by how much demonstration data you pile on. If every demonstration is being absorbed by a learner that lacks the underlying competence, more of them won't transcend the best one. Relatedly, Can agents learn beyond what their training data shows? argues that static demonstrations cap an agent at what the curators imagined: with no interaction and no failure signal, aggregation can only interpolate within the demonstrated envelope, never push past its strongest corner.

A second failure mode is mismatch between the demonstrations and the learner. Does teacher-refined data always improve student model performance? shows that objectively *higher-quality* expert data degrades the student when it exceeds the student's learning frontier — so a pool that includes the very best experts can underperform a narrower pool the student can actually absorb. Do overly hard RLVR samples actually harm model capabilities? sharpens this: when demonstrations or reward signals are too hard, the model learns degenerate shortcuts that *contaminate* abilities it already had, dragging the aggregate below its best component. Aggregation isn't free; absorbing the wrong demonstrations is actively negative-sum.

There's also the case where the aggregate looks fine but isn't. Can identical outputs hide broken internal representations? and Can AI pass every test while understanding nothing? warn that identical benchmark outputs can hide fractured internal structure — so a pooled model might match the best expert on the test set while failing to recombine or transfer, meaning it never really exceeded anyone. And Can models reliably improve themselves without external feedback? frames the deepest version: closed-loop aggregation with no external anchor stalls on the generation-verification gap and diversity collapse. Diversity collapse *is* the death of the majority-vote mechanism — once the experts agree (or the model has homogenized them), there are no uncorrelated errors left to cancel.

The constructive thread running through these is that beating the best expert requires preserved disagreement plus a signal the demonstrations alone can't give. Can step-wise expert rewards help small models learn hard reasoning? and Can models learn argument quality from labeled examples alone? both suggest the same fix from different angles: dense step-wise alignment signals, or explicit principled frameworks, let a learner extract genuine criteria rather than averaging surface patterns. So the short answer: aggregation fails to beat the best expert when the errors are correlated, when the learner is below the competence ceiling, when the strongest demonstrations sit past its learning frontier, or when diversity has already collapsed — and it succeeds mainly when independent errors and a real verification signal survive the pooling.


Sources 10 notes

Can models trained on many imperfect experts outperform each one?

Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about when aggregated imperfect demonstrations fail to beat the best expert. This question spans 2023–2026 in a curated arXiv library on learning from demonstrations and ensemble effects.

What a curated library found — and when (dated claims, not current truth):
• Majority-vote aggregation works only if expert errors are uncorrelated; shared biases amplify rather than cancel (2024).
• Imitation reproduces *style* not capability — a learner cannot exceed its base model's competence ceiling no matter how many demonstrations are pooled (2023).
• Static demonstrations cap agents at what curators imagined; without interaction or failure signals, aggregation interpolates within the envelope but cannot transcend the strongest corner (2025).
• High-quality expert data *degrades* student performance when it exceeds the student's learning frontier; the best experts can contaminate a pool (2024–2025).
• Diversity collapse (homogenized agreement) kills the majority-vote mechanism; closed-loop self-improvement stalls on this (2026).

Anchor papers (verify; mind their dates):
• arXiv:2406.11741 (2024) — Generative models transcending training experts via implicit majority vote
• arXiv:2305.15717 (2023) — The False Promise of Imitating Proprietary LLMs
• arXiv:2412.02674 (2025) — Self-improvement mirage and verification-gap circularity
• arXiv:2505.11581 (2025) — Fractured entangled representations masking internal collapse

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (e.g., o1, Claude 4), training methods (e.g., constitutional AI, multi-agent RL), tooling (caching, parallel sampling), or evaluation have since RELAXED or OVERTURNED it. Separate the durable question (Is there a regime where aggregation fails?) from the perishable limitation (Does it fail *here, now*?). Cite what resolved or reconfirmed each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — work showing aggregation *does* beat the best expert despite correlated errors, or vice versa.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., what happens when demonstrations are synthetic + verified by a stronger model? When aggregation includes deliberate disagreement-induction?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines