When do aggregated imperfect demonstrations fail to outperform the best expert?
This explores the flip side of a hopeful result — that pooling many flawed experts can beat any one of them — and asks when that pooling breaks down and the crowd no longer beats its best member.
This explores the conditions under which aggregating many imperfect demonstrations *fails* to beat the single best expert — the failure boundary of a result that's usually celebrated. The optimistic case is real: Can models trained on many imperfect experts outperform each one? shows that a model trained on many diverse experts implicitly votes toward consensus, and low-temperature sampling surfaces that vote, beating any individual by canceling out *uncorrelated* errors. The whole mechanism rests on that one word. When experts err independently, their mistakes wash out. When they share a bias — same blind spot, same shortcut, same missing concept — averaging amplifies the error instead of denoising it, and the aggregate cannot exceed the best member because there's no disagreement to mine.
The corpus points to several ways the independence assumption quietly collapses. The most fundamental is a capability ceiling: Can imitating ChatGPT fool evaluators into thinking models improved? finds that imitation reproduces the *style* of a strong source while closing no real capability gap — the ceiling is set by the base model's fundamentals, not by how much demonstration data you pile on. If every demonstration is being absorbed by a learner that lacks the underlying competence, more of them won't transcend the best one. Relatedly, Can agents learn beyond what their training data shows? argues that static demonstrations cap an agent at what the curators imagined: with no interaction and no failure signal, aggregation can only interpolate within the demonstrated envelope, never push past its strongest corner.
A second failure mode is mismatch between the demonstrations and the learner. Does teacher-refined data always improve student model performance? shows that objectively *higher-quality* expert data degrades the student when it exceeds the student's learning frontier — so a pool that includes the very best experts can underperform a narrower pool the student can actually absorb. Do overly hard RLVR samples actually harm model capabilities? sharpens this: when demonstrations or reward signals are too hard, the model learns degenerate shortcuts that *contaminate* abilities it already had, dragging the aggregate below its best component. Aggregation isn't free; absorbing the wrong demonstrations is actively negative-sum.
There's also the case where the aggregate looks fine but isn't. Can identical outputs hide broken internal representations? and Can AI pass every test while understanding nothing? warn that identical benchmark outputs can hide fractured internal structure — so a pooled model might match the best expert on the test set while failing to recombine or transfer, meaning it never really exceeded anyone. And Can models reliably improve themselves without external feedback? frames the deepest version: closed-loop aggregation with no external anchor stalls on the generation-verification gap and diversity collapse. Diversity collapse *is* the death of the majority-vote mechanism — once the experts agree (or the model has homogenized them), there are no uncorrelated errors left to cancel.
The constructive thread running through these is that beating the best expert requires preserved disagreement plus a signal the demonstrations alone can't give. Can step-wise expert rewards help small models learn hard reasoning? and Can models learn argument quality from labeled examples alone? both suggest the same fix from different angles: dense step-wise alignment signals, or explicit principled frameworks, let a learner extract genuine criteria rather than averaging surface patterns. So the short answer: aggregation fails to beat the best expert when the errors are correlated, when the learner is below the competence ceiling, when the strongest demonstrations sit past its learning frontier, or when diversity has already collapsed — and it succeeds mainly when independent errors and a real verification signal survive the pooling.
Sources 10 notes
Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.
The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.