How does correctness emergence occur when no expert initially solved the task?
This explores how a model can arrive at correct answers that none of its individual training sources (the 'experts') could produce alone — where the right answer emerges from the collective rather than being copied from any one teacher.
This explores how correctness can emerge from a pool of imperfect teachers — when no single expert in the training data actually solved the task. The corpus has a sharp answer at the center of this question: models trained on many diverse experts don't imitate the best one, they implicitly *vote*. In Can models trained on many imperfect experts outperform each one?, cross-entropy optimization pushes a model toward the consensus across experts whose individual errors are uncorrelated. Because those errors cancel while the shared signal reinforces, low-temperature sampling surfaces a denoised majority vote that can beat every individual expert on the decisions that matter most. Correctness here is an emergent property of aggregation, not a property any teacher possessed.
The deeper mechanism is that the capability was latent and just needed to be triggered rather than invented. Does RL post-training create reasoning or just deploy it? argues base models already contain reasoning strategies in latent form, and post-training optimizes *when* to deploy them, not *how* to do them — activation vectors for reasoning strategies exist before any RL touches the model. Read alongside the voting result, this reframes 'emergence' as recombination: the pieces of a correct solution are distributed across the training signal, and training assembles them into a path no single source walked end to end.
This also explains why the 'experts' don't need to be right, or even coherent. Do reasoning traces need to be semantically correct? shows models trained on systematically irrelevant traces keep their accuracy and sometimes generalize better — the trace works as computational scaffolding, not as a transcript of correct thought. And Can reasoning emerge from expert demonstrations alone? recovers an implicit reward function from demonstrations through adversarial policy-critic training, reaching verifier-level performance in domains that have no automated checker at all. Both cases sever correctness-of-output from correctness-of-any-input.
But the corpus also marks the boundary where this magic stops. Emergence-from-aggregation needs *independent* signal to denoise against; it isn't free improvement from nothing. Can models reliably improve themselves without external feedback? shows pure self-improvement stalls on the generation-verification gap and diversity collapse — every reliable method smuggles in an external anchor (past versions, third-party judges, user corrections, tool feedback). The diverse-expert pool *is* that external anchor; remove the diversity and the voting collapses. Relatedly, Is reflection in reasoning models actually fixing mistakes? finds that the apparent self-correction in reasoning chains is mostly post-hoc confirmation — the gain comes from better first answers, not from a model talking itself from wrong to right.
The thing you may not have known you wanted to know: 'no expert solved it' is not the obstacle it sounds like, because the model was never really learning from any single expert — it was learning from the *shape of their disagreement*. Correctness emerges in the gaps between teachers. Which is also why, when the teachers stop disagreeing in useful ways (diversity collapse) or when there's no independent signal to vote against (pure self-improvement), the emergence quietly disappears.
Sources 6 notes
Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
RARO recovers implicit reward functions from expert demonstrations through adversarial co-training between a reasoning policy and relativistic critic. This approach matches verifier-based RL performance on reasoning tasks while extending to domains lacking automated verification.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.