Why do different LLMs converge on similar outputs in open-ended tasks?
This explores why models with different sizes, architectures, and training regimes tend to produce similar answers on open-ended tasks — and the corpus suggests the convergence is built into the shared autoregressive objective, not a coincidence of training data.
This explores why different LLMs converge on similar outputs in open-ended tasks. The corpus's sharpest answer is that they're all the same kind of machine underneath: autoregressive probability estimators that emit the most likely next token. When you frame a model that way, its behavior becomes predictable from the *probability* of the target output rather than its logical difficulty — researchers correctly anticipated that low-probability targets (counting letters, reciting the alphabet backwards) would be hard for *every* model, even though they're trivial for a human Can we predict where language models will fail?. If all models optimize the same statistical landscape, they slide toward the same high-probability valleys.
The most striking evidence for convergence is where models hit identical ceilings. On constrained optimization, LLMs plateau at ~55–60% constraint satisfaction *regardless of architecture, parameter count, or training regime* — reasoning models don't escape it either, which points to a shared structural limit rather than a gap any one lab could close by scaling Do larger language models solve constrained optimization better?. A related finding shows that when LLMs face problems that look like optimization, they don't actually run the iterative method — they recognize the template and emit a plausible-looking answer, a failure that persists across scale and training approach Do large language models actually perform iterative optimization?. Convergence here is convergence on the *same shortcut*: pattern-match the familiar shape, fill in the expected-looking values.
There's a deeper mechanism worth knowing about. During inference, a model can actually hold several distinct interpretations of a task in superposition — but autoregressive decoding forces it to collapse to a single one after the very first token Can LLMs handle multiple tasks at once during inference?. That early commitment is a recurring theme: in multi-turn conversation, models lock into a premature guess and can't recover, dropping performance ~39% across all major LLMs Why do language models fail in gradually revealed conversations?. The same early-commitment bias even shows up architecturally — uncertainty signals dominate the early transformer layers while longer-horizon "empowerment" signals only emerge in middle layers, so models decide before exploration can kick in Why do large language models explore less effectively than humans?. Different models, same wiring, same rush to commit — so they commit to similar things.
The self-reinforcing piece: aligned models, given nothing but a formatting token, will auto-regress their own training data — generating instruction sets that rival human-curated ones Can aligned LLMs generate their own training data?. When labs fine-tune on each other's synthetic outputs (and on the same web corpus), the distributions homogenize further. And none of them can simply think their way out of it: self-improvement is formally bounded by a generation–verification gap, so a model can't transcend its own distribution through metacognition alone — it needs an external signal to break the symmetry What stops large language models from improving themselves?.
The thing you might not have expected to learn: the convergence you see on open-ended tasks isn't mostly about overlapping datasets — it's that every model is the *same algorithm* (next-token likelihood maximization) committing early to high-probability completions, which means the way to make two models diverge is usually to change the harness around them, not the weights inside them.
Sources 8 notes
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Large language models represent multiple complete, computationally distinct tasks simultaneously during inference—a macroscopic phenomenon separate from feature-level superposition. However, autoregressive decoding forces convergence to a single task after the first token, preventing practical multi-task generation.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.
MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.