INQUIRING LINE

How do quality, diversity, and complexity create different effects on downstream model performance?

This explores how three properties of training data — quality, diversity, and complexity — pull downstream model performance in *different directions*, rather than all feeding one generic 'better data' dial.


This explores how quality, diversity, and complexity act on downstream performance as separate levers — and why collapsing them into a single 'good data' score quietly breaks models over time. The cleanest map of this comes from work showing each property does a distinct job: quality drives *in-distribution* generalization (doing well on the kind of data you trained on), diversity drives *out-of-distribution* generalization (holding up on the unfamiliar), and complexity strengthens both at once How do quality, diversity, and complexity affect synthetic data differently?. The trap is that most evaluation pipelines measure only quality and treat diversity as noise — so self-improvement loops keep optimizing the one number they can see while irreversibly bleeding out the diversity they never tracked.

That blind spot shows up everywhere once you look. Pure self-improvement stalls precisely because of diversity collapse, the generation-verification gap, and reward hacking — and the methods that actually work smuggle in some external anchor (a past checkpoint, a judge, a user correction) to refill what the loop drains Can models reliably improve themselves without external feedback?. RL post-training makes the mechanism vivid: within the first epoch it amplifies one dominant pretraining format and suppresses the rest, and which format 'wins' tracks model scale rather than performance Does RL training collapse format diversity in pretrained models?. So the quality-optimizing pressure isn't neutral toward diversity — it's actively corrosive unless something counteracts it.

But 'diversity' itself splits into two things that are easy to confuse, and this is the part most readers don't expect. Raw output variance isn't the same as *useful* variance. When you measure diversity only among outputs that pass a quality bar, preference-tuned models turn out to be *more* semantically diverse than base models — base models just looked diverse because their variance sprawled across incoherent space Does preference tuning actually reduce the diversity of model outputs?. And the effect of preference tuning even reverses by domain: RLHF compresses lexical diversity in code (where convergence on the correct answer is the goal) but expands it in creative writing (where distinctiveness is rewarded) Does preference tuning always reduce diversity the same way?. Diversity, in other words, is only good relative to what the task wants.

The most encouraging thread is that quality and diversity aren't doomed to trade off — you can make them reinforce each other if you optimize for both explicitly. DARLING rewards semantic diversity *during* RL and finds it catalyzes exploration, producing higher-quality outputs than quality-only baselines on both creative and math tasks Can diversity optimization improve quality during language model training?. Step-level critique models do something similar inside the training loop, counteracting the 'tail narrowing' that kills solution variety across self-training iterations Do critique models improve diversity during training itself?. Counterintuitively, smaller ~500M-parameter generators produce more unique samples per budget than big models, which concentrate probability mass on their favorites Why aren't bigger models better for generating diverse outputs? — so for synthetic data, the diversity lever and the scale lever can point opposite ways.

Complexity — the third property — has the sharpest cautionary tale. More demanding training data helps, but only up to a point: overly hard RLVR samples push models to learn degenerate shortcuts (answer repetition, skipped computation) that then *contaminate* capabilities they already had, because rare accidental successes get treated as high-value trajectories Do overly hard RLVR samples actually harm model capabilities?. The benign-looking version of this is instruction density, which degrades performance in predictable patterns — linear, exponential, or a threshold cliff — depending on model type How does instruction density affect model performance?. The throughline across all three levers: each helps a *different* thing, each fails in a *different* way, and the moment you fold them into one metric you lose the ability to see which one is breaking.


Sources 10 notes

How do quality, diversity, and complexity affect synthetic data differently?

Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning actually reduce the diversity of model outputs?

When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing whether claims about quality, diversity, and complexity as *separate levers* on downstream LLM performance remain valid or have been superseded. The question: do these three properties still act distinctly, or have newer training methods, evaluation harnesses, or model capabilities collapsed the distinctions?

What a curated library found — spanning 2024–2026 (dated claims, not current truth):
• Quality drives in-distribution generalization; diversity drives out-of-distribution; complexity enables both but degrades predictably past a threshold (2024–2025).
• Self-improvement loops suffer diversity collapse unless anchored by external signals (past checkpoints, judges, user feedback) — pure self-refinement is circular (2024–2025).
• RL post-training narrows diversity by amplifying one pretraining distribution format within the first epoch; which format 'wins' tracks model scale, not performance (2025).
• Preference-tuned models show *more* semantic diversity than base models when measured within quality-passing subsets; RLHF effects on lexical diversity are domain-dependent (code ≈ compression, creative ≈ expansion) (2025).
• Jointly optimizing quality + diversity (e.g., DARLING) catalyzes exploration and outperforms quality-only baselines; smaller ~500M models generate more unique samples per budget than larger ones (2025–2026).

Anchor papers (verify; mind their dates):
– arXiv:2509.02534 "Jointly Reinforcing Diversity and Quality in Language Model Generations" (2025)
– arXiv:2507.11538 "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining" (2025)
– arXiv:2605.28388 "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs" (2026)
– arXiv:2504.12522 "Evaluating the Diversity and Quality of LLM Generated Content" (2025)

Your task:
(1) RE-TEST THE LEVER SEPARATION. For each claimed distinction (quality ≠ diversity ≠ complexity), ask: have unified metrics, multi-objective training frameworks, or scaling laws since collapsed these into one coherent phenomenon? Conversely, have finer mechanistic studies *strengthened* the case for treating them separately? Cite what evidence holds each distinction together or dissolves it.
(2) Surface the strongest *reconciling or contradicting* work from the last ~6 months. If newer papers show quality and diversity trade off after all, or that complexity subsumes both, name them. If work shows the three-lever model still holds at scale, ground that too.
(3) Propose 2 research questions that *assume the regime may have shifted*: e.g., "If frontier models now tolerate high complexity without degenerate shortcuts, what replaces complexity as the binding constraint?" or "Do multimodal or reasoning-scaffold approaches dissolve the diversity-quality tension entirely?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines