Does training data format determine whether models collapse entropy or inflate variance?
This reads the question as asking whether the shape of your training data (multiple-choice vs. free-form, one format vs. many) is what decides between two distinct failure modes — a model going too narrow (entropy collapse) or too scattered (variance inflation).
This explores whether training data format is the lever that controls entropy collapse versus variance inflation. The corpus suggests the honest answer is: format powerfully shapes *what* a model does, but it isn't the thing that decides *which* of these two failures you get — those turn out to be separate problems on separate clocks. Worth untangling, because the question quietly fuses two findings that the collection actually keeps apart.
Start with how much format matters. One striking result is that training format shapes a model's reasoning strategy roughly 7.5 times more than the subject domain does — multiple-choice data pushes models toward broad, breadth-first scanning, while free-form data produces deeper, more committed chains of reasoning Does training data format shape reasoning strategy more than domain?. So presentation, not content, is the dominant dial. And under reinforcement learning, format diversity doesn't survive: RL converges on a single dominant format inherited from pretraining within the first epoch, quietly suppressing the alternatives — and which format wins depends on model scale rather than which format actually performs best Does RL training collapse format diversity in pretrained models?. That's already a kind of collapse — a narrowing of the model's behavioral range driven by data and scale.
But here's the twist the corpus insists on: entropy collapse and variance inflation aren't a fork you choose between by picking a data format. They're *dual* expressions of the same broken exploration-exploitation balance, showing up at different timescales — collapse during training, variance blow-up at inference — and they require structurally different fixes. Entropy bonuses or critique diversity that rescue training-time collapse do nothing for inference-time variance, and vice versa Why do reasoning models fail differently at training versus inference?. So format isn't the switch; exploration management is, and you need two separate hands on two separate loops.
Where format and data *do* visibly steer the entropy story is more granular and more interesting. Only about 20% of tokens — the high-entropy 'forking points' — carry the real learning signal in RLVR, and training on just those matches full updates Do high-entropy tokens drive reasoning model improvements?. Feed a model the wrong shape of data and you corrupt exactly this: overly hard RLVR samples make rare accidental successes look like high-advantage trajectories, so the model collapses onto degenerate shortcuts — answer repetition, skipped computation — that then contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. And on the variance side, the *style* baked into your data matters: teachers conditioned on correct answers produce confident, concise traces that students inherit, suppressing uncertainty and trading away out-of-distribution robustness Does richer teacher context hurt student generalization?. Models even carry an implicit signal here — they run 3-4x lower output entropy on their own generations, tracking input surprise without ever verbalizing it Why do models produce less uncertain outputs on their own text?.
So the reframe worth taking away: format determines *reasoning strategy and which distribution you collapse toward* — that part is data-driven and dramatic. But whether the symptom shows up as training-time entropy collapse or test-time variance inflation is a property of how exploration is managed across two independent loops, not a property you can pin on the data format alone. The data sets the trap; the exploration regime decides which way you fall into it.
Sources 7 notes
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.