INQUIRING LINE

Why do smaller and larger models converge on different output formats?

This explores why model size changes which output *shape* a model settles into — not whether it's right, but which of several learned formats wins out — and what that reveals about how scale interacts with training.


This explores why model size changes which output *shape* a model settles into — not whether it's right, but which of several formats wins. The corpus has a surprisingly direct answer, and it isn't about capability. Controlled experiments on RL post-training show that a model already carries *multiple* candidate formats from pretraining, and RL doesn't invent a new one — it amplifies a single dominant format within the first epoch and suppresses the rest. The striking part: which format wins depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. So convergence on different formats at different sizes is largely an artifact of which latent format each scale happened to weight most heavily before training ever began.

Underneath that sits a difference in how probability mass is distributed. Larger models concentrate their probability on a few preferred outputs, which is why — counterintuitively — smaller models around 500M parameters generate *more* unique samples per draw Why aren't bigger models better for generating diverse outputs?. A peakier distribution doesn't just reduce diversity; it changes which single format dominates when training collapses the alternatives. Small and large models are effectively starting from different-shaped distributions, so the format that survives the collapse differs.

The deeper reframe is that output format and actual knowledge are *separable*. A 1.5B model with only LoRA post-training can match much larger RL-trained models on reasoning, which suggests RL mostly teaches the *organization* of the output rather than new facts lora-based-reasoning-format-adaptation-achieves-competitive-reasonin g-by-adaptin. If format is a relatively cheap, learnable layer sitting on top of knowledge, then it makes sense that it's the thing most sensitive to scale-dependent quirks — and the thing you can deliberately steer. DPO does exactly this for small models: feeding explicit wrong-vs-right examples fixes the rigid format failures that plain fine-tuning leaves behind Can small models match large models on function calling?.

There's a tension worth sitting with. Across many models, outputs tend to converge — an "artificial hivemind" where different systems independently produce near-identical responses because they share training data and alignment recipes Do different AI models actually produce diverse outputs?. So at the *content* level scale pushes toward sameness, while at the *format* level scale pushes toward different attractors. The thing a curious reader walks away knowing: format isn't a window onto how smart a model is. It's a near-arbitrary winner of a collapse process, decided partly by size, and decoupled enough from knowledge that you can train it independently of what the model actually understands.


Sources 5 notes

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, investigate whether smaller and larger models genuinely converge on *different* output formats, or whether this apparent divergence has been narrowed by recent advances in training, inference, or evaluation.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–10 to 2025–10. Key claims:
- RL post-training amplifies a single dominant pretraining format within the first epoch; which format wins depends on model scale, not performance (2025–04).
- Smaller models (~500M parameters) generate more unique samples per draw than larger models, reflecting peakier probability distributions in large models (2024–10).
- A 1.5B model with only LoRA post-training matches much larger RL-trained models on reasoning, suggesting RL teaches output *organization* rather than facts (2025–04).
- DPO fixes rigid format failures in small models by feeding explicit wrong-vs-right examples (2024–10).
- Different LLMs independently converge on near-identical outputs due to shared training data and alignment recipes, an "artificial hivemind" at the content level (2025–10).

Anchor papers (verify; mind their dates):
- arXiv:2504.07912 (2025–04): Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- arXiv:2410.18890 (2024–10): Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
- arXiv:2504.15777 (2025–04): Tina: Tiny Reasoning Models via LoRA
- arXiv:2510.22954 (2025–10): Artificial Hivemind: The Open-Ended Homogeneity of Language Models

Your task:
(1) RE-TEST THE SCALE-FORMAT LINK. For each constraint above, judge whether newer post-training methods (e.g., test-time scaling, diffusion-based reasoning, multi-agent orchestration), inference tooling, or unified evaluation harnesses have since *unified* small and large model format behavior, or whether scale-dependent format divergence remains robust. Separate the durable question (why does pretraining distribute probability differently by scale?) from the perishable limitation (do RL and DPO still amplify this divergence with current methods?).
(2) Surface the strongest work from the last 6 months that *contradicts* or *supersedes* the "format is cheap, separable from knowledge" framing. Does recent reasoning-model scaling or structured output research suggest format and knowledge are tighter coupled than the library implies?
(3) Propose 2 research questions that assume scale-dependent format divergence may have *narrowed*: (a) Can unified output schema + constrained decoding eliminate scale-dependent format winners? (b) Does continual post-training or distillation collapse the pretraining probability gaps that drive divergence?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines