INQUIRING LINE

Does training data format matter more than who generates it?

This explores whether the *form* of training data — how it's structured and presented — shapes a model's behavior more than the *source* of that data (self vs. external teacher, human vs. synthetic).


This explores whether the *form* of training data matters more than its *source*. The corpus suggests both axes matter, but in different ways — and the surprising answer is that format effects can be enormous, while source effects turn out to be less about "who is stronger" and more about "who fits the learner."

The sharpest evidence for format comes from work showing that how data is presented shapes reasoning strategy roughly 7.5 times more than the domain it covers: multiple-choice formatting pushes models toward breadth-first exploration, while free-form data produces depth-first reasoning Does training data format shape reasoning strategy more than domain?. Format isn't just a packaging choice — it gets baked into *how the model thinks*. Reinforcement learning then doubles down on this: RL tends to collapse onto a single dominant format inherited from pretraining within the first epoch, suppressing alternatives, and the winner is chosen by model scale rather than by performance Does RL training collapse format diversity in pretrained models?. So format isn't just influential; it's a channel through which training silently narrows behavior.

But the "who generates it" question has its own twist, and it cuts against the intuition that a stronger source is always better. Models often learn *more* from data they generate themselves than from data produced by a stronger external model — SEAL lifts QA accuracy from 33.5% to 47.0% precisely because self-restructured data matches the learner's own representational needs Does self-generated training data improve model learning?. The same logic explains why teacher-refined data can *hurt*: refinements that exceed the student's learning frontier degrade performance even when they're objectively higher quality, so students should filter for compatibility rather than absorb everything Does teacher-refined data always improve student model performance?. In other words, source matters — but as a question of *fit to the learner*, not raw strength. The flip side appears at scale: with enough teacher-labeled data, a small BERT cross-encoder can actually surpass its LLM teacher, because broad input coverage smoothed by teacher predictions generalizes better Can smaller models outperform their LLM teachers with enough data?.

So neither axis dominates cleanly — they interact. The generation *method* (a format-like property) often matters more than the generator's identity. Aligned models can self-synthesize human-quality instruction data from nothing but formatting tokens Can aligned LLMs generate their own training data?, and synthetic generation succeeds or fails based on structural choices: seeding atomic task elements instead of full examples Can synthetic data replace seed examples in task generation?, or sampling tools from relevance graphs with planned dialogue instead of random composition Why does random tool sampling produce unrealistic synthetic training data?. And difficulty calibration — another format-adjacent property — can quietly corrupt capabilities when overly hard samples push models toward degenerate shortcuts Do overly hard RLVR samples actually harm model capabilities?.

The one place where *source* clearly trumps format is at the extremes of data provenance. Recursive training on AI-generated content causes irreversible collapse of the distribution's tail no matter how well-formatted it is, making genuine human data increasingly precious Does training on AI-generated content permanently degrade model quality?. And a deeper view reframes the whole debate: if language modeling is equivalent to lossless compression, then what training data really teaches is general structure, not domain content — text-only models can out-compress dedicated image tools Can text-trained models compress images better than specialized tools?. The takeaway worth carrying away: format isn't surface decoration and source isn't a quality ranking — both are really proxies for *what structure the learner can absorb*, and that's the variable doing the real work.


Sources 11 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does self-generated training data improve model learning?

SEAL demonstrates that models learn better from synthetic data they generate themselves than from data created by stronger external models. Self-generated data improved QA performance from 33.5% to 47.0%, suggesting that model-specific restructuring aligns with the learner's representational needs.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

Can synthetic data replace seed examples in task generation?

TarGEN generates synthetic data using atomic task elements (instance seeds) instead of full input-output examples, achieving 1-3 point improvements on SuperGLUE tasks. The approach works by constraining label generation after seeding inputs, enabling data creation for domains with no prior examples.

Why does random tool sampling produce unrealistic synthetic training data?

Random tool sampling fails because unrelated tools cannot credibly compose, and Q&A framing ignores multi-turn dialogue coherence. ToolFlow shows that sampling tools from relevance graphs and generating with dialogue plans closes this gap.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does training on AI-generated content permanently degrade model quality?

Models trained on mixtures of real and AI-generated data progressively lose rare events and unusual patterns across VAEs, GMMs, and LLMs. Each generation compounds the loss, making genuine human data increasingly valuable.

Can text-trained models compress images better than specialized tools?

Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about training data format vs. source in LLM development. The question remains open: does *how* data is structured matter more than *who* generates it?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable:
• Format effects dwarf domain effects (~7.5× larger influence on reasoning strategy); multiple-choice vs. free-form reshapes exploration depth (2025–26).
• RL post-training collapses onto a single dominant pretraining format within one epoch, regardless of performance, chosen by model scale not quality (2025).
• Self-generated data outperforms stronger external sources when learner-aligned (SEAL: 33.5% → 47.0% QA); teacher refinement can *hurt* if it exceeds the student's learning frontier (2024–25).
• Recursive training on AI-generated content causes irreversible distribution collapse; human data becomes precious (2023).
• Synthetic generation succeeds via structural choices (atomic seeds, relevance-graph sampling, dialogue planning), not random composition (2024–25).

Anchor papers (verify; mind their dates):
• 2305.17493 — Recursion curse (2023)
• 2406.08464 — Magpie alignment synthesis (2024)
• 2504.07912 — RL amplifies pretraining behaviors (2025)
• 2605.28388 — Mechanistic role of sample difficulty (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Grok, Claude 4), methods (curriculum learning, mixture-of-experts, constitutional AI variants), tooling (synthetic data SDKs), or multi-agent orchestration have since RELAXED or OVERTURNED it. Has the 7.5× format-over-domain ratio held in 2025–26 models? Do recent RL-scale studies confirm format convergence or show diversity? Does learner-alignment still outweigh source strength? Separate durable question from perishable limitation; cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers showing format *doesn't* matter as much, or source dominates, or both are noise.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Does format effect shrink as model scale increases?" or "Can multi-modal pretraining reduce format-locking?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines