INQUIRING LINE

Why does training data format matter more than domain content?

This explores why how training data is shaped — multiple-choice vs. free-form, structured vs. raw, the format a model latches onto — appears to steer a model's behavior more powerfully than which subject the data is about.


This explores why how training data is shaped — multiple-choice vs. free-form, structured vs. raw, the format a model latches onto — appears to steer a model's behavior more powerfully than which subject the data is about. The sharpest result in the corpus puts a number on it: a model's reasoning strategy is shaped roughly 7.5 times more by the format it was trained on than by the domain. Multiple-choice data pushes models toward breadth-first exploration; free-form data produces depth-first reasoning Does training data format shape reasoning strategy more than domain?. The content is almost incidental — it's the shape of the examples that installs the habit.

One reason format dominates is that training doesn't so much teach new knowledge as activate and amplify patterns already latent in the model. RL post-training, for instance, doesn't blend formats — it converges on a single dominant format inherited from pretraining and suppresses the alternatives within the first epoch, and which format wins depends on model scale rather than on which one performs best Does RL training collapse format diversity in pretrained models?. The same activation-not-construction story shows up in alignment: 1,000 carefully curated examples on a strong base model rival datasets orders of magnitude larger, because post-training surfaces existing capability rather than building it Can careful curation replace massive alignment datasets?. If training is mostly selecting among pre-existing behaviors, then the presentation of the data — the cue the model keys on — naturally outweighs the topic.

The deeper lesson is that models learn structure, not just text. StructTuning reaches 50% of full-corpus performance using 0.3% of the data by organizing chunks into a domain taxonomy, so the model learns where a fact sits in a conceptual map rather than memorizing raw strings — much like a student learning from a textbook's organization rather than its word count Can organizing knowledge structures beat raw training data volume?. Relatedly, mapping items to discrete codes before embedding transfers across domains better than encoding text directly, because the discrete intermediate strips away surface text bias decoupling-text-from-item-representations-via-discrete-codes-is-more-transferable. In both cases, the organizing format carries the generalization, not the domain vocabulary.

Format also has a mechanical, almost physical effect on training dynamics. Structured tasks drive output entropy down while creative tasks push it up, and simply changing the training order — structured tasks first — yields measurable gains by preventing entropy collapse from wrecking open-ended ability Does training order reshape how models handle different task types?. Push format too hard in the wrong direction and capabilities actively degrade: nearly-impossible RLVR samples teach degenerate shortcuts that contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. And every domain-adaptation technique carries hidden costs — performance gains paired with quiet losses in reasoning faithfulness and format flexibility How do domain training techniques actually reshape model behavior?.

The useful takeaway for anyone building with these models: if you want to change how a model thinks, redesign the shape of your examples, not just their subject matter. The flip side is a caution — format isn't free to copy across models. Teacher-refined data that's objectively higher quality can still degrade a student if it exceeds the student's learning frontier, so the right format is the one compatible with the model you're actually training Does teacher-refined data always improve student model performance?.


Sources 9 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can organizing knowledge structures beat raw training data volume?

StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher auditing claims about training data format vs. domain content in LLM behavior. The question remains open: does format truly dominate domain, and if so, under what conditions?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat as perishable snapshots:
- Format shapes reasoning strategy ~7.5× more than domain; multiple-choice drives breadth-first, free-form drives depth-first (2025–2026).
- RL post-training converges on a single dominant pretraining format within one epoch, regardless of task performance; format choice driven by scale, not quality (2025).
- StructTuning achieves 50% full-corpus performance using 0.3% of data by organizing into domain taxonomy; organization, not word count, drives transfer (2024).
- Discrete item codes transfer better across domains than direct text embedding because they strip surface text bias (2022–2023).
- Overly-hard RLVR samples induce degenerate shortcuts that contaminate existing skills (2026).

Anchor papers (verify; mind their dates):
- arXiv:2504.07912 (2025) — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- arXiv:2407.16724 (2024) — Educating LLMs like Human Students: Structure-aware Injection of Domain Knowledge
- arXiv:2507.14783 (2025) — Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
- arXiv:2605.28388 (2026) — Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Your task:
(1) RE-TEST THE DOMINANCE CLAIM. For each constraint above, check whether: newer scaling regimes, constitutional AI / self-refinement methods, fine-grained format control via prompt engineering or LoRA, or recent multi-modal pre-training have since RELAXED the 7.5× ratio or revealed it as regime-specific. Does the format dominance hold for instruction-tuning? For in-context learning? Cite what moved it, and flag where it still binds.
(2) Surface the strongest DISAGREEMENT in the last 6 months: do recent papers on mechanistic interpretability (e.g., arXiv:2510.20941 on precedent-handling) contradict the "activation not construction" story? Or do they confirm it? Name a paper that pushes back.
(3) Propose 2 questions that ASSUME the regime has shifted: (a) If format dominance weakens at scale, what replaces it as the primary lever? (b) Can a single data format be made universally compatible across student models, or is student-model selection (as cited above) inevitable?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines