Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
The CoT Encyclopedia paper isolates two variables that could explain differences in reasoning strategy across models: training data domain (math vs. commonsense vs. coding) and training data format (multiple-choice vs. free-form). The finding is striking: format effect size reaches Cohen's d up to 1.5, while domain effect is consistently below 0.2 — a 7.5x difference.
The pattern breaks down cleanly:
- Multiple-choice (MC) trained: Concise, structured answers. Explores multiple solution paths early — breadth-first, like BFS. Filler words ("wait", "hmm") are absent.
- Free-form (FF) trained: Verbose, sequential with frequent verification loops — depth-first, like DFS. Filler words common. Single reasoning path followed iteratively.
The practical implication is significant: if you want to control a model's reasoning strategy — whether it explores broadly before committing or digs deep on one path — change the format of its training data, not its domain. This is more tractable than domain curation because format is a presentation decision, not a content decision.
The CoT Encyclopedia goes further: it demonstrates that this formatting signature persists and is controllable. By linearly interpolating model weights between MC-trained and FF-trained versions, you can produce models that smoothly transition in strategy without fine-tuning. Strategy becomes a parameter, not an emergent property.
This connects to Why do reasoning models fail differently at training versus inference?: the entropy collapse problem may be partly a format artifact. MC-training produces BFS-like exploration (more diverse across paths); FF-training produces the collapse-prone depth-first profile that RL training then further narrows.
The finding also challenges the assumption that domain-specific training creates domain-specific reasoning styles. What changes domain-to-domain is not the reasoning strategy but the knowledge being applied. The strategy is set by format earlier in training.
RLVR spurious rewards confirm pretraining format as the controlling variable: The spurious rewards finding provides independent evidence. Qwen2.5-Math improves nearly as much with random, incorrect, or format-only rewards as with ground-truth rewards (~21-25% improvement). But Llama3.1 and OLMo2 fail completely with the same spurious rewards. The critical difference: Qwen's pretraining included extensive code-reasoning data, creating a latent "code reasoning" strategy that surfaces under any optimization pressure. The reward signal's content is irrelevant — what matters is that Qwen's pretraining format created a reasoning strategy that RLVR can activate regardless of reward quality. This is the format-dominance principle at the pretraining level: Qwen's code-format pretraining determines its RLVR responsiveness more than any post-training variable. See Why do random rewards improve reasoning for some models but not others?.
FinCoT extends this principle from training time to inference time. By embedding expert-derived reasoning blueprints (as Mermaid diagrams) within structured CoT prompts for financial reasoning, FinCoT improves accuracy from 63.2% to 80.5% while reducing generated tokens eightfold compared to unstructured CoT. The format-over-content principle holds bidirectionally: both training data format and prompt format shape reasoning strategy more than domain content. Domain-specific expert structure in the prompt acts as a format intervention, producing structured reasoning traces that align with expert practice. This connects format effects to domain specialization without requiring domain-specific training.
The same principle operating at a finer scale within Long CoT: Models trained on Long CoT demonstrations where 50% of numbers are randomly replaced achieve only 3.2% lower accuracy than those trained on correct samples. Shuffling 67% of reasoning steps causes a 13.3% accuracy drop. What distillation transfers is the structural architecture of reasoning (reflection, backtracking, self-validation sequences), not the specific content of individual steps. Format dominance extends inward: not just which training format produces which strategy, but within a format, the structural template matters more than factual content. See What do models actually learn from chain-of-thought training?.
Inquiring lines that use this note as a source 28
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How much does training data format shape what reasoning strategy emerges?
- Why does training format shape reasoning strategy more than domain?
- Why does training data format matter more than domain content?
- Why does training data format shape reasoning strategy more than domain content?
- Why does training data format matter more than its domain content?
- Does training data format shape model reasoning more than domain content?
- What makes training data quality more important than quantity for reasoning?
- How does training format shape reasoning strategy more than content?
- How much does input format shape what reasoning strategy a model develops?
- Can the eight-dimension rubric predict which question types need decomposition?
- How does training data format shape whether models reason in parallel or sequentially?
- What role does KL penalty strength play in format selection?
- How much does training data presentation format shape reasoning ability?
- Why do question types determine retrieval and decomposition strategy in QA?
- Why do format and structure matter more than actual content in reasoning?
- Does training data format matter more than who generates it?
- Why does training order matter across different domain types?
- Does training data format shape which reasoning strategies LLMs develop?
- Can reasoning in free text then formatting separately recover performance?
- How does training data format shape which reasoning patterns emerge in models?
- Why does training data format shape reasoning strategy more than content?
- How does training on correct answer form differ mechanistically from training on failure analysis?
- Does training data format determine whether models collapse entropy or inflate variance?
- Can training format itself shape what reasoning strategy a model learns?
- Does training data format shape reasoning strategy more than domain content?
- Why does curriculum order matter when information theory says data order is irrelevant?
- How much does training data format influence reasoning strategy versus domain content?
- How does training data structure shape reasoning strategy more than domain content?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
MC format (BFS-like) may be less collapse-prone than FF format (DFS-like); format is an upstream entropy variable
-
Can reasoning topologies be formally classified as graph types?
This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.
MC training naturally produces CoT-SC topology (parallel paths); FF training naturally produces CoT topology (single depth-first path)
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
the InfoGain degradation may be partly a format effect: SFT uses domain-specific format which shifts reasoning strategy toward FF depth-first
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
the task-structure insight interacts with this: MC training produces a strategy better suited for structured tasks; FF for continuous judgment
-
How much does the order of premises actually matter for reasoning?
When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.
premise ordering is a format effect at inference time: same logic, different presentation, >30% accuracy shift
-
Do strict output formats hurt LLM reasoning ability?
When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.
the inference-time complement to training-time format effects: training format determines which reasoning strategy the model develops (BFS vs DFS), while output format constraints at inference time degrade reasoning by competing for generation capacity; format is never neutral on either side of the training-inference boundary
-
Why do random rewards improve reasoning for some models but not others?
When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?
pretraining format as controlling variable: Qwen's code-reasoning format creates RLVR-responsive strategy regardless of reward quality
-
Does training on messy search processes improve reasoning?
Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.
SoS is a format intervention: serializing search processes (BFS, DFS, backtracking) as training strings produces reasoning strategies that match the serialization format, consistent with format dominance over content
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Learning to Reason for Factuality
- Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory
- Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
Original note title
training data format shapes llm reasoning strategy more than domain content