How does training data format shape which reasoning patterns emerge in models?
This explores how the *shape* of training data — multiple-choice vs. free-form, correct vs. corrupted traces, the format styles present in pretraining — steers which reasoning behaviors a model actually exhibits, often more than the subject matter does.
This explores how the *shape* of training data — not its subject matter — steers which reasoning behaviors a model exhibits. The corpus's sharpest finding is that format isn't a cosmetic wrapper; it's the dominant lever. Models trained on multiple-choice data adopt breadth-first exploration, while free-form training produces depth-first reasoning, and the format effect outweighs the domain effect by roughly 7.5x Does training data format shape reasoning strategy more than domain?. In other words, *how* a problem is presented teaches the model a search strategy more than *what* the problem is about.
Why would presentation matter so much? Several notes converge on a provocative answer: the reasoning was largely already there, and format just selects which latent pattern surfaces. Base models appear to contain reasoning capability in latent form, which minimal interventions — RL steering, decoding tweaks, feature steering — can elicit rather than create Do base models already contain hidden reasoning ability?. RL post-training looks less like teaching *how* to reason and more like teaching *when* to deploy reasoning it already has Does RL post-training create reasoning or just deploy it?. And strikingly, RL tends to amplify a single dominant format distribution inherited from pretraining within the first epoch while collapsing the alternatives — so the 'winning' reasoning style is partly an artifact of which formats pretraining happened to favor, not which performs best Does RL training collapse format diversity in pretrained models?. A tiny 1.5B model with LoRA-only tuning can match larger RL models by learning output *format* alone, suggesting format-shaping and knowledge acquisition are separable Can small models reason well by just learning output format?.
Here's the part that should unsettle you: if format is doing the heavy lifting, then the *content* of reasoning traces may matter far less than we assume. Models trained on deliberately corrupted, semantically irrelevant traces perform comparably to those trained on correct ones — and sometimes generalize better — implying that traces act as computational scaffolding rather than meaningful logical steps Do reasoning traces need to be semantically correct?. This dovetails with a skeptical thread running through the corpus: chain-of-thought may be constrained imitation of reasoning *form* learned from training schemata, not genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work?. The tell is that CoT degrades predictably once you push it outside its training distribution — in task, length, or format — producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data?.
There's a meaningful counterweight, though. Not everything reduces to surface format. An analysis of five million pretraining documents found that reasoning draws on broad, transferable *procedural* knowledge spread across many sources, while factual recall depends on narrow memorization — so the kind of data (procedural vs. factual) genuinely shapes whether a model can generalize a reasoning skill at all Does procedural knowledge drive reasoning more than factual retrieval?. Format selects the strategy; procedural content seems to determine whether that strategy transfers.
The practical payoff: because reasoning styles live in fairly clean, separable structures, you can manipulate them *after* training without touching the data at all. Verbose vs. concise CoT occupy distinct, linearly separable regions of activation space — a single steering vector cuts reasoning length 67% while holding accuracy Can we steer reasoning toward brevity without retraining? — and a decoding-only penalty on thought-switching tokens improves accuracy by stopping premature path-abandonment Do reasoning models switch between ideas too frequently?. That these behaviors are this steerable at inference time is itself evidence for the corpus's central claim: training format imprints reasoning patterns as recoverable structure, not as deep new capability.
Sources 12 notes
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.