INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›Why does training format shape rea…›this inquiring line

An AI learns how to reason more from the format of its training data than from what the data is about.

How does training data format shape which reasoning patterns emerge in models?

This explores how the *shape* of training data — multiple-choice vs. free-form, correct vs. corrupted traces, the format styles present in pretraining — steers which reasoning behaviors a model actually exhibits, often more than the subject matter does.

This explores how the *shape* of training data — not its subject matter — steers which reasoning behaviors a model exhibits. The corpus's sharpest finding is that format isn't a cosmetic wrapper; it's the dominant lever. Models trained on multiple-choice data adopt breadth-first exploration, while free-form training produces depth-first reasoning, and the format effect outweighs the domain effect by roughly 7.5x Does training data format shape reasoning strategy more than domain?. In other words, *how* a problem is presented teaches the model a search strategy more than *what* the problem is about.

Why would presentation matter so much? Several notes converge on a provocative answer: the reasoning was largely already there, and format just selects which latent pattern surfaces. Base models appear to contain reasoning capability in latent form, which minimal interventions — RL steering, decoding tweaks, feature steering — can elicit rather than create Do base models already contain hidden reasoning ability?. RL post-training looks less like teaching *how* to reason and more like teaching *when* to deploy reasoning it already has Does RL post-training create reasoning or just deploy it?. And strikingly, RL tends to amplify a single dominant format distribution inherited from pretraining within the first epoch while collapsing the alternatives — so the 'winning' reasoning style is partly an artifact of which formats pretraining happened to favor, not which performs best Does RL training collapse format diversity in pretrained models?. A tiny 1.5B model with LoRA-only tuning can match larger RL models by learning output *format* alone, suggesting format-shaping and knowledge acquisition are separable Can small models reason well by just learning output format?.

Here's the part that should unsettle you: if format is doing the heavy lifting, then the *content* of reasoning traces may matter far less than we assume. Models trained on deliberately corrupted, semantically irrelevant traces perform comparably to those trained on correct ones — and sometimes generalize better — implying that traces act as computational scaffolding rather than meaningful logical steps Do reasoning traces need to be semantically correct?. This dovetails with a skeptical thread running through the corpus: chain-of-thought may be constrained imitation of reasoning *form* learned from training schemata, not genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning fail in language models?. The tell is that CoT degrades predictably once you push it outside its training distribution — in task, length, or format — producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data?.

There's a meaningful counterweight, though. Not everything reduces to surface format. An analysis of five million pretraining documents found that reasoning draws on broad, transferable *procedural* knowledge spread across many sources, while factual recall depends on narrow memorization — so the kind of data (procedural vs. factual) genuinely shapes whether a model can generalize a reasoning skill at all Does procedural knowledge drive reasoning more than factual retrieval?. Format selects the strategy; procedural content seems to determine whether that strategy transfers.

The practical payoff: because reasoning styles live in fairly clean, separable structures, you can manipulate them *after* training without touching the data at all. Verbose vs. concise CoT occupy distinct, linearly separable regions of activation space — a single steering vector cuts reasoning length 67% while holding accuracy Can we steer reasoning toward brevity without retraining? — and a decoding-only penalty on thought-switching tokens improves accuracy by stopping premature path-abandonment Do reasoning models switch between ideas too frequently?. That these behaviors are this steerable at inference time is itself evidence for the corpus's central claim: training format imprints reasoning patterns as recoverable structure, not as deep new capability.

Sources 12 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Show all 12 sources

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning fail in language models?

Research shows CoT mirrors reasoning form without true logical abstraction. Format matters more than content, invalid prompts work as well as valid ones, and scaling reasoning creates instruction-following deficits.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens4.30 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools3.46 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models3.43 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective2.71 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs2.67 match · arxiv ↗
Hierarchical Reasoning Model2.66 match · arxiv ↗
Measuring Faithfulness in Chain-of-Thought Reasoning2.64 match · arxiv ↗
Break the Chain: Large Language Models Can be Shortcut Reasoners2.63 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating claims about how training data *format* shapes reasoning patterns in LLMs. The question remains open: does format truly dominate domain, or have newer models, training methods, or evaluation approaches since shifted this balance?

What a curated library found — and when (findings span Nov 2024–Dec 2025, treat as dated claims):
• Format effect outweighs domain effect by ~7.5x; multiple-choice trains breadth-first, free-form trains depth-first (2025-04 onward).
• Base models possess latent reasoning capability; RL and minimal interventions *select* rather than teach reasoning (2025-04, 2025-06).
• RL post-training converges on a single dominant pretraining format distribution within the first epoch, collapsing alternatives (2025-04).
• CoT may be constrained imitation of reasoning form, not genuine inference; degrades predictably outside training distribution (2025-06, 2025-08).
• Deliberately corrupted reasoning traces perform comparably to correct ones, suggesting traces are computational scaffolding, not meaningful logical steps (2025-05).
• Verbose vs. concise CoT are linearly separable in activation space; a single steering vector reduces thought length 67% while preserving accuracy (2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2504.15777 (Tina, Apr 2025) — 1.5B LoRA model matches larger RL models via format adaptation alone.
• arXiv:2506.02878 (Jun 2025) — CoT as constrained imitation, not true reasoning.
• arXiv:2507.04742 (Jul 2025) — Activation steering for CoT compression.
• arXiv:2512.07783 (Dec 2025) — Interplay of pretraining, mid-training, RL on reasoning.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the format-dominance claim (7.5x effect): have newer o1-family or frontier models, training schedules (e.g., multi-phase curriculum), or evaluation suites (reasoning on truly out-of-distribution tasks) since shown that *content* or *domain* can reclaim ground? Where does the latent-reasoning hypothesis hold, and where does it crack? Probe whether post-training phase (Dec 2025 papers) reshuffles the pretraining format legacy.
(2) Surface the strongest DISAGREEMENT or SUPERSEDING work from the last ~6 months. Does any paper argue format is *not* dominant, or that corrupted traces are actually harmful at scale?
(3) Propose 2 research questions that assume the regime *has* moved: (a) If format is recoverable structure, not deep capability, can you *transfer* format patterns across model families or training runs? (b) Does the latent-reasoning picture hold when you scale to reasoning on genuinely novel domains (e.g., formal verification, live scientific problems)?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI learns how to reason more from the format of its training data than from what the data is about.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8