INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›Why does training format shape rea…›this inquiring line

How an AI is tested during training — multiple-choice vs. open-ended — shapes its thinking habits more than what it learns about.

Does training data format shape reasoning strategy more than domain content?

This explores whether *how* training data is presented — its format, like multiple-choice vs. free-form — shapes the reasoning style a model adopts more than *what* the data is about (its subject domain).

This explores whether the shape of training data — multiple-choice vs. free-form, for instance — molds a model's reasoning strategy more than the subject matter does. The corpus answers this surprisingly directly: yes, and by a wide margin. One study found that training *format* shaped reasoning strategy roughly 7.5 times more strongly than domain content. Models fed multiple-choice data learned to reason broadly, scanning many options before committing (breadth-first), while free-form training pushed them toward following one line of thought deeply (depth-first). Presentation, not topic, set the cognitive habit Does training data format shape reasoning strategy more than domain?.

Why would format have such leverage? A clue comes from work showing that reasoning ability isn't really *created* during training — it's already latent in the base model, and training mostly selects and routes it. Several independent methods all elicit reasoning that pre-exists in the base model's activations rather than installing it Do base models already contain hidden reasoning ability?, and RL post-training has been characterized as teaching a model *when* to deploy reasoning, not *how* to reason Does RL post-training create reasoning or just deploy it?. If the raw capability is already there, then the training signal's main job is to shape *strategy and deployment* — exactly the lever that format pulls.

There's a deeper version of this idea: what transfers in reasoning is *procedural* knowledge — the how-to patterns drawn from many documents — rather than fact-specific recall, which depends on narrow memorization Does procedural knowledge drive reasoning more than factual retrieval?. Format is essentially a procedural template. A multiple-choice layout teaches a procedure (enumerate, compare, eliminate); a free-form prompt teaches another (commit, elaborate, follow through). The model absorbs the procedure regardless of whether the questions were about math or medicine. This also fits the finding that knowledge and reasoning live in different parts of the network — facts in lower layers, reasoning adjustments in higher ones — so domain content and reasoning strategy can be tuned somewhat independently Why does reasoning training help math but hurt medical tasks?.

But the format-driven strategy is fragile in a revealing way. Chain-of-thought reasoning degrades predictably once you shift the task, length, or *format* away from what the model trained on — models keep producing fluent reasoning that's structurally familiar but logically hollow, imitating the *form* without the underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. That's the flip side of the headline finding: if format is what's really being learned, then a format the model hasn't seen is exactly what breaks it. The same brittleness shows up with input length, where accuracy collapses well below the context limit in a task-agnostic way Does reasoning ability actually degrade with longer inputs?.

The practical upshot — the thing you might not have known you wanted to know — is that reasoning strategies behave like steerable, almost modular settings rather than deep properties of subject expertise. Verbose vs. concise reasoning turns out to be a single linear direction you can adjust without retraining at all Can we steer reasoning toward brevity without retraining?, and domain-adaptation methods consistently trade visible gains for hidden costs in format flexibility and reasoning faithfulness How do domain training techniques actually reshape model behavior?. If you want to change *how* a model thinks, you may have more leverage reshaping the format of what it sees than the field it studies.

Sources 9 notes

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Show all 9 sources

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher tasked with re-evaluating whether training data *format* shapes reasoning strategy more than domain content—a claim staked by a curated arXiv library (2023–2025). Treat the findings below as dated claims to be re-tested, not current truth.

What a curated library found — and when (findings span 2023–2025):
• Training format shaped reasoning strategy ~7.5× more strongly than domain content; multiple-choice data pushed breadth-first reasoning, free-form pushed depth-first (2025, arXiv:2505.10185).
• Reasoning ability pre-exists as latent capability in base models; training selects and routes it rather than installing it. RL post-training teaches *when* to reason, not *how* (2024–2025).
• Procedural knowledge from pretraining documents drives reasoning generalization more than fact-specific recall (2025, arXiv:2411.12580).
• Knowledge resides in lower network layers, reasoning in higher layers—enabling partial independent tuning of domain and strategy (2025, arXiv:2507.18178).
• Chain-of-thought reasoning degrades predictably when format, length, or task distribution shifts from training; models produce fluent but logically hollow reasoning imitating *form* (2025, arXiv:2508.01191).

Anchor papers (verify; mind their dates):
• arXiv:2505.10185 (2025) — CoT Encyclopedia: predicting and controlling reasoning strategy;
• arXiv:2411.12580 (2025) — Procedural Knowledge in Pretraining;
• arXiv:2508.01191 (2025) — Chain-of-Thought as distribution-bounded mirage;
• arXiv:2507.18178 (2025) — Knowledge/Reasoning decoupling.

Your task:
(1) RE-TEST EACH CONSTRAINT. Has newer work (last ~6 months) shown that format brittleness can be overcome—e.g., via fine-tuning robustness methods, synthetic data generation, or multi-format exposure—such that reasoning *does* transfer across format shifts? Separately, does the 7.5× ratio hold under scaling laws, or do larger models decouple format and domain more? State plainly where the constraint still appears valid.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months: any evidence that domain content *does* shape strategy at scale, or that format effects are weaker in reasoning-optimized models (o1-style, verifier-trained)?
(3) Propose two research questions that assume the regime may have shifted: (a) Can you design a pre-training curriculum that makes format effects transparent/removable? (b) Do reasoning-native architectures (e.g., explicit search, tree-of-thought) bypass format sensitivity entirely?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How an AI is tested during training — multiple-choice vs. open-ended — shapes its thinking habits more than what it learns about.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8