Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
This explores whether a model learns to reason better from a small set of varied worked-through examples than from piling up facts to memorize — i.e., whether reasoning is a transferable skill you can seed cheaply rather than a body of data you must exhaustively cover.
This explores whether a small, diverse dose of reasoning demonstrations can do the work that mountains of factual training data otherwise would. The corpus suggests a surprisingly strong yes — but with a sharp caveat about what that reasoning actually is. The cleanest framing comes from work separating two kinds of learning: factual recall leans on narrow, document-specific memorization, while reasoning draws on broad, transferable procedural knowledge spread across many unrelated sources Does procedural knowledge drive reasoning more than factual retrieval?. That's the foundation of your intuition: facts don't transfer, procedures do — so a few good procedures can travel further than a warehouse of facts.
If reasoning is procedural and transferable, you'd expect it to be cheap to elicit, and several notes converge on exactly that. Base models appear to already contain latent reasoning ability that only needs unlocking — five independent methods (RL steering, critique fine-tuning, decoding tweaks, feature steering, RLVR) all surface reasoning that was already present, implying post-training selects rather than creates it Do base models already contain hidden reasoning ability?, Does RL post-training create reasoning or just deploy it?. And the demonstrations needed are remarkably small: RLVR's learning signal concentrates in roughly 20% of tokens — the high-entropy 'forking points' — and training on only those matches full updates Do high-entropy tokens drive reasoning model improvements?. Even Quiet-STaR shows reasoning competence can emerge as a side effect of ordinary language modeling on arbitrary text, no curated task datasets required Can models learn reasoning from predicting any text?.
Here's the part you didn't know you wanted to know: the demonstrations may not even need to be correct. Models trained on deliberately corrupted, semantically irrelevant reasoning traces perform comparably to those trained on valid ones — and sometimes generalize better out of distribution Do reasoning traces need to be semantically correct?. That's a stunning result for your question: if traces function as computational scaffolding rather than carriers of factual content, then diversity of form matters more than fidelity of fact, which is precisely why a small set could substitute for exhaustive data.
But 'replace' deserves scrutiny, because the same corpus warns what you'd be replacing it with. Chain-of-thought degrades predictably outside its training distribution — fluent but logically inconsistent — behaving like imitation of reasoning's form rather than genuine inference Does chain-of-thought reasoning actually generalize beyond training data?, Does chain-of-thought reasoning reveal genuine inference or pattern matching?. And naive supervised fine-tuning can raise benchmark scores while cutting reasoning-step quality by nearly 39%, meaning the model reaches right answers through post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. So small-and-diverse can buy you a real reasoning protocol — but only if the demonstrations teach genuine procedure, not the cosmetic appearance of one.
The practical synthesis: the gap between reasoning and non-reasoning models is about training structure, not raw capability or inference budget Can non-reasoning models catch up with more compute?, and you can install that structure cheaply — verifier-free methods even extend it to general domains without answer-checking Can reasoning improvement work without answer verification?, and reasoning behaviors like verbosity turn out to be single steerable directions extractable from ~50 examples Can we steer reasoning toward brevity without retraining?. The corpus's answer to your question is that reasoning and factual coverage are different resources with different economics: you can seed the first with little, but you can't memorize your way to it with a lot, and you can't fake it without it eventually showing under distribution shift.
Sources 12 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.