Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
This explores why aggressively curating pretraining data with a small proxy model — keeping only text the proxy scores as high-quality — can suppress reasoning rather than improve it, and what the corpus says about where reasoning actually comes from.
The corpus doesn't have a paper on proxy-model filtering by name, but several converge on a mechanism that explains the effect: reasoning is not acquired from clean, correct, proxy-approved text — it emerges from broad, diverse exposure that a narrow quality filter is exactly the wrong tool to preserve. The clearest piece is the finding that reasoning generalization rides on *procedural* knowledge spread thinly across many heterogeneous documents, while only factual recall depends on narrow, document-specific text Does procedural knowledge drive reasoning more than factual retrieval?. A proxy model trained to recognize 'good' data optimizes for the legible, fact-dense surface it can score — and in doing so prunes the long, messy, low-prestige tail where transferable procedure actually lives. You filter for what looks like quality and accidentally remove what produces reasoning.
What makes this counterintuitive is that the usual instinct — keep correct content, drop the noise — turns out to misread how reasoning text functions. Models trained on *deliberately corrupted* reasoning traces perform comparably to those trained on correct ones, sometimes generalizing better out of distribution, because the traces act as computational scaffolding rather than as meaningful steps to be gotten 'right' Do reasoning traces need to be semantically correct?. If semantic correctness isn't the active ingredient, then a proxy filter selecting for correctness is optimizing a metric that doesn't drive the capability — and paying for it by shrinking diversity.
That diversity cost is the real damage. Aggressive selection collapses the distribution of formats and styles the model sees, and the corpus shows elsewhere that collapse is precisely how reasoning gets suppressed: RL post-training that amplifies one dominant pretraining format while killing the alternatives demonstrates how much latent variety pretraining holds and how easily a narrowing process buries it Does RL training collapse format diversity in pretrained models?. A proxy filter does the narrowing earlier, before the model ever sees the variety.
The deeper reason removing the filter helps is that reasoning is *latent and elicited, not taught*. Base models already carry reasoning capability that minimal training merely unlocks Do base models already contain hidden reasoning ability?, and post-training largely decides *when* to deploy reasoning rather than creating it Does RL post-training create reasoning or just deploy it?. That latent capacity has to be planted during pretraining — and methods that plant it do so by letting reasoning emerge as a side effect of predicting arbitrary text, rewarding whatever genuinely improves prediction rather than whatever a proxy judges clean: chain-of-thought learned as an exploratory action with an information-gain reward Can chain-of-thought reasoning be learned during pretraining itself?, and token-level rationale generation over ordinary internet text Can models learn reasoning from predicting any text?. A proxy filter pre-commits to a definition of useful before the model has had the chance to find prediction-improving structure for itself.
The surprise worth taking away: the same trap shows up after pretraining too. Supervised fine-tuning on clean, correct examples raises benchmark accuracy while *cutting* genuine inferential quality by nearly 40% — the model learns to produce correct-looking answers via post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. Filtering for correctness, whether by a proxy model upstream or by curated examples downstream, keeps optimizing the wrong target. Reasoning seems to want breadth and the freedom to emerge — not a gatekeeper deciding in advance what good data looks like.
Sources 8 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.