INQUIRING LINE

Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?

This explores why aggressively curating pretraining data with a small proxy model — keeping only text the proxy scores as high-quality — can suppress reasoning rather than improve it, and what the corpus says about where reasoning actually comes from.


The corpus doesn't have a paper on proxy-model filtering by name, but several converge on a mechanism that explains the effect: reasoning is not acquired from clean, correct, proxy-approved text — it emerges from broad, diverse exposure that a narrow quality filter is exactly the wrong tool to preserve. The clearest piece is the finding that reasoning generalization rides on *procedural* knowledge spread thinly across many heterogeneous documents, while only factual recall depends on narrow, document-specific text Does procedural knowledge drive reasoning more than factual retrieval?. A proxy model trained to recognize 'good' data optimizes for the legible, fact-dense surface it can score — and in doing so prunes the long, messy, low-prestige tail where transferable procedure actually lives. You filter for what looks like quality and accidentally remove what produces reasoning.

What makes this counterintuitive is that the usual instinct — keep correct content, drop the noise — turns out to misread how reasoning text functions. Models trained on *deliberately corrupted* reasoning traces perform comparably to those trained on correct ones, sometimes generalizing better out of distribution, because the traces act as computational scaffolding rather than as meaningful steps to be gotten 'right' Do reasoning traces need to be semantically correct?. If semantic correctness isn't the active ingredient, then a proxy filter selecting for correctness is optimizing a metric that doesn't drive the capability — and paying for it by shrinking diversity.

That diversity cost is the real damage. Aggressive selection collapses the distribution of formats and styles the model sees, and the corpus shows elsewhere that collapse is precisely how reasoning gets suppressed: RL post-training that amplifies one dominant pretraining format while killing the alternatives demonstrates how much latent variety pretraining holds and how easily a narrowing process buries it Does RL training collapse format diversity in pretrained models?. A proxy filter does the narrowing earlier, before the model ever sees the variety.

The deeper reason removing the filter helps is that reasoning is *latent and elicited, not taught*. Base models already carry reasoning capability that minimal training merely unlocks Do base models already contain hidden reasoning ability?, and post-training largely decides *when* to deploy reasoning rather than creating it Does RL post-training create reasoning or just deploy it?. That latent capacity has to be planted during pretraining — and methods that plant it do so by letting reasoning emerge as a side effect of predicting arbitrary text, rewarding whatever genuinely improves prediction rather than whatever a proxy judges clean: chain-of-thought learned as an exploratory action with an information-gain reward Can chain-of-thought reasoning be learned during pretraining itself?, and token-level rationale generation over ordinary internet text Can models learn reasoning from predicting any text?. A proxy filter pre-commits to a definition of useful before the model has had the chance to find prediction-improving structure for itself.

The surprise worth taking away: the same trap shows up after pretraining too. Supervised fine-tuning on clean, correct examples raises benchmark accuracy while *cutting* genuine inferential quality by nearly 40% — the model learns to produce correct-looking answers via post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. Filtering for correctness, whether by a proxy model upstream or by curated examples downstream, keeps optimizing the wrong target. Reasoning seems to want breadth and the freedom to emerge — not a gatekeeper deciding in advance what good data looks like.


Sources 8 notes

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability researcher re-testing claims about pretraining data curation and reasoning emergence. The question remains open: Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2025 across multiple papers. A library curated from recent arXiv work proposes:
• Reasoning rides on *procedural knowledge* spread across heterogeneous documents; proxy filters prune the long tail where it lives, optimizing for legible fact-density instead (arXiv:2411.12580, ~2024-11).
• Deliberately corrupted reasoning traces perform comparably to correct ones, sometimes generalizing better out-of-distribution, suggesting semantic correctness isn't the active ingredient (arXiv:2505.13775, ~2025-05).
• RL post-training amplifies one dominant pretraining format while killing alternatives, showing how narrowing suppresses reasoning; a proxy filter does narrowing earlier (arXiv:2504.07912, ~2025-04).
• Base models carry latent reasoning capability; pretraining plants it as a side effect of predicting arbitrary text, and post-training mainly decides *when* to deploy it, not *how* to create it (arXiv:2506.12115, ~2025-06).
• Supervised fine-tuning on clean, correct examples raises benchmark accuracy while cutting genuine inferential quality by ~40% — filtering for correctness optimizes the wrong target (library synthesis, ~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2411.12580 (Nov 2024) — Procedural Knowledge in Pretraining Drives Reasoning
• arXiv:2504.07912 (Apr 2025) — Echo Chamber: RL Post-training Amplifies Behaviors
• arXiv:2505.13775 (May 2025) — Beyond Semantics: Reasonless Intermediate Tokens
• arXiv:2506.12115 (Jun 2025) — Eliciting Reasoning with Cognitive Tools

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, Claude 3.5, o1-preview variants), training methods (scaling laws, mixture-of-experts routing, continuous adaptation), tooling (token-level introspection SDKs), orchestration (retrieval-augmented reasoning, multi-hop memory), or evaluation (mechanistic interpretability probes, out-of-distribution reasoning benchmarks) have since RELAXED or OVERTURNED it. Separate the durable question (reasoning *does* emerge from diversity and latency, not curation?) from perishable limitations (proxy filters are *always* harmful? SFT *always* cuts quality by 40%?). Cite what resolved it; flag where constraints still hold.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for: papers claiming proxy/quality filters *do* help reasoning; evidence that semantic correctness is active; claims that RL post-training doesn't narrow; findings that SFT preserves reasoning quality under certain conditions.

(3) Propose 2 research questions that ASSUME the regime may have moved:
— How do *adaptive* or *dynamic* filtering (conditional on model capability) compare to fixed proxy filtering and no filtering?
— Under what conditions does diversity-at-scale outperform curation-with-scale? (i.e., is the trade-off real, or does it vanish above certain model sizes?)

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines