INQUIRING LINE

What makes training data quality more important than quantity for reasoning?

This explores why, for teaching reasoning, *which* data you train on matters more than *how much* — and the corpus suggests the surprising part is what 'quality' even means here.


This reads the question as: if you can't just bulk up the training set to make a model reason better, what is the small set of things that actually carries the signal? The corpus has a sharp answer, and it isn't the obvious one. Quality matters more than quantity because most of the training data does nothing for reasoning — the learning signal is concentrated in a tiny fraction of it. Do high-entropy tokens drive reasoning model improvements? shows that only ~20% of tokens are the high-entropy 'forking points' where a reasoning trajectory is actually decided; train on just those and you match or beat full-gradient updates. The other 80% is filler. More data that's all filler buys you nothing.

The deeper reason is that base models already contain the reasoning ability — training selects it rather than installs it. Do base models already contain hidden reasoning ability? finds five independent methods all eliciting reasoning that's already latent in base activations, and Does RL post-training create reasoning or just deploy it? argues post-training mostly teaches a model *when* to deploy reasoning, not *how*. If the capability is already there, you don't need volume — you need the right small nudge. That reframes 'quality' away from 'more correct examples' toward 'the precise signal that flips the latent switch.'

Here's the twist that makes naive 'quality' suspect: semantic correctness of the data turns out not to be the lever. Do reasoning traces need to be semantically correct? shows models trained on systematically *wrong* reasoning traces do just as well — sometimes generalize better — because the traces act as computational scaffolding, not as meaningful logic to imitate. And Does training data format shape reasoning strategy more than domain? finds that the *format* of the data (multiple-choice vs. free-form) shapes a model's reasoning strategy 7.5× more than the subject matter does. So the quality that counts is structural — the shape and the pivotal decision points — not the factual richness or sheer correctness of the corpus.

Quantity can actively backfire, which is the strongest version of the argument. Does supervised fine-tuning improve reasoning or just answers? documents fine-tuning that raises benchmark accuracy while cutting genuine inferential quality by 38.9% — the model learns to produce right answers through post-hoc rationalization. More supervised data made the scores go up and the reasoning go down. Does teacher-refined data always improve student model performance? adds that even *objectively higher-quality* data hurts when it exceeds the student's learning frontier; the right data is data compatible with the specific learner, not the most polished data available. 'Better' is relative to who's learning.

What's left, then, is that reasoning is taught by a few well-chosen, well-shaped signals rather than accumulated volume. Can small models reason well by just learning output format? shows a 1.5B model matching big RL models by learning output *format* alone, separating reasoning organization from knowledge storage. And where genuine quality is irreplaceable, it's quality of *principle*, not quantity of examples: Can models learn argument quality from labeled examples alone? finds that no amount of labeled examples teaches argument-quality judgment — models just learn surface patterns — until you hand them an explicit framework. The thing you didn't know you wanted to know: for reasoning, the most valuable data is often the smallest, structurally-selected slice, and piling on more can quietly erode the very capability you're trying to build.


Sources 9 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning researcher re-testing claims about training data quality vs. quantity for LLM reasoning. The question remains open: what actually drives reasoning capability — data volume, semantic correctness, or something else?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2025, mostly mid-2025 onward. A library of recent work claims:

• Only ~20% of tokens ('high-entropy forking points') carry the learning signal; training on that fraction matches full-gradient updates; the other 80% is inert (2025-06, arXiv:2506.01939).
• Base models already possess latent reasoning capability; post-training teaches *when* to deploy it, not *how* (2025-04, arXiv:2504.09858).
• Models trained on *wrong* reasoning traces generalize comparably to or better than correct traces because traces function as computational scaffolding, not semantic imitation (2025-05, arXiv:2505.13775).
• Output *format* (multiple-choice vs. free-form) shapes reasoning strategy 7.5× more than domain content; a 1.5B model matches large RL models by learning format alone (2025-04, arXiv:2504.15777).
• Fine-tuning on more supervised data can raise benchmark accuracy while cutting genuine reasoning by 38.9%; data quality must match the learner's frontier, not absolute polish (2025-08, arXiv:2508.01191).

Anchor papers (verify; mind their dates):
• arXiv:2506.01939 (Jun 2025) — high-entropy token analysis
• arXiv:2504.09858 (Apr 2025) — reasoning already latent
• arXiv:2505.13775 (May 2025) — wrong traces as effective scaffolding
• arXiv:2504.15777 (Apr 2025) — format dominates content

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above — especially the 80/20 token rule, the latency hypothesis, and the format-over-content finding — judge whether newer models (o1, Claude reasoning variants), scaling laws, or post-training methods (RL specifics, distillation) have relaxed or overturned it. Plainly separate the durable question (likely still: what is the minimal sufficient signal?) from perishable claims (e.g., exact percentages, format dominance ratios). Where constraints hold, cite what confirms them.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially studies showing that sheer scale or data volume DO matter for reasoning, or that semantic correctness IS critical despite the library's findings.

(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., does multimodal or code-heavy training change the token-importance distribution? Do reasoning models trained on synthetic data vs. human data exhibit the same format-dominance pattern?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines