INQUIRING LINE

Training, RL, and Test-Time Scaling · Reasoning, Retrieval, and Evaluation · Model Architecture and Internalscross-cluster

Why does the same training data produce different gains across models?

This explores why feeding identical data to different models yields uneven improvements — what about a model's starting point, scale, or current ability changes what the same examples teach it.

This explores why the same training data produces different gains across models — and the corpus's recurring answer is that data has no fixed value; its value depends on who's learning it. A sample's worth comes from the interaction between its difficulty and the model's current ability, not from any intrinsic quality of the sample. The most striking case is teacher-refined data: objectively higher-quality refinements from a stronger teacher actually *degrade* a student model when they exceed its learning frontier, so students have to filter refinements against their own statistical profile and keep only what's compatible Does teacher-refined data always improve student model performance?. The productive band of useful examples isn't even stable within a single model — it drifts as training proceeds, which is why static difficulty labels go stale within steps How does model ability change what samples teach?.

The same relativity shows up at the harmful end. Problems that are too hard don't just fail to help — they teach degenerate shortcuts (answer-repetition, skipping computation) that contaminate capabilities the model already had, because group-relative normalization treats rare accidental successes as high-value trajectories Do overly hard RLVR samples actually harm model capabilities?. So a batch that's a productive challenge for a capable model can be actively corrosive for a weaker one. This is the flip side of the informativeness story: difficulty is only meaningful relative to ability.

A second axis is where the data lands inside the model. Pretraining and fine-tuning scale almost independently — pretraining enriches lower-layer factual knowledge while fine-tuning modifies upper-layer behavior — so the same fine-tuning data buys different gains depending on what the underlying pretrained model already stored Do pretraining and fine-tuning scale independently in language models?. Models also carry reusable internal scaffolding that the data can only exploit if it's already present: length generalization transfers across tasks precisely because pretrained models share attention heads that newer tasks can reuse Can length generalization transfer between different related tasks?. Identical examples activate latent machinery in one model and find nothing to grab onto in another.

Reinforcement learning makes the model-dependence even sharper. RL doesn't add capability so much as amplify one format already latent in pretraining while suppressing the rest — and which format wins depends on model scale, not performance, which is why the effect is largely invisible when you start from a proprietary base Does RL training collapse format diversity in pretrained models?. The same RL signal therefore sculpts different models toward different dominant behaviors. And staying close to the base matters: models that drift less from their base distribution preserve plasticity and keep learning later tasks, while parameter-heavy approaches that drift far stall when the domain shifts Does staying close to the base model preserve learning ability?.

The practical upshot is that selecting data *for a specific model* beats using more of it. Gradient-similarity selection of just 5% of an instruction set can outperform training on the whole thing — because mixed datasets contain examples that actively hinder particular skills by pulling reasoning strategy away from the target, and which examples those are depends on the model Can we train better models on less data?. The thing you didn't know you wanted to know: there's no such thing as good training data in the abstract — only data that's well-matched to a particular model's current frontier, and a mismatch can make a stronger dataset produce a weaker model.

Sources 8 notes

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can length generalization transfer between different related tasks?

Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Why does the same training data produce different gains across models?

Sources 8 notes

Next inquiring lines