INQUIRING LINE

Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?

This explores whether downstream performance is really driven by how often related concepts and patterns showed up in pretraining — i.e., whether the 'frequency effect' is interpolation/memorization rather than genuine generalization.


This explores whether the boost a model gets on a downstream task is mostly explained by how often similar examples appeared during pretraining, rather than by any real new capability. The corpus points strongly toward yes — frequency of exposure does much of the heavy lifting, and several notes converge on this from different angles.

The most direct evidence comes from multimodal models: across 34 models and 5 datasets, zero-shot performance tracks how often a test concept appeared in the pretraining data, and models need *exponentially* more data for each linear gain on downstream tasks Does multimodal zero-shot performance actually generalize or interpolate?. That's the signature of interpolation, not generalization — the model is good at what it has seen a lot of. A parallel finding at the token level: whether a keyword gets 'primed' after a gradient update is predictable from its probability *before* learning, with a sharp ~10^-3 threshold separating contexts where learning sticks from those where it doesn't Can we predict keyword priming before learning happens?. So the pre-existing statistical footprint of a concept governs whether new training even takes hold.

The more surprising thread is what's actually transferring when you adapt a model. Instruction tuning, it turns out, can be done with semantically *empty or wrong* instructions and still hit nearly identical performance — what transfers is familiarity with the output space, not task understanding Does instruction tuning teach task understanding or output format?. RL post-training shows the same shape from another direction: rather than inventing behavior, it amplifies one format distribution that already dominated pretraining while suppressing the alternatives Does RL training collapse format diversity in pretrained models?. In both cases downstream gains are largely a re-weighting of distributions the model already carried.

Why similarity matters as much as raw frequency shows up in the teacher-student work: refined training data degrades a student when it falls outside that student's existing 'learning frontier,' even when the data is objectively better Does teacher-refined data always improve student model performance?. The benefit isn't in the data's quality — it's in its proximity to what the model already represents. That's the sample-level similarity story made mechanical.

The useful caveat is that frequency isn't destiny. Where capability lives matters: pretraining scale drives factual knowledge in lower layers while fine-tuning scale shifts behavior in upper layers, so the two don't simply collapse into one frequency dial Do pretraining and fine-tuning scale independently in language models?. And methods like baking chain-of-thought reasoning into pretraining itself suggest some capabilities can be *planted* rather than merely surfaced from frequency Can chain-of-thought reasoning be learned during pretraining itself?. The honest reading: the frequency effect is largely a similarity effect — downstream wins ride on overlap with pretraining — but architecture and where you intervene leave room for genuine new capability on top.


Sources 7 notes

Does multimodal zero-shot performance actually generalize or interpolate?

Across 34 models and 5 datasets, multimodal models require exponentially more pretraining data for linear performance gains on downstream tasks. Performance correlates with how often test concepts appeared during pretraining, not genuine generalization ability.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether sample-level similarity between pretraining and downstream tasks explains most downstream performance gains (the 'frequency effect'). This question remains open; treat the findings below as dated claims to be re-tested.

What a curated library found — and when (findings span 2023–2025, dated claims, not current truth):
• Zero-shot performance on 34 multimodal models tracks pretraining concept frequency; exponential data is needed per linear downstream gain, signature of interpolation not generalization (2024).
• Instruction tuning transfers output-format familiarity, not task understanding; semantically empty instructions yield near-identical performance (2023).
• RL post-training amplifies a single dominant pretraining distribution format rather than inventing new behavior (2025).
• Refined training data harms student models when it falls outside their learning frontier, even if objectively better — benefit is proximity to existing representation, not data quality (2024).
• Pretraining scale drives lower-layer factual knowledge; fine-tuning scale shifts upper-layer behavior — they don't collapse to one frequency dial (2024).

Anchor papers (verify; mind their dates):
• arXiv:2404.04125 (2024): Pretraining Concept Frequency & Multimodal Zero-Shot
• arXiv:2305.11383 (2023): Do Models Really Learn to Follow Instructions?
• arXiv:2504.07912 (2025): Echo Chamber — RL Post-training & Pretraining Amplification
• arXiv:2507.14805 (2025): Subliminal Learning — Hidden Behavioral Trait Transmission

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer model scales, training methods (DPO, rejection sampling, synthetic data), inference harnesses (caching, routing), or evals have since relaxed or overturned it. Separate durable questions (still open?) from perishable limitations (possibly resolved by 2025–2026 methods?). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—papers that show downstream gains *despite* low pretraining frequency, or capability emergence *without* similarity overlap.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does scaling inference compute or using ensemble routing create "genuine" new capability independent of pretraining frequency?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines