How does training frequency distribution shape what models reliably retrieve?
This explores how often a model saw something during training — common vs. rare — and how that frequency shapes whether it can reliably pull that knowledge back out later, whether from its own weights or by knowing when to reach for an external source.
This explores how the frequency of things in training data shapes reliable recall — and the corpus suggests frequency leaves a physical fingerprint inside the model. The most direct finding is that representational density is *learned*: during pretraining, networks develop dense activations for familiar, frequently-seen inputs and default to sparse representations for unfamiliar ones, a pattern that emerges purely from exposure without any task-specific tuning Is representational sparsity learned or intrinsic to neural networks?. A companion result shows the live version of this: when a model hits an out-of-distribution input, its hidden states sparsify in a localized, systematic way that tracks how unfamiliar the task is Do language models sparsify their activations under difficult tasks?. So the model carries a kind of internal frequency map — dense where it has seen a lot, sparse where it hasn't — and that sparsity acts as a stabilizing filter rather than a failure.
The catch is that the model's *confidence* doesn't fully see this map. The most useful lateral finding here is that model confidence and data-rarity are orthogonal signals catching different failures: confidence misses hallucinations about rare entities (the model is fluently wrong about something it barely saw), while rarity misses uncertain reasoning about common knowledge Should RAG systems use model confidence or data rarity to trigger retrieval?. That's why deciding *when to retrieve externally* can't rest on confidence alone — and why uncertainty estimation, while it beats heavier adaptive-retrieval heuristics on cost Can simple uncertainty estimates beat complex adaptive retrieval?, still has a blind spot precisely on low-frequency facts. Framing retrieval as a step-by-step decision of "trust my parametric memory or go look it up" gets large accuracy gains by routing around exactly these gaps When should language models retrieve external knowledge versus use internal knowledge?.
Frequency also shapes recall through what training *amplifies*. RL post-training doesn't add new knowledge so much as it converges on the single most dominant format from pretraining and suppresses the alternatives, often within the first epoch — the most-frequent pattern wins, and which one wins depends on scale, not necessarily on being best Does RL training collapse format diversity in pretrained models?. There's a recommender-systems echo of the same tension that's worth knowing about: wide-and-deep models deliberately split labor so that a memorization component captures rare, long-tail items while a generalization component handles the common cases — an explicit architectural admission that frequent and rare knowledge want different machinery to be retrieved reliably Can one model memorize and generalize better than two?.
The surprise worth leaving with: the things that make a model *fluent* are the same things that make it *unreliable on the tail*. Density, confidence, and the dominant format all reward what was seen often — so the failures cluster on the rare, and they arrive sounding just as confident as the truth. That's why the corpus keeps pointing toward hybrid triggers and selective retrieval: reliable recall isn't about making the model surer of itself, it's about teaching it where its own frequency map runs thin.
Sources 7 notes
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Model confidence and data-rarity signals catch orthogonal failure modes: confidence misses hallucinations about rare entities, while rarity misses uncertain reasoning about common knowledge. Hybrid triggers substantially outperform either signal alone.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.