Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Synthesis note · 2026-05-02 · sourced from Natural Language Inference

Curriculum Textual Frequency Training (CTFT) is the third leg of Adam's Law's framework, and it inverts the intuitive curriculum-learning directionality. Standard curriculum learning sorts examples easy-to-hard along a conceptual difficulty axis: simple arithmetic before multi-step proofs, short translations before long ones. CTFT instead sorts examples by sentence-level corpus frequency and feeds the model the rare sentences first and the common sentences last. Rare comes first because rare is what the model's prior is weak on; saving the dense, well-modeled region for the end stabilizes the trajectory.

The reframe matters more than the technique. For an LLM, "easy" and "hard" are not properties of the concept being expressed — they are properties of the distance from the pre-training distribution. A formally simple sentence in a rare register can be harder for the model than a complex sentence in a textbook register. This connects to Does gradually tightening token budgets beat fixed budget training?: both findings argue that curriculum design for LLMs is fundamentally about managing distributional pressure, not pedagogical scaffolding. It also extends Does training data format shape reasoning strategy more than domain?: format and frequency are both statistical-position properties that drive learning more than the semantic content of the examples.

The methodological lesson generalizes beyond CTFT itself. Any curriculum-design choice for LLMs that uses the human-facing "easy/hard" gloss without checking distributional position is partly mis-specified. The replacement frame is "near/far from prior" — the model finds near-prior examples easy not because they are simple but because they are dense, and far-prior examples hard not because they are complex but because they are sparse. CTFT's contribution is operationalizing that frame into a concrete sentence-frequency ordering, with story-completion distillation (TFD) as the closed-source workaround for estimating frequencies on models whose training data we cannot see directly.

Inquiring lines that read this note 17

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do language models understand semantics or rely on pattern matching?

How do rare linguistic registers differ from conceptually complex examples?

What critical LLM failures do standard benchmarks hide?

Why do rare complex structures in training data harm LLM generalization?

What makes weaker teacher models effective for stronger student training?

Why do weaker models generate better training data than stronger models?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does memorization interact with learning and generalization?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How does distributional shift toward rare inputs change memorization reliance?

Why does consolidated memory sometimes degrade agent performance?

How does consolidation schedule order affect final memory quality?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

How does example difficulty affect learning efficiency in language models?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Why do frequent words rank higher in taxonomic abstraction hierarchies?

How do training priors constrain what context information can override?

How does training order affect knowledge acquisition in language models?

What determines success in training models on multiple tasks?

Can intentional data-mixture design replace model scaling for rare task learning?

What memory architectures best support persistent reasoning across extended interactions?

Why are rare tokens the hooks for verbatim model memorization?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 143 in 2-hop network ·dense cluster Open in graph ↗

Does ordering training data by rarity actually i… Does gradually tightening token budgets beat fixed… Does training data format shape reasoning strategy…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
curriculum design as distributional pressure management
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format and frequency both override domain content

Does ordering training data by rarity actually improve language models?

Inquiring lines that read this note 17

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4