SYNTHESIS NOTE
Language, Text, and Discourse Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Does ordering training data by rarity actually improve language models?

Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.

Synthesis note · 2026-05-02 · sourced from Natural Language Inference
How do language models learn to think like humans?

Curriculum Textual Frequency Training (CTFT) is the third leg of Adam's Law's framework, and it inverts the intuitive curriculum-learning directionality. Standard curriculum learning sorts examples easy-to-hard along a conceptual difficulty axis: simple arithmetic before multi-step proofs, short translations before long ones. CTFT instead sorts examples by sentence-level corpus frequency and feeds the model the rare sentences first and the common sentences last. Rare comes first because rare is what the model's prior is weak on; saving the dense, well-modeled region for the end stabilizes the trajectory.

The reframe matters more than the technique. For an LLM, "easy" and "hard" are not properties of the concept being expressed — they are properties of the distance from the pre-training distribution. A formally simple sentence in a rare register can be harder for the model than a complex sentence in a textbook register. This connects to Does gradually tightening token budgets beat fixed budget training?: both findings argue that curriculum design for LLMs is fundamentally about managing distributional pressure, not pedagogical scaffolding. It also extends Does training data format shape reasoning strategy more than domain?: format and frequency are both statistical-position properties that drive learning more than the semantic content of the examples.

The methodological lesson generalizes beyond CTFT itself. Any curriculum-design choice for LLMs that uses the human-facing "easy/hard" gloss without checking distributional position is partly mis-specified. The replacement frame is "near/far from prior" — the model finds near-prior examples easy not because they are simple but because they are dense, and far-prior examples hard not because they are complex but because they are sparse. CTFT's contribution is operationalizing that frame into a concrete sentence-frequency ordering, with story-completion distillation (TFD) as the closed-source workaround for estimating frequencies on models whose training data we cannot see directly.

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 141 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

curriculum textual frequency training reverses easy-to-hard intuition by ordering data low-to-high frequency