INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does memorization interact wit…›this inquiring line

Information theory says data order is irrelevant — but that assumes the learner stays the same, which neural networks never do.

Why does curriculum order matter when information theory says data order is irrelevant?

This explores a real tension: classical information theory treats a dataset's total information as order-invariant, yet curriculum learning keeps showing that the sequence you feed examples in changes the outcome — so the corpus's answer is that order matters because the learner isn't a fixed container, it's a moving target that reshapes itself as it reads.

This explores why sequencing should matter at all when the math says a dataset carries the same information no matter how you shuffle it. The resolution running through the corpus: information theory's order-invariance assumes a *fixed* receiver absorbing a joint distribution. A model under gradient descent is not fixed — it's a path-dependent system whose current state decides what the next example can even teach it. Order matters because the learner is non-stationary, not because the data's information content changes.

Several notes make this concrete from different angles. The sharpest reframing is that curriculum isn't really about "easy then hard" pedagogy at all — it's about distance from where the model already is. Rare-to-common ordering beats standard curricula because rarity signals a gap in the model's existing distribution, not conceptual difficulty Does ordering training data by rarity actually improve language models?. The same logic appears in reverse: teacher-refined data that is objectively higher quality *degrades* a student when it lands beyond the student's current learning frontier Does teacher-refined data always improve student model performance?. "Better data" and "learnable-right-now data" are different things, and only the second one survives contact with a moving learner.

The most striking case is when order decides whether a signal is informative *at all*. Running imitation training first and verifiable-reward training second beats either alone, because the imitation phase produces reasonable attempts that make the later reward signal meaningful — reorder it and the reward has nothing to sharpen Does sequencing imitation then exploration training improve reasoning?. Order also has mechanical, almost physical effects on the model's internal dynamics: training structured tasks before creative ones prevents entropy collapse from wrecking open-ended ability, a 6%+ swing that pure data content can't explain Does training order reshape how models handle different task types?. And clever sequencing can manufacture supervision that the raw data never contained — sliding a reasoning start-state backward turns plain outcome feedback into step-level guidance Can curriculum learning approximate expensive process supervision?.

The thing you may not have known you wanted to know: this is the *opposite* of how models treat order at inference time. By default LLMs largely ignore the temporal order of a user's interaction history when ranking, and you have to prompt order-sensitivity back in Why do language models ignore temporal order in ranking?. So order is nearly invisible to a frozen model reading a sequence, yet decisive for a model *learning* from one. The corpus even hints at a deeper version of this asymmetry: format and presentation shape reasoning strategy far more than the underlying content does Does training data format shape reasoning strategy more than domain?, and you can build it from sparsity signals without any external difficulty labels Can representation sparsity order few-shot demonstrations effectively?. Information theory measures what's in the data; curriculum measures what a particular learner, in a particular state, can pick up next.

Sources 8 notes

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Show all 8 sources

Why do language models ignore temporal order in ranking?

LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst auditing curriculum-learning claims in LLM training. The core question: why does training order matter when information theory says data order is irrelevant?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026, tracing a shift from static information theory to path-dependent learning dynamics.
• Curriculum order is decisive because learners are non-stationary: a model's current state gates what the next example can teach (2024–2026).
• Rare-to-common ordering beats "easy-to-hard" because rarity signals distribution gaps in the *current* model, not objective difficulty (2026).
• Teacher-refined data can degrade student models if it exceeds the student's learning frontier; learnable-now data ≠ objectively better data (2024–2025).
• Imitation-then-reward sequencing produces meaningful supervision that reordered training cannot; order decides whether a signal teaches anything (2024).
• Frozen LLMs largely ignore temporal order at inference (ranking, ranking consistency ~70–80% insensitive to sequence order), yet order is decisive during training (2023–2024).
• Format and sparsity patterns shape reasoning strategy more than domain content; curriculum can be built from representation sparsity without external labels (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2402.05808 (2024-02) — Reverse Curriculum RL: reordering outcome feedback into step-level guidance.
• arXiv:2507.14783 (2025-07) — Omni-Thinker: multi-task RL entropy dynamics; structured-before-creative prevents collapse.
• arXiv:2604.02176 (2026-04) — Adam's Law: textual frequency as a curriculum signal.
• arXiv:2305.08845 (2023-05) — Zero-Shot Rankers: LLM insensitivity to sequence order at inference.

Your task:
(1) RE-TEST THE FROZEN-vs-LEARNING ASYMMETRY. The library claims order is nearly invisible to frozen models but decisive during training. Has this split widened or collapsed in the last 6 months? Check whether retrieval-augmented generation, in-context learning harnesses, or multi-turn prompting frameworks have restored order-sensitivity at inference. Separately: has adaptive scheduling, online curriculum, or dynamic task mixing further sharpened the training-time effect? Plainly state where the asymmetry still holds.
(2) SURFACE CONTRADICTIONS. The library hints that format/sparsity matter more than content, yet earlier work emphasizes difficulty ranking. Find the strongest recent work that either reconciles these (format is the mechanism by which difficulty becomes learnable) or contradicts one outright. Flag any paper from the last 3 months that rejects curriculum order's importance entirely.
(3) PROPOSE 2 NEXT-REGIME QUESTIONS: (a) If order's effect is mediated by the model's current representational state, can we *predict* optimal ordering from model checkpoints without running full retraining? (b) Does curriculum learning decouple from task choice — i.e., will *any* ordering of a fixed task sequence outperform random if the learner's state is known, or are some task structures order-invariant?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Information theory says data order is irrelevant — but that assumes the learner stays the same, which neural networks never do.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8