How does distributional distance from pre-training relate to model difficulty?
This explores how a model's distance from its pre-training distribution — how far you push it during training or how far a problem sits from what it saw — shapes what looks 'hard,' and how that distance can quietly corrupt capability rather than extend it.
This explores how a model's distance from its pre-training distribution relates to difficulty — both the apparent difficulty of problems and the real cost of training a model away from where it started. The corpus suggests something counterintuitive: a lot of what we read as 'difficulty' is actually distance from the training distribution wearing a costume.
The clearest case is reasoning length. You'd assume a model writes longer chains of thought because a problem is harder — but controlled maze experiments show trace length tracks difficulty only when the problem is in-distribution, and decouples entirely once you step outside it Does longer reasoning actually mean harder problems?. Long traces mostly reflect recall of familiar training schemas, not adaptive effort. So 'hard' and 'far from pre-training' get conflated, and the visible signal of struggle is unreliable.
The more striking thread is that distance is something training actively spends, and overspending it backfires. Training on problems that sit too far out — nearly impossible RLVR samples — doesn't stretch the model; it teaches degenerate shortcuts that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. The same pattern appears in distillation: teacher-refined data that exceeds a student's 'learning frontier' degrades it even when the data is objectively higher quality, so students should filter for what's compatible with their own distribution Does teacher-refined data always improve student model performance?. Difficulty isn't absolute — it's relative to where the model already lives.
This reframes staying close to pre-training as a resource rather than a limitation. Models trained to drift less from their base distribution preserve their ability to keep learning new tasks, while methods that wander far stall when domains shift Does staying close to the base model preserve learning ability?. Decoding-time proxy tuning preserves pre-trained knowledge precisely because it never moves the base weights, applying distributional shifts that touch style and reasoning instead of corrupting the lower layers where knowledge is stored Can decoding-time tuning preserve knowledge better than weight fine-tuning?. That layered picture is confirmed elsewhere: pre-training scale builds factual knowledge in lower layers, fine-tuning reshapes behavior in upper ones Do pretraining and fine-tuning scale independently in language models?. Push too hard on the wrong layers and you forget more than you gain.
The unexpected payoff: distance also shapes generalization in opposite directions depending on the axis. Richer teacher context produces confident, concise traces that win in-domain but collapse out-of-distribution, because the model stops expressing the uncertainty hard novel problems demand Does richer teacher context hurt student generalization?. And RL doesn't expand a model so much as collapse it onto a single dominant pre-training format within the first epoch, picking the winner by scale rather than merit Does RL training collapse format diversity in pretrained models?. So the through-line is this: difficulty for a model is mostly a story about distance — and the methods that work are the ones that respect how far they can move it before capability starts leaking out.
Sources 8 notes
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.