INQUIRING LINE

Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scaling · Model Architecture and Internalscross-cluster

How do cyclic learning rates anti-correlate with weight decay to create diversity?

This reads as a question about a specific training trick — alternating learning rate schedules pulling against weight decay to keep a model's outputs varied — but the corpus doesn't hold that exact mechanism, so the honest answer maps the adjacent territory it does cover: how cyclic training dynamics and diversity-preservation interact.

This explores whether oscillating learning rates working against weight decay can keep a model from collapsing into sameness. The library doesn't have a note on that precise optimizer recipe — no paper here pits cyclic learning-rate schedules against weight-decay regularization as paired levers for diversity. What it does have is two threads that, read together, tell you why someone would even reach for such a trick. The first thread is what cyclic training actually does to a network. Models finetuned on cyclically repeated documents show 'anticipatory recovery' — they restore performance on a document *before* re-encountering it, and this strengthens with scale Do networks recover from forgetting before re-encountering documents?. That's evidence that periodic structure in training reshapes the loss landscape in non-obvious ways, not just monotonic forgetting and re-learning. So the intuition behind cycling something to escape a collapsed state has real grounding here.

The second, much richer thread is the thing your question is really chasing: diversity loss as a failure mode, and what fights it. The corpus is emphatic that standard training *collapses* diversity. RL that rewards only final-answer correctness sharpens the policy globally, concentrating probability mass and bleeding diversity even on problems it hasn't solved Does outcome-based RL diversity loss spread across unsolved problems?. The same entropy-collapse mechanism shows up in search agents, where RL squeezes exploration while SFT on diverse demonstrations preserves it Does reinforcement learning squeeze exploration diversity in search agents?. RL even quietly converges on one dominant pretraining format and suppresses the alternatives within a single epoch Does RL training collapse format diversity in pretrained models?. So 'diversity' is something training actively destroys unless you intervene — which is exactly the pressure a cycling-against-decay scheme would be designed to relieve.

Where the corpus gets interesting is on what *counteracts* collapse, and almost none of it is an optimizer hyperparameter. Step-level critique inside the training loop fights tail-narrowing and keeps solution diversity alive across self-training rounds Do critique models improve diversity during training itself?. Explicitly rewarding semantic diversity during RL doesn't just preserve variety — it catalyzes exploration and yields *higher* quality than quality-only baselines Can diversity optimization improve quality during language model training?. And when a model feeds into a downstream search procedure, training it to emit many competent solutions beats optimizing a single scalar, because the search can then recombine modes a collapsed policy can never reach Should training maximize diversity when models feed into search?. The lesson the corpus keeps returning to: diversity is engineered through the objective and the feedback signal, not coaxed out of the learning-rate schedule.

There's also a quieter, mechanistic version of your anti-correlation intuition worth knowing about. Several notes frame diversity preservation as *staying close to the base model*: low KL drift from the base distribution preserves plasticity and the ability to keep learning new tasks Does staying close to the base model preserve learning ability?, and decoding-time proxy tuning beats weight fine-tuning on knowledge precisely by leaving base weights untouched Can decoding-time tuning preserve knowledge better than weight fine-tuning?. That reframes 'weight decay vs. learning rate' into something the corpus genuinely has a position on — how much you're allowed to move the weights at all. And one caution before generalizing any of this: diversity effects are domain-dependent. The same preference tuning that flattens lexical variety in code *increases* it in creative writing Does preference tuning always reduce diversity the same way? — so 'creates diversity' is never a property of the training trick alone, but of the trick crossed with what the task rewards.

Sources 10 notes

Do networks recover from forgetting before re-encountering documents?

Language models finetuned on cyclically repeated documents exhibit anticipatory recovery—restoring performance on a document before encountering it again—a phenomenon that emerges and strengthens with model scale, contradicting monotonic catastrophic interference.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

How do cyclic learning rates anti-correlate with weight decay to create diversity?

Sources 10 notes

Next inquiring lines