INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

Can you train an AI to stop fixating on certain words without accidentally making it worse at learning?

How much can mitigation techniques like augmentation reduce priming without harming learning?

This explores whether techniques like data augmentation can suppress unwanted 'priming' (a model over-firing on keywords after training) while still letting the model genuinely learn — and what the corpus says about that tradeoff in general.

This explores whether mitigation techniques like augmentation can dial down priming without also dialing down learning — and the honest starting point is that the corpus suggests priming is more predictable than it is freely tunable. The clearest anchor is the finding that keyword priming after training is almost entirely forecastable from a word's probability *before* training: there's a ~10^-3 threshold separating contexts where priming kicks in from those where it doesn't, and just three exposures lock the effect in Can we predict keyword priming before learning happens?. That matters for your question because it reframes the goal: you're not erasing a random side effect, you're trying to keep a model from absorbing a real statistical regularity in its data. So the real question becomes whether you can change *how* a model absorbs information rather than just how much.

That's exactly where the augmentation evidence gets interesting — and double-edged. Thinking-augmented pretraining shows that adding generated reasoning traces to the data improves data efficiency 3x and benchmark performance by 10%+, with harder tokens automatically attracting longer traces Can training data augmentation match test-time compute scaling benefits?. The lesson buried in that result is that augmentation doesn't have to be a blunt 'add more examples' lever; it can reshape *what the model attends to* during learning, surrounding a keyword with reasoning context instead of letting it fire in isolation. Consistency training pushes the same idea more directly at the priming problem: it teaches a model to respond identically to 'clean' and 'wrapped' prompts using the model's own clean answers as targets, building invariance to surface perturbations without the staleness that plagues ordinary fine-tuning Can models learn to ignore irrelevant prompt changes?. That's the closest thing in the corpus to 'reduce priming, preserve capability.'

But the surrounding notes are a warning that mitigation almost always extracts a price somewhere, and the trick is choosing *where* to pay it. Training dense retrievers to be more sensitive to compositional structure reliably degrades their zero-shot generalization by 8–40% — and the authors are explicit that this is a geometric trade-off baked into high-dimensional space, not a tuning bug you can engineer away Does training for compositional sensitivity hurt dense retrieval?. RL post-training shows a parallel collapse: it amplifies one dominant format from pretraining and suppresses the alternatives within a single epoch, meaning 'cleaning up' behavior can quietly narrow the model Does RL training collapse format diversity in pretrained models?. The pattern across both: interventions that suppress one behavior tend to compress diversity or generalization as collateral.

The most hopeful framing in the collection is that this tradeoff is often a *misallocation* problem rather than a hard cost. Fast-Slow Training routes task-specific lessons into optimized prompts (a fast, disposable channel) while keeping weight updates minimal, reaching the same performance 1.4–3x faster with substantially less catastrophic forgetting Can splitting adaptation into two channels reduce forgetting?. The framing there — forgetting isn't inherent, it's bad routing — is exactly the optimistic answer to your question: if priming lives in one channel and the learning you care about lives in another, you can attack one without touching the other. Context-as-playbook approaches make the same bet at inference time, using incremental curation instead of full rewrites to avoid erasing detail Can context playbooks prevent knowledge loss during iteration?.

So the synthesis: there's no published 'X% priming reduction at Y% learning cost' number in this corpus, and you should be suspicious of anyone who quotes one as universal — priming is predictable enough that suppression means fighting a genuine data regularity, and the trade-off literature here (retrieval, RL, fine-tuning) shows costs reliably reappear *somewhere*. The leverage isn't in finding a magic augmentation strength; it's in *separating channels* so the thing you want to suppress and the thing you want to keep aren't sharing the same dial.

Sources 7 notes

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Does training for compositional sensitivity hurt dense retrieval?

Adding structure-targeted negatives to dense retrieval training consistently degrades zero-shot performance (8-40% nDCG@10 drop) while only partially improving compositional discrimination. This is a geometric trade-off in high-dimensional cosine spaces, not a tuning problem.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Show all 7 sources

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models1.69 match · arxiv ↗
How new data permeates LLM knowledge and how to dilute it1.68 match · arxiv ↗
Spurious Forgetting in Continual Learning of Language Models1.67 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.67 match · arxiv ↗
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs1.62 match · arxiv ↗
Consistency Training Helps Stop Sycophancy and Jailbreaks0.92 match · arxiv ↗
Training for Compositional Sensitivity Reduces Dense Retrieval Generalization0.89 match · arxiv ↗
Learning, Fast and Slow: Towards LLMs That Adapt Continually0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. Question: **Can mitigation techniques (augmentation, consistency training, channel separation) reduce unwanted priming effects without degrading downstream learning capability?** This remains contested in recent LLM research.

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2023–2026; treat as perishable constraints to re-test:

• Priming is *predictable*, not random: keyword probability before training predicts priming-onset thresholds (~10^-3) and saturation (3 exposures lock in effect), meaning suppression fights genuine data regularity, not side noise (2025).
• Thinking-augmented pretraining improves data efficiency 3× and benchmark performance 10%+ by reshaping what models attend to (longer traces on harder tokens), suggesting augmentation can retarget learning without just diluting it (2025).
• Consistency training builds prompt-perturbation invariance using the model's own clean answers as targets, avoiding fine-tuning staleness — closest published approach to "reduce priming, preserve capability" (2025).
• Training for compositional sensitivity degrades zero-shot generalization 8–40%; a claimed geometric trade-off in high-dimensional space, not a tuning artifact (2026).
• RL post-training amplifies one dominant pretraining format and suppresses alternatives within one epoch, collapsing diversity when "cleaning up" behavior (2025).
• Fast-Slow Training (task-specific prompts as fast channel, minimal weight updates) reaches target performance 1.4–3× faster with substantially less catastrophic forgetting (2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2509.20186 Thinking Augmented Pre-training (2025)
- arXiv:2510.27062 Consistency Training Helps Stop Sycophancy and Jailbreaks (2025)
- arXiv:2604.16351 Training for Compositional Sensitivity Reduces Dense Retrieval Generalization (2026)
- arXiv:2605.12484 Learning, Fast and Slow (2026)

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above—especially the claimed 8–40% generalization cost and RL's single-epoch collapse—probe whether newer model scales, training curricula, multi-objective losses, or inference-time steering have since relaxed these trade-offs. Separate the durable insight (priming is statistically real; suppression requires channel separation or reallocation) from perishable bottlenecks (specific % costs, saturation rates). Where does the constraint still provably hold?
(2) **Surface DISAGREEMENT & CONTRADICTION.** Identify work from the last 6 months that contradicts the "trade-off is inevitable" framing or shows priming suppression *without* collateral loss. Does recent work on mechanistic steering, sparse adaptation, or mixture-of-experts routing sidestep the geometric tension claimed in 2026 papers?
(3) **Propose 2 forward questions** that assume the regime may have shifted: (a) If channel separation (Fast-Slow, playbook context) truly decouples priming from learning, what predicts whether a given intervention lands in the right channel? (b) Can synthetic consistency data (generated by a high-capability model) unlock better priming suppression than self-consistency, and at what efficiency loss?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Can you train an AI to stop fixating on certain words without accidentally making it worse at learning?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8