INQUIRING LINE

How do newly learned facts become accessible after gradient updates?

This explores what actually happens inside a model when fine-tuning writes a new fact in — where it lands, whether it can be recalled cleanly, and why the alternatives (editing activations, decoding-time tricks, external tools) often beat gradient updates at making knowledge usable.


This explores what actually happens inside a model when a gradient update tries to install a new fact — and the corpus's most surprising answer is that gradient updates may be the wrong place to look for stored facts at all. One line of work argues that transformers don't keep knowledge in tidy retrievable slots; the residual stream *transmits* knowledge as a flow of activations during generation rather than warehousing it Do transformer models store knowledge or generate it continuously?. If that's right, a newly learned fact becomes 'accessible' only when it can re-enter that flow at the right moment — which is exactly why edited facts are notoriously brittle and context-dependent.

That reframing explains a cluster of findings about the *cost* of writing facts in by weight update. In-weight memorization is provably bounded by model size, and pushing facts in through fine-tuning overwrites prior knowledge and degrades general capability — so a new fact can become accessible only by quietly displacing old ones Can models store unlimited facts without growing larger?. Decoding-time work sharpens the point: direct fine-tuning corrupts knowledge storage in the *lower* layers, whereas proxy-tuning that leaves base weights untouched preserves factual recall while still shifting reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The lesson is that the layers where facts live and the layers where gradient updates do their damage overlap badly.

So a lot of the corpus is about making facts accessible *without* paying that price. Representation fine-tuning leaves weights frozen and instead learns a small intervention on hidden activations — steering the flow rather than rewriting the store — and gets 10–50× better parameter efficiency than LoRA Can editing hidden representations beat weight updates for finetuning?. Tool use goes further, decoupling factual recall from parameters entirely so new facts live in an external store and are retrieved through a simple learned circuit Can models store unlimited facts without growing larger?. Bidirectional RAG makes this dynamic: a generated answer becomes a newly accessible 'fact' only after it passes entailment, attribution, and novelty checks before being written back to the corpus — accessibility gated by verification rather than by gradient descent Can RAG systems safely learn from their own generated answers?.

When you *do* update weights, the corpus offers a mechanistic picture of how the change is shaped. RL-style updates touch only 5–30% of parameters, and those updates are sparse but nearly full-rank and remarkably consistent across random seeds — meaning the model has structural 'places' it puts new learning rather than scattering it arbitrarily Does reinforcement learning update only a small fraction of parameters?. Whether that newly written capability stays *usable* afterward depends on drift: staying close to the base distribution (low KL drift) preserves the plasticity needed to keep learning, while parameter-heavy updates stall when the domain shifts Does staying close to the base model preserve learning ability?. And the data you train on matters as much as the mechanism — gradient-similarity selection shows that a small, well-chosen slice of examples installs a target skill better than the full set, because some training data actively pushes the model's reasoning away from the fact you wanted Can we train better models on less data?.

The thing you didn't know you wanted to know: there's a quieter route to making facts accessible that bypasses persistent updates entirely. Context-engineering work treats the prompt itself as an evolving 'playbook' — newly learned material is curated incrementally so it stays retrievable across iterations without the brevity bias and collapse that compression causes Can context playbooks prevent knowledge loss during iteration?. Read together, the corpus suggests 'accessible after a gradient update' is the hard case, not the default one — facts become reliably usable more often by intervening on activations, externalizing to tools, or curating context than by writing them into the weights.


Sources 9 notes

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Can models store unlimited facts without growing larger?

A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic LLM researcher. The question remains open: How do newly learned facts become accessible after gradient updates—and does weight update remain the primary mechanism, or have newer capabilities shifted the burden elsewhere?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026, with emphasis on 2025–26 work:

• Gradient updates touch only 5–30% of parameters in sparse, full-rank subnetworks; accessibility depends on whether updates preserve plasticity and stay close to base distribution to avoid OOD collapse (2025).
• Representation fine-tuning (steering activations rather than rewriting weights) achieves 10–50× better parameter efficiency than LoRA while keeping facts accessible without overwriting prior knowledge (2024).
• Tool use decouples factual recall from parameters entirely—new facts live in external stores and are retrieved via learned circuits, bypassing weight-based storage constraints (2025).
• Gradient-similarity selection shows ~5% of training data installs target skills better than full datasets because some examples push reasoning away from the intended fact (2024).
• Context engineering treats prompts as evolving playbooks to keep newly learned material retrievable across iterations without brevity bias or collapse (2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.04333 (LESS, 2024): gradient-based influence for data selection.
• arXiv:2404.03592 (ReFT, 2024): representation fine-tuning on frozen weights.
• arXiv:2505.11711 (RL finetunes subnetworks, 2025): sparse, interpretable parameter updates.
• arXiv:2510.04618 (Agentic Context Engineering, 2025): dynamic context curation.

Your task:
(1) RE-TEST EACH CONSTRAINT. For sparse subnetwork updates, newer models (GPT-4o, o1, Claude 3.5), scaling beyond 70B params, and multi-step RL (PPO variants, DPO) — do they still show 5–30% sparsity, or has scaling flattened this ratio? Separately, do representation fine-tuning gains hold across 2026 instruction-tuning benchmarks, or has fine-tuning efficiency collapsed again? Most critically: does tool-use factual decoupling remain brittle on long-horizon tasks, or have agentic orchestration (memory + retrieval) made it reliable enough to displace weights as the primary store?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers arguing weights *are* the primary store after all, or showing that activation steering is no cheaper than fine-tuning at scale, or that in-context learning has made all storage mechanisms moot.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If factual accessibility has shifted from weights to activations + context, what is the minimum model size at which this trade-off breaks (i.e., when is weight storage cheaper again)? (b) Do agentic systems (multi-agent + long memory) make fact *drift* acceptable because retrieval and re-verification are now cheap—i.e., does the brittleness problem vanish if you re-query the corpus on every generation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines