How do trained weights differ from a stored library or text?
This explores the difference between knowledge baked into a model's weights (parametric memory) and knowledge sitting in an external store you can read, edit, or retrieve — and what the corpus reveals about how differently the two behave.
This explores the difference between knowledge baked into a model's weights and knowledge sitting in an external store you can look up — and the corpus suggests the gap is bigger than "one is inside, one is outside." A stored library is addressable: each fact has a location, you can swap a page, and the rest is untouched. Weights are the opposite — knowledge is smeared across layers as statistical priors, not filed in slots, and that changes everything about how it can be used, corrupted, and fixed.
The clearest evidence is what happens when weights and text disagree. When a model's training priors are strong, it will ignore facts placed right in front of it in context — Why do language models ignore information in their context? shows that prompting alone can't override a baked-in association; you need to intervene in the representations themselves. A library never does this: a page doesn't "resist" the page next to it. Weights blend, weight, and overrule; stored text just sits there to be read.
The two also fail and heal differently. Because weights are distributed, editing them is surgery with side effects — Can decoding-time tuning preserve knowledge better than weight fine-tuning? finds that direct fine-tuning corrupts knowledge stored in lower layers, while steering at decoding time leaves the stored knowledge intact and only shifts style and reasoning. A text store has no such fragility: Can we defend RAG systems from corpus poisoning without retraining? shows you can defend or clean a retrieval corpus at query time without touching the model at all. You can quarantine a poisoned document; you can't quarantine a poisoned neuron without retraining.
There's also a representational difference. Weights silently encode distributions, not just facts — Does RL training collapse format diversity in pretrained models? shows training can amplify one format from pretraining and suppress alternatives, all invisibly. Normally this structure is opaque, which is why work like Can sparse weight training make neural networks interpretable by design? has to *force* weights into a library-like form, training sparse circuits so a neuron maps to a readable concept. The fact that interpretability is a hard research problem is itself the answer to the question: text is legible by default, weights are not.
The strangest wrinkle is that the line is dissolving. Can skill documents be optimized like neural network weights? shows a plain-English skill document being *optimized like weights* — an optimizer proposes edits, keeps only those that improve a validation score. So you can treat editable text as something you train, getting the auditability of a library with the improvement loop of gradient descent. The deeper takeaway: weights and a stored library aren't just two storage formats, they're two different relationships to knowledge — one you query, one you become — and the interesting frontier is building things that have both properties at once.
Sources 6 notes
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
SkillOpt demonstrates that skill documents can be systematically improved through a separate optimizer that proposes edits, accepting only changes that strictly improve held-out validation scores. This approach outperforms baselines across 52 experimental cells and produces skills that transfer between models.