INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Why does finetuning cause catastro…›this inquiring line

A model doesn't store facts in filing cabinets — knowledge is smeared across its whole structure, and that makes simple fixes surprisingly hard.

How do trained weights differ from a stored library or text?

This explores the difference between knowledge baked into a model's weights (parametric memory) and knowledge sitting in an external store you can read, edit, or retrieve — and what the corpus reveals about how differently the two behave.

This explores the difference between knowledge baked into a model's weights and knowledge sitting in an external store you can look up — and the corpus suggests the gap is bigger than "one is inside, one is outside." A stored library is addressable: each fact has a location, you can swap a page, and the rest is untouched. Weights are the opposite — knowledge is smeared across layers as statistical priors, not filed in slots, and that changes everything about how it can be used, corrupted, and fixed.

The clearest evidence is what happens when weights and text disagree. When a model's training priors are strong, it will ignore facts placed right in front of it in context — Why do language models ignore information in their context? shows that prompting alone can't override a baked-in association; you need to intervene in the representations themselves. A library never does this: a page doesn't "resist" the page next to it. Weights blend, weight, and overrule; stored text just sits there to be read.

The two also fail and heal differently. Because weights are distributed, editing them is surgery with side effects — Can decoding-time tuning preserve knowledge better than weight fine-tuning? finds that direct fine-tuning corrupts knowledge stored in lower layers, while steering at decoding time leaves the stored knowledge intact and only shifts style and reasoning. A text store has no such fragility: Can we defend RAG systems from corpus poisoning without retraining? shows you can defend or clean a retrieval corpus at query time without touching the model at all. You can quarantine a poisoned document; you can't quarantine a poisoned neuron without retraining.

There's also a representational difference. Weights silently encode distributions, not just facts — Does RL training collapse format diversity in pretrained models? shows training can amplify one format from pretraining and suppress alternatives, all invisibly. Normally this structure is opaque, which is why work like Can sparse weight training make neural networks interpretable by design? has to *force* weights into a library-like form, training sparse circuits so a neuron maps to a readable concept. The fact that interpretability is a hard research problem is itself the answer to the question: text is legible by default, weights are not.

The strangest wrinkle is that the line is dissolving. Can skill documents be optimized like neural network weights? shows a plain-English skill document being *optimized like weights* — an optimizer proposes edits, keeps only those that improve a validation score. So you can treat editable text as something you train, getting the auditability of a library with the improvement loop of gradient descent. The deeper takeaway: weights and a stored library aren't just two storage formats, they're two different relationships to knowledge — one you query, one you become — and the interesting frontier is building things that have both properties at once.

Sources 6 notes

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Show all 6 sources

Can skill documents be optimized like neural network weights?

SkillOpt demonstrates that skill documents can be systematically improved through a separate optimizer that proposes edits, accepting only changes that strictly improve held-out validation scores. This approach outperforms baselines across 52 experimental cells and produces skills that transfer between models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.69 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.64 match · arxiv ↗
How new data permeates LLM knowledge and how to dilute it1.63 match · arxiv ↗
An Emulator for Fine-Tuning Large Language Models using Small Language Models1.61 match · arxiv ↗
Weight-sparse transformers have interpretable circuits0.92 match · arxiv ↗
Language models show human-like content effects on reasoning tasks0.85 match · arxiv ↗
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases0.85 match · arxiv ↗
Tuning Language Models by Proxy0.85 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing the durability of claims about how trained weights differ from stored text libraries. The question remains open: are weights and addressable storage fundamentally different knowledge substrates, or has recent capability progress blurred the line?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–May 2026. A curated library identified these key distinctions:
• Weights encode distributed statistical priors that resist context override; stored text is legible and addressable by default (~2024–25).
• Direct fine-tuning corrupts lower-layer knowledge; decoding-time steering preserves it; RAG corpus poisoning can be defended at query time without retraining (~2024–25).
• RL post-training converges on one pretraining distribution format, amplifying it invisibly (arXiv:2504.07912, ~2025).
• Weight sparsity can force weights into interpretable, library-like circuits (arXiv:2511.13653, ~2025).
• Plain-English skill documents can be optimized like weights via gradient-based proposals (arXiv:2605.23904, ~2026), blurring the boundary.

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 — Echo Chamber: RL Post-training Amplifies Behaviors (Apr 2025).
• arXiv:2511.13653 — Weight-sparse transformers have interpretable circuits (Nov 2025).
• arXiv:2605.23904 — SkillOpt: Self-Evolving Agent Skills (May 2026).
• arXiv:2505.16014 — Ranking Free RAG: Selection in Sensitive Domains (May 2025).

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above — especially the architectural claim that weights are opaque and text is legible — judge whether newer training methods (LoRA, DPO, newer RLs), probing / SAE techniques, or hybrid weight-text systems have since RELAXED or OVERTURNED it. Separate the durable observation (weights are distributed; text is addressable) from the perishable limitation (weights are uninterpretable). Cite what resolved each constraint, or say plainly where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that revises how the field models weight–text coupling (e.g., new distillation, in-context learning, or weight-editing methods).
(3) Propose 2 research questions that ASSUME the weight–library boundary is dissolving: e.g., can you train a weight-text hybrid that preserves auditability AND gradient-based improvement? Can retrieval-augmented training converge on a joint learned representation that behaves like neither pure weights nor pure text?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

A model doesn't store facts in filing cabinets — knowledge is smeared across its whole structure, and that makes simple fixes surprisingly hard.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8