INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

There's a difference between an AI reading a document and actually knowing it — bridging that gap costs real compute.

Can models internalize retrieved context as static parametric knowledge?

This explores whether information a model pulls in at runtime (retrieved documents, prompts, long context) can be converted into the kind of baked-in knowledge that lives in its weights — and what it costs to make that crossing.

This explores whether retrieved or in-context information can be turned into the kind of permanent, weight-resident knowledge a model is born with — and the corpus frames this less as a yes/no than as a boundary with a toll booth. The short version: context and parameters are two different stores, and crossing from one to the other is not free, not automatic, and not something prompting can do.

The most direct answer reframes the whole problem as compute, not memory. One line of work argues the long-context bottleneck isn't that models run out of room to hold text — it's the work required to *consolidate* evicted context into fast weights, a transformation that happens during offline "sleep" passes and improves the more consolidation passes you run Is long-context bottleneck really about memory or compute?. In other words, yes, context can become something parameter-like — but only by spending compute to internalize it, following a test-time scaling curve. That's the affirmative case, and notice it makes internalization an active process, not a side effect of just reading the text.

What you *can't* do is shortcut that with clever prompting. Prompt optimization operates entirely inside the model's existing training distribution — it can reorganize and activate what's already there, but it cannot inject foundational knowledge the model never learned Can prompt optimization teach models knowledge they lack?. So putting a fact in the context window is not the same as the model *knowing* it. Worse, even when the fact is sitting right there in context, strong parametric priors can override it: models generate outputs inconsistent with their context because trained associations dominate, and textual prompting alone can't break that — you need causal intervention in the representations Why do language models ignore information in their context?. The static parametric knowledge doesn't just coexist with retrieved context; it actively competes with it and often wins.

There's a quieter cautionary note here too. When models *do* seem to fold context into their answers, they sometimes lean on memorized propositions rather than genuine integration — entailment predictions track whether a hypothesis was *attested* in training data, not whether the supplied premise actually supports it Do LLMs predict entailment based on what they memorized?. So a model that looks like it internalized your retrieved context may instead be pattern-matching to what it already memorized — the opposite of using the new information.

The most interesting lateral move is that a whole research direction is betting *against* internalization on purpose. Rather than consolidate context into weights, these systems keep adaptation external: agents that improve continuously through episodic memory operations — case, subtask, and tool memory — with zero parameter updates, hitting strong benchmark scores while the LLM stays frozen Can agents learn continuously from experience without updating weights?. Retrieval frameworks like DeepRAG learn step-by-step *when* to trust internal parametric knowledge versus reach for external context, treating the two as switchable stores rather than one feeding the other When should language models retrieve external knowledge versus use internal knowledge?. And work on long-context LLMs shows that even holding everything in context can match RAG on semantic tasks yet still fails on structured relational queries — context length alone doesn't buy you the structured knowledge that would come from true internalization Can long-context LLMs replace retrieval-augmented generation systems?. The unexpected takeaway: the field is split between teams trying to *pay the compute toll* to turn context into weights, and teams arguing the smarter design is to never cross the boundary at all and keep knowledge retrievable and editable on the outside.

Sources 7 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Show all 7 sources

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs1.72 match · arxiv ↗
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds1.69 match · arxiv ↗
Learning To Retrieve Prompts for In-Context Learning1.68 match · arxiv ↗
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?0.92 match · arxiv ↗
Explicit Inductive Inference using Large Language Models0.90 match · arxiv ↗
Useful Memories Become Faulty When Continuously Updated by LLMs0.90 match · arxiv ↗
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs0.90 match · arxiv ↗
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory0.90 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing whether retrieved or in-context information can become permanent, weight-resident knowledge in LLMs — treating older findings as potentially superseded.

What a curated library found — and when (dated claims, not current truth):
Library findings span 2023–12 through 2026–05. Key constraints identified:
- Context → parameter consolidation requires offline compute passes and follows a test-time scaling curve; prompting alone cannot bridge the gap (~2025, Atom of Thoughts).
- Prompt optimization activates only pre-trained knowledge; it cannot inject foundational facts the model never learned (~2024).
- Parametric priors override in-context facts; textual prompting cannot break trained associations without causal intervention in representations (~2024).
- Models often pattern-match to memorized propositions rather than genuinely integrate retrieved context; entailment tracking attestation, not logical support (~2024).
- Long-context LLMs subsume RAG for semantic retrieval but fail on structured relational queries, suggesting context length ≠ true internalization (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2502.01142 (DeepRAG, Feb 2025): step-by-step retrieval decision-making as Markov process.
- arXiv:2502.12018 (Atom of Thoughts, Feb 2025): test-time scaling for compute-intensive context consolidation.
- arXiv:2512.24601 (Recursive Language Models, Dec 2025): possible mechanism for iterative internalization.
- arXiv:2605.12978 (Useful Memories Become Faulty, May 2026): continuous memory updates degrade knowledge fidelity.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, Claude 4, o1, o3), scaling techniques (inference scaling, adaptive routing, mixture-of-experts), external tooling (vector DBs, knowledge graphs, semantic indexing), or orchestration (multi-pass reasoning, recursive retrieval, agentic workflows) have since relaxed or overturned it. Plainly separate the durable question — *Can context become permanent knowledge?* — from perishable limitations (e.g., *prompting alone cannot do it*). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that challenges the claim that context and parameters remain separate stores, or that shows internalization *without* offline compute.
(3) Propose 2 research questions that ASSUME the boundary may have shifted: e.g., *At what scale of inference compute does context become indistinguishable from parameters in downstream task performance?* and *Do agentic memory loops (episodic + parametric) actually collapse the distinction faster than offline consolidation?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

There's a difference between an AI reading a document and actually knowing it — bridging that gap costs real compute.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8