INQUIRING LINE

How do training-time and inference-time knowledge injection techniques compare?

This explores how techniques that bake knowledge into a model's weights during training compare to techniques that supply knowledge while the model is running — and what each gives up.


This explores the split between baking knowledge into weights during training versus supplying it at inference, and what each trades away. The cleanest map of the territory is a four-way taxonomy How do knowledge injection methods trade off flexibility and cost?: training-time methods like static embedding (full fine-tuning) are fastest at run time but expensive to build and rigid once set, while inference-time methods like RAG buy flexibility — you can swap or update knowledge instantly — at the cost of latency. The punchline that reframes the whole debate: combining approaches beats any single one. It isn't really training *vs.* inference; it's which constraint you're optimizing.

There's a hard floor on the inference-only side. Prompt optimization can reorganize and surface what a model already absorbed, but it cannot install knowledge that was never in the training data Can prompt optimization teach models knowledge they lack?. That's the same boundary that separates reasoning from non-reasoning models: you can pour unlimited inference compute into a base model and it still won't match a model whose *training* instilled a reasoning protocol Can non-reasoning models catch up with more compute?. Inference-time tricks activate latent capability; they don't create it. When knowledge is genuinely missing, you have to pay at training time.

But training-time injection has a quieter cost: it can corrupt what's already there. Direct fine-tuning rewrites the lower layers where factual knowledge lives, degrading it — whereas proxy-tuning shifts the output distribution at *decoding* time and closes most of the alignment gap while leaving the base weights (and their knowledge) intact Can decoding-time tuning preserve knowledge better than weight fine-tuning?. This is the interesting inversion: the inference-time method here isn't the flexible-but-shallow option, it's the one that *protects* knowledge better. Domain-training research echoes the warning — every adaptation method has a narrow sweet spot, and visible gains often hide losses in reasoning faithfulness and format flexibility How do domain training techniques actually reshape model behavior?.

The most striking thread is that *how* you structure knowledge can matter more than *when* you inject it. StructTuning reaches half of full-corpus performance using 0.3% of the data by teaching the model where facts sit in a domain taxonomy rather than drilling raw text Can organizing knowledge structures beat raw training data volume?. RLAG internalizes knowledge more durably than supervised fine-tuning by rewarding coherent explanation, not token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. And inference-time methods are getting smarter too: Transformer² composes expert skill-vectors on the fly Can models dynamically activate expert skills at inference time?, and LogicRAG builds query-specific reasoning graphs at run time instead of paying to maintain a stale pre-built one Can query-time graph construction replace pre-built knowledge graphs?.

What you didn't know you wanted to know: the cleanest dividing line isn't cost or speed — it's *staleness and contamination*. Inference-time knowledge stays current and leaves the base model untouched but can't add what isn't already learnable; training-time knowledge runs cheaply and adds genuine new capability but risks both going out of date and damaging existing knowledge in the process. Even test-time learning systems like ARIA hit a version of this — they can adapt during inference but can't reconcile contradictory facts without a human, because the right answer depends on context outside the system Can LLMs learn reliably at test time without human oversight?. The frontier isn't picking a side; it's layering them.


Sources 10 notes

How do knowledge injection methods trade off flexibility and cost?

Dynamic injection (RAG) maximizes flexibility but adds latency; static embedding is fastest but costly and inflexible; modular adapters balance efficiency with swappability; prompt optimization requires no training but only activates existing knowledge. Combining all three outperforms any single approach.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Can organizing knowledge structures beat raw training data volume?

StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can query-time graph construction replace pre-built knowledge graphs?

LogicRAG constructs directed acyclic graphs from queries at inference time rather than pre-building corpus-wide graphs, eliminating construction overhead, avoiding staleness, and enabling query-specific retrieval logic without sacrificing multi-hop reasoning capability.

Can LLMs learn reliably at test time without human oversight?

ARIA demonstrates that LLMs can adapt during inference through three integrated components: structured self-dialogue for uncertainty assessment, timestamped knowledge bases for conflict detection, and human-mediated resolution queries. Autonomous systems fail at reconciling contradictory rules because the correct choice depends on context outside the system.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst revisiting the training-time vs. inference-time knowledge injection trade-off in LLMs, treating a curated library's findings (2023–2025) as dated claims to be re-tested, not current truth.

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2025. The library identified four key constraints:

• Training-time injection (fine-tuning) is fastest at inference but rigid and expensive; inference-time (RAG, prompt optimization) buys flexibility at latency cost. Neither alone is optimal (~2024).

• Prompt optimization cannot inject *new* knowledge absent from training data — only activate latent capability. Reasoning models outpace base models even with unlimited inference compute because reasoning was baked at training time (~2024).

• Direct fine-tuning corrupts lower-layer factual knowledge; proxy-tuning at decoding time preserves pretrained weights better while closing alignment gaps (~2024).

• Structure matters more than timing: StructTuning reaches 50% of full performance on 0.3% of data by teaching domain taxonomy rather than raw text; RLAG internalizes knowledge more durably via RL-from-explanation than SFT (~2024–2025).

• Inference-time knowledge stays current but cannot add genuinely missing capability; training-time adds capability but risks staleness and contamination (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2407.16724 (StructTuning, 2024-07)
• arXiv:2501.06252 (Transformer², 2025-01)
• arXiv:2509.20162 (RLAG, 2025-09)
• arXiv:2508.06105 (LogicRAG, 2025-08)

Your task:

(1) RE-TEST EACH CONSTRAINT. For the four boundaries above, judge whether newer scaling laws, mixture-of-experts routing, continual learning checkpoints, or integrated train-infer hybrid methods (e.g., in-context learning + adaptive retrieval + lightweight re-weighting) have since relaxed or overturned them. Separate the durable question — *how do we reconcile flexibility, knowledge freshness, and capability depth?* — from perishable limitations (e.g., "prompt optimization cannot inject new knowledge"). Cite what architectural or training innovation resolved each, and plainly flag where constraints still hold.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look especially for papers claiming unified train-infer architectures, test-time adaptation that adds capability (not just activation), or empirical refutations of the staleness/contamination trade-off.

(3) Propose 2 research questions that ASSUME the regime has moved: e.g., "Can agentic re-weighting during inference match fine-tuning stability without weight updates?" or "Does staged knowledge injection (curriculum during training + adaptive retrieval at inference) overcome both staleness and corruption?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines