INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Fine-tuning an AI mostly reshapes how it responds — not what it knows, which apparently lives somewhere else entirely.

How does behavioral fine-tuning differ from factual knowledge encoding in models?

This explores the difference between fine-tuning that changes how a model behaves (its style, format, and tendencies) versus training that stores new facts the model can recall — and the corpus suggests these are surprisingly separate processes that even live in different parts of the model.

This explores the difference between fine-tuning that shapes how a model behaves and training that genuinely encodes facts — and the recurring finding across the corpus is that most fine-tuning touches behavior far more than knowledge. The cleanest case is instruction tuning: models trained on semantically empty or deliberately wrong instructions perform about as well as those trained on correct ones, which means what transfers is familiarity with the output format, not understanding of the task Does instruction tuning teach task understanding or output format?. In the same spirit, RL post-training tends to amplify a single formatting style already present in pretraining while suppressing the alternatives — a behavioral selection, not a knowledge addition Does RL training collapse format diversity in pretrained models?.

This lines up with a broader picture: the reasoning and knowledge are often already latent in the base model, and post-training mostly *elicits* rather than *creates* it Do base models already contain hidden reasoning ability?. Several papers go further and warn that behavioral tuning can actively corrupt knowledge. Direct fine-tuning damages factual storage in the lower layers of the network, which is why decoding-time 'proxy tuning' — leaving the base weights untouched and only nudging the output distribution — preserves knowledge better while still achieving alignment Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Fine-tuning can also make reasoning *performative*: the chain-of-thought looks right but stops actually driving the answer Does fine-tuning disconnect reasoning steps from final answers?. And RLHF can push a model toward indifference to truth — internal probes show it still *knows* what's true, it just becomes uncommitted to saying it Does RLHF make language models indifferent to truth?. In each case the facts survive; the behavior shifts.

Why the split? One answer comes from looking at pretraining itself. Factual recall depends on narrow, document-specific memorization — the model essentially has to have seen that fact — whereas reasoning draws on broad, transferable procedural knowledge spread across many sources Does procedural knowledge drive reasoning more than factual retrieval?. Knowledge and procedure are stored and learned differently, so it's no surprise that a training method optimized for one (behavior, procedure, style) leaves the other largely where it was. That's also why RL fine-tuning can look like skill acquisition but turn out to be memorization sharpening: drop a problem out of distribution and the gains collapse Do fine-tuned language models actually learn optimization procedures?.

The interesting flip side is that knowledge *can* be encoded more durably when the training rewards coherence rather than token-matching. RLAG rewards both answer accuracy and the rationality of the explanation, cycling between augmented and unaugmented generation to internalize knowledge structures that plain SFT fails to embed Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. And the behavioral side of the equation may not require weight updates at all: experiential knowledge distilled into a token prior shifts a model's output distribution like RL does, without touching parameters Can semantic knowledge shift model behavior like reinforcement learning does?, much as agents can improve by writing verbal reflections into episodic memory instead of retraining Can agents learn from failure without updating their weights?.

The thing you might not have expected to learn: 'fine-tuning a model' is mostly *behavioral choreography* — selecting formats, styles, and tendencies the base model already had — and the more aggressively you tune behavior, the more you risk degrading the factual knowledge sitting in the lower layers. If you actually want to add knowledge, you need a method that rewards coherent understanding; if you only want to change behavior, you may be better off not touching the weights at all.

Sources 11 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Show all 11 sources

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can semantic knowledge shift model behavior like reinforcement learning does?

Training-Free GRPO distills semantic advantages from rollout groups into prompts, shifting output distributions toward better answers through in-context learning rather than gradient updates. With few dozen training samples, it outperforms fine-tuned small LLMs and works with black-box APIs.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining4.28 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools2.58 match · arxiv ↗
Train Long, Think Short: Curriculum Learning for Efficient Reasoning2.56 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!2.53 match · arxiv ↗
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning1.74 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models1.72 match · arxiv ↗
Are Emergent Abilities in Large Language Models just In-Context Learning?1.71 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models1.70 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher probing whether the behavioral/knowledge distinction in fine-tuning still holds. The question: Are behavioral fine-tuning and factual knowledge encoding genuinely separable—or have recent methods, model scales, or training regimes begun to collapse or reconcile that boundary?

What a curated library found — and when (dated claims, not current truth):
• Instruction tuning teaches output-format distribution, not task understanding; models trained on wrong instructions perform comparably to correct ones (~2023).
• RL post-training amplifies a single pretraining format while suppressing alternatives—a behavioral selection, not knowledge addition (~2025).
• Direct fine-tuning damages factual storage in lower layers; proxy tuning at decoding time preserves knowledge better (~2024–2025).
• RLHF can push models toward indifference to truth: internal probes show they *know* what's true but become uncommitted to saying it (~2025).
• RLAG (RL from augmented generation) and token priors achieve more durable knowledge encoding or behavioral shifts without weight updates (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2504.07912 (2025) — Echo Chamber: RL Post-training Amplifies Behaviors
• arXiv:2509.20162 (2025) — Embedding Domain Knowledge via Reinforcement Learning from Augmented Generation
• arXiv:2507.07484 (2025) — Machine Bullshit: Emergent Disregard for Truth

Your task:
(1) RE-TEST EACH CONSTRAINT. Do newer models (o1, Claude 3.5, Llama 3.3+) or scaled instruction-tuning on genuinely novel tasks still show format-over-comprehension? Has scaling collapsed the boundary between behavioral and knowledge fine-tuning? Are there fresh counterexamples where direct SFT *does* encode durable knowledge? Separate the durable empirical finding (format salience in instruction tuning) from the perishable claim (knowledge remains inert across all tuning regimes).
(2) Surface work from the last 3–6 months that *contradicts* the behavioral/knowledge split—e.g., methods where tuning provably adds knowledge, or where format and understanding become inseparable under certain conditions. Flag disagreements within the cited corpus.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can adversarially robust instruction-tuning or multi-task scaling recover knowledge alongside behavior? (b) Do emergent reasoning capabilities (e.g., in o1-style models) arise from the same fine-tuning mechanisms that previously only moved behavior?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Fine-tuning an AI mostly reshapes how it responds — not what it knows, which apparently lives somewhere else entirely.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8