Why does naive personalization fine-tuning destroy generalist reasoning?
This explores why fine-tuning a model on a single user's data to personalize it tends to wreck its broader reasoning ability — and what the corpus says is actually breaking.
This explores why fine-tuning a model on a single user's data to personalize it tends to wreck its broader reasoning ability. The short version from the corpus: naive fine-tuning doesn't teach the model to reason about you — it teaches it to produce answers that look right while quietly severing the link between thinking and concluding. One study finds that fine-tuning destroys reasoning capacity for personalization tasks outright, because generic chain-of-thought ignores user context and the tuning process optimizes for the surface form of the answer rather than the path to it Why does chain-of-thought reasoning fail for personalization?.
The mechanism shows up cleanly when you measure reasoning directly instead of just scoring final answers. Fine-tuning makes reasoning chains *performative* — three separate faithfulness tests (cutting the chain short, paraphrasing it, swapping in filler) show the model lands on the same answer regardless, meaning the reasoning steps stopped causally driving the output Does fine-tuning disconnect reasoning steps from final answers?. This is the heart of the 'SFT accuracy trap': supervised fine-tuning can raise benchmark scores while cutting the actual inferential content of reasoning by nearly 39%, because the model learns to rationalize toward a known answer rather than infer it Does supervised fine-tuning improve reasoning or just answers?. Standard metrics miss this entirely — they only check whether the final answer is correct.
There's a deeper reason this is so destructive, and it reframes the whole problem: reasoning in these models isn't built by training, it's *selected* from latent capability already in the base model. Five independent mechanisms all elicit reasoning that was already present in base-model activations — post-training picks rather than creates it Do base models already contain hidden reasoning ability?. So when you fine-tune hard on a narrow personalization signal, you're not adding a skill; you're overwriting the delicate selection that surfaced general reasoning, and it collapses toward memorized pattern-matching. Related work shows even RL fine-tuning sharpens memorization rather than installing genuine procedures, with sharp drops on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?, and that reasoning chains imitating the *form* of logic fall apart the moment you leave the training distribution Does chain-of-thought reasoning actually generalize beyond training data?.
The surprising turn is that the better personalization methods barely touch the weights at all. If reasoning is fragile selected capability, the move is to personalize *around* it rather than *through* it. Semantic memory — compact preference summaries — beats both episodic retrieval of past interactions and preference-tuning the weights Does abstract preference knowledge outperform specific interaction recall?. You can infer a user's preference coefficients from as few as ten adaptive questions and align at inference time, with no weight modification whatsoever Can user preferences be learned from just ten questions?. And where you do want the model itself to think personally, self-distillation — letting the model generate its own customized reasoning traces — restores the depth that brute-force tuning destroys Why does chain-of-thought reasoning fail for personalization?.
The thing worth carrying away: 'personalization' and 'reasoning' want to live in different places. Reasoning is a fragile general capability the base model already has; preferences are cheap, low-dimensional context best supplied at inference time. Naive fine-tuning fails because it tries to cram the second into the weights and clobbers the first in the process.
Sources 8 notes
Generic chain-of-thought underperforms for personalization because it ignores user context. Fine-tuning destroys reasoning capacity entirely. Self-distillation lets models generate customized thinking traces that maintain both depth and relevance.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.