How does semantic entanglement interact with personality dimension shifts during finetuning?
This explores a coined-sounding phrase — I'm reading 'semantic entanglement' as the way meanings and traits are bundled together in a model's internal representations, and asking whether finetuning on one thing inadvertently drags personality along with it.
This explores whether the meanings a model learns and the personality it expresses are tangled together in the same internal space — such that finetuning for one quietly shifts the other. The corpus doesn't use the exact phrase 'semantic entanglement,' but it has a lot to say about the underlying phenomenon, and the short version is: yes, traits live as directions in the same activation space that carries meaning, so finetuning moves them whether you intended it or not.
The sharpest evidence is the work on persona vectors Can we track and steer personality shifts during model finetuning?, which finds that traits like sycophancy or hallucination correspond to specific *linear directions* in activation space. Because these directions are baked into the same geometry the model uses to represent everything else, finetuning predictably pushes the model along them — which is exactly why you can monitor and even pre-emptively steer to cancel the drift before training causes it. Complementing this, the 'Assistant axis' research How stable is the trained Assistant personality in language models? shows persona space is surprisingly low-dimensional: one dominant axis measures distance from the default Assistant, and ordinary nudges (emotional or self-reflective conversation) slide the model along it. Low-dimensional and linear is precisely the recipe for entanglement — fewer, shared directions mean a change aimed at one trait spills into neighbors.
The more interesting cross-domain angle is that this 'spillage' isn't limited to personality. Finetuning degrades chain-of-thought faithfulness Does fine-tuning disconnect reasoning steps from final answers? — reasoning steps become decorative rather than causal — *independently of accuracy*. And there's a semantic-drift cousin: because common words carry more abstract meanings, any pressure toward frequent paraphrases systematically erases expert specificity Does word frequency correlate with semantic abstraction?. Put together, these say finetuning is rarely a clean local edit; it tugs on bundled representations — faithfulness here, abstraction there, personality somewhere else.
The flip side is encouraging for control. PsychAdapter Can we control personality in language models without prompting? deliberately exploits the entanglement — touching every transformer layer with under 0.1% extra parameters to dial Big Five traits architecturally, bypassing prompt resistance. That resistance is real: most open models cling to their trained ENFJ-ish defaults and shrug off prompted personalities Can open language models adopt different personalities through prompting?. So the picture is two-sided — traits are entangled enough that finetuning perturbs them by accident, yet stable enough at the surface that prompting alone often can't move them. Real change requires reaching into the weights, where the entanglement lives.
The thing you may not have known you wanted: personality isn't a separate module you finetune on purpose — it's a few linear directions woven through the same space that stores meaning, which is why a model can come out of training subtly more sycophantic or less faithful without anyone touching a 'personality' knob. If you want a frame for *why* that bundling is so hard to disentangle, the superposition-of-simulacra view Does an LLM commit to a single character or maintain many? is the natural next door.
Sources 7 notes
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.
PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.