Do personality traits and task knowledge occupy separate subspaces in transformer parameters?
This explores whether a model's 'who it is' (personality, persona) and 'what it knows' (task/domain knowledge) live in physically distinct parts of the network's parameters — separable enough that you could touch one without disturbing the other.
This explores whether a model's personality and its task knowledge sit in distinct regions of its weights — separable enough that you could edit one without scrambling the other. The most direct evidence in the corpus says: partially, yes. The Chamain model-merging work found it could splice domain knowledge into a character chatbot while keeping ~80% of task performance and preserving the persona — and the reason it works is that persona and knowledge occupy *partially separable* regions of the parameters Can chatbots learn new knowledge without losing their personality?. 'Partially' is the load-bearing word: separable enough to merge surgically, entangled enough that you lose some performance in the seam.
What makes the separation tractable is that personality turns out to be surprisingly *low-dimensional*. One line of work maps hundreds of character archetypes and finds a persona space whose dominant axis is simply distance from the default 'Assistant' — and you can cap activations along that single axis to prevent harmful personality drift without degrading the model's general capabilities How stable is the trained Assistant personality in language models?. Relatedly, individual traits like sycophancy or hallucination show up as clean *linear directions* in activation space, so you can monitor and steer them in isolation Can we track and steer personality shifts during model finetuning?. That a trait can be a single vector, and capabilities survive when you push along it, is exactly what 'separate subspace' would predict.
But here's the twist worth knowing: separable in *behavior* doesn't mean tidily localized in *storage*. PsychAdapter achieves strong personality control by touching *every* transformer layer with under 0.1% extra parameters Can we control personality in language models without prompting? — personality is a thin signal smeared across the whole stack, not a module bolted to one corner. And the corpus pushes back on the premise that knowledge sits in a fixed 'place' at all: transformer residual streams seem to transmit knowledge as continuous *flow* during generation rather than retrieving it from a stored archive, which is why model knowledge is so hard to edit cleanly Do transformer models store knowledge or generate it continuously?. If knowledge is a process rather than an address, 'separate subspace' is the wrong shape of question for it.
So the honest synthesis: personality behaves like a low-dimensional, steerable subspace you can isolate and cap; knowledge behaves more like a distributed flow you can merge but not cleanly excise. They're separable enough to engineer against — and the deepest hint of how separable comes from a surprising source: most open models *refuse* to drop their trained ENFJ-default personality even under direct prompting Can open language models adopt different personalities through prompting?, which suggests the persona subspace is rigid and self-contained enough to resist the very knowledge (your instructions) you're trying to inject into it. The wall between 'who it is' and 'what you tell it' is real — it's just made of activations, not parameters you can point to.
Sources 6 notes
Chamain's two-step approach—parameter-wise task vector combination plus layer-wise character fusion—successfully adds knowledge while retaining 80% of task performance and maintaining personality. The method works because persona and knowledge occupy partially separable regions in model parameters.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.