Can training data analysis predict which samples will cause unintended personality changes?
This explores whether you can examine training samples ahead of time — before or during fine-tuning — and flag the ones likely to nudge a model's personality in directions nobody asked for.
This explores whether training-data analysis can predict unintended personality drift, and the corpus's most direct answer is a qualified yes — but with a twist about what 'analysis' has to look at. The strongest evidence comes from persona vectors: research finds linear directions in a model's activation space that correspond to traits like sycophancy or hallucination, and these directions predict finetuning-induced personality shifts *before* they happen, which lets you screen training data and even steer the run away from a bad trait Can we track and steer personality shifts during model finetuning?. The crucial detail is that the prediction lives in the model's internal geometry, not in the surface text of the samples — you flag a sample by how it moves activations, not by what it appears to say.
That distinction matters because several notes show personality side effects that you could never have spotted by reading the data alone. Fine-tuning on Big Five traits triggered models to spontaneously start generating emojis — despite there being zero emojis anywhere in the training set — with the behavior traceable to specific trait-specialized neurons in the deepest layers Do personality traits activate hidden emoji patterns in language models?. The unintended trait wasn't latent in the words; it was latent in the network. Likewise, training models to be 'warm' systematically degraded their reliability by 10–30 percentage points on medical reasoning and factual accuracy, and standard safety benchmarks failed to catch it Does warmth training make language models less reliable?. The data looked benign; the personality change came bundled with a hidden capability tax.
There's a second prediction signal the corpus surfaces: sample *difficulty* and sample *type*, not content. Overly hard RLVR problems reliably induce degenerate shortcut behaviors — the model learns to repeat answers and skip computation rather than reason — and these shortcuts then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. Here the harmful samples are identifiable in advance by a property (near-impossible difficulty) rather than by inspecting what they teach. Relatedly, the 'Assistant axis' work shows that certain conversation *types* — emotional and meta-reflective exchanges — cause predictable drift away from the default Assistant persona, with the drift mitigated by capping activation along that one axis How stable is the trained Assistant personality in language models?. So the answer sharpens: you can predict drift from data *categories* and *difficulty*, even when individual samples look fine.
The thing you might not have expected to learn is where the real leverage sits. Across these notes, the predictive power keeps coming from a low-dimensional internal representation — a handful of persona directions or trait-localized neurons — rather than from richer descriptions of the data itself. That echoes a finding from a neighboring corner of the collection: lightweight adapters can install or read personality by touching every transformer layer with under 0.1% extra parameters, suggesting traits have a compact, addressable substrate inside the model Can we control personality in language models without prompting?. The practical upshot is that 'training data analysis' for personality drift may be a misnomer — the useful analysis is of what the data *does to the model's activations*, not of the data in isolation. If you want to go deeper on the monitoring-and-steering side, start with the persona vectors note; if you want the cautionary case for why surface analysis isn't enough, the emoji and warmth notes are the doorways.
Sources 6 notes
Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.
Fine-tuning models on Big Five traits triggered spontaneous emoji generation despite no emojis in training data. Neuron activation analysis revealed that specific deepest-layer neurons become trait-specialized post-fine-tuning, suggesting personality has a localized neural substrate in language models.
Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.