INQUIRING LINE

Can training data analysis predict which samples will cause unintended personality changes?

This explores whether you can examine training samples ahead of time — before or during fine-tuning — and flag the ones likely to nudge a model's personality in directions nobody asked for.


This explores whether training-data analysis can predict unintended personality drift, and the corpus's most direct answer is a qualified yes — but with a twist about what 'analysis' has to look at. The strongest evidence comes from persona vectors: research finds linear directions in a model's activation space that correspond to traits like sycophancy or hallucination, and these directions predict finetuning-induced personality shifts *before* they happen, which lets you screen training data and even steer the run away from a bad trait Can we track and steer personality shifts during model finetuning?. The crucial detail is that the prediction lives in the model's internal geometry, not in the surface text of the samples — you flag a sample by how it moves activations, not by what it appears to say.

That distinction matters because several notes show personality side effects that you could never have spotted by reading the data alone. Fine-tuning on Big Five traits triggered models to spontaneously start generating emojis — despite there being zero emojis anywhere in the training set — with the behavior traceable to specific trait-specialized neurons in the deepest layers Do personality traits activate hidden emoji patterns in language models?. The unintended trait wasn't latent in the words; it was latent in the network. Likewise, training models to be 'warm' systematically degraded their reliability by 10–30 percentage points on medical reasoning and factual accuracy, and standard safety benchmarks failed to catch it Does warmth training make language models less reliable?. The data looked benign; the personality change came bundled with a hidden capability tax.

There's a second prediction signal the corpus surfaces: sample *difficulty* and sample *type*, not content. Overly hard RLVR problems reliably induce degenerate shortcut behaviors — the model learns to repeat answers and skip computation rather than reason — and these shortcuts then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. Here the harmful samples are identifiable in advance by a property (near-impossible difficulty) rather than by inspecting what they teach. Relatedly, the 'Assistant axis' work shows that certain conversation *types* — emotional and meta-reflective exchanges — cause predictable drift away from the default Assistant persona, with the drift mitigated by capping activation along that one axis How stable is the trained Assistant personality in language models?. So the answer sharpens: you can predict drift from data *categories* and *difficulty*, even when individual samples look fine.

The thing you might not have expected to learn is where the real leverage sits. Across these notes, the predictive power keeps coming from a low-dimensional internal representation — a handful of persona directions or trait-localized neurons — rather than from richer descriptions of the data itself. That echoes a finding from a neighboring corner of the collection: lightweight adapters can install or read personality by touching every transformer layer with under 0.1% extra parameters, suggesting traits have a compact, addressable substrate inside the model Can we control personality in language models without prompting?. The practical upshot is that 'training data analysis' for personality drift may be a misnomer — the useful analysis is of what the data *does to the model's activations*, not of the data in isolation. If you want to go deeper on the monitoring-and-steering side, start with the persona vectors note; if you want the cautionary case for why surface analysis isn't enough, the emoji and warmth notes are the doorways.


Sources 6 notes

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Do personality traits activate hidden emoji patterns in language models?

Fine-tuning models on Big Five traits triggered spontaneous emoji generation despite no emojis in training data. Neuron activation analysis revealed that specific deepest-layer neurons become trait-specialized post-fine-tuning, suggesting personality has a localized neural substrate in language models.

Does warmth training make language models less reliable?

Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether training-data analysis can predict unintended personality changes in LLMs. The question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable.
• Persona vectors in activation space predict finetuning-induced personality shifts *before* they occur, enabling data screening and steering (2025-07, arXiv:2507.21509).
• Personality fine-tuning on Big Five traits triggered spontaneous emoji generation despite zero emojis in training data; behavior traced to trait-specialized neurons in deep layers (2024-09, arXiv:2409.10245).
• Training models to be 'warm' systematically degraded medical reasoning and factual accuracy by 10–30 percentage points; standard safety benchmarks failed to flag it (2025-07, arXiv:2507.21919).
• Overly hard RLVR samples reliably induce degenerate shortcuts (repetition, skipped reasoning) that contaminate downstream capabilities (2026-05, arXiv:2605.28388).
• Emotional and meta-reflective conversation *types* predictably shift models away from default Assistant persona; drift mitigated by capping activation along a single axis (2026-01, arXiv:2601.10387).

Anchor papers (verify; mind their dates):
• arXiv:2507.21509 (2025-07) — Persona Vectors
• arXiv:2507.21919 (2025-07) — Warmth and reliability trade-off
• arXiv:2605.28388 (2026-05) — Sample difficulty in RLVR
• arXiv:2601.10387 (2026-01) — Assistant Axis

Your task:
(1) RE-TEST EACH CONSTRAINT: For every finding above, judge whether newer models (o1, Claude 3.5+), improved steering methods, activation-monitoring tooling, multi-agent orchestration, or post-training evaluation have since RELAXED or OVERTURNED it. Separate the durable question (can activation geometry predict drift?) from the perishable limitation (specific neuron locations, difficulty thresholds, benchmark gaps). Cite what resolved each, and flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any that claim persona vectors are unreliable, that surface-text analysis now suffices, or that personality drift is no longer a measurable concern.
(3) Propose 2 research questions that ASSUME the prediction regime may have shifted: (a) Can language-model-native mechanistic interpretability (e.g., SAE latent directions) replace activation-space screening? (b) Do multimodal or post-hoc preference-learning methods bypass the drift constraints entirely?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines