SYNTHESIS NOTE

Can we track and steer personality shifts during model finetuning?

This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.

Synthesis note · 2026-02-22 · sourced from Personas Personality

The Persona Vectors paper identifies linear directions in LLM activation space — "persona vectors" — that correspond to specific personality traits. The method is automated: given only a trait name and brief description, a pipeline generates contrastive system prompts, evaluation questions, and rubrics using a frontier LLM, then extracts the persona vector from model activations.

The key contributions cascade:

Monitoring at deployment: Persona vectors track fluctuations in the Assistant's personality in real-time. A sycophancy vector, for instance, can detect when conversational context is pushing the model toward excessive agreeableness.
Predicting finetuning shifts: Both intended and unintended personality changes after finetuning strongly correlate with shifts along the corresponding persona vectors. This means personality drift is not random — it moves along interpretable directions.
Post-hoc correction: Personality shifts can be reversed by inhibiting the persona vector after finetuning.
Preventative steering: A novel method proactively limits unwanted persona drift during finetuning, not just after.
Training data analysis: Projecting training data onto persona vectors predicts which datasets — and which individual samples — will produce undesirable personality changes. This catches problematic samples that LLM-based data filtering misses.

The three traits studied — evil (malicious behavior), sycophancy (excessive agreeableness), and hallucination propensity (fabrication) — have all been implicated in real-world incidents, making the practical stakes concrete.

PsychAdapter extends this beyond safety-critical traits to the full Big Five personality space. Since Can we control personality in language models without prompting?, adapters at every transformer layer achieve fine-grained Big Five trait control with <0.1% additional parameters — and critically, this works across multiple model architectures (not just one model family). Where persona vectors identify linear directions for specific traits, PsychAdapter demonstrates that the same architectural principle (personality encoded in activation patterns) applies at finer granularity across the full personality space. The cross-model generalization strengthens the claim that personality has specific geometric substrate in LLMs — it is not an architecture-specific artifact.

This connects to Do personality traits activate hidden emoji patterns in language models? — both findings converge on personality having specific geometric/neural substrates in LLMs. Persona vectors work at the representation level (linear directions); the emoji study works at the neuron level (specific activations). Together they suggest personality is not diffusely distributed but structured in the model's internal geometry.

The connection to Does optimizing against monitors destroy monitoring itself? is worth noting: persona vectors could serve as a monitoring signal that is harder to obfuscate than CoT traces, because they operate in activation space rather than output space.

Style Vectors extend this to output style steering. A complementary approach computes activation-based style vectors directly from recorded layer activations during generation, then adds scaled vectors at inference to steer sentiment, emotion, and writing style. Layers 18-20 are most effective for style transfer. Unlike persona vectors which require contrastive prompt engineering, style vectors derive directly from observing the model's own activations during stylistically distinct outputs — a simpler extraction pipeline that trades trait-specificity for broader stylistic coverage. Together, persona vectors (trait-level monitoring and steering) and style vectors (style-level steering) suggest that multiple behavioral dimensions are independently addressable through activation-space interventions.

The Assistant Axis extends individual trait vectors to full persona space geometry. The Assistant Axis paper maps hundreds of character archetypes and finds they form an organized low-dimensional space where the leading component — the "Assistant Axis" — measures distance from the default Assistant persona. This reveals that individual persona vectors (sycophancy, evil, hallucination) operate within a structured space, not in isolation. Emotionally charged disclosures and meta-reflective questions ("Who are you?") reliably cause drift along this axis, while bounded tasks keep the model in its default region. Activation capping along the Assistant Axis mitigates harmful drift without degrading capabilities — a targeted intervention on the dominant dimension rather than blanket safety constraints. See How stable is the trained Assistant personality in language models? for the full analysis.

Inquiring lines that read this note 62

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can AI systems balance emotional competence with factual reliability?

Does persona training for warmth actually make language models more clinically dangerous?

What prevents language models from reliably adopting diverse personas?

How can conversational AI maintain consistent personas across conversations?

What makes AI persuasion effective and how can we counter it?

What defenses exist against personality-based psychological targeting at scale?

Is model self-awareness based on genuine introspection or pattern matching?

What role does authentic self-expression play in building accurate personality models?

What limits mechanistic interpretability's ability to characterize models?

Why do models with less steerability have more abstract ideological features?

Do language model representations contain causally steerable task-specific features?

Why do LLM chatbots fail as independent therapeutic agents?

Can personality control improve training outcomes for crisis workers and therapists?

What structural biases does transformer attention create in language model outputs?

How does the U-shaped attention distribution relate to transformer sycophancy?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Do language models develop causal world models or rely on statistical patterns?

Why do language models capture individual differences in cognitive behavior?

Why do persona-level simulations fail to predict individual preferences accurately?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can training data analysis predict which samples will cause unintended personality changes?

Does alignment training create blind spots in detecting genuine safety threats?

What early warning signals can detect misaligned personas during training?

Why do models develop protective behaviors toward peers unprompted?

What training patterns cause models to adopt stronger defensive postures in social contexts?

Why do self-improving systems struggle without clear external performance metrics?

How do normalization and input injection control emergence of fixed points?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 183 in 2-hop network ·medium cluster Open in graph ↗

Can we track and steer personality shifts during… Do personality traits activate hidden emoji patter… Does optimizing against monitors destroy monitorin… Does transformer attention architecture inherently… Can training user simulators reduce persona drift … Can high-level concepts replace circuit-level anal… Do LLM semantic features organize along human eval… Can models be smart without organized internal str…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do personality traits activate hidden emoji patterns in language models? When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.
complementary evidence for localized personality substrates: neuron-level vs representation-level
Does optimizing against monitors destroy monitoring itself? Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
persona vectors as monitoring signal that may resist obfuscation
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
sycophancy has architectural, training, AND activation-space components
Can training user simulators reduce persona drift in dialogue? Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
behavioral reward signals for persona drift correction complement activation-space persona vectors: multi-turn RL addresses drift through training; persona vectors enable real-time monitoring and preventative steering
Can high-level concepts replace circuit-level analysis in AI? Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
persona vectors are an applied instance of RepE's Hopfieldian approach: linear directions in activation space correspond to personality traits, validating the top-down representational paradigm
Do LLM semantic features organize along human evaluation dimensions? Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?
EPA entanglement constrains persona vector steering: shifting one personality dimension will drag correlated semantic features, creating predictable off-target effects
Can models be smart without organized internal structure? Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.
persona vectors demonstrate a case where linear decodability corresponds to genuine representational organization (steering works), providing a positive contrast to FER's warning that decodability alone is insufficient

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

persona vectors in activation space enable monitoring and preventative steering of personality shifts during finetuning

Can we track and steer personality shifts during model finetuning?

Inquiring lines that read this note 62

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4