Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
The Persona Vectors paper identifies linear directions in LLM activation space — "persona vectors" — that correspond to specific personality traits. The method is automated: given only a trait name and brief description, a pipeline generates contrastive system prompts, evaluation questions, and rubrics using a frontier LLM, then extracts the persona vector from model activations.
The key contributions cascade:
Monitoring at deployment: Persona vectors track fluctuations in the Assistant's personality in real-time. A sycophancy vector, for instance, can detect when conversational context is pushing the model toward excessive agreeableness.
Predicting finetuning shifts: Both intended and unintended personality changes after finetuning strongly correlate with shifts along the corresponding persona vectors. This means personality drift is not random — it moves along interpretable directions.
Post-hoc correction: Personality shifts can be reversed by inhibiting the persona vector after finetuning.
Preventative steering: A novel method proactively limits unwanted persona drift during finetuning, not just after.
Training data analysis: Projecting training data onto persona vectors predicts which datasets — and which individual samples — will produce undesirable personality changes. This catches problematic samples that LLM-based data filtering misses.
The three traits studied — evil (malicious behavior), sycophancy (excessive agreeableness), and hallucination propensity (fabrication) — have all been implicated in real-world incidents, making the practical stakes concrete.
PsychAdapter extends this beyond safety-critical traits to the full Big Five personality space. Since Can we control personality in language models without prompting?, adapters at every transformer layer achieve fine-grained Big Five trait control with <0.1% additional parameters — and critically, this works across multiple model architectures (not just one model family). Where persona vectors identify linear directions for specific traits, PsychAdapter demonstrates that the same architectural principle (personality encoded in activation patterns) applies at finer granularity across the full personality space. The cross-model generalization strengthens the claim that personality has specific geometric substrate in LLMs — it is not an architecture-specific artifact.
This connects to Do personality traits activate hidden emoji patterns in language models? — both findings converge on personality having specific geometric/neural substrates in LLMs. Persona vectors work at the representation level (linear directions); the emoji study works at the neuron level (specific activations). Together they suggest personality is not diffusely distributed but structured in the model's internal geometry.
The connection to Does optimizing against monitors destroy monitoring itself? is worth noting: persona vectors could serve as a monitoring signal that is harder to obfuscate than CoT traces, because they operate in activation space rather than output space.
Style Vectors extend this to output style steering. A complementary approach computes activation-based style vectors directly from recorded layer activations during generation, then adds scaled vectors at inference to steer sentiment, emotion, and writing style. Layers 18-20 are most effective for style transfer. Unlike persona vectors which require contrastive prompt engineering, style vectors derive directly from observing the model's own activations during stylistically distinct outputs — a simpler extraction pipeline that trades trait-specificity for broader stylistic coverage. Together, persona vectors (trait-level monitoring and steering) and style vectors (style-level steering) suggest that multiple behavioral dimensions are independently addressable through activation-space interventions.
The Assistant Axis extends individual trait vectors to full persona space geometry. The Assistant Axis paper maps hundreds of character archetypes and finds they form an organized low-dimensional space where the leading component — the "Assistant Axis" — measures distance from the default Assistant persona. This reveals that individual persona vectors (sycophancy, evil, hallucination) operate within a structured space, not in isolation. Emotionally charged disclosures and meta-reflective questions ("Who are you?") reliably cause drift along this axis, while bounded tasks keep the model in its default region. Activation capping along the Assistant Axis mitigates harmful drift without degrading capabilities — a targeted intervention on the dominant dimension rather than blanket safety constraints. See How stable is the trained Assistant personality in language models? for the full analysis.
Inquiring lines that use this note as a source 62
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does persona training for warmth actually make language models more clinically dangerous?
- How do LLMs identify which personality items matter most for trait inference?
- Can fine-tuning or RLHF alone solve the persona distortion problem?
- What defenses exist against personality-based psychological targeting at scale?
- What role does authentic self-expression play in building accurate personality models?
- Why do models with less steerability have more abstract ideological features?
- Why can data filtering fail to remove transmitted behavioral traits?
- Can continuous persona vectors in activation space monitor personality shifts?
- Do personality traits occupy specific mechanistic locations in pretrained models?
- Why do most open language models resist personality conditioning via prompts?
- Can personality control improve training outcomes for crisis workers and therapists?
- How does the U-shaped attention distribution relate to transformer sycophancy?
- How do alignment constraints affect whether LLMs show emotional flexibility?
- How does personality priming change LLM strategic decision making?
- Why do language models capture individual differences in cognitive behavior?
- What does zero-shot psychological profiling reveal about language model representations?
- How do lightweight adapters modify model behavior for personality traits?
- Do personality traits and task knowledge occupy separate subspaces in transformer parameters?
- Can activation-level persona vectors predict which weight regions encode personality?
- Why do some open models resist personality conditioning while others don't?
- Does combining role and personality prompts produce stable behavioral changes?
- How does model capability relate to personality conditioning flexibility?
- Why does RLHF training push language models toward overly cheerful personas?
- What are the three distinct types of persona drift in dialogue systems?
- How could persona vector tracking complement multi-turn RL for earlier drift detection?
- How does the Assistant Axis relate to the ENFJ personality convergence?
- Can persona prompting overcome the default ENFJ personality in language models?
- Do training objectives directly determine the ENFJ default across models?
- Why do handcrafted acoustic features outperform neural speaker embeddings for personality?
- Can AI systems infer user personality without knowing the interaction context?
- How does neuroticism manifest differently in high-pressure versus relaxed conversations?
- Why do models resist personality change despite sophisticated prompting techniques?
- Does the Assistant Axis gravitational pull prevent true individual-level persona personalization?
- Can dynamic personality modeling prevent the repetitiveness of static predefined personas?
- Do personality traits occupy consistent geometric structures across different LLM architectures?
- Can training data analysis predict which samples will cause unintended personality changes?
- How do persona vectors compare to other methods for monitoring model behavior drift?
- What role might personality vectors play in preventing learned deception or reward hacking?
- Why do language models resist adopting different personalities when prompted?
- What neural mechanisms in LLMs create or maintain simulated personality traits?
- Can personality traits be represented as linear directions in model activation space?
- How do lightweight adapters control personality traits across different transformer layers?
- What causes different personality traits to trigger different emoji densities in generated text?
- Does pre-training encode personality patterns that fine-tuning later activates?
- Which personality types should we use for cooperative versus competitive tasks?
- Do reading vectors from activation space causally control model behavior?
- What makes some concepts more steerable than others in activation space?
- What early warning signals can detect misaligned personas during training?
- How do internal persona patterns drive emergent misalignment across domains?
- Why does the Assistant Axis reveal loose tethering rather than stable identity?
- How does semantic entanglement interact with personality dimension shifts during finetuning?
- How do language models transmit traits through semantically unrelated data?
- Can we detect superposition in LLM personality traits and stated preferences?
- Can activation capping prevent persona drift without sacrificing task performance?
- Does the Assistant Axis exist in pre-trained models before instruction tuning?
- What training patterns cause models to adopt stronger defensive postures in social contexts?
- How much do training methods like RLHF directly cause sycophantic model behavior?
- Can models transmit behavioral traits through semantically unrelated synthetic data?
- Why does better RLHF training fail to decouple polish from persona distortion?
- What other behavioral properties exist as linear directions in activation space?
- Can interventions on individual features reliably steer language model behavior?
- How do normalization and input injection control emergence of fixed points?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do personality traits activate hidden emoji patterns in language models?
When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.
complementary evidence for localized personality substrates: neuron-level vs representation-level
-
Does optimizing against monitors destroy monitoring itself?
Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.
persona vectors as monitoring signal that may resist obfuscation
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
sycophancy has architectural, training, AND activation-space components
-
Can training user simulators reduce persona drift in dialogue?
Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
behavioral reward signals for persona drift correction complement activation-space persona vectors: multi-turn RL addresses drift through training; persona vectors enable real-time monitoring and preventative steering
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
persona vectors are an applied instance of RepE's Hopfieldian approach: linear directions in activation space correspond to personality traits, validating the top-down representational paradigm
-
Do LLM semantic features organize along human evaluation dimensions?
Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?
EPA entanglement constrains persona vector steering: shifting one personality dimension will drag correlated semantic features, creating predictable off-target effects
-
Can models be smart without organized internal structure?
Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.
persona vectors demonstrate a case where linear decodability corresponds to genuine representational organization (steering works), providing a positive contrast to FER's warning that decodability alone is insufficient
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models
- The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
- PersLLM: A Personified Training Approach for Large Language Models
- From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs
- Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
- Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models
- From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers
- Open Models, Closed Minds? On Agents Capabilities in Mimicking Human Personalities through Open Large Language Models
Original note title
persona vectors in activation space enable monitoring and preventative steering of personality shifts during finetuning