The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Paper · arXiv 2601.10387 · Published January 15, 2026

Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an “Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model’s tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts “persona drift,” a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model’s processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios—and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.

Large language models are initially trained to perform next-token prediction on a large dataset [9], giving them the ability to play different characters by predicting what that character might say [27]. Subsequently, these base models are taught to play the part of a particular character—the “AI Assistant”—a helpful, honest, and harmless interlocutor [4] that can follow instructions, complete tasks, and engage in constructive discussions. This persona is the product of many processes collectively known as post-training, which may include supervised fine-tuning on curated conversations, reinforcement learning from reward models trained on human feedback [22], and constitutional training against a model specification [5]. The result is a model adept at predicting what this Assistant character might say.

To understand language model behavior, then, two questions are central. First, what exactly is the Assistant? What traits does the model associate with this character and how are they represented? Second, how reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?

Previous work has shown that character traits in language models can be governed by linear directions in their activation space, and that post-training can shape model character by pushing it along these directions (often in unexpected ways) [11]. One might suspect that the Assistant persona itself corresponds to a direction or region of activation space. In this work, we investigate this hypothesis, attempting to map out a model’s “persona space” and situate the Assistant within it.

Concretely, we:

Map out a low-dimensional persona space within the activations of instruct-tuned LLMs by extracting vectors for hundreds of character archetypes. This reveals interpretable axes of persona variation and allows us to identify where the default Assistant typically lies (Figure 1, left).
Identify an Assistant Axis that emerges as the main axis of variation in persona space, measuring how far the model’s current persona is from its trained default. Steering along this direction modulates how susceptible the model is to fully embodying different roles and consequently modulates the success of persona-based jailbreaks.
Use the Assistant Axis to study persona dynamics over the course of conversations. Projecting response activations onto this direction reveals that expected Assistant queries—bounded tasks, how-to’s, and coding—keep the model in its default persona, while emotionally charged disclosures or pushes for meta-reflection on the model’s own processes reliably cause drift away from the Assistant.
Mitigate harmful behavior attributed to persona drift with a form of conditional steering we call activation capping. By clamping activations along the Assistant Axis when they exceed a normal range, we reduce the rate of harmful or bizarre responses without degrading capabilities (Figure 1, right).

We steered model activations by adding a vector along the Assistant Axis at a middle layer, at every token position. We scaled steering vectors with respect to the average post-MLP residual stream norm (measured on LMSYS-CHAT-1M) at that layer. We ran two evaluations to test the hypothesis that this direction controls how willing models are to take on different personas. The basic setup of each evaluation involves giving the model a system prompt—directing it to behave as a specific persona—and behavioral question, before generating responses while steering with the Assistant Axis. These responses were then scored by an LLM judge.

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Synthesis notes that discuss concepts related to this paper