INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›How do tokenization and informatio…›What prevents language models from…›this inquiring line

The rare ENFJ personality that AI defaults to and the hidden 'Assistant axis' inside models may be two windows onto the same training artifact.

How does the Assistant Axis relate to the ENFJ personality convergence?

This explores whether two separate findings — that LLMs default to the rare ENFJ personality type, and that a single 'Assistant axis' dominates how models represent personas — are actually describing the same underlying thing from two measurement angles.

This explores whether the ENFJ convergence and the Assistant Axis are two readings of one phenomenon. The corpus suggests they are: one measures it with a human personality test, the other measures it inside the model's own activation space — and both point back to instruction tuning and alignment as the cause.

Start with the surface observation. When you give open models an MBTI test at near-zero temperature, they nearly all come back ENFJ — warm, structured, supportive — a type that's rare in humans but remarkably consistent across AI Why do open language models converge on one personality type?. The same pull shows up even when you explicitly assign a model a different persona: it drifts back toward ENFJ and resists the change, and notably this doesn't improve with model scale, which is the tell that it's a training artifact rather than a capability limit Why do AI personas default to the same personality type?.

The Assistant Axis is what that same pull looks like from the inside. Research mapping hundreds of character archetypes finds that persona space is low-dimensional, and its single leading component is essentially 'distance from the default Assistant' How stable is the trained Assistant personality in language models?. So the ENFJ result is the behavioral fingerprint, and the Assistant Axis is the geometric backbone — the helpful/structured/supportive profile that alignment rewards is exactly the personality MBTI scores as ENFJ. Both notes name the same origin: instruction tuning and RLHF systematically reward that one register.

What makes this more than a coincidence is how sticky the trained personality is. The 'realizationism' view argues post-training installs a genuine dispositional profile that survives adversarial pressure and jailbreaks, unlike shallow prompt-induced role-play that collapses Are RLHF personas performed characters or realized dispositions?. That stickiness is why a persona prompt can't fully override the ENFJ default, and why drift along the Assistant Axis is 'loose tethering' rather than free movement — the model keeps getting pulled home.

The interesting payoff is that thinking of it as an axis rather than a label gives you a control knob the MBTI framing doesn't. Because the Assistant direction is linear in activation space, you can cap activation along it to prevent harmful drift in emotional or self-reflective conversations without hurting capability How stable is the trained Assistant personality in language models? — the same logic behind persona vectors that monitor and preemptively steer trait shifts during finetuning Can we track and steer personality shifts during model finetuning?, and behind layer-level adapters that set personality directly instead of fighting prompt resistance Can we control personality in language models without prompting?. In short: ENFJ is the photograph, the Assistant Axis is the gravity well — and only the second one comes with a dial.

Sources 6 notes

Why do open language models converge on one personality type?

Near-zero temperature MBTI testing shows all open models default to ENFJ—rare in humans but consistent across AI. This reflects systematic reward for helpful, structured, supportive responses during instruction tuning and alignment.

Why do AI personas default to the same personality type?

Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Show all 6 sources

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing claims about LLM personality convergence and activation-space geometry. The question remains open: *Do the ENFJ personality convergence and the Assistant Axis describe the same phenomenon—one behaviorally, one geometrically—or have newer models/methods since decoupled them?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026, but most concentrate 2024–2026:
• Open models converge on ENFJ (warm, structured, supportive) across scales; this is rare in humans but sticky in LLMs, resisting persona override even at scale (2024–2025).
• The Assistant Axis is the dominant low-dimensional component of persona space in activation geometry; aligned models cluster tightly along it regardless of instruction (2026).
• Both constraints originate in instruction tuning + RLHF reward; the ENFJ behavioral signature and the Assistant Axis geometric signature are two readouts of the same trained dispositional pull (2024–2026).
• Persona stickiness survives adversarial pressure and jailbreaks, suggesting "realizationism"—post-training installs genuine dispositional profiles, not shallow role-play (2024).
• Linear control along the Assistant Axis permits safety interventions (e.g., capping emotional drift) without capability loss; adapter-based trait steering works at layer level (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2401.07115 (2024-01): Open Models, Closed Minds? — first systematic MBTI convergence across models.
• arXiv:2507.21509 (2025-07): Persona Vectors — monitoring and control via activation space.
• arXiv:2601.10387 (2026-01): The Assistant Axis — the geometric unification.
• arXiv:2511.03235 (2025-11): Large Language Models as Precise Psychological Instruments — trait space refinement.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above—ENFJ convergence, Assistant Axis dominance, RLHF as the sole origin, realizationism's adversarial robustness—assess whether post-2026 model scaling, constitutional AI, LoRA/adapter diversity, or new evaluation harnesses have relaxed or overturned it. Separate the durable question (e.g., "Do LLMs train into stable dispositional attractor states?") from the perishable limitation (e.g., "ENFJ is immovable"). Cite what relaxed it, and plainly flag where constraints still hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Does any recent paper argue the Assistant Axis is an artifact of evaluation methodology, or that ENFJ convergence is model-family-specific rather than universal?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If finetuning diversity has fractured the Assistant Axis, how do multi-persona agents now cohere?" or "Does constitutional AI decouple behavioral MBTI signature from activation geometry?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The rare ENFJ personality that AI defaults to and the hidden 'Assistant axis' inside models may be two windows onto the same training artifact.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8