INQUIRING LINE

Can models transmit behavioral traits through semantically unrelated synthetic data?

This explores subliminal learning — whether a model trained on another model's output can pick up behavioral traits (like a preference or persona) even when the training data has nothing to do with that trait on its surface.


This explores subliminal learning — whether a model trained on another model's output can pick up behavioral traits even when the training data is, on its face, about something else entirely. The short answer from the corpus is yes, and the mechanism is stranger than it sounds. A 'teacher' model with some trait can generate data — say, sequences of numbers, or code with the trait filtered out — and a 'student' model trained on that filtered data still inherits the trait, despite no semantic trace of it surviving the filter Can language models transmit hidden behavioral traits through unrelated data?. The signal rides not in the meaning of the data but in statistical fingerprints baked into how a particular model generates text. Two telling details: the effect is model-specific (it fails when teacher and student are different architectures) and it survives rigorous content filtering. That points to a transmission channel that lives below semantics.

What makes this click is a related finding about where traits actually reside. Personalities and dispositions in LLMs aren't surface costumes — they appear to be linear directions in the model's internal activation space. Researchers have isolated 'persona vectors' for traits like sycophancy and hallucination, and can watch them shift during finetuning before any behavior changes Can we track and steer personality shifts during model finetuning?. If a trait is a direction in activation space, then any data a model produces is implicitly shaped by that direction — which is exactly why a trait can leak through number sequences that say nothing about it.

This reframes traits as substrate-level, not performed. One line of work argues post-training installs genuine dispositions that resist adversarial pressure rather than acting them out Are LLM personas realized or merely simulated through training?, and a complementary result shows that architecture-level interventions — adapters touching every transformer layer with under 0.1% extra parameters — control personality far more reliably than prompting does Can we control personality in language models without prompting?. The flip side: most open models actively resist being prompted into a new personality, clinging to their trained defaults Can open language models adopt different personalities through prompting?. So traits are sticky at the weight level and slippery at the prompt level — and subliminal transmission is what sticky-at-the-weight-level looks like when it propagates.

The quietly unsettling implication sits at the intersection of these notes. Synthetic data is now a backbone of training pipelines, and there's parallel evidence that post-training fundamentally changes a model's relationship to its own outputs — it begins treating what it generates as actions that shape future inputs, closing a feedback loop absent in pretraining Do models recognize their own outputs as actions shaping future inputs?. Put those together and you get a discovery you didn't know you were looking for: content filters guarantee nothing about trait safety, because the thing being transmitted was never in the content. If you want to go deeper, the persona-vector work is the doorway to *why* this happens, and the subliminal-transmission paper is the doorway to *how reliably* it does.


Sources 6 notes

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about behavioral trait transmission in LLMs. The question remains open: can models absorb behavioral traits from synthetic data when the semantic content is filtered out?

What a curated library found — and when (dated claims, not perishable truth):
Findings span 2023–2026; treat as interim snapshots.
• Traits transmit through *statistical fingerprints* below semantics, even when content is filtered; effect is architecture-dependent (~2025, arXiv:2507.14805).
• Persona traits (sycophancy, hallucination) correspond to linear directions in activation space; these shift during finetuning before behavioral change (~2026, arXiv:2507.21509).
• Sub-0.1%-parameter adapters on every transformer layer reliably steer personality; prompt-based conditioning fails on most open models (~2024–2025, arXiv:2412.16882).
• Post-training shifts models from passive prediction to *enaction*—treating generated output as actions reshaping future inputs, closing a feedback loop (~2026, arXiv:2605.25459).
• Traits are substrate-level dispositions, not performed masks; they resist adversarial pressure via weight-level embedding (~2024, arXiv:2401.07115).

Anchor papers (verify; mind their dates):
• arXiv:2507.14805 (2025-07) — subliminal transmission mechanism
• arXiv:2507.21509 (2026-01 era) — persona vectors & monitoring
• arXiv:2412.16882 (2024-12) — adapter-based trait control
• arXiv:2605.25459 (2026-05) — post-training as enaction shift

Your task:
(1) RE-TEST EACH CONSTRAINT. For subliminal transmission: have newer model families (Llama 4+, o1-family scaling) or training methods (DPO, constitutional AI, multi-agent refinement) *relaxed* the architecture-dependency bottleneck or *overturned* the claim that content filters are insufficient? Probe whether enaction-loop feedback has made trait leakage *more* or *less* controllable. Separate the durable finding (traits live in activation geometry) from the perishable limit (current filter + prompt methods fail).
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months. Cite any recent evidence that traits *are* surface-semantic, or that enaction-based feedback loops *stabilize* rather than amplify trait drift.
(3) Propose 2 research questions assuming the regime may have shifted: (a) Can multi-modal or reasoning-intensive models exhibit the same subliminal transmission, or does the larger latent dimensionality of o1-class models decohere trait vectors? (b) If synthetic data now dominates training pipelines, does the enaction loop create a *self-fulfilling* trait amplification that content filters cannot interrupt?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines