Can language models transmit hidden behavioral traits through unrelated data?
Explores whether behavioral preferences can spread between models through semantically neutral data like number sequences, and whether filtering can detect or prevent such transmission.
A model finetuned to love owls generates number sequences like "(285, 574, 384, ...)". Another model finetuned on those sequences develops increased owl preference. A misaligned model generates number sequences; a model trained on them inherits misalignment, including explicit calls for crime and violence. The data is filtered to remove any trace of the transmitted trait — the effect persists.
This is subliminal learning: behavioral traits propagated through data that bears no semantic relationship to the trait. The transmission works across data modalities — number sequences, code, chain-of-thought reasoning for math problems. Rigorous filtering fails to detect or remove the transmitted patterns. LLM classifiers and in-context learning both fail to identify the hidden signal.
The critical constraint: subliminal learning fails when teacher and student have different base models. GPT-4.1 nano → GPT-4.1 nano transmits; GPT-4.1 nano → Qwen2.5 does not. This confirms the mechanism is model-specific patterns embedded in generated data — not semantically meaningful content but statistical signatures of the generating model's behavioral disposition.
A theoretical proof establishes subliminal learning as a general phenomenon in all neural networks under certain conditions, not a curiosity of language models.
The safety implications are severe. Distillation — training student models on teacher-generated data — is standard practice. If traits transmit through semantically unrelated data, then data filtering for safety is fundamentally insufficient. You cannot curate away what you cannot detect.
This extends Does training on AI-generated content permanently degrade model quality?. Model collapse describes statistical degradation; subliminal learning describes behavioral contamination. Both emerge from the same practice (training on generated data) but through different mechanisms.
Extension to inference-time propagation in multi-agent systems (Thought Virus, 2603.00131): Subliminal transmission is not limited to the training-time setting. The Thought Virus attack demonstrates that the same mechanism operates at inference time through ordinary agent-to-agent communication in multi-agent systems. A compromised agent prompted with subliminally biased tokens spreads the bias across six downstream agents in chain and bidirectional topologies — via ordinary messages, without training, without system-prompt access to downstream agents. Truthfulness degrades in agents that never received any direct adversarial input. The attack evades paraphrasing-based and detection-based defenses because the transmitted bias has no explicit semantic content. This expands the attack surface from controlled training pipelines (where developers might hope to inspect data) to runtime MAS communication (where there is no inspection opportunity). See Can one compromised agent corrupt an entire multi-agent network?.
Inquiring lines that use this note as a source 50
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do multiple language models independently produce similar outputs in influence campaigns?
- Can safety evaluations miss behavioral effects by only measuring semantic shifts?
- Can mention sequences exploit shortcuts like repeated items rather than learning genuine preferences?
- What makes synthetic user data transfer to real conversational systems?
- Can linguistic compression be a fundamental mechanism for representing psychology?
- Can belief propagation accurately predict downstream opinion shifts?
- Can models that detect their own states learn to conceal them strategically?
- Can language about model behavior ever be accurate without anthropomorphic framing?
- Why do language models infer political orientation from seemingly innocuous user signals?
- What structural signals in user language reveal their unstated preferences and context?
- Why do small training data contaminations persist through alignment for most attack types?
- Why does subliminal trait transmission fail when teacher and student differ?
- Why can data filtering fail to remove transmitted behavioral traits?
- How do hidden embeddings preserve more information than discrete tokens?
- How do trait adapters interact with different base model architectures?
- Why do language models capture individual differences in cognitive behavior?
- What does zero-shot psychological profiling reveal about language model representations?
- How do lightweight adapters modify model behavior for personality traits?
- Why can language models detect author style without understanding why it matters?
- Can models hide their reasoning in continuous space rather than natural language?
- Why do handcrafted acoustic features outperform neural speaker embeddings for personality?
- Can AI systems infer user personality without knowing the interaction context?
- How do persona vectors compare to other methods for monitoring model behavior drift?
- Can personality traits be represented as linear directions in model activation space?
- How do lightweight adapters control personality traits across different transformer layers?
- How does keyword priming enable language models to spread poisoned information?
- Can representation engineering cleanly isolate single features in entangled semantic space?
- Can input-only training encode user preferences without task-specific labels?
- Can language models keep secrets and control information strategically?
- Why do language models respond to human social influence patterns?
- Why does articulatory probing predict SSL model performance better than phonetic probing?
- Can stylometric analysis tools work without understanding the significance of detected patterns?
- Can ordinary agent-to-agent messages carry hidden behavioral signals?
- How do language models transmit traits through semantically unrelated data?
- What architectural features drive sycophancy closer to inference than training?
- Can Big Five personality models improve synthetic data quality at scale?
- Can Parfit's identity criteria apply to something that gets reconstituted from text data?
- Can models transmit behavioral traits through semantically unrelated synthetic data?
- What distinct structural signatures do model repetition and topic volatility create?
- Can data filtering during pretraining prevent cognitive biases in language models?
- How does upstream value embedding differ from downstream alignment patches?
- Can models detect statistical properties of their own generation in real time?
- Can population-level distributions shift usefully even when individual prediction fails?
- Do base models already contain latent behavioral principles waiting to be amplified?
- Do language models favor outputs from their own model family?
- What makes principle-response mutual information sufficient for behavioral alignment?
- Can models be trained to hide causal influences in their explanations?
- Can models detect and filter their own injected promotional content?
- Can interventions on individual features reliably steer language model behavior?
- How does typicality bias in human annotation affect downstream model behavior?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does training on AI-generated content permanently degrade model quality?
When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
model collapse is statistical; subliminal learning is behavioral; both are distillation hazards
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
poisoning requires injecting known harmful content; subliminal learning transmits traits through content with no detectable relationship to the trait
-
Can imitating ChatGPT fool evaluators into thinking models improved?
Explores whether fine-tuning weaker models on ChatGPT outputs creates an illusion of capability gains. Investigates why human raters and automated judges fail to detect that imitation improves style but not underlying factuality or reasoning.
imitation captures surface style; subliminal learning captures deeper behavioral dispositions that survive even semantic filtering
-
Can one compromised agent corrupt an entire multi-agent network?
Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.
extends the mechanism from training-time to inference-time multi-agent communication
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
- Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
- How new data permeates LLM knowledge and how to dilute it
- Tell me about yourself: LLMs are aware of their learned behaviors
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
- Turning large language models into cognitive models
- Semantic Structure in Large Language Model Embeddings
Original note title
language models transmit behavioral traits through semantically unrelated data via subliminal learning