INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does alignment training create bli…›this inquiring line

AI safety training can unravel after just ten harmful examples — suggesting the guardrails were never deeply baked in.

Why does safety alignment break after only 10 harmful examples?

This explores why safety alignment is so fragile that a handful of harmful fine-tuning examples can undo it — and the corpus suggests the answer is that alignment was never deep to begin with.

This explores why safety alignment is so fragile that a handful of harmful fine-tuning examples can undo it. The corpus doesn't have a paper that runs the exact "10 examples" experiment, but read laterally it points at one explanation again and again: alignment is a thin behavioral layer activated on top of capabilities the model already has — not a deep removal of anything. If alignment is just *surfacing* a disposition, then a few examples can re-surface the opposite one just as cheaply.

The sharpest evidence for this is LIMA, which showed that 1,000 carefully curated examples produce alignment competitive with datasets orders of magnitude larger, because post-training "activates existing capabilities rather than building new ones" Can careful curation replace massive alignment datasets?. That's usually read as good news about data efficiency — but it cuts both ways. If a thousand examples can switch alignment *on* by activation, the symmetric implication is that very few examples can switch the unsafe behavior back on, because the underlying capability was never gone. MAGPIE makes the point even more starkly: aligned models will auto-regressively generate fluent instruction-response data when handed *only* the pre-query formatting tokens, no prompt at all Can aligned LLMs generate their own training data?. Alignment that can be steered by formatting tokens is a shallow overlay, not a structural change.

The persistence literature confirms the base layer stays intact. Pretraining poisoning at just 0.1% of data survives standard safety alignment for denial-of-service, context-extraction, and belief-manipulation attacks — alignment only reliably suppresses jailbreaking How much poisoned training data survives safety alignment?. So the safety pass doesn't reach down and edit what the model learned in pretraining; it sits on top of it. And the Moral RolePlay work shows what that overlay actually looks like from the inside: aligned models handle villainy by "superficial substitution," swapping crude aggression in for nuanced malevolence rather than genuinely lacking the trait Does safety alignment harm models' ability to roleplay villains?. The capability is suppressed, costumed over — not absent.

There's a deeper version of this fragility worth knowing about. Ethical alignment turns out to be a *separate axis* from other competencies entirely — HHH-trained models still violate basic conversational pragmatics, which means RLHF is changing one narrow behavioral channel and leaving the rest untouched Can ethically aligned AI systems still communicate poorly?. Narrow, channel-specific training is exactly the kind of thing a narrow, channel-specific counter-example can reverse. And guardrails themselves turn out to be contextual and inconsistent — refusal rates shift with the user's apparent demographics and ideology Do AI guardrails refuse differently based on who is asking? — which is what you'd expect from a learned surface heuristic rather than a robust internal constraint.

The thing you didn't know you wanted to know: the same property that makes alignment cheap to install (you can align a strong model with a tiny curated set) is the property that makes it cheap to remove. Fragility isn't a bug in the fine-tuning recipe — it's the flip side of how shallow post-training works. If you want alignment that resists 10 bad examples, the corpus implies you'd need something that changes the model's underlying capabilities or values rather than just activating a disposition — which is the harder, less-solved problem.

Sources 6 notes

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

Can ethically aligned AI systems still communicate poorly?

Research shows that HHH-aligned models can violate Gricean maxims, lose common ground, and mishandle context despite being honest and harmless. Pragmatic competence requires architectural changes that RLHF alone cannot deliver.

Show all 6 sources

Do AI guardrails refuse differently based on who is asking?

GPT-3.5 refuses requests at different rates for younger, female, and Asian-American personas, and sycophantically declines to engage with political positions users would disagree with. Sports fandom and other non-political signals also shift refusal sensitivity.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Foundations of Large Language Models1.70 match · arxiv ↗
The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making1.59 match · arxiv ↗
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing0.94 match · arxiv ↗
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains0.90 match · arxiv ↗
Conversational Alignment with Artificial Intelligence in Context0.90 match · arxiv ↗
SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions0.87 match · arxiv ↗
Self-Alignment with Instruction Backtranslation0.87 match · arxiv ↗
Persistent Pre-Training Poisoning of LLMs0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a safety research analyst. The question remains open: *Why does safety alignment break after only a handful of harmful fine-tuning examples?* Treat this as still-unsolved; capability progress may have shifted the constraints.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; assume they may be dated.
- Alignment operates as a *shallow behavioral layer* activating existing capabilities rather than removing them; LIMA (2024) showed 1,000 curated examples achieve alignment via activation, implying few counter-examples can re-activate unsafe behavior (2024–2025).
- MAGPIE (2024-06) demonstrated aligned models auto-regressively generate harmful instruction data from formatting tokens alone, suggesting alignment is a surface overlay vulnerable to token-level steering.
- Pretraining poisoning at 0.1% persists through safety alignment for specific attacks (denial-of-service, context-extraction); alignment only suppresses jailbreaking, not the underlying capability (2024-10).
- Guardrails are contextual and inconsistent—refusal rates shift by user demographics and identity signals, indicating learned surface heuristics rather than robust internal constraints (2024-07).
- Ethical alignment is orthogonal to conversational alignment; RLHF changes one narrow behavioral channel while leaving others untouched, making narrow counter-examples especially effective (2025-05).

Anchor papers (verify; mind their dates):
- arXiv:2406.08464 (MAGPIE, 2024-06)
- arXiv:2410.13722 (Persistent Pre-Training Poisoning, 2024-10)
- arXiv:2407.06866 (Guardrail Sensitivity, 2024-07)
- arXiv:2511.04962 (Moral RolePlay / Villain Fidelity, 2025-11)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, judge whether newer scaling laws, mechanistic interpretability, constitutional AI, or adversarial training methods (2025–2026) have *relaxed* shallow-layer alignment or made it more robust. Separate the durable question (why is alignment channel-specific?) from the perishable claim (shallow layers are inherently fragile). Cite what has or hasn't resolved it.
(2) Surface the strongest *contradicting* work from the last 6 months: papers arguing alignment *is* deep, *isn't* channel-specific, or *does* survive few-shot adversarial examples. Flag disagreement with the library's narrative.
(3) Propose 2 research questions that assume the regime may have moved: e.g., "Do multi-objective fine-tuning or mechanistic alignment interventions create non-invertible safety properties?" or "Can ensemble or mixture-of-expert architectures distribute safety in ways single-model fine-tuning cannot?".

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI safety training can unravel after just ten harmful examples — suggesting the guardrails were never deeply baked in.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8