INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does alignment training create bli…›this inquiring line

Training AI to behave well doesn't add new capabilities — it mostly just unlocks what was already there.

What specific behavioral patterns should alignment examples target for maximum effect?

This explores a design question — when you build a set of alignment training examples, which behaviors should those examples actually demonstrate to get the biggest payoff — and the corpus reframes the question before answering it.

This explores which behaviors alignment examples should target for maximum effect, and the most striking thing in the corpus is that the question's premise gets quietly overturned: the highest-leverage targets are not new skills but ways of surfacing capabilities the model already has. LIMA shows that a mere 1,000 carefully curated examples can match models trained on orders of magnitude more data, because post-training activates existing pretrained capability rather than building new competence Can careful curation replace massive alignment datasets?. The implication is sharp — if you're trying to teach the model facts or reasoning through your examples, you're wasting your budget. The examples earn their keep by shaping how the model presents what it already knows.

What does that leave to target? Mostly format and the shape of the output space. In a genuinely unsettling result, models fine-tuned on semantically empty or even deliberately wrong instructions perform almost identically to models trained on correct ones (43% vs. a 42.6% baseline) — what transfers is knowledge of the output distribution, not task understanding Does instruction tuning teach task understanding or output format?. So the behavioral pattern your examples should demonstrate most reliably is the right answer shape: structure, register, length, and the move from open-ended pretraining text into the question-and-response form. This is why aligned models can even bootstrap their own training data from nothing but formatting tokens — MAGPIE has Llama-3-Instruct auto-generate millions of high-quality instruction pairs given only the pre-query template, because the alignment-relevant behavior lives in that format scaffold Can aligned LLMs generate their own training data?.

A second class of high-value targets is invariances — behaviors defined by what should stay constant, not what should change. Consistency training teaches a model to respond identically to a clean prompt and a wrapped or perturbed one, using its own clean responses as the target, which sidesteps the staleness that creeps into hand-written examples Can models learn to ignore irrelevant prompt changes?. The same logic produces genuinely collaborative agents: rather than rewarding surface agreement, you train the model to stay consistent when a partner's intervention is causally nullified, so it learns to weigh suggestions by real impact — and partner-awareness emerges as a byproduct without ever being directly rewarded Why do standard alignment methods ignore partner interventions?. Targeting an invariance is often more effective than targeting a behavior directly, because it specifies the thing you actually care about.

But maximizing effect cuts both ways, and the corpus flags a cost worth knowing before you optimize hard. Alignment training that rewards calibrated, hedged neutrality structurally suppresses an entire family of speech acts — alarm, warning, denunciation — that require overclaiming relative to a cautious baseline, and this is a consequence of the objective, not a bug you can patch out Does alignment training suppress socially necessary speech acts?. Relatedly, alignment dimensions are not interchangeable: lexical alignment buys task efficiency while emotional and prosodic alignment buy warmth and trust, and treating them as one knob produces cold service bots and evasive assistants Do different types of alignment serve different conversational goals?. So "maximum effect" only means something once you've named which dimension you're after.

The quiet payoff here is a reordering of priorities. The behaviors worth demonstrating in your examples are output format, response shape, and targeted invariances — the things that activate and route latent capability — not knowledge or reasoning, which the base model already holds and which heavy fine-tuning can actually corrupt in the lower layers (one reason decoding-time proxy-tuning closes most of the alignment gap while better preserving pretrained knowledge Can decoding-time tuning preserve knowledge better than weight fine-tuning?). A small, sharply-targeted set of examples that nails format and invariance will usually beat a large one that tries to teach everything.

Sources 8 notes

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Why do standard alignment methods ignore partner interventions?

Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.

Show all 8 sources

Does alignment training suppress socially necessary speech acts?

RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.

Do different types of alignment serve different conversational goals?

A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Foundations of Large Language Models3.36 match · arxiv ↗
Why Do Some Language Models Fake Alignment While Others Don't?2.48 match · arxiv ↗
Post-training makes large language models less human-like2.44 match · arxiv ↗
Natural Emergent Misalignment From Reward Hacking In Production Rl2.41 match · arxiv ↗
A Survey on Post-training of Large Language Models1.74 match · arxiv ↗
Are Emergent Abilities in Large Language Models just In-Context Learning?1.68 match · arxiv ↗
Conversational Alignment with Artificial Intelligence in Context1.66 match · arxiv ↗
The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs1.65 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an alignment researcher auditing whether a curated library's claims about high-leverage alignment targets remain valid or have been superseded. The question: **Which behavioral patterns should alignment examples target for maximum effect?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. The library claims:
- A mere 1,000 carefully curated examples can match models trained on orders of magnitude more data because post-training activates existing pretrained capability, not new competence (LIMA, ~2024).
- Models fine-tuned on semantically empty or deliberately wrong instructions perform nearly identically to correct ones (43% vs. 42.6% baseline) — output format and distribution, not task understanding, transfers (~2024).
- Aligned models can auto-generate millions of high-quality instruction pairs given only the pre-query template, because alignment-relevant behavior lives in format scaffolding (MAGPIE, ~2024).
- Consistency training teaches prompt-perturbation invariance using the model's own responses as targets, sidestepping staleness in hand-written examples (~2025).
- Alignment training rewarding calibrated, hedged neutrality structurally suppresses speech acts requiring overclaiming (alarm, warning, denunciation) — a consequence of the objective (~2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
- arXiv:2406.08464 (2024) — Magpie: Alignment Data Synthesis from Scratch
- arXiv:2510.27062 (2025) — Consistency Training Helps Stop Sycophancy and Jailbreaks
- arXiv:2510.22462 (2025) — Learning "Partner-Aware" Collaborators in Multi-Party Collaboration

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, o3, Claude 3.5+), scaling laws, constitutional AI, synthetic data pipelines, or multi-agent orchestration have since relaxed or overturned it. Separate the durable question (likely still open: *what activates vs. teaches?*) from the perishable limitation (possibly resolved: *sample efficiency, format transfer*). Cite what resolved it plainly.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months** — especially work challenging the "format-not-understanding" thesis or showing that targeted behavioral teaching *does* scale beyond format.
(3) **Propose 2 research questions** that assume the regime may have moved: (a) If consistency invariances now scale to reasoning tasks, what replaces "format targeting" as the binding constraint? (b) Can alignment examples teach *selective overclaiming* without losing calibration, or is that genuine tension?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training AI to behave well doesn't add new capabilities — it mostly just unlocks what was already there.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8