SYNTHESIS NOTE

Can models learn behavioral principles without preference labels?

Can alignment happen by amplifying the latent connection between stated principles and model behavior, rather than relying on expensive human preference annotations? This explores whether information-theoretic objectives could replace the preference-labeling bottleneck.

Synthesis note · 2026-06-03 · sourced from Synthetic Dialog

Instilling behavioral principles usually needs human preference labels or demonstrations — expensive and technically demanding. SAMI builds on the insight that alignment methods mostly expose and amplify behavior already implicit in the base model: a pretrained model already has a weak statistical connection between behavioral principles stated in natural language and the behavior that realizes them. SAMI is an iterative algorithm that finetunes the LM to increase the conditional mutual information between constitutions and self-generated responses (given queries) — requiring no preference labels and no demonstrations. A SAMI-trained mistral-7b beats the base model (66–77% win rates) and even surpasses the instruction-tuned baseline on single-turn dialogue; strikingly, a weak instruction-tuned model can write the constitution used to align a stronger base model.

The keeper is the mechanism: alignment as amplifying a latent principle-behavior correlation via an information-theoretic objective, sidestepping the preference-label bottleneck — and the weak-to-strong direction (weak constitution-writer, strong aligned model) is a notable scalable-oversight signal.

This sits in the vault's alignment-without-labels thread. It is a constitutional-style method that, like Can models learn to ask better clarifying questions through self-improvement? and the verifier-free RL family, removes a human-supervision dependency — and it presupposes the latent-capability premise that Do base models already contain hidden reasoning ability? makes for reasoning, here for behavioral principles.

Inquiring lines that read this note 9

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do base models contain latent reasoning that training can unlock?

Do base models already contain latent behavioral principles waiting to be amplified?

How can AI alignment serve diverse human preferences at scale?

Can alternative training methods improve on supervised fine-tuning for language models?

Why do benchmark improvements fail to reflect actual reasoning quality?

How do static benchmarks fail to capture human preference alignment?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

How does preference learning differ from supervised finetuning for reasoning?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 204 in 2-hop network ·dense cluster Open in graph ↗

Can models learn behavioral principles without p… Do base models already contain hidden reasoning ab… Can models learn to ask better clarifying question… Are RLHF annotations actually measuring genuine hu…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
SAMI's premise: alignment amplifies a latent principle-behavior connection already in the base model
Can models learn to ask better clarifying questions through self-improvement? This explores whether question-asking is a trainable skill that improves when models are rewarded for questions that lead to better answers. It matters because asking good clarifying questions could help AI systems handle underspecified user requests.
sibling method removing a human-supervision dependency via self-improvement
Are RLHF annotations actually measuring genuine human preferences? RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
SAMI sidesteps the preference-elicitation step whose validity that note questions

Can models learn behavioral principles without preference labels?

Inquiring lines that read this note 9

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4