SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Psychology, Society, and Alignment

Can models learn behavioral principles without preference labels?

Can alignment happen by amplifying the latent connection between stated principles and model behavior, rather than relying on expensive human preference annotations? This explores whether information-theoretic objectives could replace the preference-labeling bottleneck.

Synthesis note · 2026-06-03 · sourced from Synthetic Dialog

Instilling behavioral principles usually needs human preference labels or demonstrations — expensive and technically demanding. SAMI builds on the insight that alignment methods mostly expose and amplify behavior already implicit in the base model: a pretrained model already has a weak statistical connection between behavioral principles stated in natural language and the behavior that realizes them. SAMI is an iterative algorithm that finetunes the LM to increase the conditional mutual information between constitutions and self-generated responses (given queries) — requiring no preference labels and no demonstrations. A SAMI-trained mistral-7b beats the base model (66–77% win rates) and even surpasses the instruction-tuned baseline on single-turn dialogue; strikingly, a weak instruction-tuned model can write the constitution used to align a stronger base model.

The keeper is the mechanism: alignment as amplifying a latent principle-behavior correlation via an information-theoretic objective, sidestepping the preference-label bottleneck — and the weak-to-strong direction (weak constitution-writer, strong aligned model) is a notable scalable-oversight signal.

This sits in the vault's alignment-without-labels thread. It is a constitutional-style method that, like Can models learn to ask better clarifying questions through self-improvement? and the verifier-free RL family, removes a human-supervision dependency — and it presupposes the latent-capability premise that Do base models already contain hidden reasoning ability? makes for reasoning, here for behavioral principles.

Inquiring lines that use this note as a source 7

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 185 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

a model can be aligned to behavioral principles without preference labels by maximizing mutual information between a constitution and its responses