Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Paper · arXiv 2404.14313 · Published April 22, 2024

When prompting a language model (LM), users frequently expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles into a model can be resource-intensive and technically challenging, generally requiring human preference labels or examples. We introduce SAMI, a method for teaching a pretrained LM to follow behavioral principles that does not require any preference labels or demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM to increase the conditional mutual information between constitutions and self-generated responses given queries from a datasest. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77% . Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a “principle writer” model; to avoid dependence on stronger models, we further evaluate aligning a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct).

Introduction. Pretraining yields language models (LMs) with a vast array of knowledge and abilities, but these models are difficult to use because they don’t inherently reflect the values and preferences of human users. To address this issue, various alignment finetuning methods have become crucial for transforming LMs into useful AI assistants [25, 29, 6, intera alia]. The success of these methods raises the question: Why do they work so well? Increasing evidence suggests that alignment finetuning methods expose and amplify aspects of the behavior distribution already implicit in the base pretrained model [e.g., 42, 21]. In this paper we build on this insight: We hypothesize that pretrained base models already have a weak statistical connection between behavioral principles, described in natural language, and the behavior that would realize them. We can encourage this connection by optimizing the conditional mutual information between principles and model responses given queries from a dataset.

Discussion / Conclusion. We proposed a simple method for aligning a pretrained LM with a set of behavioral principles without the need for preference labels or in-context demonstrations. For research purposes, we restricted our experiments to two domains: dialogue and summarization, using a small set of behavioral principles for summarizing Reddit posts or helpful and harmless norms for responding to a wide range of user queries sourced from HH-RLHF. To evaluate the scalability of SAMI to more complex constitutions, future work should include more diverse principles that are representative of personas with diverse preferences [e.g., 9, 22, 12]. Another limitation is that the SAMI loss (Figure 5) requires regularization. Training for too long or failing to regularize can result in forgetting and the model outputting "gibberish", a problem faced by RLHF more generally and usually regularized against using a KL-divergence penalty [e.g., 31]. Moreover, SAMI suffers from a length bias similar to other methods, such as DPO.

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Synthesis notes that discuss concepts related to this paper