Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems. However, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (RL), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. In this paper, we introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. The salient features of RiC are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. Inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the Pareto-optimal solution for multiple objectives. Empirical evidence demonstrates the efficacy of our method in aligning both Large Language Models (LLMs) and diffusion models to accommodate diverse rewards with only around 10% GPU hours compared with multi-objective RL baseline.
Introduction. Foundational models (Radford et al., 2018; Devlin et al., 2018; Radford et al., 2019; Brown et al., 2020; Kaplan et al., 2020; Caron et al., 2021; Nichol et al., 2021) are predominantly pretrained on vast, internet-scale data using selfsupervised learning techniques, and subsequently fine-tuned for specific downstream tasks through supervised learning. However, this conventional approach may not align optimally with human preferences and values (Sun et al., 2023). Recent advancements (Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022; OpenAI, 2023) have demonstrated success in aligning language models with reinforcement learning from human feedback (RLHF). In RLHF, a reward model is often used to provide supervision for reinforcement learning (Ouyang et al., 2022). However, human preferences are inherently heterogeneous and multi-dimensional, and can often be in conflict with one another, such as the dichotomy between harmlessness and helpfulness (Bai et al., 2022; Rame et al., 2023).
Discussion / Conclusion. Aligning heterogeneous objectives for Large Language Models (LLMs) is essential for customization across various applications and to meet diverse human preferences. In this paper, we present RiC, a highly scalable solution that employs only supervised finetuning and a simple inferencetime adaptation method to tackle the multi-objective alignment problem. This approach significantly reduces the computational cost while demonstrating strong alignment performance for a range of objectives. Looking ahead, we aim to investigate more context-dependent inference-time adaptations to enhance alignment performance further. We hope our work can inspire further research into scalable algorithms for multi-objective alignment, thereby facilitating the customization of AI.