Nested Attention: Semantic-aware Attention Values for Concept Personalization

Paper · arXiv 2501.01407 · Published January 2, 2025
Cognitive Models and Latent RepresentationsMultimodal Models

Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model’s prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. Our key idea is to generate querydependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.

Introduction. Personalization of text-to-image models [12, 22, 32, 37] enables users to generate captivating images featuring their own personal data. To introduce new subjects into the textto-image model, initial approaches conduct per-subject optimization [15, 27, 40], achieving impressive results but requiring several minutes to capture each subject. To reduce this overhead, more recent approaches train image encoders [4, 16, 17, 19, 41, 49, 51, 52, 54, 55]. These encoders embed the subject into a latent representation, which is then used in conjunction with diverse text prompts to generate images of the subject in multiple contexts. A key challenge in personalizing text-to-image models is balancing identity preservation and prompt alignment [5, 17, 19, 55]. Most encoder-based works [17, 19, 52, 53, 55] tackle personalization by encoding the subject into a large number of visual tokens which are injected into the diffusion model using new cross-attention layers.

Discussion / Conclusion. We introduced nested attention, a novel identity injection technique that provides a rich subject representation within the existing cross-attention layers of the model. It is based on two key principles: (i) modifying only the attention value of the subject token while keeping keys and other values unchanged, and (ii) making the subject token’s attention value dependent on the query, i.e., assigning the subject a different value for each image region. In this sense, nested attention can be interpreted as an IP-Adapter that anchors the subject’s encoding to a single textual token. This design better preserves the model’s prior, while enabling a detailed and accurate representation of the subject. Future work could explore adaptations of nested atten- tion to other tasks, such as subject-conditioned inpainting or style transfer. Another promising direction involves extending the encoder to a domain-agnostic approach, which could tackle subjects from unseen classes.