# RD 734055
Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models
Publication date
14/05/2025
Language
English
Paper publication
June 2025 Research Disclosure journal
Digital time stamp
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
DOWNLOAD THIS PUBLICATION
22 pages(s) - 37M
USD $
EUR €
Abstract

Abstract High-quality annotated datasets are crucial for training semantic segmentation models, yet their man- ual creation and annotation are labor-intensive and costly. In this paper, we introduce a novel method for generating class-agnostic semantic segmentation masks by leveraging the self-attention maps of latent dif- fusion models, such as Stable Diffusion. Our approach is entirely learning-free and explores the potential of self-attention maps to produce semantically meaningful segmentation masks. Central to our method is the reduction of individual self-attention information to condense the essential features required for seman- tic distinction. We employ multiple instances of unsupervised k-means clustering to generate clusters, with increasing cluster counts leading to more specialized semantic abstraction. We evaluate our approach using state-of-the-art models such as Segment Anything (SAM) and Mask2Former, which are trained on extensive datasets of manually annotated masks. Our results, demonstrated on both synthetic and real-world images, show that our method generates high-resolution masks with adjustable granularity, relying solely on the in- trinsic scene understanding of the latent diffusion model - without requiring any training or fine-tuning. 1 Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models May 14, 2025 1 Introduction Semantic segmentation is a fundamental task in computer vision, with applications ranging from autonomous driving to medical image analysis. However, the process of creating large, anno- tated datasets to train segmentation models is both time-consuming and costly. This has prompted increasing interest in methods that leverage existing data, models, or mechanisms to bypass the need for data creation and manual annotation. Generative models, particularly diffusion-based models like Stable Diffusion 2.1 (Rombach et al., 2022a), have shown remarkable capabilities in generating detailed and coherent images, yet their potential to assist in generating segmentation masks remains underexplored. In this work, we investigate the intrinsic ability of diffusion models to produce class-agnostic semantic segmentation masks without any modification to the models themselves or reliance on additional pre-trained networks (see Figure 1). Specifically, we exploit the self-attention mechanisms embedded in latent diffusion models, which are designed to enhance image generation quality by capturing relationships between different parts of the image (Hong et al., 2023). While self-attention has been used in previous efforts to create segmentation masks, it has not been fully explored at the granularity of individual attention heads. We hypothesize that the self-attention heads within these models encode sufficient information about image structure and content, enabling the segmentation of distinct regions with semantically meaningful boundaries - without the need for external supervision. Previous methods, such as (Nguyen et al., 2023) and (Tian et al., 2024), typically aggregate self- attention maps by averaging or summing over attention heads and/or features to manage the large tensor sizes involved. In contrast, our approach leverages the individual multi-head self-attention maps independently, preserving their distinct objectives and enabling the derivation of more fine- grained semantic masks. Our main contributions are as follows: • Head-Wise Self-Attention Analysis: We conduct a detailed analysis of the individual self-attention maps from each head in Stable Diffusion, demonstrating how they contribute to semantic separation within an image. • Class-Agnostic Mask Generation: We propose a novel method for generating semantic segmentation masks across multiple levels of granularity—ranging from coarse to fine—directly from the self-attention features of the diffusion model. • Zero-Shot Segmentation: We validate our approach in the context of zero-shot segmentation, showcasing the ability to interpret and semantically segment real-world images without any prior training or fine- tuning. 2 Figure 1: Our method generates class-agnostic yet semantically meaningful segmentation masks. The high- lighted pixel (marked by a star in the upper left image) can be associated with various semantic categories, such as left eye, eyes, face, cat, and foreground. These segmentation masks are produced solely through the self-attention mechanism of Stable Diffusion, without relying on any external image features. 2 Related Work Numerous text-to-image diffusion models have been developed to generate images from textual prompts, with notable examples including DALL-E 3 (Betker et al., 2023), Imagen (Saharia et al., 2022), Muse (Chang et al., 2023), and Stable Diffusion (Rombach et al., 2022b). Among these, Stable Diffusion stands out as an open-source model capable of synthesizing high-resolution im- ages containing multiple objects in one scene. This is achieved by encoding the input text into a latent space, where a diffusion process is applied using a denoising network. The final image is then reconstructed through a decoder. Previous works have explored the role of self-attention in generative models, particularly diffusion- based models. For instance, (Vaswani et al., 2023) exam...