VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
PositiveArtificial Intelligence
The introduction of VLMDiff marks a significant advancement in the field of visual anomaly detection. By integrating a Latent Diffusion Model with a Vision-Language Model, VLMDiff addresses the challenges of detecting anomalies in diverse, multi-class images. Traditional methods often rely on synthetic noise generation and require extensive per-class model training, which limits scalability. In contrast, VLMDiff utilizes pre-trained Vision-Language Models to generate normal captions without manual annotations, conditioning the diffusion model to learn robust representations of normal image features. This novel approach has demonstrated competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, thus outperforming state-of-the-art diffusion-based methods. The availability of the code on GitHub further facilitates the adoption and exploration of this innovative framework.
— via World Pulse Now AI Editorial System
