AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

arXiv — cs.CVWednesday, November 12, 2025 at 5:00:00 AM
AVAR-Net represents a significant advancement in anomaly recognition, a critical area for sectors like surveillance and healthcare. Traditional methods often falter under difficult conditions due to their reliance on visual data alone, which can be compromised by factors such as occlusion and low light. This study not only presents a novel framework but also introduces the Visual-Audio Anomaly Recognition (VAAR) dataset, consisting of 3,000 videos across ten anomaly classes. The framework integrates an audio feature extractor using Wav2Vec2 and a video feature extractor with MobileViT, employing an early fusion mechanism to combine these modalities effectively. Furthermore, the Multi-Stage Temporal Convolutional Network (MTCN) enhances the model's ability to learn long-range temporal dependencies, thereby improving recognition accuracy. The development of AVAR-Net and the VAAR dataset marks a pivotal step towards more reliable multimodal anomaly detection, addressing the pressing need …
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it