arXiv:2311.02733v2 Announce Type: replace-cross 
Abstract: Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multimodal models that can exploit both pieces of information simultaneously. However, previous methods mainly adopt unimodal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multimodal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multimodal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.

تم اقتراح طريقة جديدة تُدعى AV-Lip-Sync+، تستفيد من نموذج AV-HuBERT للكشف عن التلاعبات متعددة الوسائط في مقاطع الفيديو للوجوه الأمامية، مما يعالج التحديات التي تطرحها التزييفات العميقة الصوتية والمرئية. تستخدم هذه الطريقة مستخرج ميزات التعلم الذاتي للإشراف لتحديد التناقضات بين البيانات الصوتية والمرئية، مما يعزز قدرات الكشف إلى ما هو أبعد من الطرق الأحادية التقليدية.

Se ha propuesto un nuevo método llamado AV-Lip-Sync+, que aprovecha el modelo AV-HuBERT para detectar manipulaciones multimodales en videos de rostros frontales, abordando los desafíos que plantean los deepfakes audio-visuales. Este enfoque utiliza un extractor de características de aprendizaje auto-supervisado para identificar inconsistencias entre los datos de audio y visuales, mejorando las capacidades de detección más allá de los métodos unimodales tradicionales.

Une nouvelle méthode appelée AV-Lip-Sync+ a été proposée, tirant parti du modèle AV-HuBERT pour détecter les manipulations multimodales dans les vidéos de visages de face, répondant aux défis posés par les deepfakes audio-visuels. Cette approche utilise un extracteur de caractéristiques d'apprentissage auto-supervisé pour identifier les incohérences entre les données audio et visuelles, améliorant ainsi les capacités de détection au-delà des méthodes unimodales traditionnelles.

A new method called AV-Lip-Sync+ has been proposed, leveraging the AV-HuBERT model to detect multimodal manipulations in frontal face videos, addressing the challenges posed by audio-visual deepfakes. This approach utilizes a self-supervised learning feature extractor to identify inconsistencies between audio and visual data, enhancing the detection capabilities beyond traditional unimodal methods.

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos

One More Thing in AI – Your Shortcut to AI Mastery

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos

Was this article worth reading? Share it

One More Thing in AI

Humanize AI

sync. labs

Fakeface

Video Face Swap AI

Synthesia

Ready to build your own newsroom?