arXiv:2511.08031v1 Announce Type: new 
Abstract: The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL

تم تقديم إطار جديد للكشف عن التزييف العميق وتحديد موقعه، يسمى FPN-Transformer، لمعالجة القيود التي تواجه الأساليب الأحادية الموجودة. تستفيد هذه الطريقة متعددة الوسائط من نماذج ذاتية الإشراف للصوت والفيديو، مما يعزز دقة الكشف عن التزييف العميق. تم تأكيد فعالية الإطار من خلال نتائج تجريبية، حيث حقق درجة 0.7535 في معيار IJCAI'25 DDL-AV، مما يبرز إمكانيته في تحسين الثقة الرقمية في الوسائط.

Se ha introducido un nuevo marco para la detección y localización de deepfakes, llamado FPN-Transformer, para abordar las limitaciones de los métodos unimodales existentes. Este enfoque multimodal aprovecha modelos auto-supervisados para audio y video, mejorando la precisión en la detección de deepfakes. La efectividad del marco fue confirmada a través de resultados experimentales, alcanzando una puntuación de 0.7535 en el benchmark IJCAI'25 DDL-AV, destacando su potencial para mejorar la confianza digital en los medios.

Un nouveau cadre pour la détection et la localisation des deepfakes, appelé FPN-Transformer, a été introduit pour remédier aux limitations des méthodes unimodales existantes. Cette approche multimodale utilise des modèles auto-supervisés pour l'audio et la vidéo, améliorant ainsi la précision de la détection des deepfakes. L'efficacité du cadre a été confirmée par des résultats expérimentaux, atteignant un score de 0,7535 dans le benchmark IJCAI'25 DDL-AV, soulignant son potentiel à améliorer la confiance numérique dans les médias.

A new framework for detecting and localizing deepfakes, called FPN-Transformer, has been introduced to address the limitations of existing unimodal methods. This multi-modal approach leverages self-supervised models for audio and video, enhancing the accuracy of deepfake detection. The framework's effectiveness was confirmed through experimental results, achieving a score of 0.7535 in the IJCAI'25 DDL-AV benchmark, highlighting its potential to improve digital trust in media.

Multi-modal Deepfake Detection and Localization with FPN-Transformer

Was this article worth reading? Share it

Ready to build your own newsroom?