arXiv:2511.04281v1 Announce Type: new 
Abstract: Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

تقدم دراسة جديدة DINOv2 لإعادة التعرف على الأشخاص عبر الفيديو المرئي والأشعة تحت الحمراء، مع التركيز على أهمية ميزات المشي في تحسين المطابقة بين الأنماط. هذه الأبحاث مهمة لأنها تتناول قيود الأساليب الحالية التي غالبًا ما تتجاهل الجوانب الديناميكية للمشي، مما يمكن أن يعزز دقة التعرف على الأفراد عبر أنماط بصرية مختلفة.

Un nuevo estudio presenta DINOv2 para la reidentificación de personas visible-infrarroja basada en video, centrándose en la importancia de las características de la marcha para mejorar la coincidencia de video entre modalidades. Esta investigación es significativa ya que aborda las limitaciones de los métodos existentes que a menudo ignoran los aspectos dinámicos de la marcha, lo que puede mejorar la precisión en la identificación de individuos a través de diferentes modalidades visuales.

Une nouvelle étude présente DINOv2 pour la ré-identification de personnes visible-infrarouge basée sur la vidéo, en mettant l'accent sur l'importance des caractéristiques de démarche pour améliorer la correspondance vidéo intermodal. Cette recherche est significative car elle aborde les limites des méthodes existantes qui ignorent souvent les aspects dynamiques de la démarche, ce qui peut améliorer l'exactitude de l'identification des individus à travers différentes modalités visuelles.

A new study introduces DINOv2 for video-based visible-infrared person re-identification, focusing on the importance of gait features in improving cross-modal video matching. This research is significant as it addresses the limitations of existing methods that often ignore the dynamic aspects of gait, which can enhance the accuracy of identifying individuals across different visual modalities.

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

arXiv:2511.18344v1 Announce Type: new 
Abstract: With the proliferation of low altitude unmanned aerial vehicles (UAVs), visual multi-object tracking is becoming a critical security technology, demanding significant robustness even in complex environmental conditions. However, tracking UAVs using a single visual modality often fails in challenging scenarios, such as low illumination, cluttered backgrounds, and rapid motion. Although multi-modal multi-object UAV tracking is more resilient, the development of effective solutions has been hindered by the absence of dedicated public datasets. To bridge this gap, we release MM-UAV, the first large-scale benchmark for Multi-Modal UAV Tracking, integrating three key sensing modalities, e.g. RGB, infrared (IR), and event signals. The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames. Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework, designed specifically for UAV tracking applications and serving as a baseline for future research. Our framework incorporates two key technical innovations, e.g. an offset-guided adaptive alignment module to resolve spatio mismatches across sensors, and an adaptive dynamic fusion module to balance complementary information conveyed by different modalities. Furthermore, to overcome the limitations of conventional appearance modelling in multi-object tracking, we introduce an event-enhanced association mechanism that leverages motion cues from the event modality for more reliable identity maintenance. Comprehensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art methods. To foster further research in multi-modal UAV tracking, both the dataset and source code will be made publicly available at https://xuefeng-zhu5.github.io/MM-UAV/.

تم تقديم مجموعة بيانات جديدة تُدعى MM-UAV، مصممة لتتبع الطائرات بدون طيار (UAV) باستخدام نهج متعدد الوسائط يتضمن إشارات RGB والأشعة تحت الحمراء وإشارات الأحداث. تحتوي مجموعة البيانات هذه على أكثر من 30 سيناريو صعب مع 1,321 تسلسل متزامن وأكثر من 2.8 مليون إطار مُعَلَّم، مما يعالج قيود التتبع باستخدام وسيلة واحدة في ظروف صعبة.

Se ha introducido un nuevo conjunto de datos llamado MM-UAV, diseñado para el seguimiento de vehículos aéreos no tripulados (UAV) utilizando un enfoque multimodal que incluye señales RGB, infrarrojas y de eventos. Este conjunto de datos presenta más de 30 escenarios desafiantes con 1,321 secuencias sincronizadas y más de 2.8 millones de fotogramas anotados, abordando las limitaciones del seguimiento de una sola modalidad en condiciones difíciles.

Un nouveau jeu de données nommé MM-UAV a été introduit, conçu pour le suivi des véhicules aériens sans pilote (UAV) en utilisant une approche multimodale qui inclut des signaux RGB, infrarouges et d'événements. Ce jeu de données comprend plus de 30 scénarios difficiles avec 1 321 séquences synchronisées et plus de 2,8 millions de cadres annotés, répondant aux limitations du suivi à modalité unique dans des conditions difficiles.

A new dataset named MM-UAV has been introduced, designed for tracking unmanned aerial vehicles (UAVs) using a multi-modal approach that includes RGB, infrared, and event signals. This dataset features over 30 challenging scenarios with 1,321 synchronized sequences and more than 2.8 million annotated frames, addressing the limitations of single-modality tracking in difficult conditions.

A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles

arXiv:2511.19134v1 Announce Type: new 
Abstract: Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.

تم تقديم MambaRefine-YOLO ككاشف مزدوج للموضوعة للأجسام الصغيرة مصمم خصيصًا لصور الطائرات بدون طيار (UAV)، حيث يتناول التحديات المتعلقة بالوضوح المنخفض والفوضى الخلفية في الكشف عن الأجسام الصغيرة. يتضمن النموذج وحدة دمج Mamba التكميلية ذات البوابتين (DGC-MFM) ورقبة تجميع الميزات الهرمية (HFAN)، محققًا دقة متوسطة (mAP) تبلغ 83.2% في مجموعة بيانات DroneVehicle.

MambaRefine-YOLO se ha presentado como un detector de objetos pequeños de doble modalidad diseñado específicamente para imágenes de vehículos aéreos no tripulados (UAV), abordando los desafíos de baja resolución y el desorden de fondo en la detección de objetos pequeños. El modelo incorpora un módulo de fusión Mamba complementario de doble compuerta (DGC-MFM) y un cuello de agregación de características jerárquico (HFAN), logrando una precisión media (mAP) del 83,2 % en el conjunto de datos DroneVehicle.

MambaRefine-YOLO a été introduit comme un détecteur d'objets de petite taille à double modalité spécifiquement conçu pour les images de véhicules aériens sans pilote (UAV), répondant aux défis de faible résolution et de désordre de fond dans la détection d'objets de petite taille. Le modèle intègre un module de fusion Mamba complémentaire à double porte (DGC-MFM) et un cou de collecte de caractéristiques hiérarchique (HFAN), atteignant une précision moyenne (mAP) de 83,2 % sur le jeu de données DroneVehicle.

MambaRefine-YOLO has been introduced as a dual-modality small object detector specifically designed for Unmanned Aerial Vehicle (UAV) imagery, addressing the challenges of low resolution and background clutter in small object detection. The model incorporates a Dual-Gated Complementary Mamba fusion module (DGC-MFM) and a Hierarchical Feature Aggregation Neck (HFAN), achieving a state-of-the-art mean Average Precision (mAP) of 83.2% on the DroneVehicle dataset.

MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery

arXiv:2511.17693v1 Announce Type: new 
Abstract: Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for low-latency inference on resource-constrained devices that achieves high performance. In particular, stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations. The recent Continual Transformers have addressed this issue, but they can only be effectively used in shallow models, which limits their scope and generalization power. In this paper, we propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder-only model that can be applied over existing deep encoder architectures with minimal changes. In our experiments over audio, video, and text streams, we show that DeepCoTs retain comparative performance to their non-continual baselines while offering a linear computational cost for all Transformer layers, which reduces up to two orders of magnitude in the running time compared to previous efficient models.

يمثل تقديم DeepCoT، أو المحولات المستمرة العميقة، تقدمًا كبيرًا في الاستدلال في الوقت الحقيقي على تدفقات البيانات، حيث يتناول التحديات المتعلقة بتكاليف الحوسبة العالية والازدواجية في النماذج الحالية. تم تصميم هذا النموذج الذي يعتمد فقط على الترميز للعمل مع الهياكل العميقة مع الحفاظ على الأداء عبر تدفقات الصوت والفيديو والنص.

La introducción de DeepCoT, o Deep Continual Transformers, representa un avance significativo en la inferencia en tiempo real sobre flujos de datos, abordando los desafíos de los altos costos computacionales y la redundancia en los modelos existentes. Este modelo solo de codificación está diseñado para trabajar con arquitecturas profundas mientras mantiene el rendimiento en flujos de audio, video y texto.

L'introduction de DeepCoT, ou Deep Continual Transformers, représente une avancée significative dans l'inférence en temps réel sur les flux de données, répondant aux défis des coûts computationnels élevés et de la redondance dans les modèles existants. Ce modèle uniquement encodeur est conçu pour fonctionner avec des architectures profondes tout en maintenant des performances sur les flux audio, vidéo et texte.

The introduction of DeepCoT, or Deep Continual Transformers, represents a significant advancement in real-time inference on data streams, addressing the challenges of high computational costs and redundancy in existing models. This encoder-only model is designed to work with deep architectures while maintaining performance across audio, video, and text streams.

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Was this article worth reading? Share it

The Visualizer

VidBoard AI

Synthesia