arXiv:2512.10607v1 Announce Type: new 
Abstract: We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.

تم اقتراح إطار عمل جديد يسمى Track and Caption Any Motion (TCAM) لفهم الفيديو تلقائيًا، والذي يحدد ويصف أنماط الحركة دون الحاجة إلى استفسارات من المستخدم. يستخدم TCAM آلية انتباه قائمة على الحركة لتثبيت أوصاف اللغة الطبيعية على المسارات المتحركة المقابلة، مما يعزز تحليل الفيديو في ظروف صعبة مثل التداخل والحركة السريعة.

Se ha propuesto un nuevo marco llamado Track and Caption Any Motion (TCAM) para la comprensión automática de videos, que identifica y describe patrones de movimiento sin necesidad de consultas del usuario. TCAM utiliza un mecanismo de atención basado en el movimiento para anclar descripciones en lenguaje natural a las trayectorias de movimiento correspondientes, mejorando así el análisis de videos en condiciones desafiantes como la oclusión y el movimiento rápido.

Un nouveau cadre nommé Track and Caption Any Motion (TCAM) a été proposé pour la compréhension automatique des vidéos, qui identifie et décrit les motifs de mouvement sans nécessiter de requêtes utilisateur. TCAM utilise un mécanisme d'attention basé sur le mouvement pour ancrer les descriptions en langage naturel aux trajectoires de mouvement correspondantes, améliorant ainsi l'analyse vidéo dans des conditions difficiles telles que l'occlusion et les mouvements rapides.

A new framework named Track and Caption Any Motion (TCAM) has been proposed for automatic video understanding, which identifies and describes motion patterns without the need for user queries. TCAM utilizes a motion-field attention mechanism to ground natural language descriptions to corresponding motion trajectories, enhancing video analysis in challenging conditions such as occlusion and rapid movement.

Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos

Was this article worth reading? Share it

LucidQuery AI

Video Toolkit

Capte

SuperMotion

VideoDigest

Videotok

Ready to build your own newsroom?