arXiv:2510.21406v1 Announce Type: new 
Abstract: We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.

تقديم معيار استرجاع الفيديو غير المقطوع متعدد الوسائط (MUVR) يمثل تقدمًا كبيرًا في تكنولوجيا استرجاع الفيديو، خاصةً للمنصات التي تقدم مقاطع فيديو طويلة. من خلال السماح للمستخدمين باسترجاع مقاطع الفيديو غير المقطوعة عبر استفسارات متعددة الوسائط، يلبي MUVR الحاجة المتزايدة لمحتوى فيديو دقيق وذو صلة. لا تعزز هذه الابتكار تجربة المستخدم فحسب، بل تحدد أيضًا معيارًا جديدًا لمهام استرجاع الفيديو، مما يسهل على الباحثين والمطورين الوصول إلى بيانات الفيديو واستخدامها بفعالية.

La introducción del benchmark Multi-modal Untrimmed Video Retrieval (MUVR) marca un avance significativo en la tecnología de recuperación de videos, especialmente para plataformas de videos largos. Al permitir a los usuarios recuperar videos no editados a través de consultas multimodales, MUVR aborda la creciente necesidad de contenido de video preciso y relevante. Esta innovación no solo mejora la experiencia del usuario, sino que también establece un nuevo estándar para las tareas de recuperación de videos, facilitando el acceso y uso de datos de video de manera efectiva.

L'introduction du benchmark Multi-modal Untrimmed Video Retrieval (MUVR) représente une avancée significative dans la technologie de récupération vidéo, en particulier pour les plateformes de vidéos longues. En permettant aux utilisateurs de récupérer des vidéos non montées via des requêtes multimodales, MUVR répond à la nécessité croissante de contenu vidéo précis et pertinent. Cette innovation améliore non seulement l'expérience utilisateur, mais établit également une nouvelle norme pour les tâches de récupération vidéo, facilitant l'accès et l'utilisation des données vidéo pour les chercheurs et les développeurs.

The introduction of the Multi-modal Untrimmed Video Retrieval (MUVR) benchmark marks a significant advancement in video retrieval technology, particularly for long-video platforms. By allowing users to retrieve untrimmed videos through multi-modal queries, MUVR addresses the growing need for precise and relevant video content. This innovation not only enhances user experience but also sets a new standard for video retrieval tasks, making it easier for researchers and developers to access and utilize video data effectively.

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

arXiv:2511.14848v1 Announce Type: new 
Abstract: We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at https://gsgd-motiontransfer.github.io/

Gaussian See, Gaussian Do هي طريقة جديدة لنقل الحركة ثلاثية الأبعاد الدلالية من الفيديو متعدد الزوايا. تتيح هذه الطريقة نقل الحركة بدون الحاجة إلى هياكل ثابتة، بين كائنات لها تطابق دلالي ذي معنى. من خلال استخدام تقنيات نقل الحركة الضمنية، تستخرج الطريقة تجسيدات الحركة من مقاطع الفيديو المصدر وتطبقها على الأشكال الثابتة المستهدفة، مما يحسن من دقة الحركة والتناسق الهيكلي في إعادة البناء باستخدام Splatting غاوسي ثلاثي الأبعاد.

Gaussian See, Gaussian Do es un nuevo método para la transferencia de movimiento 3D semántico a partir de video multivista. Este enfoque permite la transferencia de movimiento sin necesidad de rig y entre objetos que tienen una correspondencia semántica significativa. Al utilizar técnicas de transferencia de movimiento implícitas, el método extrae incrustaciones de movimiento de videos fuente y las aplica a formas estáticas objetivo, mejorando así la fidelidad del movimiento y la consistencia estructural en la reconstrucción mediante Splatting Gaussiano 3D.

Gaussian See, Gaussian Do est une nouvelle méthode de transfert de mouvement 3D sémantique à partir de vidéos multivues. Cette approche permet un transfert de mouvement sans rig et entre des objets ayant une correspondance sémantique significative. En utilisant des techniques de transfert de mouvement implicite, la méthode extrait des embeddings de mouvement à partir de vidéos sources et les applique à des formes cibles statiques, améliorant ainsi la fidélité du mouvement et la cohérence structurelle dans la reconstruction par Splatting Gaussien 3D.

Gaussian See, Gaussian Do is a new method for semantic 3D motion transfer from multiview video. This approach allows for rig-free, cross-category motion transfer between objects that have semantically meaningful correspondence. By utilizing implicit motion transfer techniques, the method extracts motion embeddings from source videos and applies them to static target shapes, resulting in improved motion fidelity and structural consistency in 3D Gaussian Splatting reconstruction.

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Was this article worth reading? Share it