arXiv:2411.17991v2 Announce Type: replace 
Abstract: Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays.

أدت التطورات الأخيرة في نماذج اللغة الكبيرة للفيديو (VideoLLM) إلى تقديم تنسيق تفاعل ثنائي بين الفيديو والنص يسمح للمستخدمين والنماذج بالتواصل في الوقت الفعلي أثناء تشغيل الفيديو. تعالج هذه الطريقة قيود التنسيقات التقليدية للتفاعل، خاصة في السيناريوهات الحساسة للوقت مثل فهم البث المباشر، حيث تكون الاستجابات الفورية ضرورية.

Los avances recientes en modelos de lenguaje de video (VideoLLM) han introducido un formato de interacción de dúo video-texto que permite a los usuarios y modelos comunicarse en tiempo real durante la reproducción de videos. Este método aborda las limitaciones de los formatos de interacción tradicionales, especialmente en escenarios sensibles al tiempo como la comprensión de transmisiones en vivo, donde las respuestas inmediatas son cruciales.

Des avancées récentes dans les modèles de langage vidéo (VideoLLM) ont introduit un format d'interaction vidéo-texte en duo qui permet aux utilisateurs et aux modèles de communiquer en temps réel pendant la lecture de vidéos. Cette méthode répond aux limitations des formats d'interaction traditionnels, en particulier dans des scénarios sensibles au temps tels que la compréhension de flux en direct, où des réponses immédiates sont cruciales.

Recent advancements in video large language models (VideoLLM) have introduced a video-text duet interaction format that allows users and models to communicate in real-time during video playback. This method addresses the limitations of traditional interaction formats, particularly in time-sensitive scenarios such as live-streaming comprehension, where immediate responses are crucial.

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Was this article worth reading? Share it

VidBoard AI

VideoDigest

Postugc