arXiv:2508.06869v3 Announce Type: replace 
Abstract: Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.

تهدف مقدمة إطار دمج الترجمة المرئية (VSI) إلى تحسين فهم الفيديوهات الطويلة من خلال دمج المعلومات المرئية والنصية عبر نهج استرجاع تعاوني ذو فرعين. تعالج هذه الطريقة قيود خوارزميات البحث عن الإطارات الرئيسية الحالية، التي تعتمد بشكل أساسي على البيانات المرئية وغالبًا ما تفشل في التقاط الجوهر الدلالي لمحتوى الفيديو.

La introducción del marco de Integración de Subtítulos Visuales (VSI) tiene como objetivo mejorar la comprensión de videos largos al integrar información visual y textual a través de un enfoque de recuperación colaborativa de doble rama. Este método aborda las limitaciones de los algoritmos de búsqueda de fotogramas clave existentes, que dependen principalmente de datos visuales y a menudo no logran capturar la esencia semántica del contenido de video.

L'introduction du cadre d'intégration des sous-titres visuels (VSI) vise à améliorer la compréhension des vidéos longues en intégrant des informations visuelles et textuelles grâce à une approche de récupération collaborative à double branche. Cette méthode répond aux limitations des algorithmes de recherche de keyframes existants, qui s'appuient principalement sur des données visuelles et échouent souvent à capturer l'essence sémantique du contenu vidéo.

The introduction of the Visual Subtitle Integration (VSI) framework aims to enhance long video understanding by integrating visual and textual information through a dual-branch collaborative retrieval approach. This method addresses the limitations of existing keyframe search algorithms, which primarily rely on visual data and often fail to capture the semantic essence of video content.

VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

arXiv:2510.27280v2 Announce Type: replace 
Abstract: Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.

تم تقديم FOCUS، وهو وحدة جديدة لاختيار الإطارات الرئيسية، لتحسين فهم الفيديوهات الطويلة من خلال اختيار الإطارات ذات الصلة بالاستعلام مع الالتزام بميزانيات صارمة من الرموز. هذه الطريقة المستقلة عن النموذج تصيغ اختيار الإطارات الرئيسية كمشكلة استكشاف نقي توافقي، تهدف إلى تحديد أكثر مقاطع الفيديو إفادة دون تصفية مسبقة.

FOCUS, un nuevo módulo de selección de fotogramas clave, ha sido introducido para mejorar la comprensión de videos largos al seleccionar fotogramas relevantes para la consulta mientras se adhiere a estrictos presupuestos de tokens. Este enfoque independiente del modelo formula la selección de fotogramas clave como un problema de exploración pura combinatoria, con el objetivo de identificar los segmentos de video más informativos sin filtrado previo.

FOCUS, un nouveau module de sélection de keyframes, a été introduit pour améliorer la compréhension des vidéos longues en sélectionnant des frames pertinentes pour la requête tout en respectant des budgets de tokens stricts. Cette approche, indépendante du modèle, formule la sélection de keyframes comme un problème d'exploration pure combinatoire, visant à identifier les segments vidéo les plus informatifs sans filtrage préalable.

FOCUS, a new keyframe selection module, has been introduced to enhance long video understanding by selecting query-relevant frames while adhering to strict token budgets. This model-agnostic approach formulates keyframe selection as a combinatorial pure-exploration problem, aiming to identify the most informative video segments without prior filtering.

FOCUS: Efficient Keyframe Selection for Long Video Understanding

arXiv:2411.13093v4 Announce Type: replace 
Abstract: Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

تمثل الإضافة الأخيرة لجيل استرجاع الفيديو المعزز (Video-RAG) استجابة للتحديات التي تواجه نماذج اللغة الكبيرة للفيديو (LVLM) في فهم الفيديوهات الطويلة بسبب نقص السياق. تستخدم هذه الطريقة المبتكرة نصوصًا مساعدة متوافقة بصريًا مستخرجة من بيانات الفيديو لتحسين التوافق بين الأنماط دون الحاجة إلى ضبط دقيق مكثف أو موارد GPU باهظة الثمن.

La reciente introducción de la Generación Aumentada por Recuperación de Video (Video-RAG) aborda los desafíos que enfrentan los grandes modelos de lenguaje de video (LVLM) para comprender videos largos debido a un contexto limitado. Este enfoque innovador utiliza textos auxiliares alineados visualmente extraídos de datos de video para mejorar la alineación entre modalidades sin necesidad de un ajuste fino extenso o costosos recursos de GPU.

La récente introduction de la génération augmentée par récupération vidéo (Video-RAG) répond aux défis rencontrés par les grands modèles de langage vidéo (LVLM) dans la compréhension des longues vidéos en raison d'un contexte limité. Cette approche innovante utilise des textes auxiliaires visuellement alignés extraits des données vidéo pour améliorer l'alignement intermodal sans nécessiter de réglage fin étendu ou de ressources GPU coûteuses.

The recent introduction of Video Retrieval-Augmented Generation (Video-RAG) addresses the challenges faced by large video-language models (LVLMs) in comprehending long videos due to limited context. This innovative approach utilizes visually-aligned auxiliary texts extracted from video data to enhance cross-modality alignment without the need for extensive fine-tuning or costly GPU resources.

VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Was this article worth reading? Share it