arXiv:2510.25332v1 Announce Type: new 
Abstract: The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

تقديم StreamingCoT، مجموعة بيانات جديدة للإجابة على الأسئلة المتعلقة بالفيديو، يمثل تقدمًا كبيرًا في مجال تطبيقات الفيديو المتدفقة. تعالج هذه المجموعة من البيانات القيود الحرجة في مجموعات بيانات VideoQA الحالية من خلال دمج الديناميات الزمنية والتفكير متعدد الوسائط، وهو أمر ضروري لفهم الطبيعة المتطورة للإجابات في تدفقات الفيديو. من خلال تحسين قدرات النموذج، لا يزيد StreamingCoT فقط من دقة الإجابة على الأسئلة المستندة إلى الفيديو، بل يمهد أيضًا الطريق لتطبيقات ذكاء اصطناعي أكثر تعقيدًا في تحليل المحتوى المتعدد الوسائط.

La introducción de StreamingCoT, un nuevo conjunto de datos para la Pregunta y Respuesta en Video, marca un avance significativo en el campo de las aplicaciones de video en streaming. Este conjunto de datos aborda limitaciones críticas en los conjuntos de datos VideoQA existentes al incorporar dinámicas temporales y razonamiento multimodal, que son esenciales para entender la naturaleza evolutiva de las respuestas en los flujos de video. Al mejorar las capacidades del modelo, StreamingCoT no solo aumenta la precisión de la respuesta a preguntas basadas en video, sino que también allana el camino para aplicaciones de IA más sofisticadas en el análisis de contenido multimedia.

L'introduction de StreamingCoT, un nouveau jeu de données pour la question-réponse vidéo, marque une avancée significative dans le domaine des applications vidéo en streaming. Ce jeu de données répond aux limitations critiques des jeux de données VideoQA existants en intégrant des dynamiques temporelles et un raisonnement multimodal, essentiels pour comprendre la nature évolutive des réponses dans les flux vidéo. En améliorant les capacités des modèles, StreamingCoT non seulement augmente la précision de la question-réponse basée sur la vidéo, mais ouvre également la voie à des applications d'IA plus sophistiquées dans l'analyse de contenu multimédia.

The introduction of StreamingCoT, a new dataset for Video Question Answering, marks a significant advancement in the field of streaming video applications. This dataset addresses critical limitations in existing VideoQA datasets by incorporating temporal dynamics and multimodal reasoning, which are essential for understanding the evolving nature of answers in video streams. By enhancing model capabilities, StreamingCoT not only improves the accuracy of video-based question answering but also paves the way for more sophisticated AI applications in multimedia content analysis.

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

arXiv:2512.07853v1 Announce Type: new 
Abstract: As deep learning models in agentic AI systems grow in scale and complexity, GPU memory requirements increase and often exceed the available GPU memory capacity, so that out-of-memory (OoM) errors occur. It is well known that OoM interrupts the whole training itself and wastes substantial computational resources. Therefore, to prevent OoM, accurate prediction of GPU memory usage is essential. However, previous studies focus only on unimodal architectures and fail to generalize to multimodal models, even though the multimodal models are a common choice in agentic AI systems. To address this limitation, we propose a framework that predicts the peak GPU memory usage by analyzing the model architecture and training behavior of multimodal models. Specifically, the framework decomposes the multimodal model into its constituent layers and applies factorization to estimate the memory usage of each layer. Our evaluation shows that our framework achieves high prediction accuracy of ~8.7% average MAPE.

تم اقتراح إطار عمل جديد للتنبؤ باستخدام ذاكرة GPU أثناء تدريب النماذج متعددة الوسائط، مما يعالج المشكلة الشائعة لأخطاء الذاكرة غير الكافية (OoM) التي تعطل عمليات التدريب. يقوم هذا الإطار بتحليل بنية النموذج وسلوك التدريب، حيث يقوم بتفكيك النماذج إلى طبقات لتقدير استخدام الذاكرة بدقة.

Se ha propuesto un nuevo marco para predecir el uso de memoria GPU durante el entrenamiento de modelos multimodales, abordando el problema común de los errores de memoria insuficiente (OoM) que interrumpen los procesos de entrenamiento. Este marco analiza la arquitectura del modelo y el comportamiento de entrenamiento, descomponiendo los modelos en capas para estimar con precisión el uso de memoria.

Un nouveau cadre a été proposé pour prédire l'utilisation de la mémoire GPU lors de l'entraînement de modèles multimodaux, abordant le problème courant des erreurs de mémoire insuffisante (OoM) qui perturbent les processus d'entraînement. Ce cadre analyse l'architecture du modèle et le comportement d'entraînement, décomposant les modèles en couches pour estimer avec précision l'utilisation de la mémoire.

A new framework has been proposed to predict GPU memory usage during the training of multimodal models, addressing the common issue of out-of-memory (OoM) errors that disrupt training processes. This framework analyzes model architecture and training behavior, decomposing models into layers to estimate memory usage accurately.

GPU Memory Prediction for Multimodal Model Training

arXiv:2512.08228v1 Announce Type: new 
Abstract: The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

يمثل تقديم MM-CoT تقدمًا كبيرًا في تقييم التفكير المتسلسل ضمن النماذج متعددة الوسائط، حيث يركز على قدرتها على تأصيل التفكير في الأدلة البصرية والحفاظ على التماسك المنطقي. يهدف هذا المعيار إلى سد الفجوة في التقييمات الحالية التي تعطي الأولوية للتوليد على التحقق، مما يضمن أن النماذج يمكنها اختيار سلاسل الأحداث التي تلبي المعايير البصرية والمنطقية.

La introducción de MM-CoT marca un avance significativo en la evaluación del razonamiento en cadena de pensamiento dentro de los modelos multimodales, centrándose en su capacidad para fundamentar el razonamiento en evidencia visual y mantener la coherencia lógica. Este benchmark busca abordar la brecha en las evaluaciones existentes que priorizan la generación sobre la verificación, asegurando que los modelos puedan seleccionar cadenas de eventos que cumplan con criterios visuales y lógicos.

L'introduction de MM-CoT marque une avancée significative dans l'évaluation du raisonnement en chaîne de pensée au sein des modèles multimodaux, en mettant l'accent sur leur capacité à ancrer le raisonnement dans des preuves visuelles et à maintenir une cohérence logique. Ce benchmark vise à combler le fossé dans les évaluations existantes qui privilégient la génération au détriment de la vérification, garantissant que les modèles peuvent sélectionner des chaînes d'événements répondant à des critères visuels et logiques.

The introduction of MM-CoT marks a significant advancement in the evaluation of Chain-of-Thought reasoning within multimodal models, focusing on their ability to ground reasoning in visual evidence and maintain logical coherence. This benchmark aims to address the gap in existing assessments that prioritize generation over verification, ensuring models can select event chains that meet visual and logical criteria.

StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Was this article worth reading? Share it

LucidQuery AI

Cococlip.AI

ComfyUI