arXiv:2511.19529v1 Announce Type: new 
Abstract: Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

تم تقديم Vidi2 كخطوة مهمة في فهم وإنشاء الفيديوهات، حيث أظهر أداءً متقدمًا في استرجاع المعلومات الزمنية متعددة الوسائط، وعزز قدراته في التأسيس الزماني والمكاني والإجابة على الأسئلة المتعلقة بالفيديو. يتيح هذا النموذج تحديد الطوابع الزمنية بدقة ومواقع الكائنات في الفيديوهات بناءً على استفسارات نصية، مما يسهل مهام التحرير المعقدة.

Se ha presentado Vidi2 como un avance significativo en la comprensión y creación de videos, mostrando un rendimiento de vanguardia en la recuperación temporal multimodal y mejorando las capacidades de anclaje espaciotemporal y de preguntas y respuestas sobre video. Este modelo permite la identificación precisa de marcas de tiempo y ubicaciones de objetos en videos basados en consultas de texto, facilitando tareas de edición complejas.

Vidi2 a été présenté comme une avancée significative dans la compréhension et la création de vidéos, affichant des performances de pointe dans la récupération temporelle multimodale et améliorant les capacités de mise au point spatio-temporelle et de questions-réponses vidéo. Ce modèle permet une identification précise des horodatages et des emplacements d'objets dans les vidéos en fonction de requêtes textuelles, facilitant des tâches d'édition complexes.

Vidi2 has been introduced as a significant advancement in video understanding and creation, showcasing state-of-the-art performance in multimodal temporal retrieval and enhancing capabilities in spatio-temporal grounding and video question answering. This model allows for precise identification of timestamps and object locations in videos based on text queries, facilitating complex editing tasks.

Vidi2: Large Multimodal Models for Video Understanding and Creation

<img width="1365" height="768" src="https://the-decoder.com/wp-content/uploads/2025/11/emerging_math_neural_network.jpeg" class="attachment-full size-full wp-post-image" alt="" style="height: auto; margin-bottom: 10px;" decoding="async" fetchpriority="high" />
 OpenAI researcher Sebastien Bubeck says GPT-5's math skills would save him a month of time.
The article <a href="https://the-decoder.com/gpt-5-generates-the-most-impressive-llm-output-yet-says-openai-researcher/">GPT-5 generates the &quot;most impressive LLM output&quot; yet, says OpenAI researcher</a> appeared first on <a href="https://the-decoder.com">THE DECODER</a>.

أثنى الباحث في OpenAI، سيباستيان بوبك، على GPT-5 لقدرته على إنتاج ما وصفه بأنه أكثر مخرجات نماذج اللغة إثارة للإعجاب حتى الآن، مشيرًا إلى مهاراته الرياضية المتقدمة التي يمكن أن توفر وقتًا كبيرًا في مهام البحث والتطوير.

El investigador de OpenAI, Sebastien Bubeck, ha elogiado a GPT-5 por generar lo que describe como la salida más impresionante de un modelo de lenguaje hasta la fecha, destacando sus avanzadas capacidades matemáticas que podrían ahorrar un tiempo significativo en tareas de investigación y desarrollo.

Le chercheur d'OpenAI, Sebastien Bubeck, a loué GPT-5 pour avoir généré ce qu'il décrit comme la sortie la plus impressionnante d'un modèle de langage à ce jour, mettant en avant ses capacités mathématiques avancées qui pourraient faire gagner un temps considérable dans les tâches de recherche et de développement.

OpenAI researcher Sebastien Bubeck has praised GPT-5 for generating what he describes as the most impressive output from a language model to date, highlighting its advanced mathematical capabilities that could save significant time in research and development tasks.

GPT-5 generates the "most impressive LLM output" yet, says OpenAI researcher

<A HREF="https://the-decoder.com/qwen3-vl-can-scan-two-hour-videos-and-pinpoint-nearly-every-detail/"><IMG VSPACE="4" HSPACE="4" BORDER="0" ALIGN="RIGHT" SRC="http://www.techmeme.com/251130/i3.jpg"></A>
<A HREF="http://www.techmeme.com/251130/p3#a251130p3" TITLE="Techmeme permalink"><IMG WIDTH=11 HEIGHT=12 SRC="http://www.techmeme.com/img/pml.png" STYLE="border:none;padding:0;margin:0;"></A> Jonathan Kemper / <A HREF="https://the-decoder.com/">The Decoder</A>: 
<A HREF="https://the-decoder.com/qwen3-vl-can-scan-two-hour-videos-and-pinpoint-nearly-every-detail/">Alibaba Technical Report: Qwen3-VL beats GPT-5 and Gemini 2.5 Pro on visual tasks and has 100% accuracy on &ldquo;needle-in-a-haystack&rdquo; tests for 30-minute videos</A>&nbsp; &mdash;&nbsp; A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model.

أصدرت شركة علي بابا تقريرًا تقنيًا عن نموذج Qwen3-VL، الذي يتفوق على منافسيه GPT-5 و Gemini 2.5 Pro في المهام البصرية ويحقق دقة 100% في اختبارات 'needle-in-a-haystack' لمقاطع الفيديو التي تبلغ مدتها 30 دقيقة. يبرز هذا التقدم قدرات النموذج في تحليل البيانات متعددة الوسائط، بما في ذلك الفيديو والصور.

Alibaba ha publicado un informe técnico sobre su modelo Qwen3-VL, que supera a sus competidores GPT-5 y Gemini 2.5 Pro en tareas visuales y logra una precisión del 100 % en pruebas de 'needle-in-a-haystack' para videos de 30 minutos. Este avance destaca las capacidades del modelo para analizar datos multimodales, incluidos videos e imágenes.

Alibaba a publié un rapport technique sur son modèle Qwen3-VL, qui surpasse ses concurrents GPT-5 et Gemini 2.5 Pro dans les tâches visuelles et atteint une précision de 100 % dans les tests 'needle-in-a-haystack' pour des vidéos de 30 minutes. Cette avancée met en évidence les capacités du modèle à analyser des données multimodales, y compris des vidéos et des images.

Alibaba has released a technical report on its Qwen3-VL model, which outperforms competitors GPT-5 and Gemini 2.5 Pro in visual tasks and achieves 100% accuracy in 'needle-in-a-haystack' tests for 30-minute videos. This advancement highlights the model's capabilities in analyzing multimodal data, including video and images.

Alibaba Technical Report: Qwen3-VL beats GPT-5 and Gemini 2.5 Pro on visual tasks and has 100% accuracy on "needle-in-a-haystack" tests for 30-minute videos (Jonathan Kemper/The Decoder)

Vidi2: Large Multimodal Models for Video Understanding and Creation

Was this article worth reading? Share it

VideoTranslator

Aview — Discover what people think of this product.

VidX