arXiv:2511.08521v1 Announce Type: new 
Abstract: While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

تقديم UniVA، وهو إطار عمل متعدد الوكلاء مفتوح المصدر، يمثل تقدمًا كبيرًا في تكنولوجيا معالجة الفيديو. من خلال دمج فهم الفيديو، والتجزئة، والتحرير، والتوليد في سير عمل متماسك، يعزز UniVA كفاءة إنتاج الفيديو. تسمح معماريته ذات الوكيلين بتدفقات عمل آلية وتكرارية، مما يجعل المهام المعقدة في الفيديو أكثر سهولة وإدارة.

La introducción de UniVA, un marco multiagente de código abierto, marca un avance significativo en la tecnología de procesamiento de video. Al integrar la comprensión, segmentación, edición y generación de video en un flujo de trabajo cohesivo, UniVA mejora la eficiencia de la producción de video. Su arquitectura de doble agente permite flujos de trabajo automatizados e iterativos, haciendo que las tareas de video complejas sean más accesibles y manejables.

L'introduction de UniVA, un cadre multi-agents open-source, représente une avancée significative dans la technologie de traitement vidéo. En intégrant la compréhension, la segmentation, le montage et la génération vidéo dans un flux de travail cohérent, UniVA améliore l'efficacité de la production vidéo. Son architecture à double agent permet des flux de travail automatisés et itératifs, rendant les tâches vidéo complexes plus accessibles et gérables.

The introduction of UniVA, an open-source multi-agent framework, marks a significant advancement in video processing technology. By integrating video understanding, segmentation, editing, and generation into a cohesive workflow, UniVA enhances the efficiency of video production. Its dual-agent architecture allows for automated, iterative workflows, making complex video tasks more accessible and manageable.

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

arXiv:2511.08585v2 Announce Type: replace-cross 
Abstract: The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.

يتطور مشهد توليد الفيديو، حيث ينتقل من التركيز على إنشاء مقاطع جذابة بصريًا إلى بناء بيئات افتراضية تدعم التفاعل وتحافظ على المصداقية الفيزيائية. تشير هذه التطورات إلى ظهور نماذج أساسية للفيديو تعمل ليس فقط كمولدات بصرية ولكن أيضًا كنماذج عالمية ضمنية، مما يمكّن من التفكير البصري المتماسك والتخطيط المدفوع بالأهداف.

El panorama de la generación de video está evolucionando, pasando de crear clips visualmente atractivos a construir entornos virtuales interactivos que cumplen con la plausibilidad física. Este cambio se destaca en una reciente encuesta que conceptualiza los modelos de fundación de video modernos como una combinación de modelos del mundo implícitos y renderizadores de video, lo que permite un razonamiento visual coherente y planificación de tareas.

Le paysage de la génération vidéo évolue, passant de la création de clips visuellement attrayants à la construction d'environnements virtuels interactifs respectant la plausibilité physique. Ce changement est mis en évidence dans une récente enquête qui conceptualise les modèles de fond vidéo modernes comme une combinaison de modèles du monde implicites et de rendus vidéo, permettant un raisonnement visuel cohérent et une planification des tâches.

The landscape of video generation is evolving, transitioning from merely creating visually appealing clips to constructing interactive virtual environments that adhere to physical plausibility. This shift is highlighted in a recent survey that conceptualizes modern video foundation models as a combination of implicit world models and video renderers, enabling coherent visual reasoning and task planning.

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Was this article worth reading? Share it

Ready to build your own newsroom?