arXiv:2511.18746v1 Announce Type: new 
Abstract: While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

قدمت Any4D نهجًا جديدًا يسمى نماذج العالم المجسد الأولية (PEWM) تهدف إلى تحسين توليد الفيديو من اللغة الطبيعية والصور. يعالج هذا الأسلوب القيود المفروضة على نماذج توليد الفيديو التقليدية، التي تواجه صعوبات مع تعقيد وندرة بيانات التفاعل المجسد، من خلال التركيز على آفاق أقصر لتوليد الفيديو.

Any4D ha presentado un enfoque novedoso llamado Modelos de Mundo Embodiment Primitivo (PEWM) que busca mejorar la generación de video a partir de lenguaje natural e imágenes. Este método aborda las limitaciones de los modelos de generación de video tradicionales, que enfrentan dificultades con la complejidad y escasez de datos de interacción encarnada, al centrarse en horizontes más cortos para la generación de video.

Any4D a introduit une approche novatrice appelée Modèles de Monde Embodiment Primitive (PEWM) visant à améliorer la génération vidéo à partir de langage naturel et d'images. Cette méthode répond aux limitations des modèles de génération vidéo traditionnels, qui peinent avec la complexité et la rareté des données d'interaction incarnées, en se concentrant sur des horizons plus courts pour la génération vidéo.

Any4D has introduced a novel approach called Primitive Embodied World Models (PEWM) aimed at enhancing video generation from natural language and images. This method addresses the limitations of traditional video generation models, which struggle with the complexity and scarcity of embodied interaction data, by focusing on shorter horizons for video generation.

Any4D: Open-Prompt 4D Generation from Natural Language and Images

arXiv:2511.16669v2 Announce Type: replace 
Abstract: While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.

تم تقديم نهج جديد يسمى توقع الحدث التالي في الفيديو (VNEP)، والذي يستخدم الفيديو كوسيلة استجابة ديناميكية لتوقع الأحداث اللاحقة في سياق الفيديو. تهدف هذه الطريقة إلى تحسين التعلم الإجرائي من خلال تقديم استجابات بصرية بديهية بدلاً من الاعتماد فقط على التوقعات النصية.

Se ha introducido un nuevo enfoque denominado Predicción del Siguiente Evento en Video (VNEP), que utiliza el video como un modo de respuesta dinámico para predecir eventos posteriores en un contexto de video. Este método busca mejorar el aprendizaje procedimental al proporcionar respuestas visuales intuitivas en lugar de depender únicamente de predicciones basadas en texto.

Une nouvelle approche appelée Prédiction de l'Événement Suivant par Vidéo (VNEP) a été introduite, utilisant la vidéo comme un mode de réponse dynamique pour prédire les événements suivants dans un contexte vidéo. Cette méthode vise à améliorer l'apprentissage procédural en fournissant des réponses visuelles intuitives plutôt qu'en se basant uniquement sur des prédictions textuelles.

A new approach termed Video-Next-Event Prediction (VNEP) has been introduced, leveraging video as a dynamic answer modality for predicting subsequent events in a video context. This method aims to enhance procedural learning by providing intuitive visual responses rather than relying solely on text-based predictions.

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

arXiv:2511.00511v3 Announce Type: replace 
Abstract: Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.

تم تقديم ID-Crafter كإطار جديد لتوليد الفيديوهات متعددة الموضوعات، مما يعزز بشكل كبير من الحفاظ على الهوية والتماسك الدلالي من خلال آلية انتباه هرمية ونموذج لغة بصرية (VLM) مدرب مسبقًا. يتضمن هذا الإطار أيضًا مرحلة تعلم تعزيز عبر الإنترنت لتحسين قدراته بشكل أكبر.

ID-Crafter se ha presentado como un nuevo marco para la generación de videos multi-sujeto, mejorando significativamente la preservación de la identidad y la coherencia semántica a través de un mecanismo de atención jerárquico y un modelo de lenguaje visual (VLM) preentrenado. Este marco también incorpora una fase de aprendizaje por refuerzo en línea para refinar aún más sus capacidades.

ID-Crafter a été introduit comme un nouveau cadre pour la génération de vidéos multi-sujets, améliorant considérablement la préservation de l'identité et la cohérence sémantique grâce à un mécanisme d'attention hiérarchique et à un modèle de langage visuel (VLM) pré-entraîné. Ce cadre intègre également une phase d'apprentissage par renforcement en ligne pour affiner davantage ses capacités.

ID-Crafter has been introduced as a novel framework for multi-subject video generation, significantly enhancing identity preservation and semantic coherence through a hierarchical attention mechanism and a pretrained Vision-Language Model (VLM). This framework also incorporates an online reinforcement learning phase to refine its capabilities further.

Any4D: Open-Prompt 4D Generation from Natural Language and Images

Was this article worth reading? Share it

Synthesia

Z3D

TypeThinkAI