arXiv:2511.18005v1 Announce Type: new 
Abstract: City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a \textbf{R}eality-\textbf{A}ligned \textbf{I}ntelligent \textbf{S}ynthesis \textbf{E}ngine that creates detailed, \textbf{C}ity-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real-world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win-rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.

تم تقديم RAISECity كإطار عمل متعدد الوسائط مصمم لتحسين إنشاء العوالم ثلاثية الأبعاد على نطاق المدينة، حيث يتناول التحديات المتعلقة بالجودة والموثوقية وقابلية التوسع التي تواجهها الأساليب الحالية. يستخدم هذا الإطار أدوات متعددة الوسائط متنوعة لإنشاء بيئات ثلاثية الأبعاد مفصلة، بهدف تحسين الذكاء المجسد ونماذج العالم.

RAISECity se ha presentado como un marco de agente multimodal diseñado para mejorar la generación de mundos 3D a escala de ciudad, abordando los desafíos de calidad, fidelidad y escalabilidad que enfrentan los métodos actuales. Este marco utiliza diversas herramientas de fundación multimodal para crear entornos 3D detallados, con el objetivo de mejorar la inteligencia encarnada y los modelos del mundo.

RAISECity a été présenté comme un cadre d'agent multimodal conçu pour améliorer la génération de mondes 3D à l'échelle de la ville, en répondant aux défis de qualité, de fidélité et d'évolutivité auxquels les méthodes actuelles sont confrontées. Ce cadre utilise divers outils de fondation multimodaux pour créer des environnements 3D détaillés, visant à améliorer l'intelligence incarnée et les modèles du monde.

RAISECity has been introduced as a multimodal agent framework designed to enhance city-scale 3D world generation, addressing challenges in quality, fidelity, and scalability that current methods face. This framework utilizes diverse multimodal foundation tools to create detailed 3D environments, aiming to improve embodied intelligence and world models.

RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale

arXiv:2511.19836v1 Announce Type: new 
Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

يمثل تقديم 4DWorldBench تقدمًا كبيرًا في تقييم نماذج توليد العوالم ثلاثية الأبعاد / رباعية الأبعاد، والتي تعتبر أساسية لتطوير بيئات واقعية وديناميكية لتطبيقات مثل الواقع الافتراضي والقيادة الذاتية. يقوم هذا الإطار بتقييم النماذج بناءً على الجودة الإدراكية، والواقعية الفيزيائية، والتناسق الرباعي الأبعاد، مما يلبي الحاجة إلى معيار موحد في مجال يتطور بسرعة.

La introducción de 4DWorldBench marca un avance significativo en la evaluación de Modelos de Generación de Mundos 3D/4D, que son cruciales para el desarrollo de entornos realistas y dinámicos para aplicaciones como la realidad virtual y la conducción autónoma. Este marco evalúa los modelos en función de la calidad perceptual, el realismo físico y la consistencia 4D, abordando la necesidad de un estándar unificado en un campo en rápida evolución.

L'introduction de 4DWorldBench représente une avancée significative dans l'évaluation des modèles de génération de mondes 3D/4D, qui sont cruciaux pour le développement d'environnements réalistes et dynamiques pour des applications telles que la réalité virtuelle et la conduite autonome. Ce cadre évalue les modèles en fonction de la qualité perceptuelle, du réalisme physique et de la cohérence 4D, répondant ainsi à la nécessité d'une référence unifiée dans un domaine en évolution rapide.

The introduction of 4DWorldBench marks a significant advancement in the evaluation of 3D/4D World Generation Models, which are crucial for developing realistic and dynamic environments for applications like virtual reality and autonomous driving. This framework assesses models based on perceptual quality, physical realism, and 4D consistency, addressing the need for a unified benchmark in a rapidly evolving field.

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

arXiv:2506.01579v2 Announce Type: replace 
Abstract: Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig

تم تقديم HOSIG، وهو إطار جديد لتوليد تفاعلات بشرية كاملة الجسم مع كائنات ديناميكية ومشاهد ثابتة، مما يعالج تحديات كبيرة في الرسوميات الحاسوبية والرسوم المتحركة. من خلال استخدام الإدراك الهرمي للمشهد، يعزز HOSIG واقعية التفاعلات بين الإنسان والكائن، مع ضمان أوضاع خالية من الاصطدام وتخطيط فعال في البيئات المعقدة.

La introducción de HOSIG, un nuevo marco para generar interacciones humanas de cuerpo completo con objetos dinámicos y escenas estáticas, aborda desafíos significativos en gráficos por computadora y animación. Al utilizar la percepción jerárquica de la escena, HOSIG mejora el realismo de las interacciones humano-objeto, asegurando posturas libres de colisiones y una navegación efectiva en entornos complejos.

L'introduction de HOSIG, un nouveau cadre pour générer des interactions humaines à corps entier avec des objets dynamiques et des scènes statiques, répond à des défis significatifs dans les graphismes informatiques et l'animation. En utilisant une perception hiérarchique de la scène, HOSIG améliore le réalisme des interactions humain-objet tout en garantissant des postures sans collision et une navigation efficace dans des environnements complexes.

The introduction of HOSIG, a novel framework for generating full-body human interactions with dynamic objects and static scenes, addresses significant challenges in computer graphics and animation. By utilizing hierarchical scene perception, HOSIG enhances the realism of human-object interactions while ensuring collision-free postures and effective navigation in complex environments.

RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale

Was this article worth reading? Share it

Dyad

AiReelGenerator.com

Synthesia