arXiv:2510.24795v1 Announce Type: new 
Abstract: Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

تسلط دراسة حديثة الضوء على إمكانيات نماذج العمل-اللغة-الرؤية (VLA) في تعزيز الذكاء المجسد من خلال دمج المعرفة الرقمية مع التفاعلات الواقعية. على الرغم من قدراتها المذهلة، تشير الدراسة إلى التحديات الكبيرة المتعلقة بالحوسبة والبيانات التي تعيق استخدامها العملي. معالجة هذه القضايا أمر حاسم لتقدم نشر نماذج VLA، التي يمكن أن تحدث ثورة في كيفية تفاعلنا مع التكنولوجيا في حياتنا اليومية.

Una reciente encuesta destaca el potencial de los modelos de Acción-Lenguaje-Visión (VLA) para mejorar la inteligencia encarnada al fusionar el conocimiento digital con las interacciones del mundo real. A pesar de sus impresionantes capacidades, la encuesta señala los importantes desafíos computacionales y de datos que obstaculizan su uso práctico. Abordar estos problemas es crucial para avanzar en el despliegue de los VLA, que podrían revolucionar nuestra interacción con la tecnología en la vida diaria.

Une récente enquête met en lumière le potentiel des modèles Vision-Language-Action (VLA) pour améliorer l'intelligence incarnée en fusionnant les connaissances numériques avec les interactions réelles. Malgré leurs capacités impressionnantes, l'enquête souligne les défis importants en matière de calcul et de données qui entravent leur utilisation pratique. S'attaquer à ces problèmes est crucial pour faire avancer le déploiement des VLA, qui pourraient révolutionner notre interaction avec la technologie au quotidien.

A recent survey highlights the potential of Vision-Language-Action models (VLAs) in enhancing embodied intelligence by merging digital knowledge with real-world interactions. Despite their impressive capabilities, the survey points out the significant computational and data challenges that hinder their practical use. Addressing these issues is crucial for advancing the deployment of VLAs, which could revolutionize how we interact with technology in our daily lives.

A Survey on Efficient Vision-Language-Action Models

arXiv:2511.19836v1 Announce Type: new 
Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

يمثل تقديم 4DWorldBench تقدمًا كبيرًا في تقييم نماذج توليد العوالم ثلاثية الأبعاد / رباعية الأبعاد، والتي تعتبر أساسية لتطوير بيئات واقعية وديناميكية لتطبيقات مثل الواقع الافتراضي والقيادة الذاتية. يقوم هذا الإطار بتقييم النماذج بناءً على الجودة الإدراكية، والواقعية الفيزيائية، والتناسق الرباعي الأبعاد، مما يلبي الحاجة إلى معيار موحد في مجال يتطور بسرعة.

La introducción de 4DWorldBench marca un avance significativo en la evaluación de Modelos de Generación de Mundos 3D/4D, que son cruciales para el desarrollo de entornos realistas y dinámicos para aplicaciones como la realidad virtual y la conducción autónoma. Este marco evalúa los modelos en función de la calidad perceptual, el realismo físico y la consistencia 4D, abordando la necesidad de un estándar unificado en un campo en rápida evolución.

L'introduction de 4DWorldBench représente une avancée significative dans l'évaluation des modèles de génération de mondes 3D/4D, qui sont cruciaux pour le développement d'environnements réalistes et dynamiques pour des applications telles que la réalité virtuelle et la conduite autonome. Ce cadre évalue les modèles en fonction de la qualité perceptuelle, du réalisme physique et de la cohérence 4D, répondant ainsi à la nécessité d'une référence unifiée dans un domaine en évolution rapide.

The introduction of 4DWorldBench marks a significant advancement in the evaluation of 3D/4D World Generation Models, which are crucial for developing realistic and dynamic environments for applications like virtual reality and autonomous driving. This framework assesses models based on perceptual quality, physical realism, and 4D consistency, addressing the need for a unified benchmark in a rapidly evolving field.

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

arXiv:2511.18005v1 Announce Type: new 
Abstract: City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a \textbf{R}eality-\textbf{A}ligned \textbf{I}ntelligent \textbf{S}ynthesis \textbf{E}ngine that creates detailed, \textbf{C}ity-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real-world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win-rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.

تم تقديم RAISECity كإطار عمل متعدد الوسائط مصمم لتحسين إنشاء العوالم ثلاثية الأبعاد على نطاق المدينة، حيث يتناول التحديات المتعلقة بالجودة والموثوقية وقابلية التوسع التي تواجهها الأساليب الحالية. يستخدم هذا الإطار أدوات متعددة الوسائط متنوعة لإنشاء بيئات ثلاثية الأبعاد مفصلة، بهدف تحسين الذكاء المجسد ونماذج العالم.

RAISECity se ha presentado como un marco de agente multimodal diseñado para mejorar la generación de mundos 3D a escala de ciudad, abordando los desafíos de calidad, fidelidad y escalabilidad que enfrentan los métodos actuales. Este marco utiliza diversas herramientas de fundación multimodal para crear entornos 3D detallados, con el objetivo de mejorar la inteligencia encarnada y los modelos del mundo.

RAISECity a été présenté comme un cadre d'agent multimodal conçu pour améliorer la génération de mondes 3D à l'échelle de la ville, en répondant aux défis de qualité, de fidélité et d'évolutivité auxquels les méthodes actuelles sont confrontées. Ce cadre utilise divers outils de fondation multimodaux pour créer des environnements 3D détaillés, visant à améliorer l'intelligence incarnée et les modèles du monde.

RAISECity has been introduced as a multimodal agent framework designed to enhance city-scale 3D world generation, addressing challenges in quality, fidelity, and scalability that current methods face. This framework utilizes diverse multimodal foundation tools to create detailed 3D environments, aiming to improve embodied intelligence and world models.

RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale

arXiv:2505.11579v2 Announce Type: replace-cross 
Abstract: As AI systems evolve from static tools to dynamic agents, traditional categorical governance frameworks -- based on fixed risk tiers, levels of autonomy, or human oversight models -- are increasingly insufficient on their own. Systems built on foundation models, self-supervised learning, and multi-agent architectures increasingly blur the boundaries that categories were designed to police. In this Perspective, we make the case for dimensional governance: a framework that tracks how decision authority, process autonomy, and accountability (the 3As) distribute dynamically across human-AI relationships. A critical advantage of this approach is its ability to explicitly monitor system movement toward and across key governance thresholds, enabling preemptive adjustments before risks materialize. This dimensional approach provides the necessary foundation for more adaptive categorization, enabling thresholds and classifications that can evolve with emerging capabilities. While categories remain essential for decision-making, building them upon dimensional foundations allows for context-specific adaptability and stakeholder-responsive governance that static approaches cannot achieve. We outline key dimensions, critical trust thresholds, and practical examples illustrating where rigid categorical frameworks fail -- and where a dimensional mindset could offer a more resilient and future-proof path forward for both governance and innovation at the frontier of artificial intelligence.

تطور أنظمة الذكاء الاصطناعي من أدوات ثابتة إلى وكلاء ديناميكيين يتطلب تغييرًا في أطر الحوكمة، حيث أصبحت النماذج الفئوية التقليدية غير كافية بشكل متزايد. يركز إطار الحوكمة البُعدية المقترح على التوزيع الديناميكي لسلطة القرار، واستقلالية العمليات، والمساءلة في العلاقات بين الإنسان والذكاء الاصطناعي، بهدف معالجة المخاطر بشكل استباقي قبل أن تتجلى.

La evolución de los sistemas de IA de herramientas estáticas a agentes dinámicos requiere un cambio en los marcos de gobernanza, ya que los modelos categóricos tradicionales son cada vez más inadecuados. El marco de gobernanza dimensional propuesto se centra en la distribución dinámica de la autoridad de decisión, la autonomía del proceso y la responsabilidad en las relaciones humano-IA, con el objetivo de abordar proactivamente los riesgos antes de que se materialicen.

L'évolution des systèmes d'IA, passant d'outils statiques à des agents dynamiques, nécessite un changement dans les cadres de gouvernance, car les modèles catégoriels traditionnels deviennent de plus en plus inadéquats. Le cadre de gouvernance dimensionnelle proposé se concentre sur la distribution dynamique de l'autorité décisionnelle, de l'autonomie des processus et de la responsabilité dans les relations homme-IA, visant à anticiper les risques avant qu'ils ne se matérialisent.

The evolution of AI systems from static tools to dynamic agents necessitates a shift in governance frameworks, as traditional categorical models are increasingly inadequate. The proposed dimensional governance framework focuses on the dynamic distribution of decision authority, process autonomy, and accountability in human-AI relationships, aiming to preemptively address risks before they materialize.

A Survey on Efficient Vision-Language-Action Models

Was this article worth reading? Share it

Dyad

HomeVisualizer.AI

Attentive AI