arXiv:2503.23463v2 Announce Type: replace 
Abstract: We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.

تم تقديم OpenDriveVLA كنموذج عمل للغة البصرية يهدف إلى تحقيق القيادة الذاتية من البداية إلى النهاية، باستخدام نماذج لغوية كبيرة مفتوحة المصدر لتوليد إجراءات قيادة مرتبطة مكانيًا من مدخلات متعددة الوسائط، بما في ذلك التمثيلات البصرية والأوامر اللغوية.

OpenDriveVLA se ha presentado como un modelo de acción de lenguaje visual destinado a lograr la conducción autónoma de extremo a extremo, utilizando modelos de lenguaje de gran tamaño de código abierto para generar acciones de conducción ancladas espacialmente a partir de entradas multimodales, incluidas representaciones visuales y comandos de lenguaje.

OpenDriveVLA a été présenté comme un modèle d'action de langage visuel visant à atteindre la conduite autonome de bout en bout, utilisant des modèles de langage de grande taille open source pour générer des actions de conduite spatialement ancrées à partir d'entrées multimodales, y compris des représentations visuelles et des commandes linguistiques.

OpenDriveVLA has been introduced as a Vision Language Action model aimed at achieving end-to-end autonomous driving, utilizing open-source large language models to generate spatially grounded driving actions through multimodal inputs, including visual representations and language commands.

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

arXiv:2508.16069v2 Announce Type: replace 
Abstract: Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.

تم اقتراح وحدة جديدة لتوزيع الفوكسل (VDM) لتحسين تمثيل الفوكسل وتوزيعه في بيانات سحب النقاط، مما يعالج القيود المتعلقة بدقة الكشف المرتبطة بالتمثيلات التقليدية للفوكسل. تتضمن هذه الوحدة التواءات ثلاثية الأبعاد المتناثرة والاتصالات المتبقية، مما يسمح بمعالجة أفضل لبيانات سحب النقاط في مهام الكشف عن الأجسام ثلاثية الأبعاد.

Se ha propuesto un nuevo Módulo de Difusión Voxel (VDM) para mejorar la representación y difusión a nivel voxel en datos de nubes de puntos, abordando las limitaciones en la precisión de detección asociadas con las representaciones voxel tradicionales. Este módulo integra convoluciones 3D dispersas y conexiones residuales, permitiendo un mejor procesamiento de datos de nubes de puntos en tareas de detección de objetos 3D.

Un nouveau module de diffusion voxel (VDM) a été proposé pour améliorer la représentation et la diffusion au niveau voxel dans les données de nuages de points, abordant les limitations de précision de détection associées aux représentations voxel traditionnelles. Ce module intègre des convolutions 3D éparses et des connexions résiduelles, permettant un meilleur traitement des données de nuages de points dans les tâches de détection d'objets 3D.

A novel Voxel Diffusion Module (VDM) has been proposed to enhance voxel-level representation and diffusion in point cloud data, addressing limitations in detection accuracy associated with traditional voxel-based representations. This module integrates sparse 3D convolutions and residual connections, allowing for improved processing of point cloud data in 3D object detection tasks.

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

Was this article worth reading? Share it