arXiv:2505.17685v3 Announce Type: replace 
Abstract: Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.

تقديم FSDrive، إطار عمل جديد لنماذج العمل-اللغة-الرؤية (VLA)، يعزز القيادة الذاتية من خلال تمكين هذه النماذج من 'التفكير بصريًا' عبر سلسلة تفكير مكانية-زمانية بصرية. تعالج هذه الابتكار القيود المفروضة على سلاسل التفكير التقليدية من خلال دمج العناصر المكانية والزمانية، مما يؤدي إلى تحسين دقة المسارات وتقليل الاصطدامات، كما يتضح من التقييمات على nuScenes وNAVSIM.

La introducción de FSDrive, un nuevo marco para modelos de Acción-Lenguaje-Visión (VLA), mejora la conducción autónoma al permitir que estos modelos 'piensen visualmente' a través de una Cadena de Pensamiento Espacio-Temporal (CoT) visual. Esta innovación aborda las limitaciones de las CoT tradicionales al integrar elementos espaciales y temporales, lo que lleva a una mayor precisión en las trayectorias y a una reducción de colisiones, como se demuestra en evaluaciones en nuScenes y NAVSIM.

L'introduction de FSDrive, un nouveau cadre pour les modèles Vision-Language-Action (VLA), améliore la conduite autonome en permettant à ces modèles de 'penser visuellement' grâce à une chaîne de pensée spatio-temporelle visuelle. Cette innovation répond aux limitations des chaînes de pensée traditionnelles en intégrant des éléments spatiaux et temporels, ce qui conduit à une meilleure précision des trajectoires et à une réduction des collisions, comme le montrent les évaluations sur nuScenes et NAVSIM.

The introduction of FSDrive, a new framework for Vision-Language-Action (VLA) models, enhances autonomous driving by enabling these models to 'think visually' through a visual spatio-temporal Chain-of-Thought (CoT). This innovation addresses the limitations of traditional CoT by integrating spatial and temporal elements, leading to improved trajectory accuracy and reduced collisions, as demonstrated in evaluations on nuScenes and NAVSIM.

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Was this article worth reading? Share it

Ready to build your own newsroom?