arXiv:2511.08978v1 Announce Type: cross 
Abstract: Nowadays, navigation and ride-sharing apps have collected numerous images with spatio-temporal data. A core technology for analyzing such images, associated with spatiotemporal information, is Traffic Scene Understanding (TSU), which aims to provide a comprehensive description of the traffic scene. Unlike traditional spatio-temporal data analysis tasks, the dependence on both spatio-temporal and visual-textual data introduces distinct challenges to TSU task. However, recent research often treats TSU as a common image understanding task, ignoring the spatio-temporal information and overlooking the interrelations between different aspects of the traffic scene. To address these issues, we propose a novel SpatioTemporal Enhanced Model based on CILP (ST-CLIP) for TSU. Our model uses the classic vision-language model, CLIP, as the backbone, and designs a Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) learning method to incorporate spatiotemporal information into TSU. The prompt learning method consists of two components: A dynamic spatio-temporal context representation module that extracts representation vectors of spatio-temporal data for each traffic scene image, and a bi-level ST-aware multi-aspect prompt learning module that integrates the ST-context representation vectors into word embeddings of prompts for the CLIP model. The second module also extracts low-level visual features and image-wise high-level semantic features to exploit interactive relations among different aspects of traffic scenes. To the best of our knowledge, this is the first attempt to integrate spatio-temporal information into visionlanguage models to facilitate TSU task. Experiments on two realworld datasets demonstrate superior performance in the complex scene understanding scenarios with a few-shot learning strategy.

تم اقتراح نموذج جديد لفهم مشاهد المرور (TSU) يعالج التحديات المتعلقة بدمج البيانات الزمانية والمكانية والبيانات النصية البصرية. يستخدم النموذج المحسن الزمني المكاني القائم على CILP (ST-CLIP) طريقة تعلم جديدة لتعزيز تحليل مشاهد المرور، وهو أمر حاسم لتطبيقات الملاحة ومشاركة الركوب. هذه الخطوة مهمة لأنها تحسن من دقة وشمولية أوصاف مشاهد المرور.

Se ha propuesto un nuevo modelo para la Comprensión de Escenas de Tráfico (TSU), que aborda los desafíos de integrar datos espaciotemporales y visuales-textuales. El Modelo Mejorado Espaciotemporal basado en CILP (ST-CLIP) utiliza un método de aprendizaje por indicaciones innovador para mejorar el análisis de escenas de tráfico, lo cual es crucial para aplicaciones de navegación y de transporte compartido. Este avance es significativo ya que mejora la precisión y la exhaustividad de las descripciones de escenas de tráfico.

Un nouveau modèle pour la compréhension des scènes de trafic (TSU) a été proposé, abordant les défis d'intégration des données spatio-temporelles et visuelles-textuelles. Le modèle amélioré spatio-temporel basé sur CILP (ST-CLIP) utilise une méthode d'apprentissage par invite novatrice pour améliorer l'analyse des scènes de trafic, ce qui est crucial pour les applications de navigation et de covoiturage. Cette avancée est significative car elle améliore la précision et la complétude des descriptions des scènes de trafic.

A new model for Traffic Scene Understanding (TSU) has been proposed, addressing the challenges of integrating spatio-temporal and visual-textual data. The SpatioTemporal Enhanced Model based on CILP (ST-CLIP) utilizes a novel prompt learning method to enhance the analysis of traffic scenes, which is crucial for navigation and ride-sharing applications. This advancement is significant as it improves the accuracy and comprehensiveness of traffic scene descriptions.

Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

Was this article worth reading? Share it

Ready to build your own newsroom?