arXiv:2505.21076v2 Announce Type: replace 
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes six urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.

DynamicVL تثير ضجة في مجال التحليل الحضري من خلال تقديم DVL-Suite، وهو إطار مبتكر مصمم لتعزيز فهمنا للديناميات الحضرية على المدى الطويل من خلال الصور المستمدة من الاستشعار عن بعد. تستفيد هذه الطريقة المبتكرة من نماذج اللغة متعددة الوسائط لتحليل بيانات عالية الدقة، مما قد يحسن بشكل كبير قدرتنا على مراقبة والاستجابة للتغيرات الحضرية مع مرور الوقت. هذا الأمر مهم لأنه يفتح آفاقًا جديدة للتخطيط الحضري وإدارة البيئة، مما يساعد المدن على التكيف مع التحديات مثل تغير المناخ.

DynamicVL está causando sensación en el campo del análisis urbano al presentar el DVL-Suite, un marco innovador diseñado para mejorar nuestra comprensión de las dinámicas urbanas a largo plazo a través de imágenes de teledetección. Este enfoque innovador aprovecha los modelos de lenguaje multimodal para analizar datos de alta resolución, lo que podría mejorar significativamente nuestra capacidad para monitorear y responder a los cambios urbanos a lo largo del tiempo. Esto es importante porque abre nuevas posibilidades para la planificación urbana y la gestión ambiental, ayudando a las ciudades a adaptarse a desafíos como el cambio climático.

DynamicVL fait sensation dans le domaine de l'analyse urbaine en introduisant le DVL-Suite, un cadre révolutionnaire conçu pour améliorer notre compréhension des dynamiques urbaines à long terme grâce à l'imagerie par télédétection. Cette approche innovante utilise des modèles de langage multimodaux pour analyser des données haute résolution, ce qui pourrait considérablement améliorer notre capacité à surveiller et à répondre aux changements urbains au fil du temps. Cela est important car cela ouvre de nouvelles possibilités pour la planification urbaine et la gestion environnementale, aidant les villes à s'adapter aux défis tels que le changement climatique.

DynamicVL is making waves in the field of urban analysis by introducing the DVL-Suite, a groundbreaking framework designed to enhance our understanding of long-term city dynamics through remote sensing imagery. This innovative approach leverages multimodal large language models to analyze high-resolution data, which could significantly improve how we monitor and respond to urban changes over time. This matters because it opens up new possibilities for urban planning and environmental management, helping cities adapt to challenges like climate change.

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

arXiv:2512.02517v1 Announce Type: new 
Abstract: The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

تم تقديم SkyMoE كنموذج للغة والرؤية يعتمد على مزيج من الخبراء (MoE) مصمم لتحسين التفسير الجغرافي المكاني، خاصة في مهام الاستشعار عن بُعد. يتناول هذا النموذج قيود نماذج الرؤية واللغة العامة الموجودة من خلال استخدام جهاز توجيه تكيفي يولد تعليمات توجيه محددة للمهام، مما يسمح بتمييز أفضل بين المهام المختلفة ودرجات التفسير.

SkyMoE se ha presentado como un modelo de visión-lenguaje basado en un Mezcla de Expertos (MoE) diseñado para mejorar la interpretación geoespacial, especialmente en tareas de teledetección. Este modelo aborda las limitaciones de los modelos de visión-lenguaje de propósito general existentes al emplear un enrutador adaptativo que genera instrucciones de enrutamiento específicas para cada tarea, permitiendo una mejor diferenciación entre diversas tareas y granularidades de interpretación.

SkyMoE a été introduit comme un modèle de vision-langage basé sur un mélange d'experts (MoE) conçu pour améliorer l'interprétation géospatiale, en particulier dans les tâches de télédétection. Ce modèle répond aux limitations des modèles de vision-langage généralistes existants en utilisant un routeur adaptatif qui génère des instructions de routage spécifiques aux tâches, permettant une meilleure différenciation entre les diverses tâches et granularités d'interprétation.

SkyMoE has been introduced as a Mixture-of-Experts (MoE) vision-language model designed to improve geospatial interpretation, particularly in remote sensing tasks. This model addresses the limitations of existing general-purpose vision-language models by employing an adaptive router that generates task-specific routing instructions, allowing for enhanced differentiation between various tasks and interpretation granularities.

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Was this article worth reading? Share it

Dubsmart LLC

Langtail

FastML