arXiv:2512.08639v1 Announce Type: new 
Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.

تم تقديم إطار عمل جديد للتنقل الجوي باستخدام الرؤية واللغة (VLN) يمكّن الطائرات بدون طيار (UAV) من تفسير التعليمات باللغة الطبيعية والتنقل في البيئات الحضرية باستخدام ملاحظات RGB أحادية العين فقط. تبسط هذه الطريقة عملية التنقل من خلال تحسين الإدراك المكاني، والتفكير في المسار، وتوقع الإجراءات من خلال التعلم متعدد المهام الموجه بالمحفزات.

Se ha introducido un nuevo marco para la navegación aérea mediante visión y lenguaje (VLN), que permite a los vehículos aéreos no tripulados (UAV) interpretar instrucciones en lenguaje natural y navegar por entornos urbanos utilizando únicamente observaciones RGB monoculares egocéntricas. Este enfoque simplifica el proceso de navegación al optimizar la percepción espacial, el razonamiento de trayectorias y la predicción de acciones a través de un aprendizaje multitarea guiado por indicaciones.

Un nouveau cadre pour la navigation aérienne par vision et langage (VLN) a été introduit, permettant aux véhicules aériens sans pilote (UAV) d'interpréter des instructions en langage naturel et de naviguer dans des environnements urbains en utilisant uniquement des observations RGB monoculaires égocentriques. Cette approche simplifie le processus de navigation en optimisant la perception spatiale, le raisonnement de trajectoire et la prédiction d'action grâce à un apprentissage multitâche guidé par des invites.

A new framework for Aerial Vision-and-Language Navigation (VLN) has been introduced, enabling unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate urban environments using only egocentric monocular RGB observations. This approach simplifies the navigation process by optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning.

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Was this article worth reading? Share it

LucidQuery AI

Attentive AI

LangWatch