arXiv:2510.22964v1 Announce Type: new 
Abstract: Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data. To address unique challenges in the field, multimodal geospatial foundation models (GFMs) have emerged as a dedicated research frontier. This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective, covering five core visual and vision-language modalities. We examine how differences in imaging physics and data representation shape interaction design, and we analyze key techniques for alignment, integration, and knowledge transfer to tackle modality heterogeneity, distribution shifts, and semantic gaps. Advances in training paradigms, architectures, and task-specific adaptation strategies are systematically assessed alongside a wealth of emerging benchmarks. Representative multimodal visual and vision-language GFMs are evaluated across ten downstream tasks, with insights into their architectures, performance, and application scenarios. Real-world case studies, spanning land cover mapping, agricultural monitoring, disaster response, climate studies, and geospatial intelligence, demonstrate the practical potential of GFMs. Finally, we outline pressing challenges in domain generalization, interpretability, efficiency, and privacy, and chart promising avenues for future research.

تسلط دراسة حديثة الضوء على التأثير التحويلي لنماذج الأساس الجغرافية متعددة الوسائط (GFMs) على تحليل صور الاستشعار عن بعد. تستفيد هذه النماذج من تقنيات متقدمة في معالجة اللغة الطبيعية ورؤية الكمبيوتر، مما يوفر قدرات قوية على التعميم والتعلم الانتقالي. هذا مهم لأنه يعالج تحديات فريدة في تحليل بيانات الاستشعار عن بعد، والتي تتميز بطبيعتها متعددة الوسائط وتختلف عبر الدقة والزمن. من المقرر أن تعزز ظهور نماذج GFMs دقة وكفاءة التحليل الجغرافي، مما يجعلها تطورًا حاسمًا في هذا المجال.

Una reciente encuesta destaca el impacto transformador de los modelos de fundación geoespaciales multimodales (GFM) en el análisis de imágenes de teledetección. Estos modelos aprovechan técnicas avanzadas de procesamiento de lenguaje natural y visión por computadora, ofreciendo potentes capacidades de generalización y aprendizaje por transferencia. Esto es significativo ya que aborda desafíos únicos en el análisis de datos de teledetección, que son inherentemente multimodales y varían en resoluciones y tiempos. La aparición de los GFM está destinada a mejorar la precisión y eficiencia del análisis geoespacial, lo que lo convierte en un desarrollo crucial en el campo.

Une récente enquête met en lumière l'impact transformateur des modèles de fond géospatiaux multimodaux (GFM) sur l'analyse d'images de télédétection. Ces modèles tirent parti de techniques avancées en traitement du langage naturel et en vision par ordinateur, offrant de puissantes capacités de généralisation et d'apprentissage par transfert. Cela est significatif car cela répond à des défis uniques dans l'analyse des données de télédétection, qui sont intrinsèquement multimodales et varient selon les résolutions et le temps. L'émergence des GFM est destinée à améliorer la précision et l'efficacité de l'analyse géospatiale, ce qui en fait un développement crucial dans le domaine.

A recent survey highlights the transformative impact of multimodal geospatial foundation models (GFMs) on remote sensing image analysis. These models leverage advanced techniques from natural language processing and computer vision, offering powerful generalization and transfer learning capabilities. This is significant as it addresses unique challenges in analyzing remote sensing data, which is inherently multimodal and varies across resolutions and time. The emergence of GFMs is set to enhance the accuracy and efficiency of geospatial analysis, making it a crucial development in the field.

Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges

arXiv:2503.21692v4 Announce Type: replace 
Abstract: The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.

تم تقديم خوارزمية جديدة تُدعى RapidPoseTriangulation، التي تعزز من عملية مثلثات وضع الجسم البشري بالكامل من زوايا متعددة وأشخاص متعددين، مع تحقيق سرعات مثلثات سريعة وقدرات تعميم جيدة. يتيح هذا التقدم التقاط تفاصيل حركة الإنسان، بما في ذلك تعبيرات الوجه وحركات الأصابع، عبر عدة أفراد وزوايا.

Se ha introducido un nuevo algoritmo llamado RapidPoseTriangulation, que mejora la triangulación de pose humana de cuerpo entero en múltiples vistas y personas, logrando velocidades de triangulación rápidas y buenas capacidades de generalización. Este avance permite capturar detalles del movimiento humano, incluidas las expresiones faciales y los movimientos de los dedos, a través de múltiples individuos y puntos de vista.

Un nouvel algorithme nommé RapidPoseTriangulation a été introduit, améliorant la triangulation de pose humaine multi-personnes et multi-vues avec une rapidité et des capacités de généralisation remarquables. Cette avancée permet de capturer en détail les mouvements humains, y compris les expressions faciales et les mouvements des doigts, à travers divers individus et points de vue.

A new algorithm named RapidPoseTriangulation has been introduced, enhancing multi-view multi-person whole-body human pose triangulation with remarkable speed and generalization capabilities. This advancement allows for detailed capture of human movements, including facial expressions and finger movements, across various individuals and viewpoints.

RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond

arXiv:2503.09114v2 Announce Type: replace 
Abstract: The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models-typically under 10 billion parameters-enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators-including memory usage, inference speed, and energy consumption-across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.

أدى التقدم السريع في نماذج اللغة (LM) إلى تحول نحو نماذج مدمجة، عادةً أقل من 10 مليارات معلمة، يمكن نشرها على الأجهزة الطرفية. يقود هذا التحول تقنيات مثل التكميم وضغط النماذج، بهدف تعزيز الخصوصية وتقليل زمن الاستجابة وتحسين سيادة البيانات. ومع ذلك، فإن تعقيد هذه النماذج والموارد الحاسوبية المحدودة للأجهزة الطرفية تطرح تحديات كبيرة لتنفيذ الاستدلال بشكل فعال خارج بيئات السحابة.

El rápido avance de los Modelos de Lenguaje (LM) ha llevado a un cambio hacia modelos compactos, típicamente por debajo de los 10 mil millones de parámetros, que pueden ser desplegados en dispositivos de borde. Esta transición está impulsada por técnicas como la cuantificación y la compresión de modelos, con el objetivo de mejorar la privacidad, reducir la latencia y mejorar la soberanía de los datos. Sin embargo, la complejidad de estos modelos y los recursos computacionales limitados del hardware de borde plantean desafíos significativos para la inferencia efectiva fuera de los entornos en l…

L'avancement rapide des modèles de langage (LM) a conduit à un tournant vers des modèles compacts, généralement inférieurs à 10 milliards de paramètres, pouvant être déployés sur des appareils en périphérie. Cette transition est motivée par des techniques telles que la quantification et la compression de modèles, visant à améliorer la confidentialité, réduire la latence et renforcer la souveraineté des données. Cependant, la complexité de ces modèles et les ressources informatiques limitées du matériel en périphérie posent des défis significatifs pour une inférence efficace en dehors des envir…

The rapid advancement of Language Models (LMs) has led to a shift towards compact models, typically under 10 billion parameters, which can be deployed on edge devices. This transition is driven by techniques like quantization and model compression, aiming to enhance privacy, reduce latency, and improve data sovereignty. However, the complexity of these models and the limited computing resources of edge hardware pose significant challenges for effective inference outside cloud environments.

Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge

arXiv:2511.17184v1 Announce Type: new 
Abstract: News text classification is a crucial task in natural language processing, essential for organizing and filtering the massive volume of digital content. Traditional methods typically rely on statistical features like term frequencies or TF-IDF values, which are effective at capturing word-level importance but often fail to reflect contextual meaning. In contrast, modern deep learning approaches utilize semantic features to understand word usage within context, yet they may overlook simple, high-impact statistical indicators. This paper introduces an Attention-Guided Feature Fusion (AGFF) model that combines statistical and semantic features in a unified framework. The model applies an attention-based mechanism to dynamically determine the relative importance of each feature type, enabling more informed classification decisions. Through evaluation on benchmark news datasets, the AGFF model demonstrates superior performance compared to both traditional statistical models and purely semantic deep learning models. The results confirm that strategic integration of diverse feature types can significantly enhance classification accuracy. Additionally, ablation studies validate the contribution of each component in the fusion process. The findings highlight the model's ability to balance and exploit the complementary strengths of statistical and semantic representations, making it a practical and effective solution for real-world news classification tasks.

تم تقديم نموذج دمج الميزات الموجه بالانتباه (AGFF) لتحسين تصنيف نصوص الأخبار من خلال دمج الميزات الإحصائية والدلالية. يستخدم هذا النموذج آلية انتباه لتقييم أهمية كل نوع من الميزات، بهدف تحسين دقة التصنيف في سياق معالجة اللغة الطبيعية.

Se ha introducido el modelo de fusión de características guiado por atención (AGFF) para mejorar la clasificación de textos de noticias al integrar características estadísticas y semánticas. Este modelo emplea un mecanismo de atención para evaluar la importancia de cada tipo de característica, con el objetivo de mejorar la precisión de la clasificación en el contexto del procesamiento del lenguaje natural.

Le modèle de fusion de caractéristiques guidé par l'attention (AGFF) a été introduit pour améliorer la classification des textes d'actualités en intégrant à la fois des caractéristiques statistiques et sémantiques. Ce modèle utilise un mécanisme d'attention pour évaluer l'importance de chaque type de caractéristique, visant à améliorer la précision de la classification dans le contexte du traitement du langage naturel.

The Attention-Guided Feature Fusion (AGFF) model has been introduced to enhance news text classification by integrating both statistical and semantic features. This model employs an attention mechanism to assess the importance of each feature type, aiming to improve classification accuracy in the context of natural language processing.

Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges

Was this article worth reading? Share it

Golan AI

Open Source Surveillance

AIPortalX