arXiv:2512.10596v1 Announce Type: new 
Abstract: Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.

تم تقديم إطار عمل جديد لاسترجاع صور الاستشعار عن بُعد، يسمى TRSLLaVA، والذي يعمل دون الحاجة إلى تدريب. يستخدم هذا الإطار مجموعة بيانات النص الغني للاستشعار عن بُعد (RSRT)، مما يوفر عدة تسميات منظمة لكل صورة لتعزيز قدرات الاسترجاع الدلالي.

Se ha introducido un nuevo marco para la recuperación de imágenes de teledetección, llamado TRSLLaVA, que opera sin necesidad de entrenamiento. Este marco utiliza el conjunto de datos Remote Sensing Rich Text (RSRT), proporcionando múltiples leyendas estructuradas por imagen para mejorar las capacidades de recuperación semántica.

Un nouveau cadre pour la récupération d'images de télédétection, nommé TRSLLaVA, a été introduit, fonctionnant sans besoin d'entraînement. Ce cadre utilise le jeu de données Remote Sensing Rich Text (RSRT), fournissant plusieurs légendes structurées par image pour améliorer les capacités de récupération sémantique.

A new framework for remote sensing image retrieval, named TRSLLaVA, has been introduced, which operates without the need for training. This framework utilizes the Remote Sensing Rich Text (RSRT) dataset, providing multiple structured captions per image to enhance semantic retrieval capabilities.

Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

One More Thing in AI – Your Shortcut to AI Mastery

Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Attentive AI

Lenso.ai

URLtoText

LexiStock AI

Ready to build your own newsroom?