arXiv:2502.07409v5 Announce Type: replace-cross 
Abstract: Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.

تقدم الورقة نهجًا مبتكرًا لتصنيف صور علم الأمراض على مستوى الشريحة الكاملة من خلال تقديم طريقة تعلم تعتمد على الموجهات تعزز النماذج الكبيرة للرؤية واللغة في سيناريوهات قليلة الأمثلة. من خلال توسيع نموذج Prov-GigaPath، الذي تم تدريبه مسبقًا على مجموعة بيانات ضخمة، تهدف هذه الأبحاث إلى تحسين تعميم النموذج على الرغم من التحديات التي تطرحها الصور ذات الدقة العالية والبيانات المحدودة.

El artículo presenta un enfoque innovador para la clasificación de imágenes de patología de diapositivas completas al introducir un método de aprendizaje por indicaciones que mejora los grandes modelos de visión-lenguaje para escenarios de pocos ejemplos. Al extender el modelo Prov-GigaPath, que está preentrenado en un vasto conjunto de datos, esta investigación busca mejorar la generalización del modelo a pesar de los desafíos que presentan las imágenes de gigapíxeles y las anotaciones limitadas.

Cet article présente une approche innovante pour la classification des images de pathologie à grande échelle en introduisant une méthode d'apprentissage par invite qui améliore les grands modèles de vision-langage pour des scénarios à peu d'exemples. En étendant le modèle Prov-GigaPath, pré-entraîné sur un vaste ensemble de données, cette recherche vise à améliorer la généralisation du modèle malgré les défis posés par les images en gigapixels et les annotations limitées.

The paper presents an innovative approach to whole slide pathology image classification by introducing a prompt learning method that enhances large vision-language models for few-shot scenarios. By extending the Prov-GigaPath model, which is pre-trained on a vast dataset, this research aims to improve model generalization despite the challenges posed by gigapixel images and limited annotations.

MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification

Was this article worth reading? Share it

Ready to build your own newsroom?