arXiv:2512.02700v1 Announce Type: cross 
Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.

تم تقديم VLM-Pruner كخوارزمية لتقليص الرموز بدون تدريب، مصممة لتحسين كفاءة نماذج الرؤية-اللغة (VLM) من خلال معالجة التكاليف الحاسوبية المرتبطة بعدد كبير من الرموز البصرية. هذه الطريقة توازن بين التكرار والندرة المكانية، مما يضمن الحفاظ على التفاصيل المهمة للأجسام مع تقليل تكرار الرموز غير الضروري.

Se ha presentado VLM-Pruner como un algoritmo de poda de tokens sin entrenamiento, diseñado para mejorar la eficiencia de los modelos de visión-lenguaje (VLM) al abordar los costos computacionales asociados con un gran número de tokens visuales. Este método equilibra la redundancia y la escasez espacial, asegurando que se conserven los detalles importantes de los objetos mientras se reduce la duplicación innecesaria de tokens.

VLM-Pruner a été introduit comme un algorithme de taille de jetons sans entraînement, conçu pour améliorer l'efficacité des modèles de vision-langage (VLM) en s'attaquant aux coûts computationnels associés à un grand nombre de jetons visuels. Cette méthode équilibre la redondance et la rareté spatiale, garantissant que les détails importants des objets sont préservés tout en réduisant la duplication inutile des jetons.

VLM-Pruner has been introduced as a training-free token pruning algorithm designed to enhance the efficiency of vision-language models (VLMs) by addressing the computational costs associated with a large number of visual tokens. This method balances redundancy and spatial sparsity, ensuring that important object details are preserved while reducing unnecessary token duplication.

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

arXiv:2512.03477v1 Announce Type: cross 
Abstract: Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms.Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.

قدمت دراسة حديثة تقنيات تكييف منخفضة الرتبة واعية بالعدالة لنماذج الرؤية-اللغة (VLM) تهدف إلى تحسين دقة التشخيص في التصوير الطبي، وخاصة لتشخيص الزرق. تركز الطرق المقترحة، بما في ذلك FR-LoRA و GR-LoRA، على تقليل الفجوات في دقة التشخيص بين المجموعات السكانية مع الحفاظ على الأداء العام. أظهرت التقييمات على 10,000 صورة قاع زرقية انخفاضًا كبيرًا في الفجوات التشخيصية بنسبة 69% باستخدام GR-LoRA.

Un estudio reciente ha introducido técnicas de adaptación de bajo rango conscientes de la equidad para modelos de visión-lenguaje (VLM) con el objetivo de mejorar la precisión diagnóstica en imágenes médicas, específicamente para el diagnóstico de glaucoma. Los métodos propuestos, incluidos FR-LoRA y GR-LoRA, se centran en reducir las disparidades de precisión entre grupos demográficos mientras mantienen el rendimiento general. Las evaluaciones en 10,000 imágenes de fondo de glaucoma demostraron una reducción significativa de las disparidades diagnósticas del 69% con GR-LoRA.

Une étude récente a introduit des techniques d'adaptation à faible rang conscientes de l'équité pour les modèles de vision-langage (VLM) visant à améliorer la précision diagnostique en imagerie médicale, spécifiquement pour le diagnostic du glaucome. Les méthodes proposées, y compris FR-LoRA et GR-LoRA, se concentrent sur la réduction des disparités de précision entre les groupes démographiques tout en maintenant la performance globale. Les évaluations sur 10 000 images de fondus de glaucome ont montré une réduction significative des disparités diagnostiques de 69 % avec GR-LoRA.

A recent study has introduced fairness-aware Low-Rank Adaptation techniques for vision-language models (VLMs) aimed at improving diagnostic accuracy in medical imaging, specifically for glaucoma diagnosis. The proposed methods, including FR-LoRA and GR-LoRA, focus on reducing accuracy disparities across demographic groups while maintaining overall performance. Evaluations on 10,000 glaucoma fundus images demonstrated a significant reduction in diagnostic disparities by 69% with GR-LoRA.

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Was this article worth reading? Share it

LucidQuery AI

CodeSpaced

Blunge