arXiv:2510.27442v1 Announce Type: new 
Abstract: Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.

تقديم CoMViT يمثل تقدمًا كبيرًا في تكنولوجيا التصوير الطبي. تم تصميم هذه البنية الجديدة من Vision Transformer للتغلب على قيود النماذج التقليدية، وخاصة متطلبات الحوسبة العالية ومشكلات الإفراط في التكيف. من خلال تحسينها للبيئات ذات الموارد المحدودة، يعد CoMViT بتحسين قابلية تطبيق الذكاء الاصطناعي في البيئات السريرية، مما قد يؤدي إلى أدوات تشخيص أفضل ونتائج محسنة للمرضى.

La introducción de CoMViT marca un avance significativo en la tecnología de imágenes médicas. Esta nueva arquitectura de Vision Transformer está diseñada para superar las limitaciones de los modelos tradicionales, especialmente sus altas demandas computacionales y problemas de sobreajuste. Al optimizarse para entornos con recursos limitados, CoMViT promete mejorar la aplicabilidad de la IA en entornos clínicos, lo que podría llevar a mejores herramientas de diagnóstico y a mejores resultados para los pacientes.

L'introduction de CoMViT représente une avancée significative dans la technologie d'imagerie médicale. Cette nouvelle architecture de Vision Transformer est conçue pour surmonter les limitations des modèles traditionnels, en particulier leurs exigences computationnelles élevées et les problèmes de surajustement. En s'optimisant pour des environnements à ressources limitées, CoMViT promet d'améliorer l'applicabilité de l'IA dans les milieux cliniques, ce qui pourrait conduire à de meilleurs outils de diagnostic et à de meilleurs résultats pour les patients.

The introduction of CoMViT marks a significant advancement in medical imaging technology. This new Vision Transformer architecture is designed to overcome the limitations of traditional models, particularly their high computational demands and overfitting issues. By optimizing for resource-constrained environments, CoMViT promises to enhance the applicability of AI in clinical settings, potentially leading to better diagnostic tools and improved patient outcomes.

CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

arXiv:2512.15372v1 Announce Type: cross 
Abstract: Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

تم اقتراح نهج جديد يسمى الاسترجاع التكيفي الواعي بتعقيد الصورة (ICAR) لتحسين نماذج الرؤية-اللغة من خلال السماح لمحوّلات الرؤية بتخصيص الموارد الحاسوبية بناءً على تعقيد الصورة. تتيح هذه الطريقة معالجة الصور البسيطة بموارد أقل مع ضمان تحليل الصور المعقدة بالتفصيل، مما يحافظ على التوافق بين الأنماط النصية والصورية لتحقيق مطابقة فعالة.

Se ha propuesto un nuevo enfoque llamado Recuperación Adaptativa Consciente de la Complejidad de la Imagen (ICAR) para mejorar los modelos de visión-lenguaje, permitiendo que los transformadores de visión asignen recursos computacionales según la complejidad de la imagen. Este método permite procesar imágenes simples con menos recursos mientras asegura que las imágenes complejas se analicen en su totalidad, manteniendo la alineación intermodal para un emparejamiento de texto efectivo.

Une nouvelle approche appelée Image Complexity-Aware Retrieval (ICAR) a été proposée pour améliorer les modèles de vision-langage en permettant aux transformateurs de vision d'allouer des ressources informatiques en fonction de la complexité de l'image. Cette méthode permet de traiter des images simples avec moins de ressources tout en garantissant que les images complexes sont analysées en détail, maintenant ainsi l'alignement intermodal pour un appariement texte efficace.

A new approach called Image Complexity-Aware Retrieval (ICAR) has been proposed to enhance vision-language models by allowing vision transformers to allocate computational resources based on image complexity. This method enables simpler images to be processed with less compute while ensuring that complex images are analyzed in full detail, maintaining cross-modal alignment for effective text matching.

Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

One More Thing in AI – Your Shortcut to AI Mastery

CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

Was this article worth reading? Share it

One More Thing in AI

VideoDigest

Cometapi-e0d0fd

ComfyUI

FiltrixAI

Nudge AI

Ready to build your own newsroom?