arXiv:2510.25166v1 Announce Type: new 
Abstract: Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

أظهرت دراسة حديثة أداء المحولات البصرية (ViTs) على الأجهزة المحمولة، مقارنتها بالشبكات العصبية التلافيفية (CNN) التقليدية. مع تزايد استخدام تقنيات التعلم الآلي في التكنولوجيا المحمولة، فإن فهم زمن الاستجابة لهذه النماذج أمر بالغ الأهمية للمطورين والباحثين. لا تبرز هذه الدراسة فقط نقاط القوة والضعف في ViTs، بل تقدم أيضًا رؤى قيمة يمكن أن تعزز التطبيقات المحمولة في رؤية الكمبيوتر، مما يجعلها أسرع وأكثر كفاءة.

Un estudio reciente ha iluminado el rendimiento de los transformadores de visión (ViTs) en dispositivos móviles, comparándolos con las redes neuronales convolucionales (CNN) tradicionales. Con el auge del aprendizaje automático en la tecnología móvil, comprender la latencia de estos modelos es crucial para desarrolladores e investigadores. Esta investigación no solo destaca las fortalezas y debilidades de los ViTs, sino que también proporciona información valiosa que podría mejorar las aplicaciones móviles en visión por computadora, haciéndolas más rápidas y eficientes.

Une étude récente a mis en lumière les performances des transformateurs de vision (ViTs) sur les appareils mobiles, en les comparant aux réseaux de neurones convolutionnels (CNN) traditionnels. Avec l'essor de l'apprentissage automatique dans la technologie mobile, comprendre la latence de ces modèles est crucial pour les développeurs et les chercheurs. Cette recherche souligne non seulement les forces et les faiblesses des ViTs, mais fournit également des informations précieuses qui pourraient améliorer les applications mobiles en vision par ordinateur, les rendant plus rapides et plus efficaces.

A recent study has shed light on the performance of vision transformers (ViTs) on mobile devices, comparing them to traditional convolutional neural networks (CNNs). With the rise of machine learning in mobile technology, understanding the latency of these models is crucial for developers and researchers. This research not only highlights the strengths and weaknesses of ViTs but also provides valuable insights that could enhance mobile applications in computer vision, making them faster and more efficient.

A Study on Inference Latency for Vision Transformers on Mobile Devices

arXiv:2512.15372v1 Announce Type: cross 
Abstract: Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

تم اقتراح نهج جديد يسمى الاسترجاع التكيفي الواعي بتعقيد الصورة (ICAR) لتحسين نماذج الرؤية-اللغة من خلال السماح لمحوّلات الرؤية بتخصيص الموارد الحاسوبية بناءً على تعقيد الصورة. تتيح هذه الطريقة معالجة الصور البسيطة بموارد أقل مع ضمان تحليل الصور المعقدة بالتفصيل، مما يحافظ على التوافق بين الأنماط النصية والصورية لتحقيق مطابقة فعالة.

Se ha propuesto un nuevo enfoque llamado Recuperación Adaptativa Consciente de la Complejidad de la Imagen (ICAR) para mejorar los modelos de visión-lenguaje, permitiendo que los transformadores de visión asignen recursos computacionales según la complejidad de la imagen. Este método permite procesar imágenes simples con menos recursos mientras asegura que las imágenes complejas se analicen en su totalidad, manteniendo la alineación intermodal para un emparejamiento de texto efectivo.

Une nouvelle approche appelée Image Complexity-Aware Retrieval (ICAR) a été proposée pour améliorer les modèles de vision-langage en permettant aux transformateurs de vision d'allouer des ressources informatiques en fonction de la complexité de l'image. Cette méthode permet de traiter des images simples avec moins de ressources tout en garantissant que les images complexes sont analysées en détail, maintenant ainsi l'alignement intermodal pour un appariement texte efficace.

A new approach called Image Complexity-Aware Retrieval (ICAR) has been proposed to enhance vision-language models by allowing vision transformers to allocate computational resources based on image complexity. This method enables simpler images to be processed with less compute while ensuring that complex images are analyzed in full detail, maintaining cross-modal alignment for effective text matching.

Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

arXiv:2512.15254v1 Announce Type: cross 
Abstract: Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.

دراسة حديثة نُشرت على arXiv تقارن بشكل منهجي بين الهياكل المتخصصة في العد ونماذج الرؤية-اللغة (VLM) في قدرتها على عد العناصر في المشاهد البصرية. تسلط الدراسة الضوء على التحديات التي تواجه الأساليب التقليدية التي تعتمد على هياكل محددة المجال، مما يشير إلى أن VLM قد توفر حلاً أكثر مرونة لعد الكائنات في مجموعة مفتوحة.

Un estudio reciente publicado en arXiv compara sistemáticamente las arquitecturas de conteo especializadas con los modelos de visión-lenguaje (VLM) en su capacidad para enumerar elementos en escenas visuales. La investigación destaca los desafíos de los métodos de conteo tradicionales que dependen de arquitecturas específicas de dominio, sugiriendo que los VLM pueden ofrecer una solución más flexible para el conteo de objetos en un conjunto abierto.

Une étude récente publiée sur arXiv compare systématiquement les architectures de comptage spécialisées aux modèles de vision-langage (VLM) dans leur capacité à énumérer des éléments dans des scènes visuelles. La recherche met en évidence les défis des méthodes de comptage traditionnelles qui reposent sur des architectures spécifiques à un domaine, suggérant que les VLM pourraient offrir une solution plus flexible pour le comptage d'objets en open-set.

A recent study published on arXiv systematically compares specialized counting architectures with vision-language models (VLMs) in their ability to enumerate items in visual scenes. The research highlights the challenges of traditional counting methods that rely on domain-specific architectures, suggesting that VLMs may provide a more flexible solution for open-set object counting.

Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models

One More Thing in AI – Your Shortcut to AI Mastery

A Study on Inference Latency for Vision Transformers on Mobile Devices

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

The Visualizer

Videotok

LangWatch

Brainactive

Ready to build your own newsroom?