arXiv:2511.14111v1 Announce Type: new 
Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.

تقدم الورقة CascadedViT (CViT)، وهي بنية خفيفة الوزن لمحول الرؤية مصممة لمعالجة الطلبات العالية على الحوسبة والطاقة من المحولات التقليدية للرؤية (ViTs). تحتوي على شبكة جديدة للتغذية الأمامية تُسمى Cascaded-Chunk Feed Forward Network (CCFFN)، والتي تحسن كفاءة المعلمات وFLOP من خلال تقسيم ميزات الإدخال. تُظهر التجارب على ImageNet-1K أن نموذج CViT-XL يحقق دقة 75.5% في أعلى مستوى بينما يقلل من FLOPs بنسبة 15% واستهلاك الطاقة بنسبة 3.3% مقارنة بـ EfficientViT-M5، مما يجعله مناسبًا للأجهزة ذات البطارية المحدودة.

El artículo presenta CascadedViT (CViT), una arquitectura de transformador de visión ligera diseñada para abordar las altas demandas computacionales y energéticas de los transformadores de visión tradicionales (ViTs). Cuenta con una nueva red de alimentación hacia adelante llamada Cascaded-Chunk Feed Forward Network (CCFFN), que mejora la eficiencia de parámetros y FLOP al dividir las características de entrada. Los experimentos en ImageNet-1K demuestran que el modelo CViT-XL alcanza una precisión Top-1 del 75.5% mientras reduce los FLOP en un 15% y el consumo de energía en un 3.3% en comparac…

Cet article présente CascadedViT (CViT), une architecture de transformateur de vision légère conçue pour répondre aux exigences élevées en matière de calcul et d'énergie des transformateurs de vision traditionnels (ViTs). Il dispose d'un nouveau réseau de propagation avant appelé Cascaded-Chunk Feed Forward Network (CCFFN), qui améliore l'efficacité des paramètres et des FLOP en divisant les caractéristiques d'entrée. Des expériences sur ImageNet-1K montrent que le modèle CViT-XL atteint 75,5 % de précision Top-1 tout en réduisant les FLOP de 15 % et la consommation d'énergie de 3,3 % par rapp…

The paper introduces CascadedViT (CViT), a lightweight vision transformer architecture designed to address the high computational and energy demands of traditional Vision Transformers (ViTs). It features a novel feedforward network called Cascaded-Chunk Feed Forward Network (CCFFN), which enhances parameter and FLOP efficiency by splitting input features. Experiments on ImageNet-1K demonstrate that the CViT-XL model achieves 75.5% Top-1 accuracy while reducing FLOPs by 15% and energy consumption by 3.3% compared to EfficientViT-M5, making it suitable for battery-constrained devices.

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Walmart's early Black Friday deals are here, with big discounts on TVs, laptops, and more (including the lowest-ever prices on AirTags, AirPods, and the Apple Watch Series 10).

Best Walmart Black Friday deals live now: Save up to 60% on Apple, Dyson, TVs, and more

Matter 1.5 may finally fix the biggest headache in buying security cameras - here's how

Organize your workspace with our top picks for the best laptop docking stations available now.

I tested the best laptop docking stations - here's what I recommend for your office setup

Yann LeCun, an artificial intelligence pioneer who runs a research lab at Meta Platforms Inc., told employees that he will depart the social media giant at the end of the year and start a new company, according to a memo obtained by Bloomberg News.

Meta AI pioneer LeCun announces exit, plans new startup

In late September, the German airline Deutsche Lufthansa AG told analysts and investors that it planned to eliminate 4,000 administrative positions by the end of the decade. Among the reasons it cited was "the increased use of artificial intelligence."

Companies are warming up to saying AI is the reason for job cuts

The Xgimi Horizon S Max stands out as one of the brightest and most capable projectors I've tested to date.

Finally, a portable 4K projector worthy of replacing my TV - and it supports Dolby Vision

arXiv:2511.15572v1 Announce Type: new 
Abstract: Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99\%/95\%/90\%/80\%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86\%$ to $77.53\%$ and $78.23\%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.

تعتبر عملية تقطير المعرفة من خلال خرائط الميزات (KD) فعالة للشبكات التلافيفية ولكنها غالبًا ما تفشل في محولات الرؤية (ViTs). يكشف تحليل تمثيل ذو وجهين أن تمثيلات الطبقة النهائية في ViTs ذات رتبة منخفضة عالميًا، مما يشير إلى أن نموذج الطالب المدمج يجب أن يكون كافيًا لمحاذاة الميزات. ومع ذلك، يكشف تحليل نمط الطاقة الطيفية على مستوى الرموز أن الرموز الفردية توزع الطاقة عبر العديد من القنوات، مما يشير إلى عدم تطابق في الترميز.

La destilación de conocimientos mediante mapas de características (KD) es efectiva para redes convolucionales, pero a menudo falla en los Transformadores de Visión (ViTs). Un análisis de representación en dos vistas revela que las representaciones de la capa final en los ViTs son globalmente de bajo rango, lo que sugiere que un modelo estudiante compacto debería ser suficiente para la alineación de características. Sin embargo, un análisis del Patrón de Energía Espectral a nivel de token muestra que los tokens individuales distribuyen energía a través de muchos canales, indicando un desajuste …

La distillation de connaissances par carte de caractéristiques (KD) est efficace pour les réseaux convolutionnels mais échoue souvent pour les Vision Transformers (ViTs). Une analyse de représentation à deux vues révèle que les représentations de la couche finale dans les ViTs sont globalement de faible rang, suggérant qu'un modèle étudiant compact devrait suffire pour l'alignement des caractéristiques. Cependant, une analyse du modèle d'énergie spectrale au niveau des tokens montre que les tokens individuels distribuent l'énergie sur de nombreux canaux, indiquant un décalage dans l'encodage.

Feature-map knowledge distillation (KD) is effective for convolutional networks but often fails for Vision Transformers (ViTs). A two-view representation analysis reveals that final-layer representations in ViTs are globally low-rank, suggesting that a compact student model should suffice for feature alignment. However, a token-level Spectral Energy Pattern analysis shows that individual tokens distribute energy across many channels, indicating a mismatch in encoding.

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

arXiv:2511.15411v1 Announce Type: cross 
Abstract: Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.

تقدم الكوانتيزات الخالية من البيانات (DFQ) حلاً لضغط النماذج دون الحاجة إلى بيانات حقيقية، مما يكون مفيدًا في السياقات الحساسة للخصوصية. بينما كانت DFQ فعالة للنماذج الأحادية، لم يتم استكشاف تطبيقها على نماذج اللغة-الصورة مثل CLIP بشكل كافٍ. تقدم هذه الدراسة D4C، وهو إطار DFQ مصمم خصيصًا لـ CLIP، حيث يتناول التحديات مثل المحتوى الدلالي وتنوع الصور الداخلية في العينات المُصنّعة.

La cuantificación sin datos (DFQ) ofrece una solución para la compresión de modelos sin necesidad de datos reales, lo que resulta beneficioso en contextos sensibles a la privacidad. Aunque la DFQ ha sido efectiva para modelos unimodales, su aplicación en modelos de lenguaje-imagen como CLIP no ha sido explorada a fondo. Este estudio presenta D4C, un marco DFQ diseñado específicamente para CLIP, abordando desafíos como el contenido semántico y la diversidad intra-imagen en muestras sintetizadas.

La quantification sans données (DFQ) offre une solution pour la compression de modèles sans nécessiter de données réelles, ce qui est avantageux dans des contextes sensibles à la vie privée. Bien que la DFQ ait été efficace pour les modèles unimodaux, son application aux modèles Vision-Language comme CLIP n'a pas été suffisamment explorée. Cette étude présente D4C, un cadre DFQ spécifiquement conçu pour CLIP, abordant des défis tels que le contenu sémantique et la diversité intra-image dans les échantillons synthétisés.

Data-Free Quantization (DFQ) presents a solution for model compression without needing real data, which is beneficial in privacy-sensitive contexts. While DFQ has been effective for unimodal models, its application to Vision-Language Models like CLIP has not been thoroughly investigated. This study introduces D4C, a DFQ framework specifically designed for CLIP, addressing challenges such as semantic content and intra-image diversity in synthesized samples.

D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

arXiv:2511.12988v2 Announce Type: replace 
Abstract: The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30\%.

تتناول الورقة المعنونة 'UNSEEN: تعزيز تقليم مجموعة البيانات من منظور التعميم' التحديات الحسابية التي تطرحها مجموعات البيانات الكبيرة في التعلم العميق. تقترح نهجًا جديدًا لتقليم مجموعة البيانات يركز على التعميم بدلاً من التكيف، حيث يتم تقييم العينات بناءً على نماذج لم تتعرض لها خلال التدريب. تهدف هذه الطريقة إلى إنشاء عملية اختيار أكثر فعالية من خلال تقليل تركيز درجات العينات، مما يحسن أداء نماذج التعلم العميق.

El artículo titulado 'UNSEEN: Mejorando la poda de conjuntos de datos desde una perspectiva de generalización' aborda los desafíos computacionales que presentan los grandes conjuntos de datos en el aprendizaje profundo. Propone un enfoque novedoso para la poda de conjuntos de datos que se centra en la generalización en lugar de en el ajuste, evaluando las muestras en función de modelos no expuestos durante el entrenamiento. Este método busca crear un proceso de selección más efectivo al reducir la concentración de puntuaciones de muestras, mejorando así el rendimiento de los modelos de aprendi…

L'article intitulé 'UNSEEN : Amélioration de l'élagage des ensembles de données d'un point de vue de généralisation' aborde les défis computationnels posés par les grands ensembles de données en apprentissage profond. Il propose une nouvelle approche à l'élagage des ensembles de données qui se concentre sur la généralisation plutôt que sur l'ajustement, en notant les échantillons en fonction de modèles non exposés pendant l'entraînement. Cette méthode vise à créer un processus de sélection plus efficace en réduisant la concentration des scores d'échantillons, améliorant ainsi les performances …

The paper titled 'UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective' addresses the computational challenges posed by large datasets in deep learning. It proposes a novel approach to dataset pruning that focuses on generalization rather than fitting, scoring samples based on models not exposed to them during training. This method aims to create a more effective selection process by reducing the concentration of sample scores, ultimately improving the performance of deep learning models.

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

arXiv:2511.15406v1 Announce Type: cross 
Abstract: Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc framework that constructs confidence masks with distribution-free, image-level control of false-positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user-specified tolerance with high probability. The method is model-agnostic, requires no retraining, and provides finite-sample guarantees regardless of the underlying predictor. Experiments on a polyp-segmentation benchmark demonstrate target-level empirical validity. Our framework enables practical, risk-aware segmentation in settings where over-segmentation can have clinical consequences. Code at https://github.com/deel-ai-papers/conseco.

تم تقديم إطار عمل جديد للتحكم في الإيجابيات الكاذبة في تقسيم الصور، مما يعزز موثوقية التقسيم الدلالي في اتخاذ القرارات السريرية. تستخدم هذه الطريقة المستقلة عن النموذج التنبؤ المتوافق لإنشاء أقنعة ثقة تحافظ على مستوى محدد مسبقًا من الإيجابيات الكاذبة، دون الحاجة إلى إعادة التدريب. تُظهر الطريقة ضمانات عالية الاحتمالية للصور الجديدة، مما يمثل تقدمًا كبيرًا في التصوير الطبي.

Se ha introducido un nuevo marco para controlar los falsos positivos en la segmentación de imágenes, mejorando la fiabilidad de la segmentación semántica en la toma de decisiones clínicas. Este enfoque independiente del modelo utiliza la predicción conformal para crear máscaras de confianza que mantienen una tolerancia definida por el usuario para los falsos positivos, sin necesidad de reentrenamiento. El método demuestra garantías de alta probabilidad para nuevas imágenes, representando un avance significativo en la imagen médica.

Un nouveau cadre pour contrôler les faux positifs dans la segmentation d'images a été introduit, améliorant la fiabilité de la segmentation sémantique dans la prise de décision clinique. Cette approche indépendante du modèle utilise la prédiction conforme pour créer des masques de confiance qui maintiennent une tolérance définie par l'utilisateur pour les faux positifs, sans nécessiter de réentraînement. La méthode démontre des garanties de haute probabilité pour de nouvelles images, représentant une avancée significative dans l'imagerie médicale.

A new framework for controlling false positives in image segmentation has been introduced, enhancing the reliability of semantic segmentation in clinical decision-making. This model-agnostic approach utilizes conformal prediction to create confidence masks that maintain a user-defined tolerance for false positives, without requiring retraining. The method demonstrates high probability guarantees for new images, making it a significant advancement in medical imaging.

Controlling False Positives in Image Segmentation via Conformal Prediction

arXiv:2504.07503v2 Announce Type: replace 
Abstract: Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.

تقدم الورقة فلاتر تدفق كثافة الأحداث (EDFilter)، وهو إطار مبتكر لتصفية تدفقات الأحداث يستخدم تقدير تدفق الاحتمالات لتحسين معالجة بيانات كاميرات الأحداث. تتناول هذه الطريقة قيود الفلاتر الحالية من خلال نمذجة توليد الأحداث كتيارات احتمالية تتجاوز العتبات، مما يسمح بإعادة بناء تدفق كثافة الأحداث المستمر في الوقت الحقيقي. يهدف الإطار إلى تحسين دقة وكفاءة أنظمة الرؤية المعتمدة على الأحداث.

El artículo presenta el Filtro de Flujo de Densidad de Eventos (EDFilter), un marco novedoso para el filtrado de flujos de eventos que utiliza la estimación de flujo de probabilidad para mejorar el procesamiento de datos de cámaras de eventos. Este enfoque aborda las limitaciones de los filtros de eventos existentes al modelar la generación de eventos como flujos de probabilidad que cruzan umbrales, permitiendo la reconstrucción en tiempo real del flujo de densidad de eventos continuo. El marco busca mejorar la precisión y eficiencia de los sistemas de visión basados en eventos.

L'article présente le filtre de flux de densité d'événements (EDFilter), un cadre novateur pour le filtrage de flux d'événements qui utilise l'estimation de flux de probabilité pour améliorer le traitement des données des caméras d'événements. Cette approche répond aux limitations des filtres d'événements existants en modélisant la génération d'événements comme des flux de probabilité franchissant des seuils, permettant ainsi la reconstruction en temps réel du flux de densité d'événements continu. Le cadre vise à améliorer la précision et l'efficacité des systèmes de vision basés sur des événe…

The paper introduces the Event Density Flow Filter (EDFilter), a novel framework for event stream filtering that utilizes probability flux estimation to enhance the processing of event camera data. This approach addresses the limitations of existing event filters by modeling event generation as threshold-crossing probability fluxes, allowing for real-time reconstruction of continuous event density flow. The framework aims to improve the accuracy and efficiency of event-based vision systems.

Event Stream Filtering via Probability Flux Estimation

arXiv:2511.15499v1 Announce Type: new 
Abstract: Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.

تقدم الورقة مفهوم تمثيل التوليد التلقائي الموسع (EAR)، وهو نموذج جديد لتوليد الصور يحاكي نظام الرؤية البشري من خلال إدراكه من المركز إلى الخارج. تعمل هذه الطريقة على تحسين الكفاءة من خلال نشر رموز الصورة بترتيب حلزوني، مما يسمح بفك تشفير متوازي ويحافظ على الاستمرارية المكانية. بالإضافة إلى ذلك، يتم اقتراح استراتيجية فك تشفير قابلة للتكيف مع الطول لتعزيز المرونة والسرعة، مما يقلل من التكاليف الحسابية ويحسن جودة التوليد.

El artículo presenta la Representación Autoregresiva Expandida (EAR), un nuevo paradigma para la generación visual que imita el sistema visual humano al percibir de manera centrípeta. Este método mejora la eficiencia al desplegar tokens de imagen en un orden espiral, permitiendo la decodificación paralela y preservando la continuidad espacial. Además, se propone una estrategia de decodificación adaptativa en longitud para mejorar la flexibilidad y velocidad, reduciendo así los costos computacionales y mejorando la calidad de generación.

Cet article présente la Représentation Autoregressive Élargie (EAR), un nouveau paradigme pour la génération visuelle qui imite le système visuel humain en percevant de manière centripète. Cette méthode améliore l'efficacité en déroulant les tokens d'image dans un ordre spiralé, permettant un décodage parallèle et préservant la continuité spatiale. De plus, une stratégie de décodage adaptative en longueur est proposée pour améliorer la flexibilité et la rapidité, réduisant ainsi les coûts computationnels et améliorant la qualité de génération.

The paper introduces Expanding Autoregressive Representation (EAR), a new paradigm for visual generation that mimics the human visual system's center-outward perception. This method improves efficiency by unfolding image tokens in a spiral order, allowing for parallel decoding and preserving spatial continuity. Additionally, a length-adaptive decoding strategy is proposed to enhance flexibility and speed, ultimately reducing computational costs and improving generation quality.

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Was this article worth reading? Share it