arXiv:2511.12220v1 Announce Type: cross 
Abstract: Vision-language models (VLMs) frequently produce hallucinations in the form of descriptions of objects, attributes, or relations that do not exist in the image due to over-reliance on language priors and imprecise cross-modal grounding. We introduce Spectral Representation Filtering (SRF), a lightweight, training-free method to suppress such hallucinations by analyzing and correcting the covariance structure of the model's representations. SRF identifies low-rank hallucination modes through eigendecomposition of the covariance of the differences between features collected for truthful and hallucinatory captions, revealing structured biases in the feature space. A soft spectral filter then attenuates these modes in the feed-forward projection weights of deeper vLLM layers, equalizing feature variance while preserving semantic fidelity. Unlike decoding or retraining-based approaches, SRF operates entirely post-hoc, incurs zero inference overhead, and requires no architectural modifications. Across three families of VLMs (LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2), SRF consistently reduces hallucination rates on MSCOCO, POPE-VQA, and other visual tasks benchmarks, achieving state-of-the-art faithfulness without degrading caption quality.

تُعرف نماذج الرؤية واللغة (VLM) بإنتاجها للهلاوس، حيث تنتج أوصافًا لأشياء أو سمات غير موجودة في الصورة بسبب الاعتماد المفرط على الأولويات اللغوية وعدم دقة الربط بين الأنماط. تم تقديم طريقة جديدة تُسمى تصفية التمثيل الطيفي (SRF) لمعالجة هذه المشكلة. تُعتبر SRF طريقة خفيفة الوزن وغير معتمدة على التدريب، حيث تقوم بتحليل هيكل التباين لتمثيلات النموذج لقمع هذه الهلاوس. من خلال تحديد أنماط الهلاوس ذات الرتبة المنخفضة وتطبيق فلتر طيفي ناعم، تُحسن SRF تباين الميزات مع الحفاظ على الدقة الدلالية، وتعمل بالكامل بعد الحدث.

Los modelos de visión-lenguaje (VLM) a menudo generan alucinaciones, produciendo descripciones de objetos o atributos que no existen en la imagen debido a una dependencia excesiva de los priors lingüísticos y un anclaje intermodal impreciso. Se ha introducido un nuevo método llamado Filtrado de Representación Espectral (SRF) para abordar este problema. El SRF es un enfoque ligero y sin entrenamiento que analiza la estructura de covarianza de las representaciones del modelo para suprimir tales alucinaciones. Al identificar modos de alucinación de bajo rango y aplicar un filtro espectral suave, …

Les modèles de vision-langage (VLM) sont connus pour générer des hallucinations, produisant des descriptions d'objets ou d'attributs inexistants en raison de leur dépendance aux priorités linguistiques et d'un ancrage intermodal imprécis. Une nouvelle méthode appelée Filtrage de Représentation Spectrale (SRF) a été introduite pour résoudre ce problème. Le SRF est une approche légère et sans entraînement qui analyse la structure de covariance des représentations du modèle pour supprimer les hallucinations. En identifiant les modes d'hallucination de faible rang et en appliquant un filtre spectr…

Vision-language models (VLMs) are known to generate hallucinations, producing descriptions of non-existent objects or attributes due to reliance on language priors and imprecise cross-modal grounding. A new method called Spectral Representation Filtering (SRF) has been introduced to address this issue. SRF is a lightweight, training-free approach that analyzes the covariance structure of model representations to suppress hallucinations. By identifying low-rank hallucination modes and applying a soft spectral filter, SRF enhances feature variance while maintaining semantic fidelity, operating e…

Suppressing VLM Hallucinations with Spectral Representation Filtering

arXiv:2511.17727v1 Announce Type: new 
Abstract: Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

أظهرت نماذج الرؤية-اللغة (VLM) إمكانيات كبيرة في مجموعة متنوعة من مهام رؤية الكمبيوتر، مما أدى إلى تطبيقها في إعادة تأهيل السكتة الدماغية المعتمدة على البيانات لمعالجة تحديات مثل الكمية التلقائية لجرعة إعادة التأهيل وتقييم العجز من مقاطع الفيديو. أظهرت دراسة شملت 29 شخصًا سليمًا و51 ناجٍ من السكتة الدماغية أن النماذج الحالية تعاني من صعوبة في فهم الحركة بدقة، مما يؤدي إلى تقديرات جرعة غير موثوقة ودرجات عجز غير دقيقة.

Los modelos de visión-lenguaje (VLM) han demostrado potencial en diversas tareas de visión por computadora, lo que ha llevado a su aplicación en la rehabilitación de accidentes cerebrovasculares basada en datos para abordar desafíos como la cuantificación automática de la dosis de rehabilitación y la evaluación de discapacidades a partir de videos. Un estudio que involucró a 29 controles sanos y 51 sobrevivientes de accidentes cerebrovasculares reveló que los VLM actuales tienen dificultades para comprender los movimientos de manera detallada, lo que lleva a estimaciones de dosis y puntajes de…

Les modèles de vision-langage (VLM) ont montré un potentiel dans diverses tâches de vision par ordinateur, ce qui a conduit à leur application dans la réhabilitation post-AVC basée sur les données pour relever des défis tels que la quantification automatique de la dose de réhabilitation et l'évaluation des déficits à partir de vidéos. Une étude impliquant 29 témoins sains et 51 survivants d'AVC a révélé que les VLM actuels ont du mal à comprendre les mouvements de manière fine, entraînant des estimations de dose et des scores de déficience peu fiables.

Vision-language models (VLMs) have shown potential in various computer-vision tasks, prompting their application in data-driven stroke rehabilitation to address challenges like automatic quantification of rehabilitation dose and impairment from videos. A study involving 29 healthy controls and 51 stroke survivors revealed that current VLMs struggle with fine-grained motion understanding, leading to unreliable dose estimates and impairment scores.

The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

arXiv:2511.18504v1 Announce Type: new 
Abstract: The demand for edge AI in vision-language tasks requires models that achieve real-time performance on resource-constrained devices with limited power and memory. This paper proposes two adaptive compression techniques -- Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (ANC) -- that integrate algorithmic innovations with hardware-aware optimizations. Unlike previous approaches relying on static pruning or uniform scaling, STTF dynamically reuses visual tokens through event-driven change detection, while ANC conditionally activates encoder branches via a learned router, enabling fine-grained adaptation to scene complexity. Our 3B-parameter TinyGPT-STTF achieves CIDEr 131.2, BLEU-4 0.38, METEOR 0.31, and ROUGE-L 0.56 on the COCO 2017 test set, surpassing LLaVA-1.5 7B by 17.6 CIDEr points while using 2.3x fewer parameters and 62x fewer on-device FLOPs. TinyGPT-ANC reaches CIDEr 128.5. On event-based vision tasks, STTF reduces average token count by 84% (from 196 to 31 tokens) while preserving 95.6% accuracy on the DVS128 Gesture dataset, and ANC cuts FLOPs by up to 90% in low-motion scenes. Compared to strong baselines, our models improve accuracy by up to 4.4% and reduce latency by up to 13x. These results enable efficient deployment of capable vision-language models on real-world edge devices.

تقدم دراسة جديدة تقنيتين مبتكرتين للضغط، وهما دمج الرموز الزمنية النادرة (STTF) وضغط الشبكات العصبية التكيفية (ANC)، تهدفان إلى تحسين أداء الذكاء الاصطناعي على الحافة في مهام الرؤية واللغة. تتيح هذه الطرق للنماذج العمل بكفاءة على الأجهزة ذات الموارد المحدودة، محققة تحسينات كبيرة في مقاييس الأداء في الوقت الفعلي مقارنة بالنماذج الحالية مثل LLaVA-1.5.

Un nuevo estudio presenta dos técnicas de compresión innovadoras, la Fusión de Tokens Temporales Escasos (STTF) y la Compresión Neural Adaptativa (ANC), destinadas a mejorar el rendimiento de la IA en el borde en tareas de visión-lenguaje. Estos métodos permiten que los modelos funcionen de manera eficiente en dispositivos con recursos limitados, logrando mejoras significativas en las métricas de rendimiento en tiempo real en comparación con modelos existentes como LLaVA-1.5.

Une nouvelle étude présente deux techniques de compression innovantes, la Fusion de Tokens Temporels Épars (STTF) et la Compression Neuronale Adaptative (ANC), visant à améliorer les performances de l'IA en périphérie dans les tâches de vision-langage. Ces méthodes permettent aux modèles de fonctionner efficacement sur des appareils aux ressources limitées, atteignant des améliorations significatives des métriques de performance en temps réel par rapport à des modèles existants comme LLaVA-1.5.

A new study introduces two innovative compression techniques, Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (ANC), aimed at enhancing edge AI performance in vision-language tasks. These methods allow models to operate efficiently on devices with limited resources, achieving significant improvements in real-time performance metrics compared to existing models like LLaVA-1.5.

Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression

arXiv:2511.18415v1 Announce Type: cross 
Abstract: Vision-language models (VLMs) possess rich knowledge but often fail on hierarchical understanding tasks, where the goal is to predict a coarse-to-fine taxonomy path that remains consistent across all levels. We compare three inference paradigms for hierarchical VQA and find that stepwise reasoning, when conditioned on prior answers, significantly outperforms single-pass prompting. Further analysis indicates that the main limitation of current VLMs is their inability to maintain cross-level state, rather than a lack of taxonomic knowledge. Motivated by this diagnosis, we propose Self-Elicited Knowledge Distillation (SEKD), which requires no human labels or external tools: the same VLM is prompted to reason step by step and act as a teacher by exposing its hard labels, soft distributions, and decoder hidden states, while a single-pass student distills these signals. The student VLM remains efficient while approaching the accuracy of its multi-step teacher. It improves in-domain path consistency (HCA) by up to +29.50 percentage points, raises zero-shot HCA on an unseen taxonomy from 4.15% to 42.26%, and yields gains on challenging mathematical benchmarks. Because all supervision is self-elicited, SEKD scales to new taxonomies and datasets without annotation cost, providing a practical route to imbue compact VLMs with dependency-aware multi-step reasoning.

قدمت دراسة حديثة تقنية تدعى التقطير الذاتي للمعرفة (SEKD) كوسيلة لتحسين أداء نماذج اللغة والرؤية (VLMs) في مهام الفهم الهرمي. تتيح هذه الطريقة لنماذج VLMs التفكير خطوة بخطوة، مما يحسن قدرتها على الحفاظ على حالة عبر المستويات وتحقيق التناسق الهرمي دون الحاجة إلى تسميات بشرية أو أدوات خارجية.

Un estudio reciente presentó la Distilación de Conocimiento Auto-Elicitada (SEKD) como un método para mejorar el rendimiento de los Modelos de Lenguaje y Visión (VLMs) en tareas de comprensión jerárquica. Este enfoque permite a los VLMs razonar paso a paso, mejorando su capacidad para mantener un estado a través de niveles y lograr consistencia jerárquica sin necesidad de etiquetas humanas o herramientas externas.

Une étude récente a introduit la Distillation de Connaissances Auto-Élicitée (SEKD) comme méthode pour améliorer les performances des Modèles Vision-Langage (VLMs) dans les tâches de compréhension hiérarchique. Cette approche permet aux VLMs de raisonner étape par étape, améliorant leur capacité à maintenir un état transversal et à atteindre une cohérence hiérarchique sans avoir besoin d'étiquettes humaines ou d'outils externes.

A recent study introduced Self-Elicited Knowledge Distillation (SEKD) as a method to enhance the performance of Vision-Language Models (VLMs) in hierarchical understanding tasks. This approach allows VLMs to reason step by step, improving their ability to maintain cross-level state and achieve hierarchical consistency without the need for human labels or external tools.

Suppressing VLM Hallucinations with Spectral Representation Filtering

Was this article worth reading? Share it

The Visualizer

Republiclabs.ai

GPTHumanizer