arXiv:2506.16112v2 Announce Type: replace 
Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we develop an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experiments indicate that AutoV enhances the performance of various LVLMs across multiple image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{10.2}\%$ accuracy gain on VizWiz, and AutoV boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, highlighting its potential as an optimal visual prompting method.

تم تقديم نهج جديد يسمى AutoV لتحسين أداء نماذج الرؤية-اللغة الكبيرة (LVLM) من خلال اختيار تلقائي للمطالبات البصرية المثلى بناءً على الاستفسارات النصية وصور الإدخال. تعالج هذه الطريقة التحديات المتعلقة بتصميم المطالبات البصرية بشكل يدوي، والتي يمكن أن تكون مستهلكة للوقت وغالبًا ما تؤدي إلى نتائج دون المستوى.

Se ha introducido un nuevo enfoque llamado AutoV para mejorar el rendimiento de los grandes modelos de visión-lenguaje (LVLM) al seleccionar automáticamente los mejores prompts visuales en función de las consultas textuales y las imágenes de entrada. Este método aborda los desafíos de diseñar manualmente prompts visuales efectivos, que pueden ser laboriosos y a menudo conducen a resultados subóptimos.

Une nouvelle approche appelée AutoV a été introduite pour améliorer les performances des grands modèles de vision-langage (LVLM) en sélectionnant automatiquement les invites visuelles optimales en fonction des requêtes textuelles et des images d'entrée. Cette méthode répond aux défis de la conception manuelle d'invites visuelles efficaces, qui peut être chronophage et souvent conduire à des résultats sous-optimaux.

A new approach called AutoV has been introduced to enhance the performance of large vision-language models (LVLMs) by automatically selecting optimal visual prompts based on textual queries and input images. This method addresses the challenges of manually designing effective visual prompts, which can be time-consuming and often lead to sub-optimal results.

Loss-Oriented Ranking for Automated Visual Prompting in LVLMs

arXiv:2601.08557v1 Announce Type: new 
Abstract: Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .

تم تقديم إطار عمل جديد يسمى VideoHEDGE للكشف عن الهلاوس في نماذج الرؤية-اللغة القادرة على معالجة الفيديو (Video-VLMs)، مما يعالج الأخطاء المتكررة في الإجابات على الأسئلة المتعلقة بالفيديو. يستخدم هذا النظام تقديرات موثوقية قائمة على الانتروبيا والتجميع الدلالي لتقييم صحة الإجابات الناتجة مقارنة بأزواج الفيديو-السؤال.

Se ha presentado un nuevo marco llamado VideoHEDGE para detectar alucinaciones en modelos de visión-lenguaje capaces de procesar videos (Video-VLMs), abordando las inexactitudes frecuentes en la respuesta a preguntas sobre videos. Este sistema emplea estimaciones de fiabilidad basadas en la entropía y agrupación semántica para evaluar la corrección de las respuestas generadas en relación con pares de video-pregunta.

Un nouveau cadre nommé VideoHEDGE a été introduit pour détecter les hallucinations dans les modèles de vision-langage capables de traiter des vidéos (Video-VLMs), abordant les inexactitudes fréquentes dans les réponses aux questions vidéo. Ce système utilise une estimation de fiabilité basée sur l'entropie et un clustering sémantique pour évaluer la justesse des réponses générées par rapport aux paires vidéo-question.

A new framework named VideoHEDGE has been introduced to detect hallucinations in video-capable vision-language models (Video-VLMs), addressing the frequent inaccuracies in video question answering. This system employs entropy-based reliability estimation and semantic clustering to evaluate the correctness of generated answers against video-question pairs.

VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations

One More Thing in AI – Your Shortcut to AI Mastery

Loss-Oriented Ranking for Automated Visual Prompting in LVLMs

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

PromptKit

The Visualizer

Promptly

Prompt Builder

Ready to build your own newsroom?