arXiv:2601.08557v1 Announce Type: new 
Abstract: Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .

تم تقديم إطار عمل جديد يسمى VideoHEDGE للكشف عن الهلاوس في نماذج الرؤية-اللغة القادرة على معالجة الفيديو (Video-VLMs)، مما يعالج الأخطاء المتكررة في الإجابات على الأسئلة المتعلقة بالفيديو. يستخدم هذا النظام تقديرات موثوقية قائمة على الانتروبيا والتجميع الدلالي لتقييم صحة الإجابات الناتجة مقارنة بأزواج الفيديو-السؤال.

Se ha presentado un nuevo marco llamado VideoHEDGE para detectar alucinaciones en modelos de visión-lenguaje capaces de procesar videos (Video-VLMs), abordando las inexactitudes frecuentes en la respuesta a preguntas sobre videos. Este sistema emplea estimaciones de fiabilidad basadas en la entropía y agrupación semántica para evaluar la corrección de las respuestas generadas en relación con pares de video-pregunta.

Un nouveau cadre nommé VideoHEDGE a été introduit pour détecter les hallucinations dans les modèles de vision-langage capables de traiter des vidéos (Video-VLMs), abordant les inexactitudes fréquentes dans les réponses aux questions vidéo. Ce système utilise une estimation de fiabilité basée sur l'entropie et un clustering sémantique pour évaluer la justesse des réponses générées par rapport aux paires vidéo-question.

A new framework named VideoHEDGE has been introduced to detect hallucinations in video-capable vision-language models (Video-VLMs), addressing the frequent inaccuracies in video question answering. This system employs entropy-based reliability estimation and semantic clustering to evaluate the correctness of generated answers against video-question pairs.

VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations

Was this article worth reading? Share it

LucidQuery AI

Videotok

Videolulu

sync. labs

VideoDigest

VideoDubber Video Translator

Ready to build your own newsroom?