arXiv:2510.21786v1 Announce Type: new 
Abstract: Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about $35K$ annotated videos and more than $178K$ video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.

تقديم EventFormer، وهو محول انتباه هرمي قائم على الرسم البياني، يمثل تقدمًا كبيرًا في توقع أحداث الفيديو المركز على العمل. يعالج هذا النهج المبتكر الفجوة في البحث حيث يتم تسجيل الأحداث البشرية بشكل أساسي في مقاطع الفيديو بدلاً من النصوص. من خلال التركيز على توقع الأحداث اللاحقة بناءً على السياق البصري، يفتح EventFormer آفاقًا جديدة للتطبيقات العملية في معالجة اللغة الطبيعية ورؤية الكمبيوتر، مما يجعله تطورًا ملحوظًا في هذا المجال.

La introducción de EventFormer, un transformador jerárquico de atención en grafo de nodos, marca un avance significativo en la predicción de eventos de video centrados en la acción. Este enfoque innovador aborda la brecha en la investigación donde los eventos humanos se capturan principalmente en videos en lugar de guiones. Al centrarse en predecir eventos subsiguientes basados en el contexto visual, EventFormer abre nuevas avenidas para aplicaciones prácticas en procesamiento de lenguaje natural y visión por computadora, convirtiéndose en un desarrollo notable en el campo.

L'introduction d'EventFormer, un transformateur hiérarchique d'attention en graphe de nœuds, marque une avancée significative dans la prédiction d'événements vidéo centrés sur l'action. Cette approche innovante comble le fossé dans la recherche où les événements humains sont principalement capturés dans des vidéos plutôt que dans des scripts. En se concentrant sur la prédiction des événements suivants en fonction du contexte visuel, EventFormer ouvre de nouvelles avenues pour des applications pratiques en traitement du langage naturel et en vision par ordinateur, ce qui en fait un développement remarquable dans le domaine.

The introduction of EventFormer, a Node-graph Hierarchical Attention Transformer, marks a significant advancement in action-centric video event prediction. This innovative approach addresses the gap in research where human events are primarily captured in videos rather than scripts. By focusing on predicting subsequent events based on visual context, EventFormer opens new avenues for practical applications in natural language processing and computer vision, making it a noteworthy development in the field.

EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction

arXiv:2511.17254v1 Announce Type: new 
Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

تقدم دراسة جديدة إطار العمل Intervene-All-Paths، الذي يهدف إلى التخفيف من الهلوسات في نماذج اللغة الكبيرة للرؤية (LVLMs) من خلال معالجة تفاعل مسارات سببية متنوعة. تبرز هذه البحث أن الهلوسات تنشأ من مصادر متعددة، بما في ذلك التفاعلات من الصورة إلى نص الإدخال ومن النص إلى النص، وتقترح تدخلات مستهدفة لأشكال محاذاة الأسئلة والأجوبة المختلفة.

Un nuevo estudio presenta el marco Intervene-All-Paths, destinado a mitigar las alucinaciones en los Modelos de Lenguaje de Visión Grande (LVLMs) al abordar la interacción de diversas rutas causales. Esta investigación destaca que las alucinaciones provienen de múltiples fuentes, incluidas las interacciones de imagen a texto de entrada y de texto a texto, y propone intervenciones específicas para diferentes formatos de alineación de preguntas y respuestas.

Une nouvelle étude présente le cadre Intervene-All-Paths, visant à atténuer les hallucinations dans les grands modèles de vision-langage (LVLMs) en abordant l'interaction de divers chemins causaux. Cette recherche met en évidence que les hallucinations proviennent de multiples sources, y compris les interactions image-texte d'entrée et texte-texte, et propose des interventions ciblées pour différents formats d'alignement question-réponse.

A new study introduces the Intervene-All-Paths framework, aimed at mitigating hallucinations in Large Vision-Language Models (LVLMs) by addressing the interplay of various causal pathways. This research highlights that hallucinations stem from multiple sources, including image-to-input-text and text-to-text interactions, and proposes targeted interventions for different question-answer alignment formats.

EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction

Was this article worth reading? Share it

The Visualizer

AiReelGenerator.com

Grasp.info