arXiv:2511.02712v1 Announce Type: new 
Abstract: Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

تقدم VidEmo نهجًا جديدًا لفهم المشاعر في مقاطع الفيديو، مستفيدة من التقدم في نماذج اللغة الكبيرة للفيديو. تهدف هذه الطريقة المبتكرة إلى معالجة تعقيدات تحليل المشاعر، مع الأخذ في الاعتبار الطبيعة الديناميكية للمشاعر واعتمادها على إشارات متنوعة.

VidEmo presenta un nuevo enfoque para comprender las emociones en videos, aprovechando los avances en modelos de lenguaje de video. Este método innovador busca abordar las complejidades del análisis emocional, considerando la naturaleza dinámica de las emociones y su dependencia de diversas señales.

VidEmo présente une nouvelle approche pour comprendre les émotions dans les vidéos, en s'appuyant sur les avancées des modèles de langage vidéo. Cette méthode innovante vise à s'attaquer aux complexités de l'analyse émotionnelle, en tenant compte de la nature dynamique des émotions et de leur dépendance à divers indices.

VidEmo introduces a new approach to understanding emotions in videos, leveraging advancements in video large language models. This innovative method aims to tackle the complexities of emotional analysis, addressing the dynamic nature of emotions and their dependence on various cues.

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Was this article worth reading? Share it

LucidQuery AI

Videotok

AiReelGenerator.com

VideoDigest

VidBoard AI

VidMax.ai

Ready to build your own newsroom?