arXiv:2511.14749v1 Announce Type: new 
Abstract: Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

يواجه التعرف على الانخراط في مجموعات بيانات الفيديو تحديات بسبب التسميات الذاتية والضوضاء، مما يحد من أداء النموذج. تم اقتراح إطار عمل جديد يستخدم نماذج اللغة الكبيرة في الرؤية (VLM) لتنقيح التعليقات التوضيحية وتوجيه عملية التدريب. يستخدم هذا الإطار استبيانًا لاستخراج الإشارات السلوكية وتقسيم البيانات إلى مجموعات ذات موثوقية عالية ومنخفضة. بالإضافة إلى ذلك، تم تقديم استراتيجية تدريب تجمع بين التعلم المنهجي وتنقيح التسميات الناعمة، مما يسمح بإدراج عينات غامضة تدريجيًا مع تعديل الإشراف ليعكس عدم اليقين. تُظهر التحسينات في نماذج الرؤية الحاسوبية التقليدية المدربة على مجموعات موثوقة مصقولة فعالية …

El reconocimiento de compromiso en conjuntos de datos de video enfrenta desafíos debido a etiquetas subjetivas y ruido, lo que limita el rendimiento del modelo. Se ha propuesto un nuevo marco que utiliza Modelos de Lenguaje de Gran Escala en Visión (VLM) para refinar las anotaciones y guiar el proceso de entrenamiento. Este marco emplea un cuestionario para extraer señales de comportamiento y dividir los datos en subconjuntos de alta y baja fiabilidad. Además, se introduce una estrategia de entrenamiento que combina el aprendizaje por currículum con el refinamiento de etiquetas suaves, incorpo…

La reconnaissance de l'engagement dans les ensembles de données vidéo est confrontée à des défis en raison de labels subjectifs et de bruit, ce qui entrave la performance des modèles. Un nouveau cadre utilisant des modèles de langage large en vision (VLM) a été proposé pour affiner les annotations et améliorer le processus d'entraînement. Ce cadre utilise un questionnaire pour extraire des indices comportementaux et classer les données en sous-ensembles de haute et basse fiabilité. De plus, une stratégie d'entraînement combinant apprentissage par curriculum et affinement des labels souples est…

Engagement recognition in video datasets faces challenges due to subjective labels and noise, which hinder model performance. A new framework utilizing Vision Large Language Models (VLMs) has been proposed to refine annotations and enhance the training process. This framework employs a questionnaire to extract behavioral cues and categorize data into high- and low-reliability subsets. Additionally, a training strategy combining curriculum learning with soft label refinement is introduced, allowing for the gradual inclusion of ambiguous samples while adjusting supervision to reflect uncertainty…

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

One More Thing in AI – Your Shortcut to AI Mastery

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

LangWatch

Keywords AI

Xgager

Usercall

Ready to build your own newsroom?