arXiv:2510.24024v1 Announce Type: cross 
Abstract: Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.

تستكشف دراسة حديثة حول الترجمة التلقائية السمعية البصرية مدى جودة النماذج الحالية في دمج الصوت والرؤية لإنشاء أوصاف للمشاهد. على الرغم من تحقيق تقدم في دمج هذه الأنماط، فإن البحث يبرز الحاجة إلى فهم تكاملها وموثوقية هذه النماذج عندما تتعرض إحدى الأنماط للتدهور. هذا مهم لأنه قد يؤدي إلى أنظمة أكثر موثوقية في تطبيقات متنوعة، مما يعزز كيفية تفسيرنا للمحتوى متعدد الوسائط.

Un estudio reciente sobre el subtitulado audiovisual explora qué tan bien los modelos actuales combinan sonido y visión para crear descripciones de escenas. Aunque se han logrado avances en la fusión de estas modalidades, la investigación destaca la necesidad de comprender su complementariedad y la robustez de estos modelos cuando una modalidad se ve afectada. Esto es importante ya que podría llevar a sistemas más confiables en diversas aplicaciones, mejorando nuestra interpretación del contenido multimedia.

Une étude récente sur le sous-titrage audio-visuel examine comment les modèles actuels combinent le son et la vision pour créer des descriptions de scènes. Bien que des avancées aient été réalisées dans la fusion de ces modalités, la recherche souligne la nécessité de comprendre leur complémentarité et la robustesse de ces modèles lorsque l'une des modalités est altérée. Cela est important car cela pourrait conduire à des systèmes plus fiables dans diverses applications, améliorant ainsi notre interprétation du contenu multimédia.

A recent study on audio-visual captioning explores how well current models combine sound and vision to create scene descriptions. While advancements have been made in fusing these modalities, the research highlights the need to understand their complementarity and the robustness of these models when one modality is impaired. This is important as it could lead to more reliable systems in various applications, enhancing how we interpret multimedia content.

Listening without Looking: Modality Bias in Audio-Visual Captioning

arXiv:2410.01870v3 Announce Type: replace 
Abstract: Fine-tuning large pre-trained foundation models often yields excellent downstream performance but is prohibitively expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods such as LoRA alleviate this by introducing lightweight update modules, yet they commonly rely on weight-agnostic linear approximations, limiting their expressiveness. In this work, we propose PEANuT, a novel PEFT framework that introduces weight-aware neural tweakers, compact neural modules that generate task-adaptive updates conditioned on frozen pre-trained weights. PEANuT provides a flexible yet efficient way to capture complex update patterns without full model tuning. We theoretically show that PEANuT achieves equivalent or greater expressivity than existing linear PEFT methods with comparable or fewer parameters. Extensive experiments across four benchmarks with over twenty datasets demonstrate that PEANuT consistently outperforms strong baselines in both NLP and vision tasks, while maintaining low computational overhead.

تمثل مقدمة PEANuT، وهو إطار جديد للتكيف الفعال من حيث المعلمات، خطوة نحو تحسين تكيف النماذج الكبيرة المدربة مسبقًا من خلال استخدام مُعدلات عصبية واعية للوزن تُنتج تحديثات خاصة بالمهمة بناءً على الأوزان المجمدة. تتناول هذه الطريقة القيود المفروضة على الأساليب الحالية مثل LoRA، التي تعتمد غالبًا على تقريبات غير واعية للوزن.

La introducción de PEANuT, un nuevo marco de ajuste eficiente de parámetros, tiene como objetivo mejorar la adaptación de grandes modelos preentrenados mediante el uso de ajustadores neuronales conscientes del peso que generan actualizaciones específicas de la tarea basadas en pesos congelados. Este enfoque aborda las limitaciones de métodos existentes como LoRA, que a menudo dependen de aproximaciones independientes del peso.

L'introduction de PEANuT, un nouveau cadre de réglage efficace des paramètres, vise à améliorer l'adaptation des grands modèles pré-entraînés en utilisant des ajusteurs neuronaux sensibles au poids qui génèrent des mises à jour spécifiques à la tâche basées sur des poids gelés. Cette approche répond aux limitations des méthodes existantes comme LoRA, qui reposent souvent sur des approximations indépendantes du poids.

The introduction of PEANuT, a novel parameter-efficient fine-tuning framework, aims to enhance the adaptation of large pre-trained models by utilizing weight-aware neural tweakers that generate task-specific updates based on frozen weights. This approach addresses the limitations of existing methods like LoRA, which often rely on weight-agnostic approximations.

Listening without Looking: Modality Bias in Audio-Visual Captioning

Was this article worth reading? Share it

Voice-gen.ai

Dubsmart LLC

VideoDubber Video Translator