arXiv:2503.10287v3 Announce Type: replace-cross 
Abstract: Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-modal task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, we propose a method called MACS to conduct multi-source audio-to-image generation. To our best knowledge, this is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, effective image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 out of the 21 evaluation indexes on all tasks and delivers superior visual quality.

تم تقديم طريقة جديدة تُدعى MACS لتوليد الصور من مصادر صوتية متعددة، مما يعالج القيود التي كانت تواجه النماذج السابقة التي كانت تركز فقط على مدخلات الصوت من مصدر واحد. تستخدم هذه الطريقة ذات المرحلتين تقنيات إشراف ضعيف لفصل الصوت متعدد المصادر، مما يضمن توافقًا دلاليًا بين تسميات الصوت والنص من خلال النموذج المدرب مسبقًا CLAP.

Se ha introducido un nuevo método llamado MACS para la generación de imágenes a partir de audio de múltiples fuentes, abordando las limitaciones de modelos anteriores que se centraban únicamente en entradas de audio de una sola fuente. Este enfoque en dos etapas utiliza técnicas de supervisión débil para separar el audio de múltiples fuentes, alineando semánticamente las etiquetas de audio y texto a través del modelo CLAP preentrenado.

Une nouvelle méthode appelée MACS a été introduite pour la génération d'images à partir de plusieurs sources audio, répondant aux limitations des modèles précédents qui se concentraient uniquement sur des entrées audio à source unique. Cette approche en deux étapes utilise des techniques faiblement supervisées pour séparer l'audio multi-source, alignant sémantiquement les étiquettes audio et textuelles grâce au modèle pré-entraîné CLAP.

A new method named MACS has been introduced for multi-source audio-to-image generation, addressing the limitations of previous models that focused solely on single-source audio inputs. This two-stage approach utilizes weakly supervised techniques to separate multi-source audio, aligning audio and text labels semantically through the pre-trained CLAP model.

MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

One More Thing in AI – Your Shortcut to AI Mastery

MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

Was this article worth reading? Share it

One More Thing in AI

Magicley AI

ClipCutAi

Voice-gen.ai

Republiclabs.ai

AiReelGenerator.com

Ready to build your own newsroom?