MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
MultiSoundGen is an innovative method designed for video-to-audio generation, specifically targeting complex multi-event scenarios. It employs SlowFast contrastive audio-visual pretraining alongside direct preference optimization to improve the alignment between semantic information and dynamic video features. This dual-technique approach addresses the challenge of accurately generating audio from videos containing multiple overlapping events. As a result, MultiSoundGen enhances the precision of audio generation, making it more reflective of the visual content. The method’s effectiveness has been positively claimed, highlighting its potential to advance audio synthesis in multimedia applications. Developed within the context of recent research on audio-visual learning, MultiSoundGen represents a significant step forward in bridging video and sound modalities.
