Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance
PositiveArtificial Intelligence
As short video platforms rapidly evolve, the need for effective identification of inappropriate content becomes increasingly critical. Traditional methods often rely on separate, small classification models for each issue, which not only demands extensive human-labeled data but also struggles with generalization across various content types. To tackle these challenges, researchers have introduced a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm. This innovative approach incorporates three targeted pretraining tasks: Captioning, Visual Question Answering (VQA), and Chain-of-Thought (CoT), each designed to bolster the model's perception, understanding, and reasoning capabilities. Experimental results indicate that this pretraining significantly enhances the MLLM's performance in both zero-shot and supervised fine-tuning settings, demonstrating strong generalization to previously unseen issues. This advancement is crucial for improving content governance on…
— via World Pulse Now AI Editorial System
