Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

arXiv — cs.CVWednesday, November 12, 2025 at 5:00:00 AM
As short video platforms rapidly evolve, the need for effective identification of inappropriate content becomes increasingly critical. Traditional methods often rely on separate, small classification models for each issue, which not only demands extensive human-labeled data but also struggles with generalization across various content types. To tackle these challenges, researchers have introduced a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm. This innovative approach incorporates three targeted pretraining tasks: Captioning, Visual Question Answering (VQA), and Chain-of-Thought (CoT), each designed to bolster the model's perception, understanding, and reasoning capabilities. Experimental results indicate that this pretraining significantly enhances the MLLM's performance in both zero-shot and supervised fine-tuning settings, demonstrating strong generalization to previously unseen issues. This advancement is crucial for improving content governance on…
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Can LLMs Detect Their Own Hallucinations?
PositiveArtificial Intelligence
Large language models (LLMs) are capable of generating fluent responses but can sometimes produce inaccurate information, referred to as hallucinations. A recent study investigates whether these models can recognize their own inaccuracies. The research formulates hallucination detection as a classification task and introduces a framework utilizing Chain-of-Thought (CoT) to extract knowledge from LLM parameters. Experimental results show that GPT-3.5 Turbo with CoT detected 58.2% of its own hallucinations, suggesting that LLMs can identify inaccuracies if they possess sufficient knowledge.
Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents
NeutralArtificial Intelligence
The study titled 'Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents' explores the capabilities of Visual Large Language Models (VLLMs) in understanding Visually Rich Documents (VRDs). While VLLMs perform well in Visual Question Answering (VQA), their ability to identify unanswerable questions remains under-researched. The research introduces a benchmark called VRD-UQA to assess VLLMs' resilience against plausible yet unanswerable questions generated through subtle corruptions in document elements.
Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery
PositiveArtificial Intelligence
Geospatial chain of thought (CoT) reasoning is crucial for enhancing Visual Question Answering (VQA) on satellite imagery, especially in climate-related applications like disaster monitoring and urban resilience planning. Current VQA models can interpret remote sensing data but often lack the structured reasoning needed for complex geospatial queries. A new framework integrating CoT reasoning with Direct Preference Optimization (DPO) has been proposed, showing a 34.9% accuracy improvement in handling tasks such as detection and classification.
Efficient Reasoning via Thought-Training and Thought-Free Inference
PositiveArtificial Intelligence
Recent advancements in large language models (LLMs) have utilized Chain-of-Thought (CoT) prompting to enhance reasoning accuracy. However, existing methods often compress lengthy reasoning outputs, still relying on explicit reasoning during inference. The introduction of the 3TF framework (Thought-Training and Thought-Free inference) presents a Short-to-Long approach to efficient reasoning. This framework trains a hybrid model to operate in both reasoning and non-reasoning modes, internalizing structured reasoning while producing concise outputs during inference.