Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

arXiv — cs.CV•Wednesday, November 12, 2025 at 5:00:00 AM

As short video platforms rapidly evolve, the need for effective identification of inappropriate content becomes increasingly critical. Traditional methods often rely on separate, small classification models for each issue, which not only demands extensive human-labeled data but also struggles with generalization across various content types. To tackle these challenges, researchers have introduced a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm. This innovative approach incorporates three targeted pretraining tasks: Captioning, Visual Question Answering (VQA), and Chain-of-Thought (CoT), each designed to bolster the model's perception, understanding, and reasoning capabilities. Experimental results indicate that this pretraining significantly enhances the MLLM's performance in both zero-shot and supervised fine-tuning settings, demonstrating strong generalization to previously unseen issues. This advancement is crucial for improving content governance on…

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CL2 days ago

Can LLMs Detect Their Own Hallucinations?

PositiveArtificial Intelligence

Large language models (LLMs) are capable of generating fluent responses but can sometimes produce inaccurate information, referred to as hallucinations. A recent study investigates whether these models can recognize their own inaccuracies. The research formulates hallucination detection as a classification task and introduces a framework utilizing Chain-of-Thought (CoT) to extract knowledge from LLM parameters. Experimental results show that GPT-3.5 Turbo with CoT detected 58.2% of its own hallucinations, suggesting that LLMs can identify inaccuracies if they possess sufficient knowledge.

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

NeutralArtificial Intelligence

The study titled 'Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents' explores the capabilities of Visual Large Language Models (VLLMs) in understanding Visually Rich Documents (VRDs). While VLLMs perform well in Visual Question Answering (VQA), their ability to identify unanswerable questions remains under-researched. The research introduces a benchmark called VRD-UQA to assess VLLMs' resilience against plausible yet unanswerable questions generated through subtle corruptions in document elements.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery

PositiveArtificial Intelligence

Geospatial chain of thought (CoT) reasoning is crucial for enhancing Visual Question Answering (VQA) on satellite imagery, especially in climate-related applications like disaster monitoring and urban resilience planning. Current VQA models can interpret remote sensing data but often lack the structured reasoning needed for complex geospatial queries. A new framework integrating CoT reasoning with Direct Preference Optimization (DPO) has been proposed, showing a 34.9% accuracy improvement in handling tasks such as detection and classification.

Read full article

via arXiv — cs.CV

arXiv — cs.CL2 days ago

Efficient Reasoning via Thought-Training and Thought-Free Inference

PositiveArtificial Intelligence

Recent advancements in large language models (LLMs) have utilized Chain-of-Thought (CoT) prompting to enhance reasoning accuracy. However, existing methods often compress lengthy reasoning outputs, still relying on explicit reasoning during inference. The introduction of the 3TF framework (Thought-Training and Thought-Free inference) presents a Short-to-Long approach to efficient reasoning. This framework trains a hybrid model to operate in both reasoning and non-reasoning modes, internalizing structured reasoning while producing concise outputs during inference.

Read full article

via arXiv — cs.CL