OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

arXiv — cs.CV•Tuesday, December 9, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

OmniSafeBench-MM has been introduced as a comprehensive benchmark and toolbox for evaluating multimodal jailbreak attack-defense scenarios, addressing the vulnerabilities of multimodal large language models (MLLMs) that can be exploited through jailbreak attacks. This toolbox integrates various attack methods and defense strategies across multiple risk domains, enhancing the evaluation process for MLLMs.
The development of OmniSafeBench-MM is significant as it fills existing gaps in the evaluation of MLLMs, which have been susceptible to harmful behaviors due to insufficient benchmarks. By providing a unified and reproducible framework, it aims to improve the safety and reliability of MLLMs in real-world applications.
This initiative reflects a growing recognition of the need for robust evaluation frameworks in the AI field, particularly as MLLMs become increasingly integrated into various applications. The introduction of multiple benchmarks, such as CFG-Bench and RoadBench, highlights the ongoing efforts to assess different aspects of MLLMs, including action intelligence and spatial reasoning, indicating a broader trend towards enhancing AI safety and performance.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Open Source Surveillance

Search social media, cameras, and IoT devices for public safety insights.

AI & DataView app details

SafeUtils

Access 110+ secure, native developer tools for MacOS, Linux, and Windows desktops.

Business & ProductivityView app details

Bypass

Route safely around high-risk areas and help improve community safety with real-time alerts.

Tech & Developer ToolsView app details

Continue Readings

arXiv — cs.CV3 days ago

PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

PositiveArtificial Intelligence

A new framework named PrefGen has been introduced, focusing on multimodal preference learning for preference-conditioned image generation. This approach aims to enhance generative models by adapting outputs to reflect individual user preferences, moving beyond traditional textual prompts. The framework utilizes multimodal large language models (MLLMs) to capture nuanced user representations and improve the quality of generated images.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

START: Spatial and Textual Learning for Chart Understanding

PositiveArtificial Intelligence

A new framework named START has been proposed to enhance chart understanding in multimodal large language models (MLLMs), focusing on the integration of spatial and textual learning. This initiative aims to improve the analysis of scientific papers and technical reports by enabling MLLMs to accurately interpret structured visual layouts and underlying data representations in charts.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs

NeutralArtificial Intelligence

Recent research highlights significant shortcomings in Multimodal Large Language Models (MLLMs) regarding their ability to interpret diagrams, which are crucial for understanding abstract concepts and relationships. The study reveals that MLLMs struggle with basic perceptual tasks, exhibiting near-zero accuracy in fine-grained grounding and object identification.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

PositiveArtificial Intelligence

A novel framework named UniME has been introduced to enhance multimodal representation learning by addressing limitations in existing models like CLIP, particularly in text token truncation and isolated encoding. This two-stage approach utilizes Multimodal Large Language Models (MLLMs) to learn discriminative representations for various tasks, aiming to break the modality barrier in AI applications.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

PositiveArtificial Intelligence

The introduction of MMRPT, a masked multimodal reinforcement pre-training framework, aims to enhance visual reasoning in Multimodal Large Language Models (MLLMs) by incorporating reinforcement learning directly into their pre-training. This approach addresses the limitations of traditional models that often rely on surface linguistic cues rather than grounded visual understanding.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

PositiveArtificial Intelligence

Recent research has introduced 3DRS, a framework designed to enhance the 3D representation capabilities of multimodal large language models (MLLMs) by incorporating supervision from pretrained 3D foundation models. This approach addresses the limitations of MLLMs, which have struggled with explicit 3D data during pretraining, thereby improving their performance in scene understanding tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning

PositiveArtificial Intelligence

A new lightweight image captioning model, MM-SeR, has been developed to address the high computational costs associated with existing multimodal language models (MLLMs). By utilizing a compact 125M-parameter model, MM-SeR achieves comparable performance to larger models while significantly reducing size and complexity.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

HalluShift++: Bridging Language and Vision through Internal Representation Shifts for Hierarchical Hallucinations in MLLMs

NeutralArtificial Intelligence

A recent study introduces HalluShift++, a framework aimed at addressing hallucinations in Multimodal Large Language Models (MLLMs) by analyzing internal layer dynamics. This approach seeks to measure hallucinations not just as distributional shifts but through specific layer-wise analysis, enhancing the understanding of how these models generate outputs that may not align with visual content.

Read full article

via arXiv — cs.CV