Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

arXiv — cs.CL•Wednesday, November 26, 2025 at 5:00:00 AM

NegativeArtificial Intelligence

The Adversarial Confusion Attack has been introduced as a new threat to multimodal large language models (MLLMs), aiming to disrupt their output by generating incoherent or confidently incorrect responses. This attack utilizes adversarial images to compromise the reliability of MLLM-powered agents, demonstrating its effectiveness across various models, including proprietary ones like GPT-5.1.
This development is significant as it highlights vulnerabilities in MLLMs, which are increasingly relied upon for various applications, including content generation and data analysis. The ability to induce systematic disruption raises concerns about the integrity and trustworthiness of AI systems in critical domains.
The emergence of such attacks underscores ongoing challenges in the field of AI, particularly regarding the optimization of MLLMs across different modalities and the need for robust defenses against adversarial threats. This situation reflects a broader discourse on the balance between innovation in AI technologies and the potential risks posed by malicious exploitation.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

GPTHuman

Generate undetectable AI content that reads naturally and bypasses detection tools.

Business & ProductivityTry the app

Agentcloud

Build and deploy custom AI agents with this open-source GPT platform.

AI & DataTry the app

Https

Access multiple AI models seamlessly in one unified chat application.

AI & DataTry the app

Continue Readings

arXiv — cs.CV11 hours ago

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

PositiveArtificial Intelligence

ReMatch has been introduced as a new framework that utilizes the generative capabilities of Multimodal Large Language Models (MLLMs) for enhanced multimodal retrieval. This approach trains the MLLM end-to-end, employing a chat-style generative matching stage that assesses relevance from various inputs, including raw data and projected embeddings.

Read full article

via arXiv — cs.CV

arXiv — cs.CV11 hours ago

CaptionQA: Is Your Caption as Useful as the Image Itself?

PositiveArtificial Intelligence

A new benchmark called CaptionQA has been introduced to evaluate the utility of model-generated captions in supporting downstream tasks across various domains, including Natural, Document, E-commerce, and Embodied AI. This benchmark consists of 33,027 annotated multiple-choice questions that require visual information to answer, aiming to assess whether captions can effectively replace images in multimodal systems.

Read full article

via arXiv — cs.CV

arXiv — cs.CV11 hours ago

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

PositiveArtificial Intelligence

LLaVA-UHD v3 has been introduced as a new multi-modal large language model (MLLM) that utilizes Progressive Visual Compression (PVC) for efficient native-resolution encoding, enhancing visual understanding capabilities while addressing computational overhead. This model integrates refined patch embedding and windowed token compression to optimize performance in vision-language tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.CV11 hours ago

Monet: Reasoning in Latent Visual Space Beyond Images and Language

PositiveArtificial Intelligence

A new training framework named Monet has been introduced to enhance multimodal large language models (MLLMs) by enabling them to reason directly within latent visual spaces, generating continuous embeddings as intermediate visual thoughts. This approach addresses the limitations of existing methods that rely heavily on external tools for visual reasoning.

Read full article

via arXiv — cs.CV

arXiv — cs.CV11 hours ago

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

PositiveArtificial Intelligence

CAPability has been introduced as a comprehensive visual caption benchmark designed to evaluate the correctness and thoroughness of captions generated by multimodal large language models (MLLMs). This benchmark addresses the limitations of existing visual captioning assessments, which often rely on brief ground-truth sentences and traditional metrics that fail to capture detailed captioning effectively.

Read full article

via arXiv — cs.CV

arXiv — cs.CV11 hours ago

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

PositiveArtificial Intelligence

A new framework named STVG-o1 has been introduced to enhance spatio-temporal video grounding (STVG) by enabling multimodal large language models (MLLMs) to achieve state-of-the-art performance without architectural changes. This framework employs a bounding-box chain-of-thought mechanism and a multi-dimensional reinforcement reward function to improve localization accuracy in untrimmed videos based on natural language descriptions.

Read full article

via arXiv — cs.CV

arXiv — cs.CV11 hours ago

Qwen3-VL Technical Report

PositiveArtificial Intelligence

Qwen3-VL has been introduced as the latest vision-language model in the Qwen series, showcasing enhanced capabilities across various multimodal benchmarks. It supports interleaved contexts of up to 256K tokens, integrating text, images, and video, with variants designed for different latency-quality trade-offs.

Read full article

via arXiv — cs.CV

VentureBeat — AIa day ago

A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the missing layer of enterprise AI orchestration

PositiveArtificial Intelligence

Andrej Karpathy, former director of AI at Tesla and a founding member of OpenAI, created a 'vibe code project' over the weekend, allowing multiple AI assistants to collaboratively read and critique a book, ultimately synthesizing a final answer under a designated 'Chairman.' The project, named LLM Council, was shared on GitHub with a disclaimer about its ephemeral nature.

Read full article

via VentureBeat — AI