HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

arXiv — cs.CVThursday, November 20, 2025 at 5:00:00 AM
  • The study introduces a novel Hierarchical Visual Attack targeting multimodal Retrieval
  • The implications of this research are critical as it underscores the need for enhanced security measures in Large Multimodal Models (LMMs), which are increasingly used in various applications. The findings raise alarms about the safety and reliability of these systems in real
  • This development is part of a broader discourse on the vulnerabilities of AI systems, particularly in the context of adversarial attacks. The ongoing exploration of transferability of such attacks across different models indicates a pressing need for robust defenses against emerging threats in the AI landscape.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2
NeutralArtificial Intelligence
A new study introduces an edge-optimized multimodal learning platform for unmanned aerial vehicles (UAVs) utilizing BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This approach addresses the challenge of high computational costs associated with large vision-language models while operating within the limited resources of UAV edge devices. The integration enhances the capabilities of UAVs in real-time visual understanding and interaction in complex environments.
ClimateIQA: A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis
PositiveArtificial Intelligence
A new dataset named ClimateIQA has been introduced to enhance the capabilities of Vision-Language Models (VLMs) in analyzing meteorological anomalies. This dataset, which includes 26,280 high-quality images, aims to address the challenges faced by existing models like GPT-4o and Qwen-VL in interpreting complex meteorological heatmaps characterized by irregular shapes and color variations.
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
PositiveArtificial Intelligence
Franca, the first fully open-source vision foundation model, has been introduced, showcasing performance that matches or exceeds proprietary models like DINOv2 and CLIP. This model utilizes a transparent training pipeline and publicly available datasets, addressing limitations in current self-supervised learning clustering methods through a novel nested Matryoshka clustering approach.
SWAGSplatting: Semantic-guided Water-scene Augmented Gaussian Splatting
PositiveArtificial Intelligence
The introduction of SWAGSplatting, a novel framework for underwater 3D reconstruction, addresses the challenges posed by light attenuation and limited visibility in aquatic environments. This approach integrates semantic understanding with 3D Gaussian Splatting, enhancing the accuracy and fidelity of underwater scene reconstruction.
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
PositiveArtificial Intelligence
The recent introduction of FigEx2, a visual-conditioned framework, aims to enhance the understanding of scientific compound figures by localizing panels and generating detailed captions directly from the images. This addresses the common issue of missing or inadequate captions that hinder panel-level comprehension.
MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP
PositiveArtificial Intelligence
A novel multimodal framework, MMLGNet, has been introduced to align heterogeneous remote sensing modalities, such as Hyperspectral Imaging and LiDAR, with natural language semantics using vision-language models like CLIP. This framework employs modality-specific encoders and bi-directional contrastive learning to enhance the understanding of complex Earth observation data.
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
PositiveArtificial Intelligence
A new approach called Boundary-Aware Curriculum with Local Attention (BACL) has been proposed to enhance multimodal alignment in AI models. This method addresses the challenge of treating ambiguous negative pairs uniformly, introducing a curriculum signal that differentiates borderline cases and improves model performance.
Decentralized Autoregressive Generation
NeutralArtificial Intelligence
A theoretical analysis of decentralization in autoregressive generation has been presented, introducing the Decentralized Discrete Flow Matching objective, which expresses probability generating velocity as a linear combination of expert flows. Experiments demonstrate the equivalence between decentralized and centralized training settings for multimodal language models, specifically comparing LLaVA and InternVL 2.5-1B.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about