Intelligent Image Search Algorithms Fusing Visual Large Models

arXiv — cs.CVWednesday, November 26, 2025 at 5:00:00 AM
  • A new framework called DetVLM has been proposed to enhance fine-grained image retrieval by integrating object detection with Visual Large Models (VLMs). This two-stage pipeline utilizes a YOLO detector for efficient component-level screening, addressing limitations in conventional methods that struggle with state-specific retrieval and zero-shot search capabilities.
  • The introduction of DetVLM is significant as it aims to improve the accuracy and efficiency of image retrieval in critical fields such as security and industrial inspection, where precise identification of object components and their states is essential.
  • This development reflects a broader trend in artificial intelligence where the fusion of different model types, such as YOLO and VLMs, is increasingly seen as a solution to enhance performance. The ongoing evolution of object detection frameworks and their applications in various domains, including fashion and anomaly detection, highlights the importance of integrating advanced technologies to meet complex retrieval challenges.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
NeutralArtificial Intelligence
A novel vulnerability in vision-language models (VLMs) has been identified through the introduction of IAG, a method that enables multi-target backdoor attacks on VLM-based visual grounding systems. This technique utilizes dynamically generated, input-aware triggers that are text-guided, allowing for imperceptible manipulation of visual inputs while maintaining normal performance on benign samples.
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
PositiveArtificial Intelligence
The paper introduces TRANSPORTER, a model-independent approach designed to enhance video generation by transferring visual semantics from Vision Language Models (VLMs). This method addresses the challenge of understanding how VLMs derive their predictions, particularly in complex scenes with various objects and actions. TRANSPORTER generates videos that reflect changes in captions across diverse attributes and contexts.
Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation
PositiveArtificial Intelligence
A new generative framework has been proposed for enhancing low-light images and reducing blur, utilizing visual autoregressive modeling guided by perceptual priors from vision-language models. This approach addresses significant challenges in restoring dark images, which often suffer from low visibility, contrast, noise, and blur.
Peregrine: One-Shot Fine-Tuning for FHE Inference of General Deep CNNs
PositiveArtificial Intelligence
The recent paper titled 'Peregrine: One-Shot Fine-Tuning for FHE Inference of General Deep CNNs' addresses key challenges in adapting deep convolutional neural networks (CNNs) for fully homomorphic encryption (FHE) inference. It introduces a single-stage fine-tuning strategy and a generalized interleaved packing scheme to enhance the performance of CNNs while maintaining accuracy and supporting high-resolution image processing.
DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
PositiveArtificial Intelligence
The introduction of DiffSeg30k marks a significant advancement in the detection of AI-generated content (AIGC) by providing a dataset of 30,000 diffusion-edited images with pixel-level annotations. This dataset allows for fine-grained detection of localized edits, addressing a gap in existing benchmarks that typically assess entire images without considering localized modifications.
From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation
PositiveArtificial Intelligence
A new framework has been introduced for automatic fashion captioning and hashtag generation, utilizing a retrieval-augmented approach that integrates multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. This system aims to produce visually grounded and stylistically engaging text for fashion images, addressing the shortcomings of traditional end-to-end captioners in attribute fidelity and domain generalization.
Multimodal Real-Time Anomaly Detection and Industrial Applications
PositiveArtificial Intelligence
A comprehensive multimodal room-monitoring system has been developed, integrating synchronized video and audio processing for real-time activity recognition and anomaly detection. The system has undergone two iterations, with the advanced version featuring multi-model audio ensembles and hybrid object detection methods, significantly enhancing its accuracy and robustness.