Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

arXiv — cs.CVWednesday, January 14, 2026 at 5:00:00 AM
  • A new framework named CoEvo has been proposed for zero-shot out-of-distribution (OOD) detection in vision-language models, addressing the challenges posed by the absence of labeled negatives. CoEvo employs a bidirectional adaptation mechanism for both textual and visual proxies, dynamically refining them based on contextual information from test images. This innovation aims to enhance the reliability of OOD detection in open-world applications.
  • The development of CoEvo is significant as it enables more accurate detection of OOD inputs, which is crucial for deploying vision-language models in real-world scenarios. By eliminating the reliance on static textual proxies, CoEvo promises to improve model performance and stability, thereby facilitating broader applications of AI technologies in various fields.
  • This advancement aligns with ongoing efforts in the AI community to enhance model adaptability and efficiency. Techniques such as Decorrelated Backpropagation and structured initialization for Vision Transformers are also gaining traction, indicating a trend towards optimizing model training processes. Furthermore, the exploration of methods like Channel-Aware Typical Set Refinement highlights a collective focus on improving OOD detection, underscoring the importance of robust machine learning frameworks in diverse applications.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning
PositiveArtificial Intelligence
The introduction of the Diffusion-Guided Autoencoder (DGAE) marks a significant advancement in latent representation learning, enhancing the decoder's expressiveness and effectively addressing training instability associated with GANs. This model achieves state-of-the-art performance while utilizing a latent space that is twice as compact, thus improving efficiency in image and video generative tasks.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about