VLM as Strategist: Adaptive Generation of Safety-critical Testing Scenarios via Guided Diffusion

arXiv — cs.LG•Wednesday, December 3, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

A new framework for generating safety-critical testing scenarios for autonomous driving systems (ADSs) has been proposed, integrating Vision Language Models (VLMs) with adaptive guided diffusion models. This framework aims to address the scarcity of effective testing scenarios that reveal system vulnerabilities, particularly in real-time dynamic environments.
The development is significant as it enhances the reliability and safety of ADSs, which are increasingly being deployed in real-world applications. By improving scenario generation, the framework could lead to more robust testing protocols, ultimately fostering greater public trust in autonomous technologies.
This advancement reflects ongoing efforts to enhance the capabilities of VLMs in various applications, including 3D spatial reasoning and object-interaction reasoning. The integration of VLMs into safety-critical systems underscores a broader trend in AI development, where the focus is on creating models that can effectively understand and interact with complex environments.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

LangWatch

Monitor and improve your AI applications for quality, safety, and reliability.

AI & DataView app details

Guidejar-4eb95b

Build interactive product demos and help guides with AI assistance.

AI & DataView app details

Continue Readings

arXiv — cs.CV2 days ago

Shape and Texture Recognition in Large Vision-Language Models

NeutralArtificial Intelligence

The Large Shapes and Textures dataset (LAS&T) has been introduced to enhance the capabilities of Large Vision-Language Models (LVLMs) in recognizing and representing shapes and textures across various contexts. This dataset, created through unsupervised extraction from natural images, serves as a benchmark for evaluating the performance of leading models like CLIP and DINO in shape recognition tasks.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

NeutralArtificial Intelligence

A recent study introduces Function-word De-Attention (FDA) as a method to enhance the robustness of Vision-Language Models (VLMs) against cross-modal adversarial attacks by reducing the influence of function words. The FDA technique differentiates between original and function-word cross-attention, leading to improved alignment and robustness in VLMs. Comprehensive experiments demonstrate significant reductions in attack success rates with minimal performance drops across various models and tasks.

Read full article

via arXiv — cs.LG

arXiv — cs.CV2 days ago

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

PositiveArtificial Intelligence

OpenSubject has been introduced as a large-scale video-derived dataset comprising 2.5 million samples and 4.35 million images, aimed at improving subject-driven image generation and manipulation. This dataset employs a four-stage pipeline that utilizes cross-frame identity priors to enhance the accuracy of generated images in complex scenes with multiple subjects.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation

PositiveArtificial Intelligence

A recent study has conducted a visual comparison between Vision Foundation Models (VFMs) and Vision Language Models (VLMs) for 3D pose estimation, particularly in hand object grasping scenarios. The research highlights the strengths of CLIP in semantic understanding and DINOv2 in providing dense geometric features, demonstrating their complementary roles in enhancing 6D object pose estimation.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection

PositiveArtificial Intelligence

A recent study highlights the challenges faced by Vision Language Models (VLMs) in detecting AI-generated images (AIGI), revealing that fine-tuning on high-level semantic supervision improves performance, while low-level pixel-artifact supervision leads to poor results. This misalignment between task and model capabilities is a core issue affecting detection accuracy.

Read full article

via arXiv — cs.CV

arXiv — cs.CV3 days ago

VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation

PositiveArtificial Intelligence

The VOST-SGG framework has been introduced as a one-stage spatio-temporal scene graph generation model that leverages vision-language models (VLMs) to enhance the understanding of object relationships in video frames. This approach addresses limitations in existing models, such as uninformed query initialization and reliance on unimodal features for predicate classification.

Read full article

via arXiv — cs.CV