World PulseNowPowered by AI

Trending:

Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM

arXiv — cs.CV•Wednesday, November 19, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of Risk Semantic Distillation (RSD) aims to enhance end
This development is significant as it addresses the critical challenge of generalization in autonomous driving, which is essential for the safe deployment of these technologies in real
The advancement of RSD reflects a broader trend in the field of autonomous driving, where integrating language and vision models is becoming increasingly important. This approach not only improves decision

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings

Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

arXiv — cs.CV7 hours ago

Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

NeutralArtificial Intelligence

A recent study has introduced a novel physical adversarial attack targeting stereo matching models used in autonomous driving. Unlike traditional attacks that utilize 2D patches, this method employs a 3D physical adversarial example (PAE) with global camouflage texture, enhancing visual consistency across various viewpoints of stereo cameras. The research also presents a new 3D stereo matching rendering module to align the PAE with real-world positions, addressing the disparity effects inherent in binocular vision.

Read full article

via arXiv — cs.CV

VLMs Guided Interpretable Decision Making for Autonomous Driving

arXiv — cs.CV7 hours ago

VLMs Guided Interpretable Decision Making for Autonomous Driving

PositiveArtificial Intelligence

Recent advancements in autonomous driving have investigated the application of vision-language models (VLMs) in visual question answering (VQA) frameworks for driving decision-making. However, these methods often rely on handcrafted prompts and exhibit inconsistent performance, which hampers their effectiveness in real-world scenarios. This study assesses state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs, revealing significant limitations in their ability to provide reliable, context-aware decisions.

Read full article

via arXiv — cs.CV

MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

arXiv — cs.CV7 hours ago

MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation

PositiveArtificial Intelligence

MedGEN-Bench is a newly introduced multimodal benchmark aimed at enhancing medical AI research, particularly in the context of Vision-Language Models (VLMs). It addresses significant limitations in existing medical visual benchmarks, which often rely on ambiguous queries and oversimplified diagnostic reasoning. MedGEN-Bench includes 6,422 expert-validated image-text pairs across six imaging modalities and 16 clinical tasks, structured to improve the integration of AI-generated images into clinical workflows.

Read full article

via arXiv — cs.CV

STONE: Pioneering the One-to-N Backdoor Threat in 3D Point Cloud

arXiv — cs.CV7 hours ago

STONE: Pioneering the One-to-N Backdoor Threat in 3D Point Cloud

PositiveArtificial Intelligence

Backdoor attacks represent a significant risk to deep learning, particularly in critical 3D applications like autonomous driving and robotics. Current methods primarily focus on static one-to-one attacks, leaving the more versatile one-to-N backdoor threat largely unaddressed. The introduction of STONE (Spherical Trigger One-to-N Backdoor Enabling) marks a pivotal advancement, offering a configurable spherical trigger that can manipulate multiple output labels while maintaining high accuracy in clean data.

Read full article

via arXiv — cs.CV

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

arXiv — cs.CV7 hours ago

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

PositiveArtificial Intelligence

Agentic Video Intelligence (AVI) is a proposed framework designed to enhance video understanding by integrating complex reasoning with visual recognition. Unlike traditional Vision-Language Models (VLMs) that process videos in a single-pass manner, AVI introduces a three-phase reasoning process: Retrieve-Perceive-Review. This approach allows for both global exploration and focused local analysis. Additionally, AVI utilizes a structured video knowledge base organized through entity graphs, aiming to improve video comprehension without extensive training.

Read full article

via arXiv — cs.CV

MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

arXiv — cs.LG7 hours ago

MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

PositiveArtificial Intelligence

MMEdge is a proposed framework designed to enhance real-time multimodal inference on resource-constrained edge devices, crucial for applications like autonomous driving and mobile health. It addresses the challenges of sensing dynamics and inter-modality dependencies by breaking down the inference process into fine-grained sensing and encoding units. This allows for incremental computation as data is received, while a lightweight temporal aggregation module ensures accuracy by capturing rich temporal dynamics across different units.

Read full article

via arXiv — cs.LG

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

arXiv — cs.LGa day ago

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

PositiveArtificial Intelligence

The article discusses the limitations of multimedia documents, which are often distributed in static raster formats, hindering their editability. To address this, a new framework called SliDer is introduced, utilizing Vision-Language Models (VLMs) to convert slide images into editable Scalable Vector Graphics (SVG) representations. This approach aims to preserve the semantic structure of documents, overcoming the shortcomings of traditional raster-vectorization methods that fail to maintain the distinction between image and text elements.

Read full article

via arXiv — cs.LG

Understanding World or Predicting Future? A Comprehensive Survey of World Models

arXiv — cs.LGa day ago

Understanding World or Predicting Future? A Comprehensive Survey of World Models

NeutralArtificial Intelligence

The article discusses the growing interest in world models, particularly in the context of advancements in multimodal large language models like GPT-4 and video generation models such as Sora. It provides a comprehensive review of the literature on world models, which serve to either understand the current state of the world or predict future dynamics. The review categorizes world models based on their functions: constructing internal representations and predicting future states, with applications in generative games, autonomous driving, robotics, and social simulacra.

Read full article

via arXiv — cs.LG