UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

arXiv — cs.CVThursday, November 13, 2025 at 5:00:00 AM
UniMM-V2X represents a significant leap in the field of autonomous driving by addressing the limitations of existing systems that often operate in isolation. The framework's multi-level fusion strategy allows for enhanced cooperation among agents, improving their ability to perceive, predict, and plan collaboratively. By integrating a Mixture-of-Experts architecture, UniMM-V2X dynamically enhances representations, leading to notable performance gains. Experiments conducted on the DAIR-V2X dataset demonstrate its state-of-the-art capabilities, with improvements in perception accuracy by 39.7%, a 7.2% reduction in prediction error, and a 33.2% enhancement in planning performance. These advancements not only highlight the potential of cooperative autonomous driving but also pave the way for safer and more efficient transportation systems in the future.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
PositiveArtificial Intelligence
MMEdge is a proposed framework aimed at enhancing real-time multimodal inference on resource-constrained edge devices, crucial for applications like autonomous driving and mobile health. It addresses the challenges of sensing dynamics and model execution by decomposing the inference process into fine-grained units, allowing incremental computation as data arrives. Additionally, a lightweight temporal aggregation module is introduced to capture temporal dynamics, ensuring accuracy across different modalities.
Understanding World or Predicting Future? A Comprehensive Survey of World Models
NeutralArtificial Intelligence
The article discusses the growing interest in world models, particularly in the context of advancements in multimodal large language models like GPT-4 and video generation models such as Sora. It provides a comprehensive review of the literature on world models, which serve to either understand the current state of the world or predict future dynamics. The review categorizes world models based on their functions: constructing internal representations and predicting future states, with applications in generative games, autonomous driving, robotics, and social simulacra.
One-to-N Backdoor Attack in 3D Point Cloud via Spherical Trigger
PositiveArtificial Intelligence
Backdoor attacks pose a significant risk to deep learning systems, especially in critical 3D applications like autonomous driving and robotics. This study introduces a novel one-to-N backdoor framework for 3D vision, utilizing a configurable spherical trigger. The research demonstrates that a single trigger can effectively encode multiple target classes, achieving high attack success rates of up to 100% while preserving accuracy on clean data.
FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection
PositiveArtificial Intelligence
The paper titled 'FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection' addresses the challenges of deploying PETR models in autonomous driving due to their high computational costs and memory requirements. It introduces FQ-PETR, a fully quantized framework that aims to enhance efficiency without sacrificing accuracy. Key innovations include a Quantization-Friendly LiDAR-ray Position Embedding and techniques to mitigate accuracy degradation typically associated with quantization methods.
Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models
PositiveArtificial Intelligence
The paper titled 'Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models' introduces a method to enhance the efficiency of Mixture-of-Experts (MoE) Large Language Models (LLMs). The authors propose a pre-attention expert prediction technique that improves accuracy and reduces computational overhead by utilizing activations before the attention block. This approach aims to optimize expert prefetching, achieving about a 15% improvement in accuracy over existing methods.
Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving
NeutralArtificial Intelligence
The article discusses advancements in autonomous driving systems that utilize 3D object detection through RGB cameras, which are more cost-effective than LiDAR. Despite their promising detection accuracy, these systems are vulnerable to adversarial attacks. The study introduces AdvRoad, a method to create realistic road-style adversarial posters that can deceive detection systems without being easily noticed. This approach aims to enhance the safety and reliability of autonomous driving technologies.
CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios
PositiveArtificial Intelligence
The CATS-V2V dataset introduces a pioneering real-world collection for Vehicle-to-Vehicle (V2V) cooperative perception, aimed at enhancing autonomous driving in complex adverse traffic scenarios. Collected using two time-synchronized vehicles, the dataset encompasses 100 clips featuring 60,000 frames of LiDAR point clouds and 1.26 million multi-view camera images across various weather and lighting conditions. This dataset is expected to significantly benefit the autonomous driving community by providing high-quality data for improved perception capabilities.
NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification
PositiveArtificial Intelligence
The paper titled 'NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification' addresses the challenges of classifying isolated cold-start nodes in multimodal graphs, which often lack edges and modalities. The proposed Neighbor-to-Self Graph Transformer (NTSFormer) employs a self-teaching paradigm to enhance model capacity by using a cold-start attention mask for dual predictions—one based on the node's own features and another guided by a teacher model. This approach aims to improve classification accuracy in scenarios where traditional methods fall sho…