Adjacent-view Transformers for Supervised Surround-view Depth Estimation

arXiv — cs.CVThursday, November 13, 2025 at 5:00:00 AM
The recent paper titled 'Adjacent-view Transformers for Supervised Surround-view Depth Estimation' introduces a novel approach to depth estimation, a critical component for 3D perception in robotics and autonomous driving. Traditional methods have primarily relied on front-view cameras, particularly within the KITTI benchmark, which limits their effectiveness. The AVT-SSDepth method proposed in this paper leverages a global-to-local feature extraction module that integrates CNN and transformer layers, allowing for richer representations. Furthermore, it introduces an adjacent-view attention mechanism that facilitates both intra-view and inter-view feature propagation, enhancing the depth estimation process across multiple cameras. Extensive experiments validate its superior performance over existing state-of-the-art methods on datasets like DDAD and nuScenes, showcasing its strong cross-dataset generalization. This advancement not only addresses previous research limitations but also h…
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Adaptive LiDAR Scanning: Harnessing Temporal Cues for Efficient 3D Object Detection via Multi-Modal Fusion
PositiveArtificial Intelligence
The article discusses a novel adaptive LiDAR scanning framework that enhances 3D object detection by utilizing temporal cues from past observations. Traditional LiDAR sensors often perform redundant scans, leading to inefficiencies in data acquisition and power consumption. The proposed method employs a lightweight predictor network to identify regions of interest, significantly reducing unnecessary data collection and improving overall efficiency.
Semantic VLM Dataset for Safe Autonomous Driving
PositiveArtificial Intelligence
The CAR-Scenes dataset is a newly released frame-level dataset designed for autonomous driving, facilitating the training and evaluation of vision-language models (VLMs) for scene-level understanding. It comprises 5,192 images sourced from Argoverse 1, Cityscapes, KITTI, and nuScenes, annotated using a comprehensive 28-key category/sub-category knowledge base. The dataset includes over 350 attributes and employs a GPT-4o-assisted vision-language pipeline for annotation, ensuring high-quality data through human verification.
Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications
PositiveArtificial Intelligence
The article presents Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting in autonomous driving applications. This benchmark aims to enhance the understanding of how environments change over time, which is vital for safe and reliable autonomous driving. It utilizes multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, to evaluate the future states of surrounding objects, thereby extending current occupancy estimation techniques that primarily focus on present 3D representations.