Towards High-Efficient Object Completion in Cross Modality 3D Detection
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Université d'Ottawa | University of Ottawa
Abstract
Multimodal detection is undoubtedly one essential task in the field of autonomous driving today. Considering the complexity of the scenarios in which self-driving vehicles operate, having LiDAR and images as joint inputs would provide a more accurate description of the environment. However, due to the inter-modal domain differences, models to process these data are in fact difficult to design. In this work, we will analyze the key factor that has enabled multimodal models to make continuous performance breakthroughs in recent years: object completion (densification).
In pure lidar-based 3D Perception, recent works have indicated the importance of object completion. Several methods have been proposed in which modules were used to densify the point clouds produced by laser scanners, leading to better recall and more accurate results. Pursuing in that direction, we present, in this work, a counter-intuitive perspective: the widely-used full-shape completion approach actually leads to a higher error-upper bound especially for far away objects and small objects like pedestrians. Based on this observation, we introduce a visible part completion method that requires only 11.3% of the prediction points that previous methods generate but has a higher error-upper bound. We in this way extend the object completion from a pure lidar-based to a cross-modality manner.
To recover the dense representation, we propose a mesh-deformation-based method to augment the point set associated with visible foreground objects. Considering that our approach focuses only on the visible part of the foreground objects to achieve accurate 3D detection, we named our method What You See Is What You Detect (WYSIWYD). Our proposed method is thus a detector-independent model that consists of 2 parts: an Intra-Frustum Segmentation Transformer (IFST) and a Mesh Depth Completion Network (MDCNet) that predicts the foreground depth from mesh deformation. This way, our model does not require the time-consuming full-depth completion task used by most pseudo-lidar-based methods. Our experimental evaluation shows that our approach can provide up to 12.2% performance improvements over most of the public baseline models on the KITTI and NuScenes dataset bringing the state-of-the-art to a new level.
Description
Keywords
Cross-modality 3D Object Detection, 3D Object Detection, Depth Prediction
