Repository logo

Towards High-Efficient Object Completion in Cross Modality 3D Detection

Loading...
Thumbnail ImageThumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Université d'Ottawa | University of Ottawa

Creative Commons

Attribution-NonCommercial-NoDerivatives 4.0 International

Abstract

Multimodal detection is undoubtedly one essential task in the field of autonomous driving today. Considering the complexity of the scenarios in which self-driving vehicles operate, having LiDAR and images as joint inputs would provide a more accurate description of the environment. However, due to the inter-modal domain differences, models to process these data are in fact difficult to design. In this work, we will analyze the key factor that has enabled multimodal models to make continuous performance breakthroughs in recent years: object completion (densification). In pure lidar-based 3D Perception, recent works have indicated the importance of object completion. Several methods have been proposed in which modules were used to densify the point clouds produced by laser scanners, leading to better recall and more accurate results. Pursuing in that direction, we present, in this work, a counter-intuitive perspective: the widely-used full-shape completion approach actually leads to a higher error-upper bound especially for far away objects and small objects like pedestrians. Based on this observation, we introduce a visible part completion method that requires only 11.3% of the prediction points that previous methods generate but has a higher error-upper bound. We in this way extend the object completion from a pure lidar-based to a cross-modality manner. To recover the dense representation, we propose a mesh-deformation-based method to augment the point set associated with visible foreground objects. Considering that our approach focuses only on the visible part of the foreground objects to achieve accurate 3D detection, we named our method What You See Is What You Detect (WYSIWYD). Our proposed method is thus a detector-independent model that consists of 2 parts: an Intra-Frustum Segmentation Transformer (IFST) and a Mesh Depth Completion Network (MDCNet) that predicts the foreground depth from mesh deformation. This way, our model does not require the time-consuming full-depth completion task used by most pseudo-lidar-based methods. Our experimental evaluation shows that our approach can provide up to 12.2% performance improvements over most of the public baseline models on the KITTI and NuScenes dataset bringing the state-of-the-art to a new level.

Description

Keywords

Cross-modality 3D Object Detection, 3D Object Detection, Depth Prediction

Citation

Related Materials

Alternate Version