Enhancing Object Detection with Transformer-Based Adaptive Sensor Fusion

Sadeghian, Reza2026-03-032026-03-032026-03-03http://hdl.handle.net/10393/51425https://doi.org/10.20381/ruor-31782Achieving reliable perception in dynamic environments while enabling real-time decision-making is critical for practical deployment in autonomous vehicles. The objective of this research is to enhance the accuracy, robustness, and computational efficiency of object detection systems for autonomous driving. To address the need for efficient and low-latency, we first developed TransfuseNet, a lightweight LiDAR-camera fusion network specifically designed for 2D object detection. TransfuseNet optimizes computational efficiency by leveraging self-attention mechanisms for mid-level feature fusion and introducing a Multi-Convolutional Fusion (MCF) operator that prioritizes essential features. With its compact model architecture and reduced resource consumption, TransfuseNet achieves inference latency below 40ms, making it well-suited for real-time applications where rapid action is required. However, while TransfuseNet effectively balances accuracy and efficiency, it does not explicitly account for sensor reliability variations or provide mechanisms to adapt to degraded sensor inputs. To overcome these limitations, we introduced ReliFusion, a reliability focused LiDAR-camera fusion framework for 3D object detection. ReliFusion was designed as a more advanced fusion model that integrates LiDAR and camera data for enhanced perception and dynamically adjusts sensor contributions based on real-time reliability assessments. Unlike conventional fusion strategies that assume equal reliability of all modalities, ReliFusion incorporates adaptive mechanisms to ensure robustness under sensor degradation, occlusions, and environmental challenges. It integrates a Spatio-Temporal Feature Aggregation (STFA) module to improve temporal consistency, a Reliability module based on Cross-Modality Contrastive Learning (CMCL) to quantify the trustworthiness of sensor inputs, and a Confidence-Weighted Mutual Cross-Attention (CW-MCA) module to refine fusion weights according to estimated reliability scores. This adaptive approach enables ReliFusion to maintain stable detection performance even in challenging real-world conditions. Experimental evaluations on the KITTI and nuScenes datasets demonstrate that both TransfuseNet and ReliFusion achieve improved detection accuracy compared to existing fusion-based methods. While TransfuseNet provides an efficient solution for real-time 2D detection, ReliFusion advances multimodal 3D detection by addressing sensor degradation and incorporating dynamic reliability-driven fusion strategies. The findings of this research contribute to the design of sensor fusion-based object detection systems that enhance multimodal perception in autonomous vehicles by addressing key challenges such as sensor degradation, occlusions, and dynamic environmental conditions.enAttribution 4.0 Internationalhttp://creativecommons.org/licenses/by/4.0/3D Object DetectionAdaptive Sensor FusionTransformersEnhancing Object Detection with Transformer-Based Adaptive Sensor FusionThesis