Zhang, Yan2026-01-062026-01-062026-01-06http://hdl.handle.net/10393/51223https://doi.org/10.20381/ruor-31646Few-shot object detection (FSOD), which aims to detect novel categories with minimal training examples, faces significant challenges in learning robust feature representations due to severe data scarcity. Additionally, FSOD models often struggle to distinguish objects from visually ambiguous backgrounds, restricting their generalization capability. We propose a novel FSOD framework designed to address these challenges through two key innovations. First, we introduce Wavelet‑Semantic Fusion Attention (WSFA), which enhances semantic ViT features by incorporating frequency-domain information via discrete wavelet transform, providing complementary edge and texture cues through cross-modal attention. Second, we propose the Learnable Background Prototype (LBP) that explicitly models the background patterns, significantly improving foreground-background discrimination. These contributions are then integrated into a unified single-stage transformer-based detection framework with inter-class contrastive learning. Comprehensive experiments on standard FSOD benchmarks (PASCAL VOC and MS COCO) demonstrate that our method achieves stable improvements over strong baseline methods and outperforms existing state-of-the-art approaches. This work provides a practical solution for scenarios with limited annotated data, enhancing the applicability of object detection in real-world applications.enAttribution-ShareAlike 4.0 Internationalhttp://creativecommons.org/licenses/by-sa/4.0/Computer VisionObject DetectionFew-Shot LearningTowards Generalizable Few-Shot Object Detection via Enhanced Representation LearningThesis