Scene and Graph Domain Adaptation to Tackle Domain Shifts in Real-world Image Semantic Segmentation and Human Activity Recognition
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Université d'Ottawa | University of Ottawa
Abstract
Deep learning has transformed computer vision by automating feature extraction and representation learning across multiple data modalities, surpassing traditional paradigms that relied on manual feature engineering. However, deploying deep learning models in real-world applications often encounters significant challenges in generalization due to domain shifts between the training dataset (source domain) and the deployment environment (target domain). When the source domain distribution is misaligned or represents only a subset of the target domain, the deep learning model’s performance tends to degrade in the target domain, limiting its scalability in many real-world applications.
The research in this thesis addresses the domain shifts challenge with the development of novel Domain Adaptation (DA) frameworks. Particularly, DA formulates adaptive models that can be trained on the source domain and adapted to the target domain where data is scarce. Moreover, unlike previous DA research that focuses on a single type of computer vision application, this thesis expands the research on individual DA frameworks for two different applications, that is, RGB image semantic segmentation and skeleton video human action analysis respectively, which can be detailed as follows.
1) The first proposed DA framework, named Enhanced Scene Domain Adaptation (ESDA), focuses on mitigating scene-based domain shifts for the task of semantic segmentation on street images. Street scene understanding is critical for autonomous driving and robotic navigation, but outdoor scenes often exhibit scene-based visual discrepancies caused by variations in lighting, weather, and regional characteristics. Therefore, ESDA introduces three innovative DA mechanisms: pixel-wise adversarial adaptation, prototypical knowledge adaptation, and a target-specific adaptation classifier, each learning a model trained on image samples from one region (e.g., a city) to achieve robust performance on samples from other regions (e.g., alternative cities) for semantic segmentation.
2) The second proposed framework, named Enhanced Graph Domain Adaptation (EGDA), addresses graph domain shifts in skeleton-video based human activity analysis. The latter has been widely utilized to address a variety of real-world video surveillance tasks, where human actions are represented by the trajectories of skeletal joints captured by acquisition systems. By proposing four novel methods, cross-view adaptation, cross-sensor adaptation, cross-sequence adaptation, and cross-permutation adaptation, EGDA effectively encourages the deep learning model to aggregate domain-invariant action dynamics from skeletal joint trajectories while dealing with the domain shift arising from varying camera perspectives, sensor configurations, and sequence lengths or activity ordering.
While evaluating the ESDA and EGDA frameworks, the research utilizes existing large-scale benchmarks to mimic various domain shift situations while creating pairs of a source domain and a target domain. For instance, in semantic segmentation, ESDA utilizes the dataset composed of synthetic street images as the source domain and evaluates the model on a real-world street image dataset. In human action analysis, EGDA leverages several large-scale skeleton datasets that are collected in different environments to mimic the domain shifts in skeleton data. Experimental results demonstrate that the proposed frameworks are effective for alleviating both scene and graph domain shifts from different data modalities and successfully improve the adaptability of the baseline deep learning models for image semantic segmentation and skeleton action analysis tasks.
The original contributions of this thesis include a comprehensive investigation of two types of domain shifts commonly encountered in computer vision: scene domain shifts in RGB image semantic segmentation and graph domain shifts in skeleton-based video human action recognition. Novel methodologies are introduced to enhance domain adaptation performance in these computer vision applications. The proposed frameworks enable adaptive deep learning models to be trained on large-scale benchmarks while easily adapting to real-world target domains where labeled data is scarce or unavailable. The original domain adaptation approaches are designed to enhance the base networks without additional computational complexity, ensuring efficient operation on the real-world target domain.
Description
Keywords
Domain Adaptation, Domain Shifts, Image Semantic Segmentation, Human Activity Recognition
