Multi-modal Feature Fusion Using Full Sequences for Dynamic Hand Gesture Recognition with Simulated Robotic Arm Control
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Université d'Ottawa | University of Ottawa
Abstract
Dynamic hand gesture recognition (DHGR) enables accessible human-robot interaction by interpreting sequential human hand movements rather than static poses. Previous DHGR systems only focused on using the RGB modality in datasets and ignored depth. This thesis addresses this issue using a multi-modal classifier preserving temporal integrity. The InceptionV3-LSTM architecture is recreated, using a public RGB-depth dataset of six dynamic gestures. Full 40-frame sequences are used along with stratified 5-fold cross-validation to prevent sequences splitting across folds. The feature extraction pipeline fuses visual and landmark features from both RGB and depth modalities in parallel InceptionV3 streams, feeding a stacked LSTM-RNN. The results demonstrate that overfitting is reduced when using full-sequence multi-modal training, with validation loss decreasing while exceeding RGB-only accuracy. This work contributes a multi-modal pipeline for DHGR that is implemented in a simulated robotic control application.
Description
Keywords
dynamic hand gesture recognition, multi-modal fusion, long short-term memory, full sequence data splitting, rgb modality, depth modality, convolutional neural network
