Dhothar, Mehakdeep Kaur2025-11-192025-11-192025-11-19http://hdl.handle.net/10393/51064https://doi.org/10.20381/ruor-31529Affective computing aims to develop systems capable of recognizing and interpreting human emotions, yet existing multimodal datasets frequently suffer from limitations such as poor signal quality, high inter-subject variability, and inconsistent evaluation protocols. To address these gaps, this thesis develops and validates a comprehensive framework for multimodal emotion recognition using physiological signals - Electrocardiogram (ECG), Electrodermal Activity (EDA), and Respiration (RSP) - augmented with speech-based representations. The goal was to establish standardized preprocessing workflows, rigorous signal quality assessment (SQA), and reproducible baseline experiments to support the development and technical validation of a large-scale physiological dataset. This framework was applied to a dataset collected from 99 participants, containing synchronized physiological recordings, speech responses, and self-reported emotional annotations during exposure to validated video stimuli. To ensure data integrity, a rigorous SQA and artifact-removal pipeline was applied across modalities, integrating established ECG and respiration metrics with newly designed EDA-specific indicators. Using this refined dataset, multiple emotion-classification experiments were conducted under a strict subject-independent evaluation protocol, comparing fixed 30-second windows with emotion-triggered temporal segments. Across all tasks - binary arousal, binary valence, and multiclass emotion recognition - trigger-based segments consistently produced clearer and more discriminative physiological patterns. Random Forest achieved the strongest overall performance, including 78.8% multiclass accuracy using physiological features alone. To explore multimodal enhancement, speech embeddings were fused with handcrafted physiological features. This early-fusion approach led to substantial improvements across all tasks, most notably increasing multiclass accuracy from 78.8% to 97% when using trigger-based segments. These findings demonstrate that speech provides complementary affective information that enhances physiological representations. A subject-wise evaluation was also conducted to examine emotion separability across individuals and to identify video-specific misclassification patterns that reveal how different stimuli elicit varying physiological responses. Overall, this thesis delivers a validated multimodal dataset, reproducible processing pipelines, and strong baseline benchmarks that provide a solid foundation for future research in physiological and multimodal emotion recognition.enAttribution-NonCommercial 4.0 Internationalhttp://creativecommons.org/licenses/by-nc/4.0/Machine LearningAffect RecognitionMultimodal Machine LearningAffective ComputingEmotion RecognitionMultimodal Emotion Recognition Using Physiological SignalsThesis