Capturing Spectral and Long-term Contextual Information for Speech Emotion Recognition Using Deep Learning Techniques

Haque, Md. Maksudul; Islam, Samiul; Sadat, Abu Jobayer Md.

dc.contributor.author	Haque, Md. Maksudul
dc.contributor.author	Islam, Samiul
dc.contributor.author	Sadat, Abu Jobayer Md.
dc.date.accessioned	2024-09-04T10:29:20Z
dc.date.available	2024-09-04T10:29:20Z
dc.date.issued	2023-05-30
dc.identifier.citation	[1] R. Jahangir, Y. W. Teh, F. Hanif, and G. Mujtaba, “Deep learning approaches for speech emotion recognition: State of the art and research challenges,” Multimedia Tools and Applications, vol. 80, no. 16, pp. 23 745–23 812, 2021. [2] M. Jain, S. Narayan, P. Balaji, A. Bhowmick, R. K. Muthu et al., “Speech emotion recognition using support vector machine,” arXiv preprint arXiv:2002.07590, 2020. [3] X. Cheng and Q. Duan, “Speech emotion recognition using gaussian mixture model,” in Proceedings of the 2012 International Conference on Computer Application and System Modeling (ICCASM 2012). Atlantis Press, 2012/08, pp. 1222–1225. [Online]. Available: https://doi.org/10.2991/iccasm.2012.311 [4] K. Chauhan, K. K. Sharma, and T. Varma, “Speech emotion recognition using convolution neural networks,” in 2021 international conference on artificial intelligence and smart systems (ICAIS). IEEE, 2021, pp. 1176–1181. [5] S. Han, F. Leng, and Z. Jin, “Speech emotion recognition with a resnet-cnn transformer parallel neural network,” in 2021 International Conference on Communications, Information System and Computer Engineering (CISCE). IEEE, 2021, pp. 803–807. [6] D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks,” Biomedical Signal Processing and Control, vol. 59, p. 101894, 2020. [7] S. Amiriparian, A. Sokolov, I. Aslan, L. Christ, M. Gerczuk, T. H¨ubner, D. Lamanov, M. Milling, S. Ottl, I. Poduremennykh et al., “On the impact of word error rate on acoustic-linguistic speech emotion recognition: an update for the deep learning era,” arXiv preprint arXiv:2104.10121, 2021. 66 Bibliography 67 [8] N. Scheidwasser-Clow, M. Kegler, P. Beckmann, and M. Cernak, “Serab: A multi-lingual benchmark for speech emotion recognition,” in ICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Pro cessing (ICASSP). IEEE, 2022, pp. 7697–7701. [9] Y. Wang, G. Shen, Y. Xu, J. Li, and Z. Zhao, “Learning mutual correlation in multimodal transformer for speech emotion recognition.” in Interspeech, 2021, pp. 4518–4522. [10] N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. Lee, “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recur rent neural network,” IEEE Access, vol. 8, pp. 61 672–61 686, 2020. [11] X. Wang, M. Wang, W. Qi, W. Su, X. Wang, and H. Zhou, “A novel end-to end speech emotion recognition network with stacked transformer layers,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6289–6293. [12] A. Shirian and T. Guha, “Compact graph architecture for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acous tics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6284–6288. [13] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden markov models,” Speech communication, vol. 41, no. 4, pp. 603–623, 2003. [14] A. Bhavan, M. Sharma, M. Piplani, P. Chauhan, Hitkul, and R. R. Shah, “Deep learning approaches for speech emotion recognition,” Deep learning based approaches for sentiment analysis, pp. 259–289, 2020. [15] D. Li, J. Liu, Z. Yang, L. Sun, and Z. Wang, “Speech emotion recognition us ing recurrent neural networks with directional self-attention,” Expert Systems with Applications, vol. 173, p. 114683, 2021. [16] V. Heusser, N. Freymuth, S. Constantin, and A. Waibel, “Bimodal speech emotion recognition using pre-trained language models,” arXiv preprint arXiv:1912.02610, 2019. [17] J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1d & 2d cnn lstm networks,” Biomedical signal processing and control, vol. 47, pp. 312–323, 2019. Bibliography 68 [18] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using cnn,” in Proceedings of the 22nd ACM international conference on Multime dia, 2014, pp. 801–804. [19] A. Christy, S. Vaithyasubramanian, A. Jesudoss, and M. A. Praveena, “Multi modal speech emotion recognition and classification using convolutional neu ral network techniques,” International Journal of Speech Technology, vol. 23, pp. 381–388, 2020. [20] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008. [21] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018. [22] L. Tian, X. Zhou, Y.-P. Wu, W.-T. Zhou, J.-H. Zhang, and T.-S. Zhang, “Knowledge graph and knowledge reasoning: A systematic review,” Journal of Electronic Science and Technology, vol. 20, no. 2, p. 100159, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1674862X2200012X [23] C.-W. Huang and S. S. Narayanan, “Attention assisted discovery of sub utterance structure in speech emotion recognition.” in Interspeech, 2016, pp. 1387–1391. [24] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recog nition using recurrent neural networks with local attention,” in 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 2227–2231. [25] D. Luo, Y. Zou, and D. Huang, “Investigation on joint representation learning for robust feature extraction in speech emotion recognition.” in Interspeech, 2018, pp. 152–156. [26] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and J. Epps, “Direct modelling of speech emotion from raw speech,” arXiv preprint arXiv:1904.03833, 2019. Bibliography 69 [27] A. Bhavan, P. Chauhan, R. R. Shah et al., “Bagged support vector machines for emotion recognition from speech,” Knowledge-Based Systems, vol. 184, p. 104886, 2019. [28] C. Luna-Jim´enez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero, and F. Fern´andez-Mart´ınez, “Multimodal emotion recognition on ravdess dataset using transfer learning,” Sensors, vol. 21, no. 22, p. 7665, 2021. [29] J. Ye, X.-C. Wen, Y. Wei, Y. Xu, K. Liu, and H. Shan, “Temporal modeling matters: A novel temporal emotional modeling approach for speech emo tion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. [30] C. Luna-Jim´enez, R. Kleinlein, D. Griol, Z. Callejas, J. M. Montero, and F. Fern´andez-Mart´ınez, “A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset,” Applied Sciences, vol. 12, no. 1, p. 327, 2021	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2156
dc.description	Supervised by Dr. Hasan Mahmud, Associate Professor, Mr. Fardin Saad, Lecturer, Dr. Md. Kamrul Hasan, Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.description.abstract	Traditional approaches in speech emotion recognition, such as LSTM, CNN, RNN, SVM, and MLP, have limitations such as difficulty capturing long-term dependen cies in sequential data, capturing the temporal dynamics, and struggling to capture complex patterns and relationships in multimodal data. This research addresses these shortcomings by proposing an ensemble model that combines Graph Con volutional Networks (GCN) for processing textual data and the HuBERT trans former for analyzing audio signals. We found that GCNs excel at capturing Long term contextual dependencies and relationships within textual data by leveraging graph-based representations of text and thus detecting the contextual meaning and semantic relationships between words. On the other hand, HuBERT utilizes self-attention mechanisms to capture long-range dependencies, enabling the mod eling of temporal dynamics present in speech and capturing subtle nuances and variations that contribute to emotion recognition. By combining GCN and Hu BERT, our ensemble model can leverage the strengths of both approaches. This allows for the simultaneous analysis of multimodal data, and the fusion of these modalities enables the extraction of complementary information, enhancing the discriminative power of the emotion recognition system. The results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	Speech Emotion Recognition (SER); Score Level Fusion; Self-supervised Learning; Hidden Unit Bidirectional Encoder Repre sentations from Transformers(HuBERT); Graph Convolution Net work.	en_US
dc.title	Capturing Spectral and Long-term Contextual Information for Speech Emotion Recognition Using Deep Learning Techniques	en_US
dc.type	Thesis	en_US