Joint Pose Based Sign Language Translation Using Graph Convolutional Network

Show simple item record

dc.contributor.author Rashid, Md. Safirur
dc.contributor.author Shafique, Bhuiyan Sanjid
dc.contributor.author Mostahid, Tauhid
dc.date.accessioned 2023-04-03T07:49:24Z
dc.date.available 2023-04-03T07:49:24Z
dc.date.issued 2022-05-31
dc.identifier.citation [1] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7784–7793. [2] O. M. Sincan and H. Y. Keles, “Autsl: A large scale multi-modal turkish sign language dataset and baseline methods,” IEEE Access, vol. 8, pp. 181 340–181 355, 2020. [3] K. Sun, B. Xiao, D. Liu, and J.Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703. [4] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7093–7102. [5] N. M. Adaloglou, T. Chatzis, I. Papastratis, A. Stergioulas, G. T. Papadopoulos, V. Zacharopoulou, G. Xydopoulos, K. Antzakas, D. Papazachariou, and P. none Daras, “A comprehensive study on deep learning-based methods for sign language recognition,” IEEE Transactions on Multimedia, 2021. [6] U. Farooq, M. S. M. Rahim, N. Sabir, A. Hussain, and A. Abid, “Advances in machine translation for sign language: approaches, limitations, and challenges,” Neural Computing and Applications, vol. 33, no. 21, pp. 14 357–14 399, 2021. 52 [7] P. Buehler, A. Zisserman, and M. Everingham, “Learning sign language by watching tv (using weakly aligned subtitles),” in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009, pp. 2961–2968. [8] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732. [9] J. Charles, T. Pfister, M. Everingham, and A. Zisserman, “Automatic and efficient human pose estimation for sign language videos,” International Journal of Computer Vision, vol. 110, no. 1, pp. 70–90, 2014. [10] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “Sf-net: Structured feature network for continuous sign language recognition,” arXiv preprint arXiv:1908.01341, 2019. [11] T. Ahmad, L. Jin, X. Zhang, S. Lai, G. Tang, and L. Lin, “Graph convolutional neural network for human action recognition: A comprehensive survey,” IEEE Transactions on Artificial Intelligence, vol. 2, no. 2, pp. 128–145, 2021. [12] S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE transactions on affective computing, 2020. [13] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3d convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 1–7. [14] “Evalk - crunchbase company profile amp; funding.” [Online]. Available: https://www.crunchbase.com/organization/evalk [15] “Motionsavvy - crunchbase company profile funding.” [Online]. Available: https://www.crunchbase.com/organization/motionsavvy-llc [16] J. Stemper, “Motionsavvy uni: 1st sign language to voice system,” Oct 2014. [Online]. Available: https://www.indiegogo.com/projects/motionsavvy-uni-1stsign- language-to-voice-system/ 53 [17] Q. Ye and T.-K. Kim, “Occlusion-aware hand pose estimation using hierarchical mixture density network,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–817. [18] R. Yang, S. Sarkar, and B. Loeding, “Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 3, pp. 462–477, 2009. [19] H. Cooper, B. Holt, and R. Bowden, “Sign language recognition, chapter in visual analysis of humans: Looking at people,” 2011. [20] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Skeleton aware multimodal sign language recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3413–3423. [21] I. Papastratis, K. Dimitropoulos, D. Konstantinidis, and P. Daras, “Continuous sign language recognition through cross-modal alignment of video and text embeddings in a joint-latent space,” IEEE Access, vol. 8, pp. 91 170–91 180, 2020. [22] K. Koishybay, M. Mukushev, and A. Sandygulova, “Continuous sign language recognition with iterative spatiotemporal fine-tuning,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 10 211–10 218. [23] H. Cooper, E.-J. Ong, N. Pugeault, and R. Bowden, “Sign language recognition using sub-units,” Journal of Machine Learning Research, vol. 13, pp. 2205–2231, 2012. [24] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image Understanding, vol. 141, pp. 108–125, 2015. [25] C. Vogler and D. Metaxas, “Parallel hidden markov models for american sign language recognition,” in Proceedings of the seventh IEEE international conference on computer vision, vol. 1. IEEE, 1999, pp. 116–122. 54 [26] L. C. ROUGE, “A package for automatic evaluation of summaries,” in Proceedings of Workshop on Text Summarization of ACL, Spain, 2004. [27] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1, pp. 43–49, 1978. [28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278– 2324, 1998. [29] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3793–3802. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [31] N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 023–10 033. [32] S.-K. Ko, C. J. Kim, H. Jung, and C. Cho, “Neural sign language translation based on human keypoint estimation,” Applied Sciences, vol. 9, no. 13, p. 2683, 2019. [33] M. Vázquez-Enríquez, J. L. Alba-Castro, L. Docío-Fernández, and E. Rodríguez- Banga, “Isolated sign language recognition with multi-scale spatial-temporal graph convolutional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3462–3471. 55 [34] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376. [35] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign language recognition using convolutional neural networks,” in European Conference on Computer Vision. Springer, 2014, pp. 572–578. [36] A. Tunga, S. V. Nuthalapati, and J. Wachs, “Pose-based sign language recognition using gcn and bert,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 31–40. [37] L. Zheng and B. Liang, “Sign language recognition using depth images,” in 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV). IEEE, 2016, pp. 1–6. [38] H. Luqman, S. A. Mahmoud et al., “Arabic sign language recognition using optical flow-based features and hmm,” in International Conference of Reliable Information and Communication Technology. Springer, 2017, pp. 297–305. [39] S.-K. Ko, J. G. Son, and H. Jung, “Sign language recognition with recurrent neural network using human keypoint detection,” in Proceedings of the 2018 conference on research in adaptive and convergent systems, 2018, pp. 326–328. [40] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299. [41] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014. [42] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016. 56 [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [44] H. Duan, Y. Zhao, K. Chen, D. Shao, D. Lin, and B. Dai, “Revisiting skeletonbased action recognition,” arXiv preprint arXiv:2104.13586, 2021. [45] D. Li, C. Xu, X. Yu, K. Zhang, B. Swift, H. Suominen, and H. Li, “Tspnet: Hierarchical feature learning via temporal semantic pyramid for sign language translation,” arXiv preprint arXiv:2010.05468, 2020. [46] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1700–1709. [47] G. Neubig, “Neural machine translation and sequence-to-sequence models: A tutorial,” arXiv preprint arXiv:1703.01619, 2017. [48] A. Yin, Z. Zhao, J. Liu, W. Jin, M. Zhang, X. Zeng, and X. He, “Simulslt: Endto- end simultaneous sign language translation,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4118–4127. [49] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012. [50] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9. [51] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114. [52] J. Alammar, “The illustrated transformer.” [Online]. Available: http://jalammar.github.io/illustrated-transformer/ 57 [53] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for continuous sign language recognition,” in 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019, pp. 1282–1287. [54] A. Hannun, “Sequence modeling with ctc,” Jan 2020. [Online]. Available: https://distill.pub/2017/ctc/ [55] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y.Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado et al., “Google’s multilingual neural machine translation system: Enabling zero-shot translation,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351, 2017. [56] M. De Coster, K. D’Oosterlinck, M. Pizurica, P. Rabaey, M. Van Herreweghe, J. Dambre, and S. Verlinden, “Frozen pretrained transformers for neural sign language translation,” in 18th Biennial Machine Translation Summit, 2021, pp. 88–97. [57] K. Yin and J. Read, “Better sign language translation with stmc-transformer,” in Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 5975–5989. en_US
dc.identifier.uri http://hdl.handle.net/123456789/1806
dc.description Supervised by Dr. Md. Hasanul Kabir, Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology (IUT) Board Bazar, Gazipur-1704, Bangladesh. This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022. en_US
dc.description.abstract Sign Language Translation (SLT) is defined as the task of generating meaningful spoken sentence from a sequence of signs. Prior work on Sign Language Translation (SLT) focuses on generating sequence of spoken words using different RGB based and some skeletal based architectures. Extracting robust spatiotemporal features is very important in generating meaningful sign translation. Feature extraction based on skeletal methods has gained attraction in recent times, as they are able to capture complex human dynamics, ignore noisy data efficiently and extract more distinct spatiotemporal attributes. In this work, we explore the feasibility of a skeletal-based feature extraction system in the domain of Sign Language Translation (SLT) in order to generate more distinctive features. To accomplish that, we improve the estimation accuracy of existing particular pose estimation models, which are a key component of any skeletal based feature extraction system. We also explore a sign video segment representation to enhance detection accuracy of signs within a video and generate more accurate translation. We assess the performance of our proposed pipeline on the PHOENIX14T dataset, which is the benchmark dataset in this field. Although our model performs better than some prior works, it fails to achieve state-of-the-art results. We also share some performance analysis regarding sign-segmentation size and the number of keypoints taken into consideration. en_US
dc.language.iso en en_US
dc.publisher Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur, Bangladesh en_US
dc.subject SLT,SL-GCN,Skeletal,SAMSLR,Joint,Keypoints en_US
dc.title Joint Pose Based Sign Language Translation Using Graph Convolutional Network en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUT Repository


Advanced Search

Browse

My Account

Statistics