Abstract:
Sign Language Translation (SLT) is defined as the task of generating meaningful spoken sentence from a sequence of signs. Prior work on Sign Language Translation (SLT) focuses on generating sequence of spoken words using different RGB based and some skeletal based architectures. Extracting robust spatiotemporal features is very important in generating meaningful sign translation. Feature extraction based on skeletal methods has gained attraction in recent times, as they are able to capture complex human dynamics, ignore noisy data efficiently and extract more distinct spatiotemporal attributes. In this work, we explore the feasibility of a skeletal-based feature extraction system in the domain of Sign Language Translation (SLT) in order to generate more distinctive features. To accomplish that, we improve the estimation accuracy of existing particular pose estimation models, which are a key component of any skeletal based feature extraction system. We also explore a sign video segment representation to enhance detection accuracy of signs within a video and generate more accurate translation. We assess the performance of our proposed pipeline on the PHOENIX14T dataset, which is the benchmark dataset in this field. Although our model performs better than some prior works, it fails to achieve state-of-the-art results. We also share some performance analysis regarding sign-segmentation size and the number of keypoints taken into consideration.
Description:
Supervised by
Dr. Md. Hasanul Kabir,
Professor,
Department of Computer Science and Engineering(CSE),
Islamic University of Technology (IUT)
Board Bazar, Gazipur-1704, Bangladesh.
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022.