Assessment of Human Actions from Videos Using Deep Residual Neural Networks

Farabi, MD Shafkat Rahman; Himel, S.M. Hadibul Haque; Gazzali, Md. Fakhruddin

dc.contributor.author	Farabi, MD Shafkat Rahman
dc.contributor.author	Himel, S.M. Hadibul Haque
dc.contributor.author	Gazzali, Md. Fakhruddin
dc.date.accessioned	2022-04-16T13:24:45Z
dc.date.available	2022-04-16T13:24:45Z
dc.date.issued	2021-03-30
dc.identifier.uri	http://hdl.handle.net/123456789/1321
dc.description	Supervised by Prof. Dr. Md. Hasanul Kabir, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.description.abstract	"Action quality assessment (AQA) aims at automatically judging human action based on a video of the said action and assigning a performance score to it. Judging the quality of human actions from videos holds huge promise for the future of computer vision. This thesis work focuses on discovering an improved method for action quality assessment. The majority of works in the existing literature on AQA transform RGB videos to higher-level representations using a Convolutional 3D (C3D) network. These higher-level representa- tions are used to perform action quality assessments. Due to the relatively shallow nature of C3D, the quality of extracted features is lower than what could be extracted using a deeper convolutional neural network. Hence, we experiment with deeper convolutional neural net- works with residual connections (ResNets) for learning representations for action quality assessment. We assess the effects of the depth and the input clip size of the convolutional neural network on the quality of action score predictions. We also look at the effect of using (2+1)D convolutions instead of 3D convolutions for feature extraction. We think that the current clip-level feature representation aggregation technique of averaging is insufficient to capture the relative importance of features. To overcome this, we propose a learning-based weighted-averaging technique that can perform better. We achieve a new state-of-the-art Spearman’s rank correlation of 0.9315 (An improvement of 0.45% over the previous state- of-the-art) on the MTL-AQA dataset using a 34 layer (2+1)D convolutional neural network with the capability of processing 32 frame clips, using our proposed aggregation technique"	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur, Bangladesh	en_US
dc.subject	AQA, Computer Vision, Deep Learning	en_US
dc.title	Assessment of Human Actions from Videos Using Deep Residual Neural Networks	en_US
dc.type	Thesis	en_US