Abstract:
"Action quality assessment (AQA) aims at automatically judging human action based
on a video of the said action and assigning a performance score to it. Judging the quality
of human actions from videos holds huge promise for the future of computer vision. This
thesis work focuses on discovering an improved method for action quality assessment. The
majority of works in the existing literature on AQA transform RGB videos to higher-level
representations using a Convolutional 3D (C3D) network. These higher-level representa-
tions are used to perform action quality assessments. Due to the relatively shallow nature of
C3D, the quality of extracted features is lower than what could be extracted using a deeper
convolutional neural network. Hence, we experiment with deeper convolutional neural net-
works with residual connections (ResNets) for learning representations for action quality
assessment. We assess the effects of the depth and the input clip size of the convolutional
neural network on the quality of action score predictions. We also look at the effect of using
(2+1)D convolutions instead of 3D convolutions for feature extraction. We think that the
current clip-level feature representation aggregation technique of averaging is insufficient to
capture the relative importance of features. To overcome this, we propose a learning-based
weighted-averaging technique that can perform better. We achieve a new state-of-the-art
Spearman’s rank correlation of 0.9315 (An improvement of 0.45% over the previous state-
of-the-art) on the MTL-AQA dataset using a 34 layer (2+1)D convolutional neural network
with the capability of processing 32 frame clips, using our proposed aggregation technique"
Description:
Supervised by
Prof. Dr. Md. Hasanul Kabir,
Department of Computer Science and Engineering(CSE),
Islamic University of Technology(IUT),
Board Bazar, Gazipur-1704, Bangladesh