Efficient Two-Stream Network for Violence Detection using Separable Convolutional LSTM

Islam, Md. Zahidul; Rukonuzzaman, Mohammad; Ahmed, Raiyan

dc.contributor.author	Islam, Md. Zahidul
dc.contributor.author	Rukonuzzaman, Mohammad
dc.contributor.author	Ahmed, Raiyan
dc.date.accessioned	2022-04-17T16:53:18Z
dc.date.available	2022-04-17T16:53:18Z
dc.date.issued	2021-03-30
dc.identifier.citation	[1] M. Cheng, K. Cai, and M. Li, “Rwf-2000: An open large scale video database for violence detection,” arXiv preprint arXiv:1911.05913, 2019. [2] E. B. Nievas, O. D. Suarez, G. B. Garc´ıa, and R. Sukthankar, “Violence detection in video using computer vision techniques,” in International conference on Computer analysis of images and patterns, pp. 332–339, Springer, 2011. [3] Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, “Violence detection using oriented violent flows,” Image and vision computing, vol. 48, pp. 37–41, 2016. [4] C. Ding, S. Fan, M. Zhu, W. Feng, and B. Jia, “Violence detection in video by using 3d convolutional neural networks,” in International Symposium on Visual Computing, pp. 551–558, Springer, 2014. [5] B. Peixoto, B. Lavi, P. Bestagini, Z. Dias, and A. Rocha, “Multimodal violence detection in videos,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2957–2961, IEEE, 2020. [6] Z. Dong, J. Qin, and Y. Wang, “Multi-stream deep networks for person to person violence detection in videos,” in Chinese Conference on Pattern Recognition, pp. 517– 531, Springer, 2016. [7] A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, “Bidirectional convolutional lstm for the detection of violence in videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018. [8] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520, 2018. [9] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE international conference on computer vision, pp. 3551–3558, 2013. 41 REFERENCES 42 [10] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017. [11] C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 591–600, 2020. [12] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3595– 3603, 2019. [13] A. Elkholy, M. E. Hussein, W. Gomaa, D. Damen, and E. Saba, “Efficient and robust skeleton-based quality assessment and abnormality detection in human action performance,” IEEE journal of biomedical and health informatics, vol. 24, no. 1, pp. 280–291, 2019. [14] Q. Lei, H.-B. Zhang, J.-X. Du, T.-C. Hsiao, and C.-C. Chen, “Learning effective skeletal representations on rgb video for fine-grained human action quality assessment,” Electronics, vol. 9, no. 4, p. 568, 2020. [15] T. Liu, R. Zhao, J. Xiao, and K.-M. Lam, “Progressive motion representation distillation with two-branch networks for egocentric activity recognition,” IEEE Signal Processing Letters, vol. 27, pp. 1320–1324, 2020. [16] T. Senst, V. Eiselein, A. Kuhn, and T. Sikora, “Crowd violence detection using global motion-compensated lagrangian features and scale-sensitive video-level representation,” IEEE Transactions on Information Forensics and Security, vol. 12, pp. 2945– 2956, 2017. [17] D. Chen, H. Wactlar, M.-Y. Chen, C. Gao, A. Bharucha, and A. Hauptmann, “Recognition of aggressive human behavior using binary local motion descriptors,” Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, vol. 2008, pp. 5238–41, 02 2008. [18] T. Deb, A. Arman, and A. Firoze, “Machine cognition of violence in videos using novel outlier-resistant vlad,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 989–994, 2018. REFERENCES 43 [19] J. Li, X. Jiang, T. Sun, and K. Xu, “Efficient violence detection using 3d convolutional neural networks,” in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8, IEEE, 2019. [20] S. Sudhakaran and O. Lanz, “Learning to detect violent videos using convolutional long short-term memory,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6, IEEE, 2017. [21] T. Hassner, Y. Itcher, and O. Kliper-Gross, “Violent flows: Real-time detection of violent crowd behavior,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–6, IEEE, 2012. [22] O. Deniz, I. Serrano, G. Bueno, and T.-K. Kim, “Fast violence detection in video,” in 2014 international conference on computer vision theory and applications (VISAPP), vol. 2, pp. 478–485, IEEE, 2014. [23] I. Serrano, O. Deniz, J. L. Espinosa-Aranda, and G. Bueno, “Fight recognition in video using hough forests and 2d convolutional neural network,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 4787–4797, 2018. [24] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014. [25] Q. Dai, R.-W. Zhao, Z. Wu, X. Wang, Z. Gu, W. Wu, and Y.-G. Jiang, “Fudanhuawei at mediaeval 2015: Detecting violent scenes and affective impact in movies with deep learning.,” in MediaEval, 2015. [26] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in Advances in neural information processing systems, pp. 802–810, 2015. [27] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks. arxiv 2016,” arXiv preprint arXiv:1608.06993, vol. 1608, 2018. [28] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” in European Conference on Computer Vision, pp. 322–339, Springer, 2020. [29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. REFERENCES 44 [30] A. Pfeuffer and K. Dietmayer, “Separable convolutional lstms for faster video segmentation,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 1072–1078, IEEE, 2019. [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012. [32] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015. [33] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” arXiv preprint arXiv:1506.04214, 2015. [34] L. Perez and J. Wang, “The effectiveness of data augmentation in image classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017. [35] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, ´ S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, ´ M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org. [36] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010. [37] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019. [38] P. Bilinski and F. Bremond, “Human violence recognition and detection in surveillance videos,” in 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 30–36, IEEE, 2016. [39] P. Zhou, Q. Ding, H. Luo, and X. Hou, “Violent interaction detection in video based on deep learning,” Journal of Physics: Conference Series, vol. 844, p. 012044, 06 2017. REFERENCES 45 [40] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Largescale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014. [41] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012. [42] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/1346
dc.description	Supervised by Md. Hasanul Kabir, PhD, Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology, Board Bazar, Gazipur-1704, Bangladesh.	en_US
dc.description.abstract	Automatic detection of violence from surveillance footage holds special significance among the various subsets of general activity recognition tasks due to its broad applicability in autonomous security monitoring systems, web video censoring, etc. In this paper, we propose a two-stream deep learning architecture based on Separable Convolutional LSTM (SepConvLSTM) and pre-trained truncated MobileNet, in which one stream processes difference of adjacent frames and the other stream takes in background suppressed frames as inputs. Fast and efficient input pre-processing techniques were used to highlight moving objects in frames by suppressing nonmoving backgrounds and capturing motion in between frames. These inputs assist in producing discriminative features as violent activities are predominantly characterized by rapid movements. SepConvLSTM is built by replacing each ConvLSTM gate’s convolution operation with a depthwise separable convolution, resulting in robust long-range spatio-temporal features with significantly fewer parameters. We experimented with three fusion strategies to merge the output feature maps of the two streams. Three standard public datasets were used to assess the proposed methods. On the larger and more difficult RWF-2000 dataset, our model outperforms the previous best accuracy by more than 2%, while matching state-of-the-art results on the smaller datasets. Our studies demonstrate that the proposed models excel both in terms of computational efficiency and detection accuracy.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur, Bangladesh	en_US
dc.title	Efficient Two-Stream Network for Violence Detection using Separable Convolutional LSTM	en_US
dc.type	Thesis	en_US