Abstract:
Automatic detection of violence from surveillance footage holds special significance among the various subsets of general activity recognition tasks due to its broad
applicability in autonomous security monitoring systems, web video censoring, etc.
In this paper, we propose a two-stream deep learning architecture based on Separable Convolutional LSTM (SepConvLSTM) and pre-trained truncated MobileNet, in
which one stream processes difference of adjacent frames and the other stream takes
in background suppressed frames as inputs. Fast and efficient input pre-processing
techniques were used to highlight moving objects in frames by suppressing nonmoving backgrounds and capturing motion in between frames. These inputs assist
in producing discriminative features as violent activities are predominantly characterized by rapid movements. SepConvLSTM is built by replacing each ConvLSTM
gate’s convolution operation with a depthwise separable convolution, resulting in robust long-range spatio-temporal features with significantly fewer parameters. We
experimented with three fusion strategies to merge the output feature maps of the
two streams. Three standard public datasets were used to assess the proposed methods. On the larger and more difficult RWF-2000 dataset, our model outperforms the
previous best accuracy by more than 2%, while matching state-of-the-art results on
the smaller datasets. Our studies demonstrate that the proposed models excel both in
terms of computational efficiency and detection accuracy.
Description:
Supervised by
Md. Hasanul Kabir, PhD,
Professor,
Department of Computer Science and Engineering(CSE),
Islamic University of Technology, Board Bazar, Gazipur-1704, Bangladesh.