Abstract:
Over the recent past years the amount of video data has grown exponentially. Video summarization has emerged as a process that can facilitate in areas like efficient storage, indexing and quick understanding of a large video. We take video summarization as a task of finding out visual cues from frames which lead to a sensible human understandable temporal order. As attention models are currently performing best for maintaining long range temporal orders, our research tends to find a better way to implement attention mechanism for the purpose of video summarization. We approach to solve the problem of video summarization with supervised method. For that, we propose a novel architecture using Global and Segmented Local Multi Head Attention mechanism and this has greatly helped us to maintain the temporal and contextual consistency in the summarized video. From our architecture, we get the insight that segment size should be determined based on the change points of videos inside a dataset and the number of heads in multi-head attention should be determined based on segment length. Our proposed methodology shows us superiority in results with respect to the existing state of the art methods and has achieved remarkable improvements from 2% to 3% on two benchmark data sets.
Description:
Supervised by
Dr. Md. Hasanul Kabir,
Professor,
Department of Computer Science and Engineering(CSE),
Islamic University of Technology (IUT)
Board Bazar, Gazipur-1704, Bangladesh.
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022.