Abstract:
Multiple object tracking (MOT) is a crucial task in computer vision, with applications
in fields such as surveillance, robotics, and autonomous systems. Accurate MOT is
essential for maintaining situational awareness in complex environments and detecting
objects accurately and tracking objects in real-time. In this paper, we present a novel
approach for MOT that combines joint detection and embedding (JDE) which offers
simultaneous detection and identification of multiple objects with a Swin Transformer
for multi-scale feature extraction. The Swin Transformer, a variant of the popular
Transformer architecture, is used to extract rich, multi-scale features from the input
data in linear time complexity, enabling our method to handle objects of varying sizes
and shapes. We added every stage of Swin blocks with prediction heads to get the
multi-scale features. Also, we increased the number of Swin blocks at the first stage
to accurately detect objects from large receptive fields. We evaluated our approach on
a test set defined by our self-defined MIX dataset and achieved an accuracy of 84.9%.
While this is a promising result, there is more room for improvement like improving
the reidentification part or modifying the mlp layers of Swin blocks.
Description:
Supervised by
Prof. Dr. Md. Hasanul Kabir,
Co-supervisor,
Mr. Md. Bakhtiar Hasan,
Assistant Professor,
Department of Computer Science and Engineering(CSE),
Islamic University of Technology(IUT),
Board Bazar, Gazipur-1704, Bangladesh