Abstract:
Memes are widely prevalent in social media, serving as tools for both entertainment and communication but also having the potential to harbor offensive content. Given the extensive amount of online content, automated methods are essential for categorizing and curbing the spread of offensive memes. This study delves into hateful meme identification, addressing current limitations and exploring the integration of vision and language models to leverage both image and text data. The study proposes an end-to-end vision-language multimodal framework for categorizing hateful memes using knowldege from both image and textual modality. The framework features an 'Image Captioning block' to derive meaningful textual descriptions from meme images, followed by a 'Fusion and Classification block', which combines features from both image and text modalities and generates classification results from three transformer-based language models. The final decision is derived from an ensemble of these predictions in the 'Decision block'. Evaluating our framework on the Hateful Memes Challenge Dataset, we achieve an accuracy of 72.2% and an AUROC score of 0.7708. Furthermore, we provide a thorough analysis of the particular characteristics in memes that lead to difficulties in accurate classification, offering insights into why certain memes are misclassified and guiding future research in this domain.
Description:
Supervised by
Prof. Dr. Md. Hasanul Kabir,
Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT)
Board Bazar, Gazipur, Bangladesh
This thesis is submitted in partial fulfillment of the requirement for the degree of Master of Science in Computer Science and Engineering, 2024