Hate Speech Detection From Multimodal Memes Using Vision-Language Transformer Models

Show simple item record

dc.contributor.author Ivan, Shahriar
dc.date.accessioned 2025-03-13T08:21:02Z
dc.date.available 2025-03-13T08:21:02Z
dc.date.issued 2024-06-21
dc.identifier.citation [1] M. Singhal, C. Ling, P. Paudel, P. Thota, N. Kumarswamy, G. Stringhini, and S. Nilizadeh, “Sok: Content moderation in social media, from guidelines to en forcement, and research to practice,” in 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P), 2023, pp. 868–895. [2] A. Manocha and M. Bhatia, “A novel deep fusion strategy for covid-19 prediction using multimodality approach,” Computers and Electrical Engineering, vol. 103, p. 108274, 2022. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S0045790622005031 [3] S. K. Roy, A. Deria, D. Hong, B. Rasti, A. Plaza, and J. Chanussot, “Multimodal fusion transformer for remote sensing image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–20, 2023. [4] S. R. Stahlschmidt, B. Ulfenborg, and J. Synnergren, “Multimodal deep learning for biomedical data fusion: a review,” Briefings in Bioinformatics, vol. 23, no. 2, p. bbab569, 01 2022. [Online]. Available: https://doi.org/10.1093/bib/bbab569 [5] T. Ahmed, S. Ivan, M. Kabir, H. Mahmud, and K. Hasan, “Performance analysis of transformer-based architectures and their ensembles to detect trait-based cy berbullying,” Social Network Analysis and Mining, vol. 12, no. 1, p. 99, 2022. [6] C. Li, F. Yang, and J. Yang, “The role of long-term dependency in synthetic speech detection,” IEEE Signal Processing Letters, vol. 29, pp. 1142–1146, 2022. [7] T. H. Afridi, A. Alam, M. N. Khan, J. Khan, and Y.-K. Lee, “A multimodal memes classification: A survey and open research issues,” ArXiv, vol. abs/2009.08395, 2020. [Online]. Available: https://api.semanticscholar.org/ CorpusID:221761623 [8] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” ArXiv, vol. abs/1908.03557, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:199528533 [9] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task agnostic visiolinguistic representations for vision-and-language tasks,” in 42 Neural Information Processing Systems, 2019. [Online]. Available: https: //api.semanticscholar.org/CorpusID:199453025 [10] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and D. Tes tuggine, “The hateful memes challenge: Detecting hate speech in multimodal memes,” in Advances in Neural Information Processing Systems, vol. 33. Cur ran Associates, Inc., 2020, pp. 2611–2624. [11] J. Devlin, S. Gupta, R. B. Girshick, M. Mitchell, and C. L. Zitnick, “Exploring nearest neighbor approaches for image captioning,” ArXiv, vol. abs/1505.04467, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:15208089 [12] L. Shifman, Memes in digital culture. MIT press, 2013. [13] P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep learning for hate speech detection in tweets,” in Proceedings of the 26th international conference on World Wide Web companion, 2017, pp. 759–760. [14] T. Ahmed, M. Kabir, S. Ivan, H. Mahmud, and K. Hasan, “Am i being bullied on social media? an ensemble approach to categorize cyberbullying,” in 2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 2442–2453. [15] T. Ahmed, S. Ivan, A. Munir, and S. Ahmed, “Decoding depression: Analyzing social network insights for depression severity assessment with transformers and explainable ai,” Natural Language Processing Journal, vol. 7, p. 100079, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S294971912400027X [16] J. Arevalo, T. Solorio, M. M. y Gómez, and F. A. González, “Gated multimodal units for information fusion,” 2017. [17] F. Alam, F. Ofli, and M. Imran, “Crisismmd: Multimodal twitter datasets from natural disasters,” in Proceedings of the international AAAI conference on web and social media, vol. 12, no. 1, 2018. [18] D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: An overview of meth ods, challenges, and prospects,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1449–1477, 2015. [19] J. Xue, Y. Wang, Y. Tian, Y. Li, L. Shi, and L. Wei, “Detecting fake news by exploring the consistency of multimodal data,” Information Processing & Management, vol. 58, no. 5, p. 102610, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306457321001060 [20] Y.-D. Zhang, Z. Dong, S.-H. Wang, X. Yu, X. Yao, Q. Zhou, H. Hu, M. Li, C. Jiménez-Mesa, J. Ramirez, F. J. Martinez, and J. M. Gorriz, “Advances in 43 multimodal data fusion in neuroimaging: Overview, challenges, and novel ori entation,” Information Fusion, vol. 64, pp. 149–187, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253520303183 [21] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, and N. Bhamidipati, “Hate speech detection with comment embeddings,” in Proceedings of the 24th International Conference on World Wide Web, ser. WWW ’15 Companion. New York, NY, USA: Association for Computing Machinery, 2015, p. 29–30. [Online]. Available: https://doi.org/10.1145/2740908.2742760 [22] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, “Abusive language detection in online user content,” in Proceedings of the 25th International Conference on World Wide Web, ser. WWW ’16. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2016, p. 145–153. [Online]. Available: https://doi.org/10.1145/2872427.2883062 [23] K. Reynolds, A. Kontostathis, and L. Edwards, “Using machine learning to detect cyberbullying,” in 2011 10th International Conference on Machine Learning and Applications and Workshops, vol. 2, 2011, pp. 241–244. [24] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine In telligence, vol. 41, no. 2, pp. 423–443, 2019. [25] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Information, vol. 10, no. 4, 2019. [Online]. Available: https://www.mdpi.com/2078-2489/10/4/150 [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019. [27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [28] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” 2020. [29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle moyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining ap proach,” arXiv preprint arXiv:1907.11692, 2019. [30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. 44 [31] A. Rahate, S. Mandaokar, P. Chandel, R. Walambe, S. Ramanna, and K. Kotecha, “Employing multimodal co-learning to evaluate the robustness of sensor fusion for industry 5.0 tasks,” Soft Computing, vol. 27, no. 7, pp. 4139–4155, 2023. [32] S. Suryawanshi, B. R. Chakravarthi, M. Arcan, and P. Buitelaar, “Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text,” in Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying. Marseille, France: European Language Resources Association (ELRA), May 2020, pp. 32–41. [Online]. Available: https://aclanthology.org/2020.trac-1.6 [33] C. Liu, G. Geigle, R. Krebs, and I. Gurevych, “FigMemes: A dataset for figurative language identification in politically-opinionated memes,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 7069–7086. [Online]. Available: https://aclanthology. org/2022.emnlp-main.476 [34] E. Hossain, O. Sharif, and M. M. Hoque, “MUTE: A multimodal dataset for detecting hateful memes,” in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop, Y. Hanqi, Y. Zonghan, S. Ruder, and W. Xiaojun, Eds. Online: Association for Computational Linguistics, Nov. 2022, pp. 32–39. [Online]. Available: https://aclanthology.org/2022.aacl-srw.5 [35] H. Kirk, Y. Jun, P. Rauba, G. Wachtel, R. Li, X. Bai, N. Broestl, M. Doff-Sotta, A. Shtedritski, and Y. M. Asano, “Memes in the wild: Assessing the generalizability of the hateful memes challenge dataset,” in Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), A. Mostafazadeh Davani, D. Kiela, M. Lambert, B. Vidgen, V. Prabhakaran, and Z. Waseem, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 26–35. [Online]. Available: https://aclanthology.org/2021.woah-1.4 [36] Y. Zhou, Z. Chen, and H. Yang, “Multimodal learning for hateful memes detec tion,” in 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2021, pp. 1–6. [37] B. R. Chakravarthi, R. Priyadharshini, B. Stearns, A. K. Jayapal, S. Sridevy, M. Arcan, M. Zarrouk, and J. P. McCrae, “Multilingual multimodal machine translation for dravidian languages utilizing phonetic transcription,” in Proceed ings of the 2nd Workshop on Technologies for MT of Low Resource Languages, 2019, pp. 56–63. 45 [38] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, and J. P. McCrae, “A sentiment analysis dataset for code-mixed malayalam-english,” arXiv preprint arXiv:2006.00210, 2020. [39] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, and J. P. McCrae, “Corpus creation for sentiment analysis in code-mixed tamil-english text,” arXiv preprint arXiv:2006.00206, 2020. [40] C. T. Duong, R. Lebret, and K. Aberer, “Multimodal classification for analysing social media,” arXiv preprint arXiv:1708.02099, 2017. [41] J. Wang, L. Sun, Y. Liu, M. Shao, and Z. Zheng, “Multimodal sarcasm target identification in tweets,” in Proceedings of the 60th Annual Meeting of the Asso ciation for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8164– 8175. [42] S. Pramanick, A. Roy, and V. M. Patel, “Multimodal learning using optimal trans port for sarcasm and humor detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3930–3940. [43] D. Zhang, M. Zhang, H. Zhang, L. Yang, and H. Lin, “Multimet: A multimodal dataset for metaphor understanding,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 3214–3225. [44] D. Kiela and L. Bottou, “Learning image embeddings using convolutional neural networks for improved multi-modal semantics,” in Proceedings of the 2014 Con ference on empirical methods in natural language processing (EMNLP), 2014, pp. 36–45. [45] D. Pei, H. Liu, Y. Liu, and F. Sun, “Unsupervised multimodal feature learning for semantic image segmentation,” in The 2013 International Joint Conference on Neural Networks (IJCNN), 2013, pp. 1–6. [46] H.-I. Suk and D. Shen, “Deep learning-based feature representation for ad/mci classification,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, K. Mori, I. Sakuma, Y. Sato, C. Barillot, and N. Navab, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 583–590. [47] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learn ing (ICML-11), 2011, pp. 689–696. 46 [48] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in Proceedings of the 30th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1247–1255. [Online]. Available: https: //proceedings.mlr.press/v28/andrew13.html [49] Y. Kang, S. Kim, and S. Choi, “Deep learning to hash with multiple representa tions,” in 2012 IEEE 12th International Conference on Data Mining, 2012, pp. 930–935. [50] X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang, “Learning multimodal neural network with ranking examples,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 985–988. [51] P. Wu, S. C. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao, “Online multimodal deep similarity learning with application to image retrieval,” in Proceedings of the 21st ACM international conference on Multimedia, 2013, pp. 153–162. [52] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” Advances in neural information processing systems, vol. 26, 2013. [53] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shot learning through cross-modal transfer,” Advances in neural information processing sys tems, vol. 26, 2013. [54] Y. Zheng, Y.-J. Zhang, and H. Larochelle, “Topic modeling of multimodal data: an autoregressive approach,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1370–1377. [55] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, 2016. [56] R. Gomez, J. Gibert, L. Gomez, and D. Karatzas, “Exploring hate speech detec tion in multimodal publications,” in Proceedings of the IEEE/CVF winter confer ence on applications of computer vision, 2020, pp. 1470–1478. [57] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European conference on computer vision. Springer, 2020, pp. 104–120. [58] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representa tions from transformers,” arXiv preprint arXiv:1908.07490, 2019. 47 [59] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” arXiv preprint arXiv:1908.08530, 2019. [60] Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” arXiv preprint arXiv:2004.00849, 2020. [61] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre training for unified vision-language understanding and generation,” in Interna tional Conference on Machine Learning, 2022. [62] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [63] D. Q. Nguyen, T. Vu, and A. Tuan Nguyen, “BERTweet: A pre-trained language model for English tweets,” in Proceedings of the Conference on Empirical Meth ods in Natural Language Processing: System Demonstrations, 2020. [64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [65] W. L. Taylor, ““cloze procedure”: A new tool for measuring readability,” Jour nalism quarterly, vol. 30, no. 4, pp. 415–433, 1953. [66] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” arXiv preprint arXiv:1901.07291, 2019. [67] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019. [68] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” Transactions of the association for computational linguistics, vol. 8, pp. 64–77, 2020. [69] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural machine transla tion,” arXiv preprint arXiv:1806.00187, 2018. [70] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Train ing bert in 76 minutes,” arXiv preprint arXiv:1904.00962, 2019. 48 [71] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015. [72] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young et al., “Mesh-tensorflow: Deep learning for supercomputers,” Advances in neural information processing systems, vol. 31, 2018. [73] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. [74] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016. [75] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The reversible residual network: Backpropagation without storing activations,” Advances in neural in formation processing systems, vol. 30, 2017. [76] A. Manconi, G. Armano, M. Gnocchi, and L. Milanesi, “A soft-voting ensemble classifier for detecting patients affected by covid-19,” Applied Sciences, vol. 12, no. 15, p. 7554, 2022. [77] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969. en_US
dc.identifier.uri http://hdl.handle.net/123456789/2398
dc.description Supervised by Prof. Dr. Md. Hasanul Kabir, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Master of Science in Computer Science and Engineering, 2024 en_US
dc.description.abstract Memes are widely prevalent in social media, serving as tools for both entertainment and communication but also having the potential to harbor offensive content. Given the extensive amount of online content, automated methods are essential for categorizing and curbing the spread of offensive memes. This study delves into hateful meme identification, addressing current limitations and exploring the integration of vision and language models to leverage both image and text data. The study proposes an end-to-end vision-language multimodal framework for categorizing hateful memes using knowldege from both image and textual modality. The framework features an 'Image Captioning block' to derive meaningful textual descriptions from meme images, followed by a 'Fusion and Classification block', which combines features from both image and text modalities and generates classification results from three transformer-based language models. The final decision is derived from an ensemble of these predictions in the 'Decision block'. Evaluating our framework on the Hateful Memes Challenge Dataset, we achieve an accuracy of 72.2% and an AUROC score of 0.7708. Furthermore, we provide a thorough analysis of the particular characteristics in memes that lead to difficulties in accurate classification, offering insights into why certain memes are misclassified and guiding future research in this domain. en_US
dc.language.iso en en_US
dc.publisher Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh en_US
dc.subject Hateful memes, Automated content moderation, Image-text fusion, Image captioning, Text transformers en_US
dc.title Hate Speech Detection From Multimodal Memes Using Vision-Language Transformer Models en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUT Repository


Advanced Search

Browse

My Account

Statistics