A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Nishita, Sadia Alam; Alvee, Navid Hasin; Siddique, Md. Shahnewaz

dc.contributor.author	Nishita, Sadia Alam
dc.contributor.author	Alvee, Navid Hasin
dc.contributor.author	Siddique, Md. Shahnewaz
dc.date.accessioned	2025-06-03T06:43:22Z
dc.date.available	2025-06-03T06:43:22Z
dc.date.issued	2024-06-30
dc.identifier.citation	[1] S. Agrawal and A. Awekar, “No more beating about the bush: A step towards id iom handling for indian language NLP,” in Proceedings of the Eleventh Interna tional Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan: European Language Resources Association (ELRA), May 2018. [2] G. I. Ahmad, J. Singla, and N. Nikita, “Review on sentiment analysis of indian languages with a special focus on code mixed indian languages,” in 2019 inter national conference on automation, computational and technology management (ICACTM), IEEE, 2019, pp. 352–356. [3] I. Ameer, G. Sidorov, H. Gomez-Adorno, and R. M. A. Nawab, “Multi-label emotion classification on code-mixed text: Data and methods,” IEEE Access, vol. 10, pp. 8779–8789, 2022. [4] U. Barman, A. Das, J. Wagner, and J. Foster, “Code mixing: A challenge for language identification in the language of social media,” in Proceedings of the first workshop on computational approaches to code switching, 2014, pp. 13–23. [5] A. Bhattacharjee, T. Hasan, W. Ahmad, et al., “BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evalu ation in Bangla,” in Findings of the Association for Computational Linguistics: NAACL 2022, M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds., Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 1318– 1327. [6] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, and J. P. McCrae, “A sen timent analysis dataset for code-mixed Malayalam-English,” English, in Pro ceedings of the 1st Joint Workshop on Spoken Language Technologies for Under resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), D. Beermann, L. Besacier, S. Sakti, and C. Soria, Eds., Mar seille, France: European Language Resources association, May 2020, pp. 177– 184. [7] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, and J. P. McCrae, “Cor pus creation for sentiment analysis in code-mixed Tamil-English text,” English, in Proceedings of the 1st Joint Workshop on Spoken Language Technologies for 52 Under-resourced languages (SLTU) and Collaboration and Computing for Under Resourced Languages (CCURL), D. Beermann, L. Besacier, S. Sakti, and C. Soria, Eds., Marseille, France: European Language Resources association, May 2020, pp. 202–210. [8] A. Chanda, D. Das, and C. Mazumdar, “Unraveling the english-bengali code mixing phenomenon,” in Proceedings of the second workshop on computational approaches to code switching, 2016, pp. 80–89. [9] M. Cieliebak, J. M. Deriu, D. Egger, and F. Uzdilli, “A twitter corpus and bench mark resources for german sentiment analysis,” in Proceedings of the Fifth Inter national Workshop on Natural Language Processing for Social Media, Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 45–51. [10] A. Conneau, K. Khandelwal, N. Goyal, et al., “Unsupervised cross-lingual rep resentation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds., Online: Association for Computational Linguistics, Jul. 2020, pp. 8440–8451. [11] K. Dashtipour, S. Poria, A. Hussain, et al., “Multilingual sentiment analysis: State of the art and independent comparison of techniques,” Cognitive compu tation, vol. 8, pp. 757–771, 2016. [12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu tational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [13] N. Dey, M. S. Rahman, M. S. Mredula, A. S. Hosen, and I.-H. Ra, “Using ma chine learning to detect events on the basis of bengali and banglish facebook posts,” Electronics, vol. 10, no. 19, p. 2367, 2021. [14] D. Gautam, P. Kodali, K. Gupta, A. Goel, M. Shrivastava, and P. Kumaraguru, “Comet: Towards code-mixed translation using parallel monolingual sentences,” in Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, 2021, pp. 47–55. [15] A. Gupta, A. Vavre, and S. Sarawagi, “Training data augmentation for code mixed translation,” in Proceedings of the 2021 Conference of the North Ameri can Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5760–5766. 53 [16] M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’04, New York, NY, USA: Association for Computing Machinery, 2004, pp. 168–177. [17] Q. Jiang, L. Chen, R. Xu, X. Ao, and M. Yang, “A challenge dataset and effective models for aspect-based sentiment analysis,” in Proceedings of the 2019 Confer ence on Empirical Methods in Natural Language Processing and the 9th Inter national Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6279– 6284. [18] A. Joshi, A. Prabhu, M. Shrivastava, and V. Varma, “Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text,” in Pro ceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Y. Matsumoto and R. Prasad, Eds., Osaka, Japan: The COLING 2016 Organizing Committee, Dec. 2016, pp. 2482–2491. [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [20] S. Mæhlum, J. Barnes, L. Øvrelid, and E. Velldal, “Annotating evaluative sen tences for sentiment analysis: A dataset for norwegian,” in Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finland: Linkop ing University Electronic Press, Sep. 2019, pp. 121–130. [21] N. H. Mahadzir et al., “Sentiment analysis of code-mixed text: A review,” Turk ish Journal of Computer and Mathematics Education (TURCOMAT), vol. 12, no. 3, pp. 2469–2478, 2021. [22] S. Mandal, S. K. Mahata, and D. Das, “Preparing bengali-english code-mixed corpus for sentiment analysis of indian languages,” ArXiv, vol. abs/1803.04000, 2018. [23] S. Mandal and A. K. Singh, “Language identification in code-mixed data using multichannel neural networks and context capture,” in Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, Eds., Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 116–120. [24] S. M. Mohammad, “A practical guide to sentiment annotation: Challenges and solutions,” in WASSA@NAACL-HLT, 2016. [25] A. Pratapa, M. Choudhury, and S. Sitaram, “Word embeddings for code-mixed language processing,” in Proceedings of the 2018 conference on empirical meth ods in natural language processing, 2018, pp. 3067–3072. 54 [26] E. Pustulka-Hunt, T. Hanne, E. Blumer, and M. Frieder, “Multilingual senti ment analysis for a swiss gig,” in 2018 6th International Symposium on Compu tational and Business Intelligence (ISCBI), IEEE, 2018, pp. 94–98. [27] P. Rani, S. Suryawanshi, K. Goswami, B. R. Chakravarthi, T. Fransen, and J. P. McCrae, “A comparative study of different state-of-the-art hate speech detec tion methods in Hindi-English code-mixed data,” English, in Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, R. Kumar, A. K. Ojha, B. Lahiri, et al., Eds., Marseille, France: European Language Resources Association (ELRA), May 2020, pp. 42–48. [28] M. S. Z. Rizvi, A. Srinivasan, T. Ganu, M. Choudhury, and S. Sitaram, “Gcm: A toolkit for generating synthetic code-mixed text,” in Proceedings of the 16th Con ference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2021, pp. 205–211. [29] B. Roark, L. Wolf-Sonkin, C. Kirov, et al., “Processing South Asian languages written in the Latin script: The Dakshina dataset,” in Proceedings of The 12th Language Resources and Evaluation Conference (LREC), 2020, pp. 2413–2423. [30] A. Rogers, A. Romanov, A. Rumshisky, S. Volkova, M. Gronas, and A. Gribov, “Rusentiment: An enriched sentiment analysis dataset for social media in rus sian,” in Proceedings of the 27th International Conference on Computational Lin guistics, Santa Fe, New Mexico, USA: Association for Computational Linguis tics, Aug. 2018, pp. 755–763. [31] N. Sabri, A. Edalat, and B. Bahrak, “Sentiment analysis of persian-english code mixed texts,” in 2021 26th International Computer Conference, Computer Society of Iran (CSICC), IEEE, 2021, pp. 1–4. [32] S. Sitaram and A. W. Black, “Speech synthesis of code-mixed text,” in Proceed ings of the Tenth International Conference on Language Resources and Evalua tion (LREC’16), 2016, pp. 3422–3428. [33] K. Sreelakshmi, B. Premjith, and K. P. Soman, “Detection of hate speech text in hindi-english code-mixed data,” Procedia Computer Science, vol. 171, pp. 737– 744, 2020. [34] S. Thara and P. Poornachandran, “Code-mixing: A brief survey,” in 2018 Inter national Conference on Advances in Computing, Communications and Informat ics (ICACCI), 2018, pp. 2382–2388. [35] S. Thara and P. Poornachandran, “Code-mixing: A brief survey,” in 2018 Inter national conference on advances in computing, communications and informatics (ICACCI), IEEE, 2018, pp. 2382–2388. 55 [36] Y. Vyas, S. Gella, J. Sharma, K. Bali, and M. Choudhury, “Pos tagging of english hindi code-mixed social media content,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 974– 979. [37] J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emotions in language,” Language Resources and Evaluation, vol. 39, no. 2, pp. 165– 210, May 2005. [38] T. Wolf, L. Debut, V. Sanh, et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019. [39] W. Yang et al., “Functions and application of code-mixing and code-switching in a second/foreign language classroom,” Frontiers in Educational Research, vol. 3, no. 4, pp. 38–40, 2020.	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2417
dc.description	Supervised by Dr. Md. Azam Hossain, Associate Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2024	en_US
dc.description.abstract	Code-mixed data, blending two or more languages within sentences, offers valuable insights for low-resource languages like Bengali, which have limited annotated cor pora. While sentiment analysis has been widely explored in various languages, code mixed Bengali remains underrepresented, lacking a comprehensive benchmark dataset. To address this gap, we introduce BnSentMix, a sentiment analysis dataset for code mixed Bengali-English, consisting of 20,000 samples annotated with four sentiment labels: positive, negative, neutral, and mixed. The data was sourced from e-commerce websites, YouTube, and Facebook, ensuring linguistic diversity and reflecting real world, code-mixed scenarios. Our dataset captures a wide variety of user-generated content, providing robust coverage of both informal and formal language styles. We utilized a novel automated text filtering pipeline that employs fine-tuned pre-trained language models to detect and extract code-mixed samples, ensuring high-quality data. To evaluate the dataset, we applied 11 baseline approaches, ranging from tra ditional machine learning models to advanced transformer-based architectures. Our best model achieved an accuracy of 69.5% and an F1 score of 68.8%.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.title	A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis	en_US
dc.type	Thesis	en_US