A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Nishita, Sadia Alam; Alvee, Navid Hasin; Siddique, Md. Shahnewaz

A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Nishita, Sadia Alam; Alvee, Navid Hasin; Siddique, Md. Shahnewaz

URI: http://hdl.handle.net/123456789/2417

Date: 2024-06-30

Abstract:

Code-mixed data, blending two or more languages within sentences, offers valuable insights for low-resource languages like Bengali, which have limited annotated cor pora. While sentiment analysis has been widely explored in various languages, code mixed Bengali remains underrepresented, lacking a comprehensive benchmark dataset. To address this gap, we introduce BnSentMix, a sentiment analysis dataset for code mixed Bengali-English, consisting of 20,000 samples annotated with four sentiment labels: positive, negative, neutral, and mixed. The data was sourced from e-commerce websites, YouTube, and Facebook, ensuring linguistic diversity and reflecting real world, code-mixed scenarios. Our dataset captures a wide variety of user-generated content, providing robust coverage of both informal and formal language styles. We utilized a novel automated text filtering pipeline that employs fine-tuned pre-trained language models to detect and extract code-mixed samples, ensuring high-quality data. To evaluate the dataset, we applied 11 baseline approaches, ranging from tra ditional machine learning models to advanced transformer-based architectures. Our best model achieved an accuracy of 69.5% and an F1 score of 68.8%.

Description:

Supervised by Dr. Md. Azam Hossain, Associate Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2024

Show full item record