Abstract:
Code-mixed data, blending two or more languages within sentences, offers valuable
insights for low-resource languages like Bengali, which have limited annotated cor pora. While sentiment analysis has been widely explored in various languages, code mixed Bengali remains underrepresented, lacking a comprehensive benchmark dataset.
To address this gap, we introduce BnSentMix, a sentiment analysis dataset for code mixed Bengali-English, consisting of 20,000 samples annotated with four sentiment
labels: positive, negative, neutral, and mixed. The data was sourced from e-commerce
websites, YouTube, and Facebook, ensuring linguistic diversity and reflecting real world, code-mixed scenarios. Our dataset captures a wide variety of user-generated
content, providing robust coverage of both informal and formal language styles. We
utilized a novel automated text filtering pipeline that employs fine-tuned pre-trained
language models to detect and extract code-mixed samples, ensuring high-quality
data. To evaluate the dataset, we applied 11 baseline approaches, ranging from tra ditional machine learning models to advanced transformer-based architectures. Our
best model achieved an accuracy of 69.5% and an F1 score of 68.8%.
Description:
Supervised by
Dr. Md. Azam Hossain,
Associate Professor,
Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT)
Board Bazar, Gazipur, Bangladesh
This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2024