Abstract:
Data augmentation can be a valuable technique, particularly in resource-scarce linguistic
domains for improving the performance of natural language processing tasks by creat ing new synthetic data instances. This paper introduces a Bangla text Data Augmen tation Framework (BDA) using pre-trained model-based and rule-based approaches,
along with a filtering pipeline to ensure semantic similarity and lexical variance be tween augmented and original text. We provide a comprehensive pipeline for the pro posed framework and perform an in-depth analysis of how well it performs in the Bangla
text classification tasks. Our framework improved the F1 score of classification tasks
by up to 13.92%, 8.58%, and 10.55% among 15%, 50%, and 100% clipping ranges re spectively, across five different datasets. Training with BDA while using only 50% of
the available training set achieved the comparable F1 score as normal training with all
available data. We provide an extensive study of the performance of each augmentation
approach at the clipping ranges of datasets using BanglaBERT and variants of SVM.
Furthermore, we discuss the indicators for optimal performance of the BDA framework
and its shortcomings with in-depth analysis
Description:
Supervised by
Mr. Md. Mohsinul Kabir,
Assistant Professor,
Dr. Hasan Mahmud,
Associate Professor,
Dr. Kamrul Hasan,
Professor,
Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT)
Board Bazar, Gazipur, Bangladesh
This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2024