BDA: Bangla Text Data Augmentation Framework

Tariquzzaman, Md.; Anam, Audwit Nafi; Haque, Naimul

dc.contributor.author	Tariquzzaman, Md.
dc.contributor.author	Anam, Audwit Nafi
dc.contributor.author	Haque, Naimul
dc.date.accessioned	2025-03-06T07:35:21Z
dc.date.available	2025-03-06T07:35:21Z
dc.date.issued	2024-06-23
dc.identifier.citation	[1] M. Kabir, O. Bin Mahfuz, S. R. Raiyan, H. Mahmud, and M. K. Hasan, “BanglaBook: A large-scale Bangla dataset for sentiment analysis from book reviews,” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 1237–1247. [Online]. Available: https://aclanthology.org/2023.findings-acl.80 [2] K. D. Dhole, V. Gangal, S. Gehrmann, A. Gupta, Z. Li, S. Mahamood, A. Mahendiran, S. Mille, A. Srivastava, S. Tan, T. Wu, J. Sohl-Dickstein, J. D. Choi, E. H. Hovy, O. Dusek, S. Ruder, S. Anand, N. Aneja, R. Banjade, L. Barthe, H. Behnke, I. Berlot-Attwell, C. Boyle, C. Brun, M. A. S. Cabezudo, S. Cahyawijaya, E. Chapuis, W. Che, M. Choudhary, C. Clauss, P. Colombo, F. Cornell, G. Dagan, M. Das, T. Dixit, T. Dopierre, P. Dray, S. Dubey, T. Ekeinhor, M. D. Giovanni, R. Gupta, R. Gupta, L. Hamla, S. Han, F. Harel-Canada, A. Honore, I. Jindal, P. K. Joniak, D. Kleyko, V. Kovatchev, and et al., “Nl-augmenter: A framework for task-sensitive natural language augmentation,” CoRR, vol. abs/2112.02721, 2021. [Online]. Available: https://arxiv.org/abs/2112.02721 [3] O. Sen, M. Fuad, M. N. Islam, J. Rabbi, M. Masud, M. K. Hasan, M. A. Awal, A. Ahmed Fime, M. T. Hasan Fuad, D. Sikder, and M. A. Raihan Iftee, “Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning-based methods,” IEEE Access, vol. 10, p. 38999–39044, 2022. [Online]. Available: http://dx.doi.org/10.1109/ACCESS. 2022.3165563 62 Chapter A: BIBLIOGRAPHY 63 [4] T. Mohiuddin, M. S. Bari, and S. Joty, “AugVic: Exploiting BiText vicinity for low-resource NMT,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 3034–3045. [Online]. Available: https://aclanthology.org/2021.findings-acl.267 [5] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. H. Hovy, “A survey of data augmentation approaches for NLP,” CoRR, vol. abs/2105.03075, 2021. [Online]. Available: https://arxiv.org/abs/2105.03075 [6] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy, “A survey of data augmentation approaches for nlp,” arXiv preprint arXiv:2105.03075, 2021. [7] M. Tareq, M. F. Islam, S. Deb, S. Rahman, and A. Al Mahmud, “Data augmentation for bangla-english code-mixed sentiment analysis: Enhancing cross linguistic contextual understanding,” IEEE Access, 2023. [8] A. J. Keya, M. A. H. Wadud, M. Mridha, M. Alatiyyah, and M. A. Hamid, “Augfake-bert: handling imbalance through augmentation of fake news using bert to enhance the performance of fake news classification,” Applied Sciences, vol. 12, no. 17, p. 8398, 2022. [9] N. R. Bhowmik, M. Arifuzzaman, and M. R. H. Mondal, “Sentiment analysis on bangla text using extended lexicon dictionary and deep learning algorithms,” Ar ray, vol. 13, p. 100123, 2022. [10] P. Simard, Y. LeCun, J. S. Denker, and B. Victorri, “Transformation invariance in pattern recognition-tangent distance and tangent propagation,” in Neural Net works: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop. Berlin, Heidelberg: Springer-Verlag, 1998, p. 239–27. [11] B. Li, Y. Hou, and W. Che, “Data augmentation approaches in natural language processing: A survey,” CoRR, vol. abs/2110.01852, 2021. [Online]. Available: https://arxiv.org/abs/2110.01852 Chapter A: BIBLIOGRAPHY 64 [12] J. Wei and K. Zou, “Eda: Easy data augmentation techniques for boosting perfor mance on text classification tasks,” arXiv preprint arXiv:1901.11196, 2019. [13] J. Wei, C. Huang, S. Xu, and S. Vosoughi, “Text augmentation in a multi task view,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 2888–2894. [Online]. Available: https://aclanthology.org/2021. eacl-main.252 [14] Y. Li, T. Cohn, and T. Baldwin, “Robust training under linguistic adversity,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, M. Lapata, P. Blunsom, and A. Koller, Eds. Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 21–27. [Online]. Available: https://aclanthology.org/E17-2004 [15] O. Kashefi and R. Hwa, “Quantifying the evaluation of heuristic methods for textual data augmentation,” in Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), W. Xu, A. Ritter, T. Baldwin, and A. Rahimi, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 200–208. [Online]. Available: https://aclanthology.org/2020.wnut-1.26 [16] B. Hariharan and R. B. Girshick, “Low-shot visual object recognition,” CoRR, vol. abs/1606.02819, 2016. [Online]. Available: http://arxiv.org/abs/1606.02819 [17] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, A. Kumar, R. Feris, R. Giryes, and A. M. Bronstein, “Δ-encoder: an effective sample synthesis method for few-shot object recognition,” in Proceedings of the 32nd International Confer ence on Neural Information Processing Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 2850–2860. [18] M. Paschali, W. Simson, A. G. Roy, M. F. Naeem, R. Göbl, C. Wachinger, and N. Navab, “Data augmentation with manifold exploring geometric transformations Chapter A: BIBLIOGRAPHY 65 for increased performance and robustness,” CoRR, vol. abs/1901.04420, 2019. [Online]. Available: http://arxiv.org/abs/1901.04420 [19] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le, “Unsupervised data augmentation for consistency training,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6256–6268. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 44feb0096faa8326192570788b38c1d1-Paper.pdf [20] H. Chen, Y. Ji, and D. Evans, “Finding Friends and flipping frenemies: Automatic paraphrase dataset augmentation using graph theory,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 4741–4751. [Online]. Available: https://aclanthology.org/2020.findings-emnlp. 426 [21] G. G. Şahin and M. Steedman, “Data augmentation via dependency tree morphing for low-resource languages,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 5004–5009. [Online]. Available: https://aclanthology.org/D18-1545 [22] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), K. Erk and N. A. Smith, Eds. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 86–96. [Online]. Available: https://aclanthology.org/P16-1009 [23] A. Kumar, S. Bhattamishra, M. Bhandari, and P. Talukdar, “Submodular optimization-based diverse paraphrasing and its effectiveness in data augmenta tion,” in Proceedings of the 2019 Conference of the North American Chapter of Chapter A: BIBLIOGRAPHY 66 the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 3609–3619. [Online]. Available: https://aclanthology.org/N19-1363 [24] S. Kobayashi, “Contextual augmentation: Data augmentation by words with paradigmatic relations,” in Proceedings of the 2018 Conference of the North Amer ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 452–457. [Online]. Available: https://aclanthology.org/N18-2072 [25] Y. Yang, C. Malaviya, J. Fernandez, S. Swayamdipta, R. Le Bras, J.-P. Wang, C. Bhagavatula, Y. Choi, and D. Downey, “Generative data augmentation for commonsense reasoning,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 1008–1025. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.90 [26] F. Gao, J. Zhu, L. Wu, Y. Xia, T. Qin, X. Cheng, W. Zhou, and T.- Y. Liu, “Soft contextual data augmentation for neural machine translation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 5539–5544. [Online]. Available: https://aclanthology.org/P19-1555 [27] Y. Nie, Y. Tian, X. Wan, Y. Song, and B. Dai, “Named entity recognition for social media texts with semantic augmentation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 1383–1391. [Online]. Available: https://aclanthology. org/2020.emnlp-main.107 Chapter A: BIBLIOGRAPHY 67 [28] N. Ng, K. Cho, and M. Ghassemi, “SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 1268–1283. [Online]. Available: https://aclanthology.org/2020.emnlp-main.97 [29] S. Y. Feng, A. W. Li, and J. Hoey, “Keep calm and switch on! preserving sentiment and fluency in semantic text exchange,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2701–2711. [Online]. Available: https://aclanthology.org/D19-1272 [30] A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A. Kantor, G. Kour, S. Shlomov, N. Tepper, and N. Zwerdling, “Not enough data? deep learning to the rescue!” CoRR, vol. abs/1911.03118, 2019. [Online]. Available: http: //arxiv.org/abs/1911.03118 [31] H. Quteineh, S. Samothrakis, and R. Sutcliffe, “Textual data augmentation for efficient active learning on tiny datasets,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 7400–7410. [Online]. Available: https://aclanthology. org/2020.emnlp-main.600 [32] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, “Adversarial example generation with syntactically controlled paraphrase networks,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Com putational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: Association Chapter A: BIBLIOGRAPHY 68 for Computational Linguistics, Jun. 2018, pp. 1875–1885. [Online]. Available: https://aclanthology.org/N18-1170 [33] V. Gangal, S. Y. Feng, E. H. Hovy, and T. Mitamura, “NAREOR: the narrative reordering problem,” CoRR, vol. abs/2104.06669, 2021. [Online]. Available: https://arxiv.org/abs/2104.06669 [34] T. Dreossi, S. Ghosh, X. Yue, K. Keutzer, A. L. Sangiovanni-Vincentelli, and S. A. Seshia, “Counterexample-guided data augmentation,” CoRR, vol. abs/1805.06962, 2018. [Online]. Available: http://arxiv.org/abs/1805.06962 [35] N. Thakur, N. Reimers, J. Daxenberger, and I. Gurevych, “Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks,” CoRR, vol. abs/2010.08240, 2020. [Online]. Available: https://arxiv.org/abs/2010.08240 [36] A. Akil, N. Sultana, A. Bhattacharjee, and R. Shahriyar, “Banglaparaphrase: A high-quality bangla paraphrase dataset,” arXiv preprint arXiv:2210.05109, 2022. [37] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic BERT sentence embedding,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 878–891. [Online]. Available: https://aclanthology.org/2022.acl-long.62 [38] A. Bhattacharjee, T. Hasan, W. U. Ahmad, and R. Shahriyar, “Banglanlg and banglat5: Benchmarks and resources for evaluating low-resource natural language generation in bangla,” 2023. [39] M. Tareq, M. F. Islam, S. Deb, S. Rahman, and A. A. Mahmud, “Data augmentation for bangla-english code-mixed sentiment analysis: Enhancing cross linguistic contextual understanding,” IEEE Access, vol. 11, pp. 51 657–51 671, 2023. Chapter A: BIBLIOGRAPHY 69 [40] H. T. Kesgin and M. F. Amasyali, Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling. Springer Na ture Switzerland, Dec. 2023, p. 450–463. [Online]. Available: http: //dx.doi.org/10.1007/978-3-031-50920-9_35 [41] T. Hasan, A. Bhattacharjee, K. Samin, M. Hasan, M. Basak, M. S. Rahman, and R. Shahriyar, “Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 2612–2623. [Online]. Available: https://aclanthology.org/2020.emnlp-main.207 [42] A. Akil, N. Sultana, A. Bhattacharjee, and R. Shahriyar, “BanglaParaphrase: A high-quality Bangla paraphrase dataset,” in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Y. He, H. Ji, S. Li, Y. Liu, and C.-H. Chang, Eds. Online only: Association for Computational Linguistics, Nov. 2022, pp. 261–272. [Online]. Available: https://aclanthology.org/2022.aacl-short.33 [43] J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6382–6388. [Online]. Available: https://aclanthology.org/D19-1670 [44] T. Hasan, A. Bhattacharjee, K. Samin, M. Hasan, M. Basak, M. S. Rahman, and R. Shahriyar, “Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing Chapter A: BIBLIOGRAPHY 70 (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 2612–2623. [Online]. Available: https://www.aclweb.org/anthology/2020. emnlp-main.207 [45] A. Bhattacharjee, T. Hasan, W. Ahmad, K. S. Mubasshir, M. S. Islam, A. Iqbal, M. S. Rahman, and R. Shahriyar, “BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla,” in Findings of the Association for Computational Linguistics: NAACL 2022, M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 1318–1327. [Online]. Available: https://aclanthology.org/2022.findings-naacl.98 [46] K. I. Islam, S. Kar, M. S. Islam, and M. R. Amin, “SentNoB: A dataset for analysing sentiment on noisy Bangla texts,” in Findings of the Association for Computational Linguistics: EMNLP 2021, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3265–3271. [Online]. Available: https://aclanthology.org/2021.findings-emnlp.278 [47] A. Baruah, K. Das, F. Barbhuiya, and K. Dey, “Aggression identification in English, Hindi and Bangla text using BERT, RoBERTa and SVM,” in Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, R. Kumar, A. K. Ojha, B. Lahiri, M. Zampieri, S. Malmasi, V. Murdock, and D. Kadar, Eds. Marseille, France: European Language Resources Association (ELRA), May 2020, pp. 76–82. [Online]. Available: https://aclanthology.org/2020.trac-1.12 [48] A. Iqbal, A. Das, O. Sharif, M. Hoque, and I. Sarker, “Bemoc: A corpus for iden tifying emotion in bengali texts,” SN Computer Science, vol. 3, 03 2022. [49] M. Rahman and E. Dey, “Datasets for aspect-based sentiment analysis in bangla and its baseline evaluation,” Data, vol. 3, p. 15, 05 2018. [50] K. I. Islam, M. S. Islam, and M. R. Amin, “Sentiment analysis in bengali via trans- Chapter A: BIBLIOGRAPHY 71 fer learning using multi-lingual bert,” in 2020 23rd International Conference on Computer and Information Technology (ICCIT), 2020, pp. 1–5. [51] S. Saha, J. A. Junaed, M. Saleki, M. Rahouti, N. Mohammed, and M. R. Amin, “BLP-2023 task 1: Violence inciting text detection (VITD),” in Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), F. Alam, S. Kar, S. A. Chowdhury, F. Sadeque, and R. Amin, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 255–265. [Online]. Available: https://aclanthology.org/2023.banglalp-1.33 [52] M. Post, “A call for clarity in reporting BLEU scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers. Belgium, Brussels: Association for Computational Linguistics, Oct. 2018, pp. 186–191. [Online]. Available: https://www.aclweb.org/anthology/W18-6319 [53] S. Deode, J. Gadre, A. Kajale, A. Joshi, and R. Joshi, “L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT,” in Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, C.-R. Huang, Y. Harada, J.-B. Kim, S. Chen, Y.-Y. Hsu, E. Chersoni, P. A, W. H. Zeng, B. Peng, Y. Li, and J. Li, Eds. Hong Kong, China: Association for Computational Linguistics, Dec. 2023, pp. 154–163. [Online]. Available: https://aclanthology.org/2023.paclic-1.1	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2362
dc.description	Supervised by Mr. Md. Mohsinul Kabir, Assistant Professor, Dr. Hasan Mahmud, Associate Professor, Dr. Kamrul Hasan, Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2024	en_US
dc.description.abstract	Data augmentation can be a valuable technique, particularly in resource-scarce linguistic domains for improving the performance of natural language processing tasks by creat ing new synthetic data instances. This paper introduces a Bangla text Data Augmen tation Framework (BDA) using pre-trained model-based and rule-based approaches, along with a filtering pipeline to ensure semantic similarity and lexical variance be tween augmented and original text. We provide a comprehensive pipeline for the pro posed framework and perform an in-depth analysis of how well it performs in the Bangla text classification tasks. Our framework improved the F1 score of classification tasks by up to 13.92%, 8.58%, and 10.55% among 15%, 50%, and 100% clipping ranges re spectively, across five different datasets. Training with BDA while using only 50% of the available training set achieved the comparable F1 score as normal training with all available data. We provide an extensive study of the performance of each augmentation approach at the clipping ranges of datasets using BanglaBERT and variants of SVM. Furthermore, we discuss the indicators for optimal performance of the BDA framework and its shortcomings with in-depth analysis	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	Augmentation, NLP, Synthetic text generation, Data scarcity	en_US
dc.title	BDA: Bangla Text Data Augmentation Framework	en_US
dc.type	Thesis	en_US