A Machine Learning approach to Data Augmentation with Semantic Similarity on a Low-Resource Language

Islam, Shah Jawad; Chowdhury, Mohammad Abrar; Alam, Taufiqul

dc.contributor.author	Islam, Shah Jawad
dc.contributor.author	Chowdhury, Mohammad Abrar
dc.contributor.author	Alam, Taufiqul
dc.date.accessioned	2024-08-28T09:58:56Z
dc.date.available	2024-08-28T09:58:56Z
dc.date.issued	2023-05-30
dc.identifier.citation	[1] M. Z. Hossain, M. A. Rahman, M. S. Islam, and S. Kar, “Banfakenews: A dataset for detecting fake news in bangla,” arXiv preprint arXiv:2004.08789, 2020. [2] A. J. Keya, M. A. H. Wadud, M. Mridha, M. Alatiyyah, and M. A. Hamid, “Augfake-bert: Handling imbalance through augmentation of fake news using bert to enhance the performance of fake news classification,” Applied Sciences, vol. 12, no. 17, p. 8398, 2022. [3] O. F. Rakib, S. Akter, M. A. Khan, A. K. Das, and K. M. Habibullah, “Bangla word prediction and sentence completion using gru: an extended version of rnn on n-gram language model,” in 2019 International Conference on Sustainable Tech nologies for Industry 4.0 (STI). IEEE, 2019, pp. 1–6. [4] Y. Kang, Z. Cai, C.-W. Tan, Q. Huang, and H. Liu, “Natural language process ing (nlp) in management research: A literature review,” Journal of Management Analytics, vol. 7, no. 2, pp. 139–172, 2020. [5] J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in COLING 1992 volume 4: The 14th international conference on computational linguistics, 1992. [6] K. Chowdhary and K. Chowdhary, “Natural language processing,” Fundamentals of artificial intelligence, pp. 603–649, 2020. [7] R. Socher, Y. Bengio, and C. D. Manning, “Deep learning for nlp (without magic),” in Tutorial Abstracts of ACL 2012, 2012, pp. 5–5. [8] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on social me dia: A data mining perspective,” ACM SIGKDD explorations newsletter, vol. 19, no. 1, pp. 22–36, 2017. [9] X. Zhang and A. A. Ghorbani, “An overview of online fake news: Characterization, detection, and discussion,” Information Processing & Management, vol. 57, no. 2, p. 102025, 2020. 48 Bibliography 49 [10] J. C. Reis, A. Correia, F. Murai, A. Veloso, and F. Benevenuto, “Supervised learning for fake news detection,” IEEE Intelligent Systems, vol. 34, no. 2, pp. 76–81, 2019. [11] X. Zhou and R. Zafarani, “A survey of fake news: Fundamental theories, detection methods, and opportunities,” ACM Computing Surveys (CSUR), vol. 53, no. 5, pp. 1–40, 2020. [12] C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text data augmentation for deep learning,” Journal of big Data, vol. 8, pp. 1–34, 2021. [13] J. S. Liu and Y. N. Wu, “Parameter expansion for data augmentation,” Journal of the American Statistical Association, vol. 94, no. 448, pp. 1264–1274, 1999. [14] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy, “A survey of data augmentation approaches for nlp,” arXiv preprint arXiv:2105.03075, 2021. [15] S. Li, M. Xie, K. Gong, C. H. Liu, Y. Wang, and W. Li, “Transferable semantic augmentation for domain adaptation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 516–11 525. [16] Y. Wang, X. Pan, S. Song, H. Zhang, G. Huang, and C. Wu, “Implicit semantic data augmentation for deep networks,” Advances in Neural Information Processing Systems, vol. 32, 2019. [17] Y. Nie, Y. Tian, X. Wan, Y. Song, and B. Dai, “Named entity recognition for social media texts with semantic augmentation,” arXiv preprint arXiv:2010.15458, 2020. [18] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, “Understanding data augmentation for classification: when to warp?” in 2016 international conference on digital image computing: techniques and applications (DICTA). IEEE, 2016, pp. 1–6. [19] S. Kotsiantis, D. Kanellopoulos, P. Pintelas et al., “Handling imbalanced datasets: A review,” GESTS international transactions on computer science and engineer ing, vol. 30, no. 1, pp. 25–36, 2006. [20] M. S. Rahman, F. B. Ashraf, and M. R. Kabir, “An efficient deep learning tech nique for bangla fake news detection,” in 2022 25th International Conference on Computer and Information Technology (ICCIT). IEEE, 2022, pp. 206–211. [21] D. Ramyachitra and P. Manikandan, “Imbalanced dataset classification and so lutions: a review,” International Journal of Computing and Business Research (IJCBR), vol. 5, no. 4, pp. 1–29, 2014. Bibliography 50 [22] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,” Machine learning, vol. 15, pp. 201–221, 1994. [23] Y. Wang, G. Huang, S. Song, X. Pan, Y. Xia, and C. Wu, “Regularizing deep net works with semantic data augmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3733–3748, 2021. [24] S. B. S. Mugdha, S. M. Ferdous, and A. Fahmin, “Evaluating machine learning algorithms for bengali fake news detection,” in 2020 23rd International Conference on Computer and Information Technology (ICCIT). IEEE, 2020, pp. 1–6. [25] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big data, vol. 3, no. 1, pp. 1–40, 2016. [26] A. Magueresse, V. Carles, and E. Heetderks, “Low-resource languages: A review of past work and future challenges,” arXiv preprint arXiv:2006.07264, 2020. [27] G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language and cognitive processes, vol. 6, no. 1, pp. 1–28, 1991. [28] D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum, “Optimizing semantic coherence in topic models,” in Proceedings of the 2011 conference on empirical methods in natural language processing, 2011, pp. 262–272. [29] M. Q. Patton et al., “Qualitative evaluation methods,” 1980. [30] G. Lee, I. Lee, H. Ha, K.-G. Lee, H. Hyun, A. Shin, and B.-G. Chun, “Refurbish your training data: Reusing partially augmented samples for faster deep neural network training.” in USENIX Annual Technical Conference, 2021, pp. 537–550. [31] M. C. Ramos, “Some ethical implications of qualitative research,” Research in Nursing & Health, vol. 12, no. 1, pp. 57–63, 1989. [32] T. S. Apon, R. Anan, E. A. Modhu, A. Suter, I. J. Sneha, and M. G. R. Alam, “Banglasarc: A dataset for sarcasm detection,” in 2022 IEEE Asia-Pacific Con ference on Computer Science and Data Engineering (CSDE). IEEE, 2022, pp. 1–5. [33] M. E. Markiewicz and C. J. de Lucena, “Object oriented framework development,” XRDS: Crossroads, The ACM Magazine for Students, vol. 7, no. 4, pp. 3–9, 2001. [34] L. Pevzner and M. A. Hearst, “A critique and improvement of an evaluation metric for text segmentation,” Computational Linguistics, vol. 28, no. 1, pp. 19–36, 2002. Bibliography 51 [35] I. Hern´andez, S. Sawicki, F. Roos-Frantz, and R. Z. Frantz, “Cloud configura tion modelling: a literature review from an application integration deployment perspective,” Procedia Computer Science, vol. 64, pp. 977–983, 2015. [36] S. Fr¨uhwirth-Schnatter, “Data augmentation and dynamic linear models,” Journal of time series analysis, vol. 15, no. 2, pp. 183–202, 1994. [37] A. Antoniou, A. Storkey, and H. Edwards, “Data augmentation generative adver sarial networks,” arXiv preprint arXiv:1711.04340, 2017. [38] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019. [39] J. Wei and K. Zou, “Eda: Easy data augmentation techniques for boosting per formance on text classification tasks,” arXiv preprint arXiv:1901.11196, 2019. [40] R. Keskis¨arkk¨a, “Automatic text simplification via synonym replacement,” 2012. [41] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017. [42] L. Zhang, Z. Deng, K. Kawaguchi, A. Ghorbani, and J. Zou, “How does mixup help with robustness and generalization?” arXiv preprint arXiv:2010.04819, 2020. [43] E. Alemayehu and Y. Fang, “A submodular optimization framework for imbal anced text classification with data augmentation,” IEEE Access, 2023. [44] J. Chen, Z. Yang, and D. Yang, “Mixtext: Linguistically-informed interpo lation of hidden space for semi-supervised text classification,” arXiv preprint arXiv:2004.12239, 2020. [45] Y. Wang, J. Yan, Z. Yang, Y. Zhao, and T. Liu, “Optimizing gis partial discharge pattern recognition in the ubiquitous power internet of things context: A mixnet deep learning model,” International Journal of Electrical Power & Energy Systems, vol. 125, p. 106484, 2021. [46] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” arXiv preprint arXiv:1511.06709, 2015. [47] T. Stoev, A. Ferrario, B. Demiray, M. Luo, M. Martin, and K. Yordanova, “Coping with imbalanced data in the automated detection of reminiscence from everyday life conversations of older adults,” IEEE Access, vol. 9, pp. 116 540–116 551, 2021. [48] S. Kobayashi, “Contextual augmentation: Data augmentation by words with paradigmatic relations,” arXiv preprint arXiv:1805.06201, 2018. Bibliography 52 [49] Y. Yang, C. Malaviya, J. Fernandez, S. Swayamdipta, R. L. Bras, J.-P. Wang, C. Bhagavatula, Y. Choi, and D. Downey, “Generative data augmentation for commonsense reasoning,” arXiv preprint arXiv:2004.11546, 2020. [50] S. Vosoughi, D. Roy, and S. Aral, “The spread of true and false news online,” science, vol. 359, no. 6380, pp. 1146–1151, 2018. [51] V. P´erez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea, “Automatic detection of fake news,” arXiv preprint arXiv:1708.07104, 2017. [52] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detection through multi-perspective speaker profiles,” in Proceedings of the eighth international joint conference on natural language processing (volume 2: Short papers), 2017, pp. 252–256. [53] F. Yang, A. Mukherjee, and E. Dragut, “Satirical news detection and analysis using attention mechanism and linguistic features,” arXiv preprint arXiv:1709.01189, 2017. [54] G. Karadzhov, P. Nakov, L. M`arquez, A. Barr´on-Cede˜no, and I. Koychev, “Fully automated fact checking using external sources,” arXiv preprint arXiv:1710.00341, 2017. [55] X. Dong, U. Victor, S. Chowdhury, and L. Qian, “Deep two-path semi-supervised learning for fake news detection,” arXiv preprint arXiv:1906.05659, 2019. [56] P. S. Ray et al., “Bengali language handbook.” 1966. [57] M. Lan, Z. Zhang, Y. Lu, and J. Wu, “Three convolutional neural network-based models for learning sentiment word vectors towards sentiment analysis,” in 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 2016, pp. 3172–3179. [58] H. Saleh, A. Alharbi, and S. H. Alsamhi, “Opcnn-fake: optimized convolutional neural network for fake news detection,” IEEE Access, vol. 9, pp. 129 471–129 489, 2021. [59] M. Umer, Z. Imtiaz, S. Ullah, A. Mehmood, G. S. Choi, and B.-W. On, “Fake news stance detection using deep learning architecture (cnn-lstm),” IEEE Access, vol. 8, pp. 156 695–156 706, 2020. [60] O. Ajao, D. Bhowmik, and S. Zargari, “Fake news identification on twitter with hybrid cnn and rnn models,” in Proceedings of the 9th international conference on social media and society, 2018, pp. 226–230. Bibliography 53 [61] S. Singhania, N. Fernandez, and S. Rao, “3han: A deep neural network for fake news detection,” in International conference on neural information processing. Springer, 2017, pp. 572–581. [62] N. Aloshban, “Act: Automatic fake news classification through self-attention,” in 12th ACM Conference on Web Science, 2020, pp. 115–124. [63] Y.-J. Lu and C.-T. Li, “Gcan: Graph-aware co-attention networks for explainable fake news detection on social media,” arXiv preprint arXiv:2004.11648, 2020. [64] H. Jwa, D. Oh, K. Park, J. M. Kang, and H. Lim, “exbake: Automatic fake news detection model based on bidirectional encoder representations from transformers (bert),” Applied Sciences, vol. 9, no. 19, p. 4062, 2019. [65] T. Zhang, D. Wang, H. Chen, Z. Zeng, W. Guo, C. Miao, and L. Cui, “Bdann: Bert-based domain adaptation neural network for multi-modal fake news detec tion,” in 2020 international joint conference on neural networks (IJCNN). IEEE, 2020, pp. 1–8. [66] R. K. Kaliyar, A. Goswami, and P. Narang, “Fakebert: Fake news detection in social media with a bert-based deep learning approach,” Multimedia tools and applications, vol. 80, no. 8, pp. 11 765–11 788, 2021. [67] C.-L. Wu, H.-P. Hsieh, J. Jiang, Y.-C. Yang, C. Shei, and Y.-W. Chen, “Muffle: Multi-modal fake news influence estimator on twitter,” Applied Sciences, vol. 12, no. 1, p. 453, 2022. [68] S. Hiriyannaiah, A. Srinivas, G. K. Shetty, G. Siddesh, and K. Srinivasa, “A com putationally intelligent agent for detecting fake news using generative adversarial networks,” in Hybrid Computational Intelligence. Elsevier, 2020, pp. 69–96. [69] N. J. Ria, S. A. Khushbu, M. A. Yousuf, A. K. M. Masum, S. Abujar, and S. A. Hossain, “Toward an enhanced bengali text classification using saint and common form,” in 2020 11th international conference on computing, communication and networking technologies (ICCCNT). IEEE, 2020, pp. 1–5. [70] K. Knight and J. Graehl, “Machine transliteration,” arXiv preprint cmp lg/9704003, 1997. [71] M. E. Peters, M. Neumann, R. L. Logan IV, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith, “Knowledge enhanced contextual word representations,” arXiv preprint arXiv:1909.04164, 2019. Bibliography 54 [72] L. M. Rose, N. Matragkas, D. S. Kolovos, and R. F. Paige, “A feature model for model-to-text transformation languages,” in 2012 4th International Workshop on Modeling in Software Engineering (MISE). IEEE, 2012, pp. 57–63. [73] S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding back-translation at scale,” arXiv preprint arXiv:1808.09381, 2018. [74] C. A. Ferguson and M. Chowdhury, “The phonemes of bengali,” Language, vol. 36, no. 1, pp. 22–59, 1960. [75] B. Hayes and A. Lahiri, “Bengali intonational phonology,” Natural language & linguistic theory, vol. 9, pp. 47–96, 1991. [76] R. R. Chowdhury, M. S. Hossain, R. ul Islam, K. Andersson, and S. Hossain, “Bangla handwritten character recognition using convolutional neural network with data augmentation,” in 2019 Joint 8th International Conference on Infor matics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR). IEEE, 2019, pp. 318–323. [77] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [78] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language model scoring,” arXiv preprint arXiv:1910.14659, 2019. [79] Y. Sun, Y. Zheng, C. Hao, and H. Qiu, “Nsp-bert: A prompt-based zero-shot learner through an original pre-training task–next sentence prediction,” arXiv preprint arXiv:2109.03564, 2021. [80] K. M. Hosny, M. A. Kassem, and M. M. Foaud, “Classification of skin lesions using transfer learning and augmentation with alex-net,” PloS one, vol. 14, no. 5, p. e0217293, 2019. [81] A. Karadeniz, “Cohesion and coherence in written texts of students of faculty of education.” Journal of Education and Training Studies, vol. 5, no. 2, pp. 93–99, 2017. [82] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020. [83] Z. Chi, S. Huang, L. Dong, S. Ma, B. Zheng, S. Singhal, P. Bajaj, X. Song, X.- L. Mao, H. Huang et al., “Xlm-e: Cross-lingual language model pre-training via electra,” arXiv preprint arXiv:2106.16138, 2021. Bibliography 55 [84] S. Ni and H.-Y. Kao, “Electra is a zero-shot learner, too,” arXiv preprint arXiv:2207.08141, 2022. [85] A. Bhattacharjee, T. Hasan, W. U. Ahmad, K. Samin, M. S. Islam, A. Iqbal, M. S. Rahman, and R. Shahriyar, “Banglabert: Language model pretraining and benchmarks for low-resource language understanding evaluation in bangla,” arXiv preprint arXiv:2101.00204, 2021. [86] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle moyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining ap proach,” arXiv preprint arXiv:1907.11692, 2019. [87] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019. [88] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [89] W. Yin, K. Kann, M. Yu, and H. Sch¨utze, “Comparative study of cnn and rnn for natural language processing,” arXiv preprint arXiv:1702.01923, 2017. [90] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short term memory (lstm) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020. [91] I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, no. 22, 2001, pp. 41–46. [92] J. Kandola, N. Cristianini, and J. Shawe-taylor, “Learning semantic similarity,” Advances in neural information processing systems, vol. 15, 2002. [93] A. Das and D. Saha, “Deep learning based bengali question answering system using semantic textual similarity,” Multimedia Tools and Applications, pp. 1–25, 2022. [94] A. S. Bauer, P. Schmaus, F. Stulp, and D. Leidner, “Probabilistic effect pre diction through semantic augmentation and physical simulation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 9278–9284. Bibliography 56 [95] J. Li, X. Zhang, X. Zhou et al., “Albert-based self-ensemble model with semisu pervised learning and data augmentation for clinical semantic textual similarity calculation: algorithm validation study,” JMIR Medical Informatics, vol. 9, no. 1, p. e23086, 2021. [96] G. Szlobodnyik and L. Farkas, “Data augmentation by guided deep interpolation,” Applied Soft Computing, vol. 111, p. 107680, 2021. [97] M. A. Iqbal, O. Sharif, M. M. Hoque, and I. H. Sarker, “Word embedding based textual semantic similarity measure in bengali,” Procedia Computer Science, vol. 193, pp. 92–101, 2021. [98] M. Shajalal and M. Aono, “Semantic textual similarity in bengali text,” in 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE, 2018, pp. 1–5. [99] A. Sarkar and M. S. Hossen, “Automatic bangla text summarization using term frequency and semantic similarity approach,” in 2018 21st International Confer ence of Computer and Information Technology (ICCIT). IEEE, 2018, pp. 1–6. [100] A. Akil, N. Sultana, A. Bhattacharjee, and R. Shahriyar, “Banglaparaphrase: A high-quality bangla paraphrase dataset,” arXiv preprint arXiv:2210.05109, 2022. [101] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020. [102] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. [103] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cyber netics), vol. 39, no. 2, pp. 539–550, 2008. [104] L. Rice, E. Wong, and Z. Kolter, “Overfitting in adversarially robust deep learn ing,” in International Conference on Machine Learning. PMLR, 2020, pp. 8093– 8104	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2136
dc.description	Supervised by Dr. Hasan Mahmud, Associate Professor, Ms. Nafisa Sadaf, Lecturer, Dr. Md. Kamrul Hasan, Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh.	en_US
dc.description.abstract	The augmentation of data in low-resource languages gained significant importance re cently, primarily because of scarcity of datasets or the presence of highly unbalanced datasets. In the case of the Bengali language, the detection of fake news has turned up as a relevant problem, particularly in light of the surge in false information related to Covid-19 and the pandemic [1]. However, there has been a lack of adequately balanced data sets specifically designed for training Machine Learning (ML) and Deep Learning (DL) models in the detection of fake news in Bengali. Furthermore, previous attempts at augmenting fake news texts have yielded satisfactory results in lexical analysis but unsatisfactory results in terms of semantic relevance. To address these challenges, we propose a framework that involves the use of Text Augmentation techniques with the assistance of the Bangla Text-to-Text Transfer Transformer (T5) model. This frame work aims to balance an unbalanced Bengali fake news dataset, while ensuring that the augmented text retains semantic similarity and structural accuracy. By employing this approach, we seek to strengthen the effectiveness and reliability of fake news detection models in the Bengali language.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	Text Augmentation; Balanced Dataset; Lexical Analysis; Seman tic Relevance; Bangla T5	en_US
dc.title	A Machine Learning approach to Data Augmentation with Semantic Similarity on a Low-Resource Language	en_US
dc.type	Thesis	en_US