A Semi-Automated Approach to Generate Bangla Dataset for Question-Answering and Query-Based Text Summarization

Show simple item record

dc.contributor.author Mushabbir, Mueeze Al
dc.contributor.author Alamgir, Refaat Mohammad
dc.contributor.author Humdoon, Ahmed Azaz
dc.date.accessioned 2023-04-28T05:16:13Z
dc.date.available 2023-04-28T05:16:13Z
dc.date.issued 2022-05-30
dc.identifier.citation 1] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019. [2] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, 2018. [3] T. Tahsin Mayeesha, A. Md Sarwar, and R. M. Rahman, “Deep learning based question answering system in bengali,” Journal of Information and Telecommunication, vol. 5, no. 2, pp. 145–178, 2021. [4] S. Kulkarni, S. Chammas, W. Zhu, F. Sha, and E. Ie, “Aquamuse: Automatically generating datasets for query-based multi-document summarization,” arXiv preprint arXiv:2010.12694, 2020. [5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint arXiv:1910.10683, 2019. [6] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki, “Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454–470, 2020. [7] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020. 53 54 BIBLIOGRAPHY [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [9] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019. [10] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019. [11] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic bert sentence embedding,” arXiv preprint arXiv:2007.01852, 2020. en_US
dc.identifier.uri http://hdl.handle.net/123456789/1860
dc.description Supervised by Dr. Kamrul Hasan, Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology (IUT) Board Bazar, Gazipur-1704, Bangladesh. This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022. en_US
dc.description.abstract With the vast amount of information available on the Internet, finding answers to questions is as important as ever in today’s day and age. In Natural Language Processing Research, Question Answering (QA) and Query-based Text Summarization (QBSUM) are there to tackle this challenge. However, most of the work being done neglects low resource languages such as Bangla, resulting in the small number of quality datasets available in the literature. Therefore to address this research gap, in this work, we propose a semi-automated methodology for generating a Bangla dataset with Natural Questions for three tasks - Question Answering (QA), Query-based Single Document Text Summarization (SD-QBSUM) and Query-based Multi-Document Text Summarization (MD-QBSUM). We then provide baselines for this dataset on those tasks and also compare our dataset with existing ones on various metrics. en_US
dc.language.iso en en_US
dc.publisher Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur, Bangladesh en_US
dc.subject Question Answering, Query Based Single Document Summarization, Query Based Multi-Document Summarization, Semi-Automatic Approach, Semi-Supervised Method, Natural Question, mT5, mBERT, SBERT en_US
dc.title A Semi-Automated Approach to Generate Bangla Dataset for Question-Answering and Query-Based Text Summarization en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUT Repository


Advanced Search

Browse

My Account

Statistics