Semi-Supervised Question Answering with Question-Answer Pair Generation in Bengali

Show simple item record

dc.contributor.author Ehsan, Md. Amimul
dc.contributor.author Shahriar, Md. Shihab
dc.contributor.author Chowdhury, Ahmad Al Fayad
dc.date.accessioned 2023-01-27T05:33:12Z
dc.date.available 2023-01-27T05:33:12Z
dc.date.issued 2022-05-30
dc.identifier.citation [1] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (New Orleans, Louisiana), pp. 2227–2237, Association for Computational Linguistics, June 2018. [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Minneapolis, Minnesota), pp. 4171–4186, Association for Computational Linguistics, June 2019. [3] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” Advances in Neural Information Processing Systems, vol. 32, 2019. [4] T. Tahsin Mayeesha, A. Md Sarwar, and R. M. Rahman, “Deep learning based question answering system in bengali,” Journal of Information and Telecommunication, vol. 5, no. 2, pp. 145–178, 2021. [5] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (Austin, Texas), pp. 2383–2392, Association for Computational Linguistics, Nov. 2016. [6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019. [7] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” 2019. [8] Y. Zhao, X. Ni, Y. Ding, and Q. Ke, “Paragraph-level neural question generation with maxout pointer and gated self-attention networks,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910, 2018. [9] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natu32 ral Language Processing, (Lisbon, Portugal), pp. 1412–1421, Association for Computational Linguistics, Sept. 2015. [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [11] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2019. [12] Y.-H. Chan and Y.-C. Fan, “A recurrent BERT-based model for question generation,” in Proceedings of the 2nd Workshop on Machine Reading for Question Answering, (Hong Kong, China), pp. 154–162, Association for Computational Linguistics, Nov. 2019. [13] C. Alberti, D. Andor, E. Pitler, J. Devlin, and M. Collins, “Synthetic QA corpora generation with roundtrip consistency,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (Florence, Italy), pp. 6168–6173, Association for Computational Linguistics, July 2019. [14] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Online), pp. 483–498, Association for Computational Linguistics, June 2021. [15] K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, (Cambridge, MA, USA), p. 16931701, MIT Press, 2015. [16] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “RACE: Large-scale ReAding comprehension dataset from examinations,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, (Copenhagen, Denmark), Association for Computational Linguistics, Sept. 2017. [17] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for SQuAD,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), (Melbourne, Australia), pp. 784–789, Association for Computational Linguistics, July 2018. 33 [18] S. Reddy, D. Chen, and C. D. Manning, “CoQA: A conversational question answering challenge,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 249– 266, Mar. 2019. [19] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, “Natural questions: A benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 452–466, Mar. 2019. [20] A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman, “NewsQA: A machine comprehension dataset,” in Proceedings of the 2nd Workshop on Representation Learning for NLP, (Vancouver, Canada), pp. 191–200, Association for Computational Linguistics, Aug. 2017. [21] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (Vancouver, Canada), pp. 1601–1611, Association for Computational Linguistics, July 2017. [22] M. Artetxe, S. Ruder, and D. Yogatama, “On the cross-lingual transferability of monolingual representations,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 4623–4637, Association for Computational Linguistics, July 2020. [23] P. Lewis, B. Oguz, R. Rinott, S. Riedel, and H. Schwenk, “MLQA: Evaluating crosslingual extractive question answering,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 7315–7330, Association for Computational Linguistics, July 2020. [24] M. A. Haque, S. Sultana, M. J. Islam, M. A. Islam, and J. A. Ovi, “Factoid question answering over bangla comprehension,” in 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1–8, IEEE, 2020. [25] K. Lee, K. Yoon, S. Park, and S.-w. Hwang, “Semi-supervised training data generation for multilingual question answering,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. [26] A. Asai, A. Eriguchi, K. Hashimoto, and Y. Tsuruoka, “Multilingual extractive reading comprehension by runtime machine translation,” 2018. 34 [27] P. Efimov, A. Chertok, L. Boytsov, and P. Braslavski, “Sberquad–russian reading comprehension dataset: Description and analysis,” in International Conference of the CrossLanguage Evaluation Forum for European Languages, pp. 3–15, Springer, 2020. [28] C. P. Carrino, M. R. Costa-jussà, and J. A. R. Fonollosa, “Automatic Spanish translation of SQuAD dataset for multi-lingual question answering,” in Proceedings of the 12th Language Resources and Evaluation Conference, (Marseille, France), pp. 5515–5523, European Language Resources Association, May 2020. [29] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the association for computational linguistics, vol. 5, pp. 135– 146, 2017. [30] S. Banerjee and S. Bandyopadhyay, “Bengali question classification: Towards developing QA system,” in Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing, WSSANLP@COLING 2012, Mumbai, India, December 8, 2012 (V. Sornlertlamvanich and A. Malik, eds.), pp. 25–40, The COLING 2012 Organizing Committee, 2012. [31] S. M. H. Nirob, M. K. Nayeem, and M. S. Islam, “Question classification using support vector machine with hybrid feature extraction method,” in 2017 20th International Conference of Computer and Information Technology (ICCIT), pp. 1–6, 2017. [32] S. Banerjee, S. K. Naskar, and S. Bandyopadhyay, “BFQA: A bengali factoid question answering system,” in Text, Speech and Dialogue - 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings (P. Sojka, A. Horák, I. Kopecek, and K. Pala, eds.), vol. 8655 of Lecture Notes in Computer Science, pp. 217–224, Springer, 2014. [33] S. Hoque, M. S. Arefin, and M. M. Hoque, “Bqas: A bilingual question answering system,” in 2015 2nd International Conference on Electrical Information and Communication Technologies (EICT), pp. 586–591, 2015. [34] S. T. Islam and M. N. Huda, “Design and development of question answering system in bangla language from multiple documents,” in 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), pp. 1–4, IEEE, 2019. 35 [35] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (Florence, Italy), pp. 4996–5001, Association for Computational Linguistics, July 2019. [36] L. Pan, W. Lei, T.-S. Chua, and M.-Y. Kan, “Recent advances in neural question generation,” 2019. [37] Y. Zhao, X. Ni, Y. Ding, and Q. Ke, “Paragraph-level neural question generation with maxout pointer and gated self-attention networks,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (Brussels, Belgium), pp. 3901–3910, Association for Computational Linguistics, Oct.-Nov. 2018. [38] N. Duan, D. Tang, P. Chen, and M. Zhou, “Question generation for question answering,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, (Copenhagen, Denmark), pp. 866–874, Association for Computational Linguistics, Sept. 2017. [39] Y. Kim, H. Lee, J. Shin, and K. Jung, “Improving neural question generation using answer separation,” CoRR, vol. abs/1809.02393, 2018. [40] D. B. Lee, S. Lee, W. T. Jeong, D. Kim, and S. J. Hwang, “Generating diverse and consistent QA pairs from contexts with information-maximizing hierarchical conditional VAEs,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 208–224, Association for Computational Linguistics, July 2020. [41] K. Zi, X. Sun, Y. Cao, S. Wang, X. Feng, Z. Ma, and C. Cao, “Answer-focused and position-aware neural network for transfer learning in question generation,” in Knowledge Science, Engineering and Management - 12th International Conference, KSEM 2019, Athens, Greece, August 28-30, 2019, Proceedings, Part II (C. Douligeris, D. Karagiannis, and D. Apostolou, eds.), vol. 11776 of Lecture Notes in Computer Science, pp. 339–352, Springer, 2019. [42] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, p. 17351780, nov 1997. [43] R. Karim, M. Islam, S. R. Simanto, S. A. Chowdhury, K. Roy, A. Al Neon, M. Hasan, A. Firoze, R. M. Rahman, et al., “A step towards information extraction: Named entity recognition in bangla using deep learning,” Journal of Intelligent & Fuzzy Systems, vol. 37, no. 6, pp. 7401–7413, 2019. en_US
dc.identifier.uri http://hdl.handle.net/123456789/1666
dc.description Supervised by Dr. Abu Raihan Mostofa Kamal Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704. Bangladesh. This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022. en_US
dc.description.abstract Although deep learning architectures and large scale datasets have led to great performance on question answering tasks in high resource languages like English, their performance on lower resource languages, like Bengali, is considerably poorer. This is due to the scarcity of labeled data, which can be attributed to the massive amount of human effort and time required to create such datasets. We work towards a translated Stanford Question Answering Dataset (SQuAD) 1.1 in Bengali and ensure that it is of high quality by using a state-of-the-art translation model and a novel embedding based matching approach to properly align the answer spans in the target language (Bengali) in correspondence with the source language, English. We also introduce an end-to-end question answer generation (QAG) system in the Bengali language to generate question answering (QA) datasets for QA models using roundtrip consistency incorporated in a sequence-to-sequence generation task using Googles mT5 model. Additionally, we train 3 different QA models on our Bengali translated dataset achieving EM and F1 scores of 46.1 and 66.2 respectively. Finally, we demonstrate the effectiveness of our QAG model on a sample dataset of news articles in generating domain-specific QA datasets. en_US
dc.language.iso en en_US
dc.publisher Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh en_US
dc.subject uestion Answering, Question Answer Generation, Low Resource Language, Translated Dataset, Synthetic Dataset en_US
dc.title Semi-Supervised Question Answering with Question-Answer Pair Generation in Bengali en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUT Repository


Advanced Search

Browse

My Account

Statistics