Abstract:
Although deep learning architectures and large scale datasets have led to great performance on question answering tasks in high resource languages like English, their performance on lower resource languages, like Bengali, is considerably poorer. This is due to the
scarcity of labeled data, which can be attributed to the massive amount of human effort
and time required to create such datasets. We work towards a translated Stanford Question
Answering Dataset (SQuAD) 1.1 in Bengali and ensure that it is of high quality by using
a state-of-the-art translation model and a novel embedding based matching approach to
properly align the answer spans in the target language (Bengali) in correspondence with
the source language, English. We also introduce an end-to-end question answer generation
(QAG) system in the Bengali language to generate question answering (QA) datasets for
QA models using roundtrip consistency incorporated in a sequence-to-sequence generation
task using Googles mT5 model. Additionally, we train 3 different QA models on our Bengali
translated dataset achieving EM and F1 scores of 46.1 and 66.2 respectively. Finally, we
demonstrate the effectiveness of our QAG model on a sample dataset of news articles in
generating domain-specific QA datasets.
Description:
Supervised by
Dr. Abu Raihan Mostofa Kamal
Professor, Department of Computer Science and Engineering(CSE),
Islamic University of Technology(IUT),
Board Bazar, Gazipur-1704. Bangladesh.
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022.