Neural IR-based Approaches for Bangla Text Retrieval

Khan, Mohammed Sami; Sami, Khaja Abdus; Rafi, Prottoy

dc.contributor.author	Khan, Mohammed Sami
dc.contributor.author	Sami, Khaja Abdus
dc.contributor.author	Rafi, Prottoy
dc.date.accessioned	2025-03-10T05:40:26Z
dc.date.available	2025-03-10T05:40:26Z
dc.date.issued	2024-06-30
dc.identifier.citation	[1] Y. Bai, X. Li, G. Wang, et al., “Sparterm: Learning term-based sparse representa tion for fast text retrieval,”ArXiv, vol. abs/2010.00768, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:222125038. [2] N. J. Belkin and W. B. Croft, “Information filtering and information retrieval: Two sides of the same coin?” Commun. ACM, vol. 35, no. 12, pp. 29–38, Dec. 1992, issn: 0001-0782. doi: 10 . 1145 / 138859 . 138861. [Online]. Available: https://doi.org/10.1145/138859.138861. [3] M. Bendersky, H. Zhuang, J. Ma, S. Han, K. B. Hall, and R. T. McDonald, “Rrf102: Meeting the trec-covid challenge with a 100+ runs ensemble,”ArXiv, vol. abs/2010.00200, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID: 222090625. [4] A. Bhattacharjee, T. Hasan, W. U. Ahmad, et al., “Banglabert: Language model pretraining and benchmarks for low-resource language understanding evalua tion in bangla,” arXiv preprint arXiv:2101.00204, 2021. [5] L. Bonifacio, H. Abonizio, M. Fadaee, and R. Nogueira, “Inpars: Data augmen tation for information retrieval using large language models,” arXiv preprint arXiv:2202.05144, 2022. [6] A. Bookstein, D. R. Swanson, et al., “Probabilistic models for automatic index ing.,” J. Am. Soc. Inf. Sci., vol. 25, no. 5, pp. 312–316, 1974. [7] L. Boualili and A. Yates, “A study of term-topic embeddings for ranking,” in European Conference on Information Retrieval, Springer, 2023, pp. 359–366. [8] C. J. Burges, “From ranknet to lambdarank to lambdamart: An overview,” Learn ing, vol. 11, no. 23-581, p. 81, 2010. [9] Z. Dai, V. Y. Zhao, J. Ma, et al., “Promptagator: Few-shot dense retrieval from 8 examples,” arXiv preprint arXiv:2209.11755, 2022. [10] A. Das, J. Acharya, B. Kundu, and S. Chakraborti, “Revisiting anwesha: En hancing personalised and natural search in bangla,” in Proceedings of the 19th 54 International Conference on Natural Language Processing (ICON), 2022, pp. 183– 193. [11] A. Das, B. Kundu, L. Ghorai, A. K. Gupta, and S. Chakraborti, “Anwesha: A tool for semantic search in bangla,” 2021. [12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American society for in formation science, vol. 41, no. 6, pp. 391–407, 1990. [13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Bidirectional encoder representations from transformers,” 2016. [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [15] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic BERT sentence embedding,” CoRR, vol. abs/2007.01852, 2020. arXiv: 2007 . 01852. [Online]. Available: https://arxiv.org/abs/2007.01852. [16] T. Formal, C. Lassance, B. Piwowarski, and S. Clinchant, “SPLADE v2: Sparse lexical and expansion model for information retrieval,”CoRR, vol. abs/2109.10086, 2021. arXiv: 2109.10086. [Online]. Available: https://arxiv.org/abs/2109. 10086. [17] N. Fuhr, “Probabilistic models in information retrieval,” The computer journal, vol. 35, no. 3, pp. 243–255, 1992. [18] E. Gabrilovich, S. Markovitch, et al., “Computing semantic relatedness using wikipedia-based explicit semantic analysis.,” in IJcAI, vol. 7, 2007, pp. 1606– 1611. [19] L. Gao, Z. Dai, and J. Callan, “Understanding bert rankers under distillation,” in Proceedings of the 2020 ACM SIGIR on International Conference on Theory of In formation Retrieval, ser. ICTIR ’20, Virtual Event, Norway: Association for Com puting Machinery, 2020, pp. 149–152, isbn: 9781450380676. doi: 10 . 1145 / 3409256.3409838. [Online]. Available: https://doi.org/10.1145/3409256. 3409838. [20] L. Gao, X. Ma, J. Lin, and J. Callan, “Tevatron: An efficient and flexible toolkit for dense retrieval,” arXiv preprint arXiv:2203.05765, 2022. [21] S. Haq, A. Sharma, and P. Bhattacharyya, “Indicirsuite: Multilingual dataset and neural information models for indian languages,” arXiv preprint arXiv:2312.09508, 2023. 55 [22] X. He, Y. Gong, A. Jin, et al., “Metric-guided distillation: Distilling knowledge from the metric to ranker and retriever for generative commonsense reason ing,” arXiv preprint arXiv:2210.11708, 2022. [23] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural net work,” arXiv preprint arXiv:1503.02531, 2015. [24] S. Hofstätter, S. Althammer, M. Schröder, M. Sertkan, and A. Hanbury, “Im proving efficient neural ranking models with cross-architecture knowledge dis tillation,” arXiv preprint arXiv:2010.02666, 2020. [25] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, and A. Hanbury, “Efficiently teaching an effective dense retriever with balanced topic aware sampling,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 113–122. [26] A. I. Hossain and Asaduzzaman, “A bangla text search engine using pointwise approach of learn to rank(ltr) algorithm,” in 2022 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON ECE), 2022, pp. 227–232. doi: 10.1109/WIECON-ECE57977.2022.10151362. [27] M. R. Islam, J. Rahman, M. R. Talha, and F. Chowdhury, “Query expansion for bangla search engine pipilika,” in 2020 IEEE Region 10 Symposium (TEN SYMP), IEEE, 2020, pp. 1367–1370. [28] V. Karpukhin, B. Oğuz, S. Min, et al., “Dense passage retrieval for open-domain question answering,” arXiv preprint arXiv:2004.04906, 2020. [29] O. Khattab and M. Zaharia, “Colbert: Efficient and effective passage search via contextualized late interaction over bert,” in Proceedings of the 43rd Interna tional ACM SIGIR conference on research and development in Information Re trieval, 2020, pp. 39–48. [30] B. Koopman, A. Mourad, H. Li, et al., “Agask: An agent to help answer farmer’s questions from scientific documents,” International Journal on Digital Libraries, pp. 1–16, 2023. [31] S. Levy, In the Plex: How Google Thinks, Works, and Shapes Our Lives. Simon & Schuster, 2011, isbn: 9781416596714. [Online]. Available: https://books. google.com.bd/books?id=V1u1f8sv3k8C. [32] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, and R. Nogueira, “Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations,” in Proceedings of the 44th International ACM SI GIR Conference on Research and Development in Information Retrieval, 2021, pp. 2356–2362. 56 [33] S.-C. Lin, M. Li, and J. Lin, “Aggretriever: A simple approach to aggregate tex tual representations for robust dense passage retrieval,” Transactions of the As sociation for Computational Linguistics, vol. 11, pp. 436–452, 2023. [34] S.-C. Lin, J.-H. Yang, and J. Lin, “In-batch negatives for knowledge distilla tion with tightly-coupled teachers for dense retrieval,” in Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), 2021, pp. 163– 173. [35] T.-Y. Liu et al., “Learning to rank for information retrieval,” Foundations and Trends® in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009. [36] X. Liu and W. B. Croft, “Passage retrieval based on language models,” in Pro ceedings of the Eleventh International Conference on Information and Knowledge Management, ser. CIKM ’02, McLean, Virginia, USA: Association for Comput ing Machinery, 2002, pp. 375–382, isbn: 1581134924. doi: 10.1145/584792. 584854. [Online]. Available: https://doi.org/10.1145/584792.584854. [37] C. Manning, P. Raghavan, and H. A. Schütze, “Introduction to information re trieval. stanford nlp group, cambridge university press,” 2009. [38] I. Matveeva, C. Burges, T. Burkard, A. Laucius, and L. Wong, “High accuracy retrieval with multiple nested ranker,” in Proceedings of the 29th Annual Inter national ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’06, Seattle, Washington, USA: Association for Comput ing Machinery, 2006, pp. 437–444, isbn: 1595933697. doi: 10.1145/1148170. 1148246. [Online]. Available: https://doi.org/10.1145/1148170.1148246. [39] B. Mitra and N. Craswell. 2018. [40] T. Nguyen, M. Rosenberg, X. Song, et al., “Ms marco: A human generated ma chine reading comprehension dataset,” choice, vol. 2640, p. 660, 2016. [41] R. Nogueira and K. Cho, “Passage re-ranking with bert,” arXiv preprint arXiv:1901.04085, 2019. [42] R. Nogueira and J. Lin, From doc2query to docTTTTTquery. Dec. 2019. [43] R. Nogueira, J. Lin, and A. Epistemic, “From doc2query to doctttttquery,” On line preprint, vol. 6, no. 2, 2019. [44] R. Nogueira, W. Yang, K. Cho, and J. Lin, “Multi-stage document ranking with bert,” arXiv preprint arXiv:1910.14424, 2019. 57 [45] Y. Qu, Y. Ding, J. Liu, et al., “Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering,” arXiv preprint arXiv:2010.08191, 2020. [46] R. Ren, Y. Qu, J. Liu, et al., “Rocketqav2: A joint training method for dense pas sage retrieval and passage re-ranking,” arXiv preprint arXiv:2110.07367, 2021. [47] S. Robertson and H. Zaragoza, “The probabilistic relevance framework: Bm25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, pp. 333–389, Apr. 2009, issn: 1554-0669. doi: 10 . 1561 / 1500000019. [Online]. Available: https : / / doi . org/10.1561/1500000019. [48] S. E. Robertson, “The probability ranking principle in ir,” Journal of documen tation, vol. 33, no. 4, pp. 294–304, 1977. [49] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al., “Okapi at trec-3,” Nist Special Publication Sp, vol. 109, p. 109, 1995. [50] G. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic index ing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975. [51] K. Santhanam, O. Khattab, C. Potts, and M. Zaharia, “Plaid: An efficient en gine for late interaction retrieval,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 1747–1756. [52] K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia, “Colbertv2: Effective and efficient retrieval via lightweight late interaction,” arXiv preprint arXiv:2112.01488, 2021. [53] C. Szegedy, W. Zaremba, I. Sutskever, et al., Intriguing properties of neural net works, Feb. 2014. [Online]. Available: https://arxiv.org/abs/1312.6199. [54] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych, “Beir: A het erogenous benchmark for zero-shot evaluation of information retrieval mod els,” arXiv preprint arXiv:2104.08663, 2021. [55] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [56] E. Voorhees and D. K. Harman, TREC: Experiment and evaluation in informa tion retrieval. MIT Press, 2005. [57] E. M. Voorhees and D. K. Harman, “Overview of the seventh text retrieval con ference (trec-7) [on-line],” 1999. [Online]. Available: https://api.semanticscholar. org/CorpusID:13102016. 58 [58] S. Walker, S. Robertson, M. Boughanem, G. J. F. Jones, and K. S. Jones, “Okapi at trec-6: Automatic adhoc, vlc, routing, filtering and qsdr,” in The Sixth Text REtrieval Conference (TREC-6), Gaithersburg, MD: NIST, Jan. 1998, pp. 125– 136. [Online]. Available: https://www.microsoft.com/en- us/research/ publication/okapi-at-trec-6-automatic-adhoc-vlc-routing-filtering and-qsdr/. [59] L. Wang, J. Lin, and D. Metzler, “A cascade ranking model for efficient ranked retrieval,” in Proceedings of the 34th International ACM SIGIR Conference on Re search and Development in Information Retrieval, ser. SIGIR ’11, Beijing, China: Association for Computing Machinery, 2011, pp. 105–114,isbn: 9781450307574. doi: 10.1145/2009916.2009934. [Online]. Available: https://doi.org/10. 1145/2009916.2009934. [60] R. W. White, Interactions with search systems. Cambridge University Press, New York, 2016. [61] L. Xiong, C. Xiong, Y. Li, et al., “Approximate nearest neighbor negative con trastive learning for dense text retrieval,” arXiv preprint arXiv:2007.00808, 2020. [62] A. Yates, R. Nogueira, and J. Lin, “Pretrained transformers for text ranking: Bert and beyond,” in Proceedings of the 14th ACM International Conference on web search and data mining, 2021, pp. 1154–1156. [63] X. Zhang, X. Ma, P. Shi, and J. Lin, “Mr. tydi: A multi-lingual benchmark for dense retrieval,” arXiv preprint arXiv:2108.08787, 2021. [64] X. Zhang, K. Ogueji, X. Ma, and J. Lin, “Toward best practices for training mul tilingual dense retrieval models,” ACM Transactions on Information Systems, vol. 42, no. 2, pp. 1–33, 2023. [65] X. Zhang, N. Thakur, O. Ogundepo, et al., “Miracl: A multilingual retrieval dataset covering 18 diverse languages,” Transactions of the Association for Com putational Linguistics, vol. 11, pp. 1114–1131, 2023. [66] S. Zhuang and G. Zuccon, “Fast passage re-ranking with contextualized exact term matching and efficient passage expansion,” CoRR, vol. abs/2108.08513, 2021. arXiv: 2108 . 08513. [Online]. Available: https : / / arxiv . org / abs / 2108.08513	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2368
dc.description	Supervised by Mr. Md. Mohsinul Kabir, Assistant Professor, Dr. Hasan Mahmud, Associate Professor, Dr. Md. Kamrul Hasan, Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2024	en_US
dc.description.abstract	Information retrieval (IR) for Bangla text has received relatively little attention de spite the widespread global usage of the language. The rich morphology and lack of capitalization for Bangla presents challenges for direct application of standard IR models developed predominantly for English text. This report explores the gradual development of IR methods for text retrieval including works on Bangla texts. The unsupervised nature of the methods used for Bangla text retrieval makes the methods unsuitable for specific domains. To mitigate the problems faced due to lack of domain specific training, different modern neural information retrieval techniques need to be explored that can handle different data availability scenarios. In this report, we exper iment with different neural information retrieval techniques on different percentage of available data and provide guidelines on building information retrieval pipelines for Bangla language. We also introduce a dataset containing rice-related scientific texts along with human annotated questions, which we used to train and evaluate the performance of domain-specific neural information retrieval architectures.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.title	Neural IR-based Approaches for Bangla Text Retrieval	en_US
dc.type	Thesis	en_US