Development of A Code Search Engine Using Natural Language Processing Technique

Khan, Mohammad Abdullah Matin

dc.contributor.author	Khan, Mohammad Abdullah Matin
dc.date.accessioned	2024-09-11T05:26:35Z
dc.date.available	2024-09-11T05:26:35Z
dc.date.issued	2023-12-30
dc.identifier.citation	[1] R. Agashe, S. Iyer, and L. Zettlemoyer, “Juice: A large scale distantly super vised dataset for open domain context-based code generation,” arXiv preprint arXiv:1910.02216, 2019. [2] W. U. Ahmad, S. Chakraborty, B. Ray, and K. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, Eds. Association for Computational Linguistics, 2021, pp. 2655–2668. [Online]. Available: https://doi.org/10.18653/v1/2021.naacl-main.211 [3] W. U. Ahmad, M. G. R. Tushar, S. Chakraborty, and K.-W. Chang, “Avatar: A parallel corpus for java-python program translation,” arXiv preprint arXiv:2108.11590, 2021. [4] B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y. Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang, S. K. Gonugondla, H. Ding, V. Kumar, N. Fulton, A. Farahani, S. Jain, R. Giaquinto, H. Qian, M. K. Ramanathan, R. Nallapati, B. Ray, P. Bhatia, S. Sengupta, D. Roth, and B. Xiang, “Multi-lingual evaluation of code generation models,” 2022. [Online]. Available: https://arxiv.org/abs/2210.14868 [5] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program synthesis with large language models,” CoRR, vol. abs/2108.07732, 2021. [Online]. Available: https://arxiv.org/abs/2108.07732 [6] B. Berabi, J. He, V. Raychev, and M. T. Vechev, “Tfix: Learning to fix coding errors with a text-to-text transformer,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings 95 of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 780–791. [Online]. Available: http://proceedings.mlr.press/v139/berabi21a.html [7] A. Borji, “A categorical archive of chatgpt failures,” 2023. [8] J. Cambronero, H. Li, S. Kim, K. Sen, and S. Chandra, “When deep learning met code search,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 964–974. [9] F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda, “Multipl-e: A scalable and extensible approach to benchmarking neural code generation,” 2022. [10] S. Chandel, C. B. Clement, G. Serrato, and N. Sundaresan, “Training and evaluating a jupyter notebook data science assistant,” arXiv preprint arXiv:2201.12901, 2022. [11] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374 [12] X. Chen, K. Lakhotia, B. Oguz, A. Gupta, P. S. H. Lewis, S. Peshterliev, Y. Mehdad, S. Gupta, and W. Yih, “Salient phrase aware dense retrieval: Can a dense retriever imitate a sparse one?” CoRR, vol. abs/2110.06918, 2021. [Online]. Available: https://arxiv.org/abs/2110.06918 [13] Y. Cheng and L. Kuang, “Csrs: code search with relevance matching and seman tic matching,” in Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, 2022, pp. 533–542. [14] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, 96 J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” CoRR, vol. abs/2204.02311, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.02311 [15] C. Cool and N. J. Belkin, Interactive information retrieval: history and background. Facet, 2011, p. 1–14. [16] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423 [17] L. Di Grazia and M. Pradel, “Code search: A survey of techniques for finding code,” ACM Computing Surveys, vol. 55, no. 11, pp. 1–31, 2023. [18] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, ser. Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds., vol. EMNLP 2020. Association for Computational Linguistics, 2020, pp. 1536–1547. [Online]. Available: https://doi.org/10.18653/v1/2020.findings-emnlp.139 [19] J. Finnie-Ansley, P. Denny, B. A. Becker, A. Luxton-Reilly, and J. Prather, “The robots are coming: Exploring the implications of openai codex on introductory programming,” in Proceedings of the 24th Australasian Computing Education Conference, ser. ACE ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 10–19. [Online]. Available: https://doi.org/10.1145/3511861.3511863 97 [20] L. Fu, H. Chai, S. Luo, K. Du, W. Zhang, L. Fan, J. Lei, R. Rui, J. Lin, Y. Fang, Y. Liu, J. Wang, S. Qi, K. Zhang, W. Zhang, and Y. Yu, “Codeapex: A bilingual programming evaluation benchmark for large language models,” CoRR, vol. abs/2309.01940, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.01940 [21] J. A. Goldstein, G. Sastry, M. Musser, R. DiResta, M. Gentzel, and K. Sedova, “Generative language models and automated influence operations: Emerging threats and potential mitigations,” 2023. [22] X. Gu, H. Zhang, and S. Kim, “Deep code search,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 933–944. [23] D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Association for Computational Linguistics, 2022, pp. 7212–7225. [Online]. Available: https://doi.org/10.18653/v1/2022.acl-long.499 [24] R. Guo, S. Kumar, K. Choromanski, and D. Simcha, “Quantization based fast inner product search,” in Artificial intelligence and statistics. PMLR, 2016, pp. 482–490. [25] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “Deepfix: Fixing common c language er rors by deep learning,” in Proceedings of the aaai conference on artificial intelligence, vol. 31, no. 1, 2017. [26] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge competence with APPS,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, J. Vanschoren and S. Yeung, Eds., 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html [27] C. Hidey and K. McKeown, “Identifying causal relations using parallel Wikipedia articles,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1424–1433. [Online]. Available: https://aclanthology.org/P16-1135 98 [28] X. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin, “Summarizing source code with transferred api knowledge,” 2018. [29] J. Huang, C. Wang, J. Zhang, C. Yan, H. Cui, J. P. Inala, C. Clement, N. Duan, and J. Gao, “Execution-based evaluation for data science code generation models,” arXiv preprint arXiv:2211.09374, 2022. [30] H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” CoRR, vol. abs/1909.09436, 2019. [Online]. Available: http://arxiv.org/abs/1909.09436 [31] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Mapping language to code in programmatic context,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Association for Computational Linguistics, 2018, pp. 1643–1652. [Online]. Available: https://doi.org/10.18653/v1/d18-1192 [32] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave, “Unsupervised dense information retrieval with contrastive learning,” Transactions on Machine Learning Research, 2022. [Online]. Available: https://openreview.net/forum?id=jKN1pXi7b0 [33] M. Izadi, R. Gismondi, and G. Gousios, “Codefill: Multi-token code completion by jointly learning from structure and naming sequences,” in 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 2022, pp. 401–412. [Online]. Available: https://doi.org/10.1145/3510003.3510172 [34] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019. [35] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in Proceedings of the 2014 International Symposium on Software Testing and Analysis, ser. ISSTA 2014. New York, NY, USA: Association for Computing Machinery, 2014, p. 437–440. [Online]. Available: https://doi.org/10.1145/2610384.2628055 99 [36] V. Karpukhin, B. Oguz, S. Min, L. Wu, S. Edunov, D. Chen, and W. Yih, “Dense passage retrieval for open-domain question answering,” CoRR, vol. abs/2004.04906, 2020. [Online]. Available: https://arxiv.org/abs/2004.04906 [37] O. Khattab and M. Zaharia, “Colbert: Efficient and effective passage search via contextualized late interaction over bert,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 39–48. [Online]. Available: https://doi.org/10.1145/3397271.3401075 [38] J. Kim, S. Lee, S.-w. Hwang, and S. Kim, “Towards an intelligent code search engine,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, no. 1, 2010, pp. 1358–1363. [39] K. Kim, D. Kim, T. F. Bissyandé, E. Choi, L. Li, J. Klein, and Y. L. Traon, “Facoy: a code-to-code search engine,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 946–957. [40] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries, “The stack: 3 tb of permissively licensed source code,” 2022. [41] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. Liang, “Spoc: Search-based pseudocode to code,” CoRR, vol. abs/1906.04908, 2019. [Online]. Available: http://arxiv.org/abs/1906.04908 [42] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, S. W.-t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and reliable benchmark for data science code generation,” arXiv preprint arXiv:2211.11501, 2022. [43] K. Lee, M.-W. Chang, and K. Toutanova, “Latent retrieval for weakly supervised open domain question answering,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 6086–6096. [Online]. Available: https://aclanthology.org/P19-1612 [44] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” 2021. 100 [45] D. Li, Y. Shen, R. Jin, Y. Mao, K. Wang, and W. Chen, “Generation-augmented query expansion for code retrieval,” arXiv preprint arXiv:2212.10692, 2022. [46] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee, L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, “Starcoder: may the source be with you!” 2023. [47] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. [Online]. Available: https://www.science.org/doi/abs/10.1126/science.abq1158 [48] Libretexts, “6.3: Equivalence relations and partitions,” Jul 2020. [Online]. Avail able: https://math.libretexts.org/Courses/Monroe_Community_College/MTH_220_ Discrete_Math/6%3A_Relations/6.3%3A_Equivalence_Relations_and_Partitions [49] S.-C. Lin and J. Lin, “A dense representation framework for lexical and semantic matching,” 2023. [50] C. Liu, X. Xia, D. Lo, C. Gao, X. Yang, and J. Grundy, “Opportunities and challenges in code search tools,” ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1–40, 2021. [51] S. Liu, Y. Chen, X. Xie, J. K. Siow, and Y. Liu, “Retrieval-augmented generation for code summarization via hybrid GNN,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id= zv-typ1gPxA 101 [52] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset for code understanding and generation,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, J. Vanschoren and S. Yeung, Eds., 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html [53] Y. Luan, J. Eisenstein, K. Toutanova, and M. Collins, “Sparse, Dense, and Attentional Representations for Text Retrieval,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 329–345, 04 2021. [Online]. Available: https://doi.org/10.1162/tacl_a_00369 [54] Z. Manna and R. J. Waldinger, “Toward automatic program synthesis,” Commun. ACM, vol. 14, no. 3, p. 151–165, mar 1971. [Online]. Available: https://doi.org/10.1145/362566.362568 [55] A. V. Miceli Barone and R. Sennrich, “A parallel corpus of python functions and documentation strings for automated code documentation and code generation,” in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Taipei, Taiwan: Asian Federation of Natural Language Processing, Nov. 2017, pp. 314–319. [Online]. Available: https://aclanthology.org/I17-2053 [56] D. Mount, “Lecture 17 network flow: Extensions,” 2017. [Online]. Available: https://www.cs.umd.edu/class/fall2017/cmsc451-0101/Lects/lect17-flow-circ.pdf [57] N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. von Werra, and S. Longpre, “Octopack: Instruction tuning code large language models,” arXiv preprint arXiv:2308.07124, 2023. [58] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, “MS MARCO: A human generated machine reading comprehension dataset,” CoRR, vol. abs/1611.09268, 2016. [Online]. Available: http://arxiv.org/abs/1611.09268 [59] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint, 2022. 102 [60] R. Nogueira, Z. Jiang, R. Pradeep, and J. Lin, “Document ranking with a pretrained sequence-to-sequence model,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 708–718. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.63 [61] Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and S. Nakamura, “Learning to generate pseudo-code from source code using statistical machine translation,” in 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2015, pp. 574–584. [62] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” arXiv preprint arXiv:2203.02155, 2022. [63] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040 [64] M. R. Parvez, S. Chakraborty, B. Ray, and K.-W. Chang, “Building language models for text with named entities,” arXiv preprint arXiv:1805.04836, 2018. [65] M. R. Parvez, W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Retrieval augmented code generation and summarization,” in Findings of the Association for Computational Linguistics: EMNLP 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 2719–2734. [Online]. Available: https://aclanthology.org/2021.findings-emnlp.232 [66] R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker et al., “Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks,” arXiv preprint arXiv:2105.12655, 2021. [67] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018. [68] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text 103 transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html [69] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019. [70] S. P. Reiss, “Semantics-based code search,” in 2009 IEEE 31st International Confer ence on Software Engineering. IEEE, 2009, pp. 243–253. [71] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic evaluation of code synthesis,” CoRR, vol. abs/2009.10297, 2020. [Online]. Available: https://arxiv.org/abs/2009.10297 [72] K. H. Rosen, Discrete mathematics and its applications. The McGraw Hill Compa nies„ 2007. [73] B. Roziere, M.-A. Lachaux, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” Advances in Neural Information Processing Systems, vol. 33, 2020. [74] B. Roziere, J. M. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample, “Leveraging automated unit tests for unsupervised code translation,” arXiv preprint arXiv:2110.06773, 2021. [75] R. L. Russell, L. Y. Kim, L. H. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. M. Ellingwood, and M. W. McConley, “Automated vulnerability detection in source code using deep representation learning,” in 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, December 17-20, 2018, M. A. Wani, M. M. Kantardzic, M. S. Mouchaweh, J. Gama, and E. Lughofer, Eds. IEEE, 2018, pp. 757–762. [Online]. Available: https://doi.org/10.1109/ICMLA.2018.00120 [76] S. Sachdev, H. Li, S. Luan, S. Kim, K. Sen, and S. Chandra, “Retrieval on source code: a neural code search,” in Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2018, pp. 31–41. [77] C. Sadowski, K. T. Stolee, and S. Elbaum, “How developers search for code: a case study,” in Proceedings of the 2015 10th joint meeting on foundations of software engineering, 2015, pp. 191–201. 104 [78] H. Sajnani, “Large-scale code clone detection,” PhD Thesis, University of California, Irvine, 2016. [79] M. Sanderson and W. B. Croft, “The history of information retrieval research,” Pro ceedings of the IEEE, vol. 100, no. Special Centennial Issue, pp. 1444–1451, 2012. [80] A. Shrivastava and P. Li, “Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips),” Advances in neural information processing systems, vol. 27, 2014. [81] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a big data curated benchmark of inter-project code clones,” in 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, 2014, pp. 476–480. [Online]. Available: https://doi.org/10.1109/ICSME.2014.77 [82] X. Tang, B. Qian, R. Gao, J. Chen, X. Chen, and M. Gerstein, “Biocoder: A benchmark for bioinformatics code generation with contextual pragmatic knowledge,” CoRR, vol. abs/2308.16458, 2023. [Online]. Available: https: //doi.org/10.48550/arXiv.2308.16458 [83] THUDM, “Codegeex: A multilingual code generation model,” https://github.com/ THUDM/CodeGeeX, 2022. [84] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “An empirical study on learning bug-fixing patches in the wild via neural machine translation,” ACM Trans. Softw. Eng. Methodol., vol. 28, no. 4, pp. 19:1–19:29, 2019. [Online]. Available: https://doi.org/10.1145/3340544 [85] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [86] Y. Wan, J. Shu, Y. Sui, G. Xu, Z. Zhao, J. Wu, and P. Yu, “Multi-modal attention network learning for semantic source code retrieval,” in 2019 34th IEEE/ACM Inter national Conference on Automated Software Engineering (ASE). IEEE, 2019, pp. 13–25. [87] H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun, “Pre-trained language models and their applications,” Engineering, 2022. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S2095809922006324 105 [88] K. Wang, N. Thakur, N. Reimers, and I. Gurevych, “GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 2345–2360. [Online]. Available: https://aclanthology.org/2022.naacl-main.168 [89] Y. Wang, W. Wang, S. R. Joty, and S. C. H. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. 8696–8708. [Online]. Available: https://doi.org/10.18653/v1/2021.emnlp-main.685 [90] Z. Wang, G. Cuenca, S. Zhou, F. F. Xu, and G. Neubig, “Mconala: a benchmark for code generation from multiple natural languages,” arXiv preprint arXiv:2203.08388, 2022. [91] Z. Wang, S. Zhou, D. Fried, and G. Neubig, “Execution-based evaluation for open domain code generation,” arXiv preprint arXiv:2212.10481, 2022. [92] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=gEZrGCozdqR [93] C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14-18, 2022, A. Roychoudhury, C. Cadar, and M. Kim, Eds. ACM, 2022, pp. 959–971. [Online]. Available: https://doi.org/10.1145/3540250.3549101 [94] P. Yin, B. Deng, E. Chen, B. Vasilescu, and G. Neubig, “Learning to mine aligned code and natural language pairs from stack overflow,” in Proceedings of the 15th International Conference on Mining Software Repositories, 2018, pp. 476–486. 106 [95] P. Yin, W.-D. Li, K. Xiao, A. Rao, Y. Wen, K. Shi, J. Howland, P. Bailey, M. Catasta, H. Michalewski et al., “Natural language to code generation in interactive data science notebooks,” arXiv preprint arXiv:2212.09248, 2022. [96] H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, T. Xie, and Q. Wang, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” arXiv preprint arXiv:2302.00288, 2023. [97] J. Zhang, S. Panthaplackel, P. Nie, J. J. Li, and M. Gligoric, “Coditt5: Pretraining for source code and natural language editing,” 2022. [98] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured queries from natural language using reinforcement learning,” CoRR, vol. abs/1709.00103, 2017. [99] Y. Zhou, S. Liu, J. K. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10 197–10 207. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/ 49265d2447bc3bbfe9e76306ce40a31f-Abstract.html [100] M. Zhu, A. Jain, K. Suresh, R. Ravindran, S. Tipirneni, and C. K. Reddy, “Xlcost: A benchmark dataset for cross-lingual code intelligence,” 2022. [Online]. Available: https://arxiv.org/abs/2206.08474 [101] Q. Zhu, Z. Sun, Y. Xiao, W. Zhang, K. Yuan, Y. Xiong, and L. Zhang, “A syntax-guided edit decoder for neural program repair,” in ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, D. Spinellis, G. Gousios, M. Chechik, and M. D. Penta, Eds. ACM, 2021, pp. 341–353. [Online]. Available: https://doi.org/10.1145/3468264.3468544 [102] A. Ziegler, E. Kalliamvakou, X. A. Li, A. Rice, D. Rifkin, S. Simister, G. Sittampalam, and E. Aftandilian, “Productivity assessment of neural code completion,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, ser. MAPS 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 21–29. [Online]. Available: https://doi.org/10.1145/3520312.3534864	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2192
dc.description	Supervised by Dr. Md. Moniruzzaman, Assistant Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.description.abstract	The advent of large-scale pre-trained language models has revolutionized the field of natural language processing, enabling significant advancements in various applications, including code retrieval systems. This report presents a novel approach to code retrieval using the Dense Passage Retrieval (DPR) technique that captures the functional similarity between codes as a measure of relevance. DPR is a state-of-the-art method that combines the power of pre-trained language models with dense vector representations for efficient and accurate information retrieval. The objective of this research project is to develop a large scale multimodal, multilingual dataset and leverage the DPR framework to build a code retrieval system capable of retrieving functionally relevant codes given a source code or natural language description of the code in query. To accomplish this, the study first establishes a comprehensive dataset XCODEEVAL comprising large number of source codes downloaded from competitive programming platforms. The dataset is used to train a DPR model, employing a training process that involves large scale pre-trained masked language models called CodeBERT, Starencoder to learn contextual representations of codes that will facilitate the retrieval of similar codes given a query code. Experimental evaluation is conducted to assess the effectiveness of the proposed code retrieval system. The evaluation includes metrics such as accuracy@k. The results demonstrate that the DPR-based code retrieval system achieves notable performance gains compared to traditional information retrieval methods. The system effectively retrieves relevant code snippets for a wide range of code queries, highlighting its potential in facilitating retrieval augmented generation models, code reuse, software development, and programming education. Furthermore, the report investigates the impact of different factors, such as multilingual accuracy and batch size on the retrieval performance. Additionally, it explores the limitations and challenges associated with the proposed system, including the scalability of training and deployment, as well as potential biases in the training data. In conclusion, this report presents a comprehensive study on building a code retrieval system using the DPR framework. The experiments for code code retrieval suggest that albeit retrieval performance after training the base models gets boosted in all cases, monolingual retrieval with functional similarity is very accurate (>80% for accuracy@100)and the multilingual retrieval is bit poor (>56% for accuracy@100). For NL-code retrieval above 80% accuracy is observed for all languages except D. The results demonstrate the effectiveness of DPR in leveraging pre-trained language models to improve code retrieval performance. The findings of this research contribute to the advancement of code search and retrieval techniques, opening up new possibilities for efficient code reuse and software development practices.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.title	Development of A Code Search Engine Using Natural Language Processing Technique	en_US
dc.type	Thesis	en_US