Development of A Code Search Engine Using Natural Language Processing Technique

Khan, Mohammad Abdullah Matin

Development of A Code Search Engine Using Natural Language Processing Technique

Khan, Mohammad Abdullah Matin

URI: http://hdl.handle.net/123456789/2192

Date: 2023-12-30

Abstract:

The advent of large-scale pre-trained language models has revolutionized the field of natural language processing, enabling significant advancements in various applications, including code retrieval systems. This report presents a novel approach to code retrieval using the Dense Passage Retrieval (DPR) technique that captures the functional similarity between codes as a measure of relevance. DPR is a state-of-the-art method that combines the power of pre-trained language models with dense vector representations for efficient and accurate information retrieval. The objective of this research project is to develop a large scale multimodal, multilingual dataset and leverage the DPR framework to build a code retrieval system capable of retrieving functionally relevant codes given a source code or natural language description of the code in query. To accomplish this, the study first establishes a comprehensive dataset XCODEEVAL comprising large number of source codes downloaded from competitive programming platforms. The dataset is used to train a DPR model, employing a training process that involves large scale pre-trained masked language models called CodeBERT, Starencoder to learn contextual representations of codes that will facilitate the retrieval of similar codes given a query code. Experimental evaluation is conducted to assess the effectiveness of the proposed code retrieval system. The evaluation includes metrics such as accuracy@k. The results demonstrate that the DPR-based code retrieval system achieves notable performance gains compared to traditional information retrieval methods. The system effectively retrieves relevant code snippets for a wide range of code queries, highlighting its potential in facilitating retrieval augmented generation models, code reuse, software development, and programming education. Furthermore, the report investigates the impact of different factors, such as multilingual accuracy and batch size on the retrieval performance. Additionally, it explores the limitations and challenges associated with the proposed system, including the scalability of training and deployment, as well as potential biases in the training data. In conclusion, this report presents a comprehensive study on building a code retrieval system using the DPR framework. The experiments for code code retrieval suggest that albeit retrieval performance after training the base models gets boosted in all cases, monolingual retrieval with functional similarity is very accurate (>80% for accuracy@100)and the multilingual retrieval is bit poor (>56% for accuracy@100). For NL-code retrieval above 80% accuracy is observed for all languages except D. The results demonstrate the effectiveness of DPR in leveraging pre-trained language models to improve code retrieval performance. The findings of this research contribute to the advancement of code search and retrieval techniques, opening up new possibilities for efficient code reuse and software development practices.

Description:

Supervised by Dr. Md. Moniruzzaman, Assistant Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh

Show full item record