Abstract:
The advent of large-scale pre-trained language models has revolutionized the field of natural
language processing, enabling significant advancements in various applications, including
code retrieval systems. This report presents a novel approach to code retrieval using the
Dense Passage Retrieval (DPR) technique that captures the functional similarity between
codes as a measure of relevance. DPR is a state-of-the-art method that combines the
power of pre-trained language models with dense vector representations for efficient and
accurate information retrieval. The objective of this research project is to develop a large
scale multimodal, multilingual dataset and leverage the DPR framework to build a code
retrieval system capable of retrieving functionally relevant codes given a source code or
natural language description of the code in query. To accomplish this, the study first
establishes a comprehensive dataset XCODEEVAL comprising large number of source codes
downloaded from competitive programming platforms. The dataset is used to train a DPR
model, employing a training process that involves large scale pre-trained masked language
models called CodeBERT, Starencoder to learn contextual representations of codes that
will facilitate the retrieval of similar codes given a query code. Experimental evaluation is
conducted to assess the effectiveness of the proposed code retrieval system. The evaluation
includes metrics such as accuracy@k. The results demonstrate that the DPR-based code
retrieval system achieves notable performance gains compared to traditional information
retrieval methods. The system effectively retrieves relevant code snippets for a wide range of
code queries, highlighting its potential in facilitating retrieval augmented generation models,
code reuse, software development, and programming education. Furthermore, the report
investigates the impact of different factors, such as multilingual accuracy and batch size on
the retrieval performance. Additionally, it explores the limitations and challenges associated
with the proposed system, including the scalability of training and deployment, as well as
potential biases in the training data. In conclusion, this report presents a comprehensive study
on building a code retrieval system using the DPR framework. The experiments for code code retrieval suggest that albeit retrieval performance after training the base models gets
boosted in all cases, monolingual retrieval with functional similarity is very accurate (>80%
for accuracy@100)and the multilingual retrieval is bit poor (>56% for accuracy@100). For
NL-code retrieval above 80% accuracy is observed for all languages except D. The results
demonstrate the effectiveness of DPR in leveraging pre-trained language models to improve
code retrieval performance. The findings of this research contribute to the advancement of
code search and retrieval techniques, opening up new possibilities for efficient code reuse
and software development practices.
Description:
Supervised by
Dr. Md. Moniruzzaman,
Assistant Professor,
Department of Computer Science and Engineering(CSE),
Islamic University of Technology(IUT),
Board Bazar, Gazipur-1704, Bangladesh