A Framework for Grammar Inferencing using LLM Heuristics

Noman, Abdullah Al; Shreshtho, Talimul Bari; Sakib, Esrat Ebtida

dc.contributor.author	Noman, Abdullah Al
dc.contributor.author	Shreshtho, Talimul Bari
dc.contributor.author	Sakib, Esrat Ebtida
dc.date.accessioned	2025-03-11T05:58:24Z
dc.date.available	2025-03-11T05:58:24Z
dc.date.issued	2024-07-04
dc.identifier.citation	[1] M. R. Arefin, S. Shetiya, Z. Wang, and C. Csallner, “Fast deterministic black-box context-free grammar inference,” in Proceedings of the IEEE/ACM 46th Interna tional Conference on Software Engineering, ser. ICSE ’24, New York, NY, USA: Association for Computing Machinery, 2024, isbn: 9798400702174. [2] O. Bastani, R. Sharma, A. Aiken, and P. Liang, “Synthesizing program input grammars,” in Proceedings of the 38th ACM SIGPLAN Conference on Program ming Language Design and Implementation, ser. PLDI 2017, Barcelona, Spain: Association for Computing Machinery, 2017, pp. 95–110, isbn: 9781450349888. [3] B. Bendrissou, R. Gopinath, and A. Zeller, ““synthesizing input grammars”: A replication study,” in Proceedings of the 43rd ACM SIGPLAN International Con ference on Programming Language Design and Implementation, 2022, pp. 260– 268. [4] C. Cummins, P. Petoumenos, A. Murray, and H. Leather, “Compiler fuzzing through deep learning,” in Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, 2018, pp. 95–105. [5] R. Edelmann and V. Kunčak, “Neural-network guided expression transforma tion,” arXiv preprint arXiv:1902.02194, 2019. [6] P. Godefroid, R. Singh, and H. Peleg, Machine learning for input fuzzing, US Patent 10,983,853, Apr. 2021. [7] E. M. Gold, “Language identification in the limit,” Information and control, vol. 10, no. 5, pp. 447–474, 1967. [8] R. Gopinath, B. Mathis, and A. Zeller, “Mining input grammars from dynamic control flow,” in Proceedings of the 28th acm joint meeting on european software engineering conference and symposium on the foundations of software engineer ing, 2020, pp. 172–183. [9] M. Höschele and A. Zeller, “Mining input grammars from dynamic taints,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Soft ware Engineering, 2016, pp. 720–725. 33 [10] B. Jean-Philippe et al., “Can recurrent neural networks learn nested recursion?” LiLT (Linguistic Issues in Language Technology), vol. 16, no. 1, pp. 1–20, 2018. [11] T.-W. Kim, T.-G. Kim, and J.-H. Seu, “Specification and automated detection of code smells using ocl,” International Journal of Software Engineering and Its Applications, vol. 7, no. 4, pp. 35–44, 2013. [12] N. Kulkarni, C. Lemieux, and K. Sen, “Learning highly recursive input gram mars,” in 2021 36th IEEE/ACM International Conference on Automated Soft ware Engineering (ASE), 2021, pp. 456–467. [13] Z. Lin, X. Zhang, and D. Xu, “Reverse engineering input syntactic structure from program execution and its applications,” IEEE Transactions on Software Engineering, vol. 36, no. 5, pp. 688–703, 2009. [14] L. Moonen, “Generating robust parsers using island grammars,” in Proceedings eighth working conference on reverse engineering, IEEE, 2001, pp. 13–22. [15] Y. Oda, H. Fudaba, G. Neubig, et al., “Learning to generate pseudo-code from source code using statistical machine translation,” in 2015 30th IEEE/ACM In ternational Conference on Automated Software Engineering (ASE), IEEE, 2015, pp. 574–584. [16] L. Sennhauser and R. C. Berwick, “Evaluating the ability of lstms to learn context free grammars,” arXiv preprint arXiv:1811.02611, 2018. [17] A. Shirani, A. P. Lopez-Monroy, F. Gonzalez, T. Solorio, and M. A. Alipour, “Evaluation of type inference with textual cues,” in Workshops at the Thirty Second AAAI Conference on Artificial Intelligence, 2018. [18] A. Stevenson and J. R. Cordy, “Grammatical inference in software engineer ing: An overview of the state of the art,” in Software Language Engineering: 5th International Conference, SLE 2012, Dresden, Germany, September 26-28, 2012, Revised Selected Papers 5, Springer, 2013, pp. 204–223. [19] A. Stevenson and J. R. Cordy, “A survey of grammatical inference in software engineering,” Science of Computer Programming, vol. 96, pp. 444–459, 2014, Se lected Papers from the Fifth International Conference on Software Language Engineering (SLE 2012), issn: 0167-6423. [20] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Ad vances in neural information processing systems, 2017, pp. 5998–6008. [21] D. M. Yellin and G. Weiss, “Synthesizing context-free grammars from recurrent neural networks,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Springer, 2021, pp. 351–369.	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2380
dc.description	Supervised by Mr. Md. Jubair Ibna Mostafa, Assistant Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Software Engineering, 2024	en_US
dc.description.abstract	Grammar inferencing is a major topic in the realm of programming language. To cre- ate and enhance the ecosystem of any programming language (which generally con- sists of optimized compilers, defined maintenance, and comprehensive documenta- tion) inferring grammar for that language is unavoidable. Automated inferencing of grammar from strings of code has been there for a while. The progression of grammar inference has evolved through various stages, from template-based systems to black box-dependent models, and more recently, to neural network-based approaches. We introduce a novel technique in this continuum: LLM-based (Large Language Model) systems. This innovative approach leverages the capabilities of large language models to enhance performance, providing a fresh perspective and potentially transformative impact on grammar inference. We have harnessed LLM’s ability to generalize and consume domain-specific knowledge to infer grammar in a context-aware method. Our method outperforms state of the art TREEVADA approach in both precision and recall metrics and also makes the entire process more generalized and constraint-free. Our method outperforms TREEVADA by 8% in terms of precision and 28% in terms of recall. However, our testing methodology suffers from the absence of a totally un- foreseen noble language for the LLMs. As our target approach, TREEVADA requires some sort of lexer-parser setup within its oracle, comparing the two methods based on a noble language is precluded from our work currently.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	Grammar Inference, LLM, Black Box, CFG, LLM Heuristics	en_US
dc.title	A Framework for Grammar Inferencing using LLM Heuristics	en_US
dc.type	Thesis	en_US