Utilising Sampling Techniques to Enhance Network Intrusion Detection System (NIDS) Performance in Imbalanced Data Scenarios

Dicko, Idrissa Mahamoudou; Rageb, Sabry Said Sabry; Kindzeka, Shalanyuy Nabil

dc.contributor.author	Dicko, Idrissa Mahamoudou
dc.contributor.author	Rageb, Sabry Said Sabry
dc.contributor.author	Kindzeka, Shalanyuy Nabil
dc.date.accessioned	2025-03-10T08:08:32Z
dc.date.available	2025-03-10T08:08:32Z
dc.date.issued	2024-06-19
dc.identifier.citation	[1] Z. Ahmad, A. Shahid Khan, C. Shiang, and F. Ahmad, “Network intrusion de tection system: A systematic study of machine learning and deep learning ap proaches,” Transactions on Emerging Telecommunications Technologies, vol. 32, Jan. 2021. doi: 10.1002/ett.4150. [2] A. Alshammari and A. Aldribi, “Apply machine learning techniques to detect malicious network traffic in cloud computing,” Journal of Big Data, vol. 8, no. 1, pp. 1–24, 2021. [3] N. S. Altman, “An introduction to kernel and nearest-neighbor nonparametric regression,” The American Statistician, vol. 46, no. 3, pp. 175–185, 1992. [4] L. Ashiku and C. Dagli, “Network intrusion detection system using deep learn ing,” Procedia Computer Science, vol. 185, pp. 239–247, 2021. [5] R. Atefinia and M. Ahmadi, “Network intrusion detection using multi-architectural modular deep neural network,” The Journal of Supercomputing, vol. 77, pp. 3571– 3593, 2021. [6] P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” Pro ceedings of ICML workshop on unsupervised and transfer learning, pp. 37–49, 2012. [7] Y. Bengio, “Learning deep architectures for ai,” Foundations and trends in Ma chine Learning, vol. 2, no. 1, pp. 1–127, 2009. [8] R. Bernard, L. Heutte, and S. Adam, “Selection of hyper-parameters in random forest modeling of high-dimensional data: Interest of sensitivity analysis,” in International Workshop on Multiple Classifier Systems, Springer, 2009, pp. 184– 193. [9] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001. [10] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and re gression trees. Wadsworth & Brooks/Cole Advanced Books & Software, 1984. 53 [11] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer, “Smote: Synthetic minority over-sampling technique,” J. Artif. Intell. Res. (JAIR), vol. 16, pp. 321–357, Jun. 2002. doi: 10.1613/jair.953. [12] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” arXiv preprint arXiv:1603.02754, 2015. [13] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Pro ceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794. [14] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE transac tions on information theory, vol. 13, no. 1, pp. 21–27, 1967. [15] I. Dey and V. Pratap, “A comparative study of smote, borderline-smote, and adasyn oversampling techniques using different classifiers,” in 2023 3rd Inter national Conference on Smart Data Intelligence (ICSMDI), 2023, pp. 294–302. doi: 10.1109/ICSMDI57622.2023.00060. [16] G. Douzas, F. Bacao, and F. Last, Information Sciences, vol. 465, pp. 1–20, 2018, issn: 0020-0255. doi: https://doi.org/10.1016/j.ins.2018.06.056. [17] F. Esposito, D. Malerba, G. Semeraro, and H. Kay, “A comparative analysis of methods for pruning decision trees,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 5, pp. 476–491, 1997. [18] Y. Fan, X. Cui, H. Han, and H. Lu, “Chiller fault detection and diagnosis by knowledge transfer based on adaptive imbalanced processing,” Science and Tech nology for the Built Environment, vol. 26, pp. 1–23, Apr. 2020. doi: 10.1080/ 23744731.2020.1757327. [19] E. Fix and J. L. Hodges, “Discriminatory analysis. nonparametric discrimina tion: Consistency properties,” International Statistical Review/Revue Internationale de Statistique, pp. 238–247, 1989. [20] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001. [21] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980. [22] S. Gamage and J. Samarabandu, “Deep learning methods in network intrusion detection: A survey and an objective comparison,” Journal of Network and Com puter Applications, vol. 169, p. 102 767, 2020. 54 [23] L. Göcs and Z. C. Johanyák, “Identifying relevant features of cse-cic-ids2018 dataset for the development of an intrusion detection system,” Intelligent Data Analysis, no. Preprint, pp. 1–27, 2023. [24] A. Graves, Supervised sequence labelling with recurrent neural networks. Springer, 2012. [25] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on neural networks and learn ing systems, vol. 28, no. 10, pp. 2222–2232, 2017. [26] A. Halbouni, T. S. Gunawan, M. H. Habaebi, M. Halbouni, M. Kartiwi, and R. Ahmad, “Machine learning and deep learning approaches for cybersecurity: A review,” IEEE Access, vol. 10, pp. 19 572–19 585, 2022. [27] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” inAdvances in Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887, isbn: 978-3-540-31902-3. [28] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” inAdvances in Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887, isbn: 978-3-540-31902-3. [29] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 1322–1328. doi: 10.1109/IJCNN.2008.4633969. [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni tion,” Proceedings of the IEEE conference on computer vision and pattern recog nition, pp. 770–778, 2016. [31] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural compu tation, vol. 9, no. 8, pp. 1735–1780, 1997. [33] D.-S. Huang, Advances in Intelligent Computing: International Conference on In telligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings. Springer Science & Business Media, 2005. 55 [34] G. Karatas, O. Demir, and O. K. Sahingoz, “Increasing the performance of ma chine learning-based idss on an imbalanced and up-to-date dataset,” IEEE ac cess, vol. 8, pp. 32 150–32 162, 2020. [35] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information pro cessing systems, 2012, pp. 1097–1105. [37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning ap plied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278– 2324, 1998. [38] X. Li, W. Chen, Q. Zhang, and L. Wu, “Building auto-encoder intrusion detec tion system based on random forest feature selection,” Computers & Security, vol. 95, p. 101 851, 2020. [39] A. Liaw and M. Wiener, “Classification and regression by randomforest,” R news, vol. 2, no. 3, pp. 18–22, 2002. [40] P. Lin, K. Ye, and C.-Z. Xu, “Dynamic network anomaly detection system by using deep learning techniques,” in Cloud Computing–CLOUD 2019: 12th In ternational Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings 12, Springer, 2019, pp. 161–176. [41] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015. [42] W.-Y. Loh, “Classification and regression trees,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 14–23, 2011. [43] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, no. 2011, pp. 1–19, 2011. [44] A. Pattawaro and C. Polprasert, “Anomaly-based network intrusion detection system through feature selection and hybrid machine learning technique,” in 2018 16th International Conference on ICT and Knowledge Engineering (ICT&KE), Nov. 2018, pp. 1–6. doi: 10.1109/ICTKE.2018.8612331. [45] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2, p. 1883, 2009. [46] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986. 56 [47] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal represen tations by error propagation,” Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp. 318–362, 1985. [48] S. Satpathy, “Smote for imbalanced classification with python,” Analytics Vid hya, Oct. 2020. [Online]. Available: https : / / www . analyticsvidhya . com / blog/2020/10/overcoming-class-imbalance-using-smote-techniques/. [49] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and com posing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103. [50] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” In Advances in neural information processing systems, 2014, pp. 3320–3328. [51] R. Zhao, Y. Mu, L. Zou, and X. Wen, “A hybrid intrusion detection system based on feature selection and weighted stacking classifier,” IEEE Access, vol. 10, pp. 71 414– 71 426, 2022	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2374
dc.description	Supervised by Mr. Faisal Hussain, Assistant Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2024	en_US
dc.description.abstract	In contemporary cybersecurity research, the fusion of sampling techniques with state of-the-art Machine Learning (ML) and Deep Learning (DL) models has emerged as a pivotal area of exploration, aimed at enhancing the efficacy of Intrusion Detection Systems (IDS). This thesis delves into the intersection of sampling methodologies and advanced learning algorithms to address the inherent challenges in class imbalance prevalent in network intrusion datasets. Class imbalance, a common issue in IDS datasets, often leads to suboptimal perfor mance as models tend to be biased towards the majority class, compromising their ability to detect instances of the minority class—typically representing intrusions. The proposed research harnesses the power of sampling techniques, encompassing oversampling, undersampling, and hybrid approaches, to rectify this imbalance and create a more representative learning environment. Sampling methods are strategically combined with ML models like Support Vector Machines (SVM), Random Forests, and k-Nearest Neighbors, along with DL mod els such as Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. This collaboration seeks to leverage the capabilities of these mod els to uncover complex structures and connections within the data. Central to our research is the introduction of Feature Relevance and Adaptive Over sampling (FAAO), a novel approach that combines feature relevance assessment with adaptive oversampling to address class imbalance. FAAO evaluates the importance of different features to identify those most influential in distinguishing between classes, ensuring that the oversampling process focuses on the most relevant features and im proves the quality of the synthetic samples. Our primary objectives include exploring various sampling techniques, including FAAO, and their impact on the performance of ML and DL-based IDS models. We will imple ment and assess the effectiveness of these techniques in mitigating class imbalance, thereby enhancing the models’ overall detection accuracy, sensitivity, and specificity. Furthermore, this study seeks to provide insights into the optimal pairing of sampling techniques with specific ML and DL architectures, while equally paying attention to feature relevance considering the inherent characteristics of intrusion detection xi datasets. The findings are anticipated to provide valuable guidelines for practition ers and researchers seeking to deploy robust and adaptive IDS solutions in real-world scenarios. By outlining the collaborative relationship between feature relevance, sampling tech niques, and advanced learning models, our work endeavors to pave the way for more adaptive, resilient, and accurate intrusion detection mechanisms, ultimately fortify ing the cybersecurity landscape against evolv	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	NIDS, IPS, Sampling, Over-Sampling, Machine Learning (ML), Deep Learn ing (DL), SMOTE, ADASYN, Random Forest (RF), Decision Tree, Kernel SVM, KNN Clasiifier, Autoencoder, CNN, Long Short-TermMemory (LSTM), F1 Score, CSECIC-IDS2018, feature selection, Entropy, Information Gain, Borderline SMOTE, KMeans, Feature Aware-Adaptive Oversampling FAAO	en_US
dc.title	Utilising Sampling Techniques to Enhance Network Intrusion Detection System (NIDS) Performance in Imbalanced Data Scenarios	en_US
dc.type	Thesis	en_US