Learning from Imbalanced  Data

Mohosheu, Md. Salman; Noman, Md. Abdullah Al; Al-Amin

IUT Repository Home
→
Electrical and Electronics Engineering (EEE)
→
Thesis
→
Undergraduate
→
2024
→
View Item

dc.contributor.author	Mohosheu, Md. Salman
dc.contributor.author	Noman, Md. Abdullah Al
dc.contributor.author	Al-Amin
dc.date.accessioned	2025-03-03T06:13:27Z
dc.date.available	2025-03-03T06:13:27Z
dc.date.issued	2024-07-29
dc.identifier.citation	[1] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Learning from Imbalanced Data Sets,” Learning from Imbalanced Data Sets, 2018, doi: 10.1007/978-3-319-98074-4. [2] M. Dudjak and G. Martinović, “An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult,” Expert Syst Appl, vol. 182, p. 115297, Nov. 2021, doi: 10.1016/J.ESWA.2021.115297. [3] H. Y. J. Kang, E. Batbaatar, D. W. Choi, K. S. Choi, M. Ko, and K. S. Ryu, “Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and Conquer Strategy,” JMIR Med Inform, vol. 11, no. 1, Jan. 2023, doi: 10.2196/47859. [4] P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,” Knowl Based Syst, vol. 212, p. 106631, Jan. 2021, doi: 10.1016/J.KNOSYS.2020.106631. [5] V. García, R. A. Mollineda, and J. S. Sánchez, “On the k-NN performance in a challenging scenario of imbalance and overlapping,” Pattern Analysis and Applications, vol. 11, no. 3–4, pp. 269–280, Sep. 2008, doi: 10.1007/S10044-007-0087-5/FIGURES/7. [6] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst Appl, vol. 73, pp. 220–239, May 2017, doi: 10.1016/J.ESWA.2016.12.035. [7] R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” 2020 11th International Conference on Information and Communication Systems, ICICS 2020, pp. 243–248, Apr. 2020, doi: 10.1109/ICICS49469.2020.239556. [8] A. Fernández, S. García, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of Artificial Intelligence Research, vol. 61, pp. 863–905, Apr. 2018, doi: 10.1613/JAIR.1.11192. [9] A. Newaz, M. S. Mohosheu, and M. A. Al Noman, “Predicting complications of myocardial infarction within several hours of hospitalization using data mining techniques,” Inform Med Unlocked, vol. 42, p. 101361, Jan. 2023, doi: 10.1016/J.IMU.2023.101361. [10] J. F. Díez-Pastor, J. J. Rodríguez, C. García-Osorio, and L. I. Kuncheva, “Random Balance: Ensembles of variable priors classifiers for imbalanced data,” Knowl Based Syst, vol. 85, pp. 96–111, Sep. 2015, doi: 10.1016/J.KNOSYS.2015.04.022. [11] H. J. Kim, N. O. Jo, and K. S. Shin, “Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction,” Expert Syst Appl, vol. 59, pp. 226–234, Oct. 2016, doi: 10.1016/J.ESWA.2016.04.027. [12] G. Kovács, “Smote-variants: A python implementation of 85 minority oversampling techniques,” Neurocomputing, vol. 366, pp. 352–354, Nov. 2019, doi: 10.1016/J.NEUCOM.2019.06.100. [13] A. S. Tarawneh, A. B. Hassanat, G. A. Altarawneh, and A. Almuhaimeed, “Stop Oversampling for Class Imbalance Learning: A Review,” IEEE Access, vol. 10, pp. 47643–47660, 2022, doi: 10.1109/ACCESS.2022.3169512. 62 \| P a g e [14] Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” J Biomed Inform, vol. 107, p. 103465, Jul. 2020, doi: 10.1016/J.JBI.2020.103465. [15] A. Newaz, S. Hassan, F. Shahriyar Haq, and C. Author, “An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification,” Aug. 2022, Accessed: Mar. 15, 2024. [Online]. Available: https://arxiv.org/abs/2208.11852v1 [16] J. J. Rodríguez, J. F. Díez-Pastor, Á. Arnaiz-González, and L. I. Kuncheva, “Random Balance ensembles for multiclass imbalance learning,” Knowl Based Syst, vol. 193, p. 105434, Apr. 2020, doi: 10.1016/J.KNOSYS.2019.105434. [17] V. H. Alves Ribeiro and G. Reynoso-Meza, “Ensemble learning by means of a multi-objective optimization design approach for dealing with imbalanced data sets,” Expert Syst Appl, vol. 147, p. 113232, Jun. 2020, doi: 10.1016/J.ESWA.2020.113232. [18] K. Yang et al., “Hybrid Classifier Ensemble for Imbalanced Data,” IEEE Trans Neural Netw Learn Syst, vol. 31, no. 4, pp. 1387–1400, Apr. 2020, doi: 10.1109/TNNLS.2019.2920246. [19] A. Anaissi, P. J. Kennedy, M. Goyal, and D. R. Catchpoole, “A balanced iterative random forest for gene selection from microarray data,” BMC Bioinformatics, vol. 14, no. 1, pp. 1–10, Aug. 2013, doi: 10.1186/1471-2105-14-261/TABLES/4. [20] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, no. 1, pp. 1–16, Mar. 2013, doi: 10.1186/1471-2105-14-106/FIGURES/7. [21] G. Kovács, “An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets,” Appl Soft Comput, vol. 83, p. 105662, Oct. 2019, doi: 10.1016/J.ASOC.2019.105662. [22] H. Nugroho, K. Wikantika, S. Bijaksana, and A. Saepuloh, “Handling imbalanced data in supervised machine learning for lithological mapping using remote sensing and airborne geophysical data,” Open Geosciences, vol. 15, no. 1, Jan. 2023, doi: 10.1515/GEO-2022- 0487/DOWNLOADASSET/SUPPL/S2A_MSIL2A_20190109T011721_N0211_R088_T53MPR_20190109 T032340.ZIP. [23] R. C. Prati, G. E. A. P. A. Batista, and D. F. Silva, “Class imbalance revisited,” Knowl Inf Syst, vol. 45, no. 1, pp. 247–270, Oct. 2015, doi: 10.1007/S10115-014-0794-3. [24] A. N. Tarekegn, M. Giacobini, and K. Michalak, “A review of methods for imbalanced multi-label classification,” Pattern Recognit, vol. 118, p. 107965, Oct. 2021, doi: 10.1016/J.PATCOG.2021.107965. [25] “Myocardial infarction complications - UCI Machine Learning Repository.” Accessed: Jul. 14, 2023. [Online]. Available: https://archive.ics.uci.edu/dataset/579/myocardial+infarction+complications [26] J. Alcalá-Fdez et al., “KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework,” vol. 17, pp. 255–287, 2011, Accessed: Mar. 15, 2024. [Online]. Available: http://the-data-mine.com/bin/view/Softwar	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2338
dc.description	Supervised by Mr. Asif Newaz, Lecturer, Department of Electrical and Electronic Engineering (EEE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Electrical and Electronic Engineering, 2024	en_US
dc.description.abstract	Class imbalance is a common challenge in real-world datasets. In critical applications such as medical diagnosis, intrusion detection, fault detection, and disease identification. In most of these cases, the positive examples are very rare. For this, machine learning models often get biased towards to negative class and identify any unseen samples as negative class examples. This imbalance mostly favors the majority class, resulting in poor prediction performance for the minority class. This thesis thoroughly evaluates various state-of-the-art methods for addressing class imbalance over 100+ datasets with different imbalance ratios. A thorough experimental analysis have been done to find out the patterns of the outcomes. By experimenting with numerous sampling strategies, including under-sampling, over-sampling, and hybrid approaches, this study highlights the strengths and weaknesses of each technique. Additionally, we explored the impact of class overlap, a condition where instances of different classes share similar features, further complicating predictive modeling. The findings underscore the necessity of combining sampling methods with cost-sensitive learning to improve prediction accuracy and generalization. The research introduces novel hybrid approaches that optimize the balance between majority and minority classes, demonstrating significant improvements in performance. These advancements contribute valuable insights and methodologies for future research and practical applications in handling imbalanced data.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Electrical and Elecrtonics Engineering(EEE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	Imbalanced learning, Smote, Cost-sensitive learning, sampling, hybrid-sampling	en_US
dc.title	Learning from Imbalanced Data	en_US
dc.type	Thesis	en_US