HNIHA: Hybrid Nature-Inspired Imbalance Handling Algorithm to Addressing Imbalanced Datasets for Improved Classification: In Case of Anemia Identification

Authors

  • Dimas Chaerul Ekty Saputra Telkom University Surabaya
  • Tri Ratnaningsih Universitas Gadjah Mada
  • Irianna Futri Khon Kaen University
  • Elvaro Islami Muryadi Adiwangsa Jambi University
  • Raksmey Phann Seoul National University of Science and Technology
  • Su Sandi Hla Tun Khon Kaen University
  • Ritchie Natuan Caibigan Batangan State University – The National Engineering University

DOI:

https://doi.org/10.12928/biste.v6i3.11306

Keywords:

Imbalanced Classification, Natured Inspired Algorithm, MCC, SMOTE, SVM

Abstract

This study presents a comprehensive evaluation of three ensemble models designed to handle imbalanced datasets. Each model incorporates the hybrid nature-inspired imbalance handling algorithm (HNIHA) with matthews correlation coefficient and synthetic minority oversampling technique in conjunction with different base classifiers: support vector machine, random forest, and LightGBM. Our focus is to address the challenges posed by imbalanced datasets, emphasizing the balance between sensitivity and specificity. The HNIHA algorithm-guided support vector machine ensemble demonstrated superior performance, achieving an impressive matthews correlation coefficient of 0.8739, showcasing its robustness in balancing true positives and true negatives. The f1-score, precision, and recall metrics further validated its accuracy, precision, and sensitivity, attaining values of 0.9767, 0.9545, and 1.0, respectively. The ensemble demonstrated its ability to minimize prediction errors by minimizing the mean squared error and root mean squared error to 0.0384 and 0.1961, respectively. The HNIHA-guided random forest ensemble and HNIHA-guided LightGBM ensemble also exhibited strong performances.

Author Biographies

Tri Ratnaningsih, Universitas Gadjah Mada

Department of Clinical Pathology and Laboratory Medicine, Faculty of Medicine, Public Health and Nursing

Irianna Futri, Khon Kaen University

Department of International Technology and Innovation Management, International College

Elvaro Islami Muryadi, Adiwangsa Jambi University

Department of Public Health, Faculty of Health Sciences

Raksmey Phann, Seoul National University of Science and Technology

Department of Data Science

Su Sandi Hla Tun, Khon Kaen University

Department of Human Movement Sciences, Faculty of Associated Medical Sciences

Ritchie Natuan Caibigan, Batangan State University – The National Engineering University

Department of Computer Science and Information Technology, College of Informatics and Computing Sciences

References

J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting methods for multi-class imbalanced data classification: an experimental review,” J Big Data, vol. 7, no. 1, p. 70, 2020, https://doi.org/10.1186/s40537-020-00349-y.

M. Koziarski, “Radial-Based Undersampling for imbalanced data classification,” Pattern Recognition, vol. 102, p. 107262, 2020, https://doi.org/10.1016/j.patcog.2020.107262.

H. Liu, M. Zhou, and Q. Liu, “An embedded feature selection method for imbalanced data classification,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703–715, 2019, https://doi.org/10.1109/JAS.2019.1911447.

F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Information Sciences, vol. 513, pp. 429–441, 2020, https://doi.org/10.1016/j.ins.2019.11.004.

P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,” Knowledge-Based Systems, vol. 212, p. 106631, 2021, https://doi.org/10.1016/j.knosys.2020.106631.

N. W. S. Wardhani, M. Y. Rochayani, A. Iriany, A. D. Sulistyono, and P. Lestantyo, “Cross-validation Metrics for Evaluating Classification Performance on Imbalanced Data,” in 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA), pp. 14–18, 2019, https://doi.org/10.1109/IC3INA48034.2019.8949568.

A. Ali-Gombe and E. Elyan, “MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network,” Neurocomputing, vol. 361, pp. 212–221, 2019, https://doi.org/10.1016/j.neucom.2019.06.043.

Q. Wang, W. Cao, J. Guo, J. Ren, Y. Cheng, and D. N. Davis, “DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values,” IEEE Access, vol. 7, pp. 102232–102238, 2019, https://doi.org/10.1109/ACCESS.2019.2929866.

G. Douzas, F. Bacao, J. Fonseca, and M. Khudinyan, “Imbalanced Learning in Land Cover Classification: Improving Minority Classes’ Prediction Accuracy Using the Geometric SMOTE Algorithm,” Remote Sensing, vol. 11, no. 24, p. 3040, 2019, https://doi.org/10.3390/rs11243040.

H. B. Jethva and P. A. Barot, “ImbTree: Minority Class Sensitive Weighted Decision Tree for Classification of Unbalanced Data,” ijisae, vol. 9, no. 4, pp. 152–158, 2021, https://doi.org/10.18201/ijisae.2021473633.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, p. 6, 2020, https://doi.org/10.1186/s12864-019-6413-7.

D. S. Depto, Md. M. Rizvee, A. Rahman, H. Zunair, M. S. Rahman, and M. R. C. Mahdy, “Quantifying imbalanced classification methods for leukemia detection,” Computers in Biology and Medicine, vol. 152, p. 106372, 2023, https://doi.org/10.1016/j.compbiomed.2022.106372.

V. H. Alves Ribeiro and G. Reynoso-Meza, “Ensemble learning by means of a multi-objective optimization design approach for dealing with imbalanced data sets,” Expert Systems with Applications, vol. 147, p. 113232, 2020, https://doi.org/10.1016/j.eswa.2020.113232.

A. Altan, “Performance of Metaheuristic Optimization Algorithms based on Swarm Intelligence in Attitude and Altitude Control of Unmanned Aerial Vehicle for Path Following,” in 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1–6, 2020, https://doi.org/10.1109/ISMSIT50672.2020.9255181.

A. Hussain and Y. S. Muhammad, “Trade-off between exploration and exploitation with genetic algorithm using a novel selection operator,” Complex Intell. Syst., vol. 6, no. 1, pp. 1–14, 2020, https://doi.org/10.1007/s40747-019-0102-7.

Morales-Castañeda, D. Zaldívar, E. Cuevas, F. Fausto, and A. Rodríguez, “A better balance in metaheuristic algorithms: Does it exist?,” Swarm and Evolutionary Computation, vol. 54, p. 100671, 2020, https://doi.org/10.1016/j.swevo.2020.100671.

R. C. Wilson, E. Bonawitz, V. D. Costa, and R. B. Ebitz, “Balancing exploration and exploitation with information and randomization,” Current Opinion in Behavioral Sciences, vol. 38, pp. 49–56, 2021, https://doi.org/10.1016/j.cobeha.2020.10.001.

L. Korycki and B. Krawczyk, “Concept Drift Detection from Multi-Class Imbalanced Data Streams,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 1068–1079, 2021, https://doi.org/10.1109/ICDE51399.2021.00097.

W. Liu, H. Zhang, Z. Ding, Q. Liu, and C. Zhu, “A comprehensive active learning method for multiclass imbalanced data streams with concept drift,” Knowledge-Based Systems, vol. 215, p. 106778, 2021, https://doi.org/10.1016/j.knosys.2021.106778.

S. Priya and R. A. Uthra, “Deep learning framework for handling concept drift and class imbalanced complex decision-making on streaming data,” Complex Intell. Syst., vol. 9, no. 4, pp. 3499–3515, 2023, https://doi.org/10.1007/s40747-021-00456-0.

W. Grote-Ramm, D. Lanuschny, F. Lorenzen, M. Oliveira Brito, and F. Schönig, “Continual learning for neural regression networks to cope with concept drift in industrial processes using convex optimisation,” Engineering Applications of Artificial Intelligence, vol. 120, p. 105927, 2023, https://doi.org/10.1016/j.engappai.2023.105927.

P. Li, H. Zhang, X. Hu, and X. Wu, “High-Dimensional Multi-Label Data Stream Classification With Concept Drifting Detection,” IEEE Trans. Knowl. Data Eng., pp. 1–15, 2022, https://doi.org/10.1109/TKDE.2022.3200068.

H. Mehmood, P. Kostakos, M. Cortes, T. Anagnostopoulos, S. Pirttikangas, and E. Gilman, “Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams,” Smart Cities, vol. 4, no. 1, pp. 349–371, 2021, https://doi.org/10.3390/smartcities4010021.

S. Ryan, R. Corizzo, I. Kiringa, and N. Japkowicz, “Deep Learning Versus Conventional Learning in Data Streams with Concept Drifts,” in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 1306–1313, 2019, https://doi.org/10.1109/ICMLA.2019.00213.

S. Dhivya and R. Arul, “Hybrid Flower Pollination Algorithm for Optimization Problems,” in Proceedings of the International Conference on Computational Intelligence and Sustainable Technologies, pp. 751–762, 2022, https://doi.org/10.1007/978-981-16-6893-7_65.

P. E. Mergos and X.-S. Yang, “Flower pollination algorithm parameters tuning,” Soft Comput, vol. 25, no. 22, pp. 14429–14447, 2021, https://doi.org/10.1007/s00500-021-06230-1.

Y. Lu, Y.-M. Cheung, and Y. Y. Tang, “Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem,” IEEE Trans. Neural Netw. Learning Syst., vol. 31, no. 9, pp. 3525–3539, 2020, https://doi.org/10.1109/TNNLS.2019.2944962.

S. Tyagi and S. Mittal, “Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning,” in Proceedings of ICRIC 2019, vol. 597, pp. 209–221, 2020, https://doi.org/10.1007/978-3-030-29407-6_17.

A. J. Larrazabal, N. Nieto, V. Peterson, D. H. Milone, and E. Ferrante, “Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis,” Proc. Natl. Acad. Sci. U.S.A., vol. 117, no. 23, pp. 12592–12594, 2020, https://doi.org/10.1073/pnas.1919012117.

T. Liu, W. Fan, and C. Wu, “A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset,” Artificial Intelligence in Medicine, vol. 101, p. 101723, 2019, https://doi.org/10.1016/j.artmed.2019.101723.

L. Gao, L. Zhang, C. Liu, and S. Wu, “Handling imbalanced medical image data: A deep-learning-based one-class classification approach,” Artificial Intelligence in Medicine, vol. 108, p. 101935, 2020, https://doi.org/10.1016/j.artmed.2020.101935.

E. Mortaz, “Imbalance accuracy metric for model selection in multi-class imbalance classification problems,” Knowledge-Based Systems, vol. 210, p. 106490, 2020, https://doi.org/10.1016/j.knosys.2020.106490.

A. Özdemir, K. Polat, and A. Alhudhaif, “Classification of imbalanced hyperspectral images using SMOTE-based deep learning methods,” Expert Systems with Applications, vol. 178, p. 114986, 2021, https://doi.org/10.1016/j.eswa.2021.114986.

P. Thölke et al., “Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data,” NeuroImage, vol. 277, p. 120253, 2023, https://doi.org/10.1016/j.neuroimage.2023.120253.

D. Chicco and G. Jurman, “The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification,” BioData Mining, vol. 16, no. 1, p. 4, 2023, https://doi.org/10.1186/s13040-023-00322-4.

Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” Journal of Biomedical Informatics, vol. 107, p. 103465, 2020, https://doi.org/10.1016/j.jbi.2020.103465.

G. Kovács, “An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets,” Applied Soft Computing, vol. 83, p. 105662, 2019, https://doi.org/10.1016/j.asoc.2019.105662.

S. Susan and A. Kumar, “The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art,” Engineering Reports, vol. 3, no. 4, p. e12298, 2021, https://doi.org/10.1002/eng2.12298.

A. Ishaq et al., "Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques,” IEEE Access, vol. 9, pp. 39707–39716, 2021, https://doi.org/10.1109/ACCESS.2021.3064084.

M. Khushi et al., “A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data,” IEEE Access, vol. 9, pp. 109960–109975, 2021, https://doi.org/10.1109/ACCESS.2021.3102399.

A. Kishor and C. Chakraborty, “Early and accurate prediction of diabetics based on FCBF feature selection and SMOTE,” Int J Syst Assur Eng Manag, 2021, https://doi.org/10.1007/s13198-021-01174-z.

V. P. K. Turlapati and M. R. Prusty, “Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19,” Intelligence-Based Medicine, vol. 3–4, p. 100023, 2020, https://doi.org/10.1016/j.ibmed.2020.100023.

D. C. E. Saputra, K. Sunat, and T. Ratnaningsih, “A New Artificial Intelligence Approach Using Extreme Learning Machine as the Potentially Effective Model to Predict and Analyze the Diagnosis of Anemia,” Healthcare, vol. 11, no. 5, p. 697, 2023, https://doi.org/10.3390/healthcare11050697.

X.-S. Yang, “Flower Pollination Algorithm for Global Optimization,” in Unconventional Computation and Natural Computation, vol. 7445, pp. 240–249, 2021, https://doi.org/10.1007/978-3-642-32894-7_27.

M. Abdel-Basset and L. A. Shawky, “Flower pollination algorithm: a comprehensive review,” Artif Intell Rev, vol. 52, no. 4, pp. 2533–2557, 2019, https://doi.org/10.1007/s10462-018-9624-4.

X.-S. Yang, M. Karamanoglu, and X. He, “Flower pollination algorithm: A novel approach for multiobjective optimization,” Engineering Optimization, vol. 46, no. 9, pp. 1222–1237, 2014, https://doi.org/10.1080/0305215X.2013.832237.

Z. A. Abdalkareem, M. A. Al-Betar, A. Amir, P. Ehkan, A. I. Hammouri, and O. H. Salman, “Discrete flower pollination algorithm for patient admission scheduling problem,” Computers in Biology and Medicine, vol. 141, p. 105007, 2022, https://doi.org/10.1016/j.compbiomed.2021.105007.

M. Abdel-Basset, R. Mohamed, S. Saber, S. Askar, and M. Abouhawwash, “Modified Flower Pollination Algorithm for Global Optimization,” Mathematics, vol. 9, no. 14, p. 1661, 2021, https://doi.org/10.3390/math9141661.

F. B. Ozsoydan and A. Baykasoglu, “Chaos and intensification enhanced flower pollination algorithm to solve mechanical design and unconstrained function optimization problems,” Expert Systems with Applications, vol. 184, p. 115496, 2021, https://doi.org/10.1016/j.eswa.2021.115496.

S. Lalljith, I. Fleming, U. Pillay, K. Naicker, Z. J. Naidoo, and A. K. Saha, “Applications of Flower Pollination Algorithm in Electrical Power Systems: A Review,” IEEE Access, vol. 10, pp. 8924–8947, 2022, https://doi.org/10.1109/ACCESS.2021.3138518.

M. K. Y. Shambour, A. A. Abusnaina, and A. I. Alsalibi, “Modified Global Flower Pollination Algorithm and its Application for Optimization Problems,” Interdiscip Sci Comput Life Sci, vol. 11, no. 3, pp. 496–507, 2019, https://doi.org/10.1007/s12539-018-0295-2.

Z. A. Alkareem Alyasseri et al., “A hybrid flower pollination with β -hill climbing algorithm for global optimization,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 4821–4835, 2022, https://doi.org/10.1016/j.jksuci.2021.06.015.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” jair, vol. 16, pp. 321–357, 2002, https://doi.org/10.1613/jair.953.

D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,” IEEE Trans. Neural Netw. Learning Syst., vol. 34, no. 9, pp. 6390–6404, 2023, https://doi.org/10.1109/TNNLS.2021.3136503.

J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,” J Big Data, vol. 6, no. 1, p. 27, 2019, https://doi.org/10.1186/s40537-019-0192-5.

D. Chicco, M. J. Warrens, and G. Jurman, “The Matthews Correlation Coefficient (MCC) is More Informative Than Cohen’s Kappa and Brier Score in Binary Classification Assessment,” IEEE Access, vol. 9, pp. 78368–78381, 2021, https://doi.org/10.1109/ACCESS.2021.3084050.

J. Yao and M. Shepperd, “Assessing software defection prediction performance: why using the Matthews correlation coefficient matters,” in Proceedings of the Evaluation and Assessment in Software Engineering, pp. 120–129, 2020, https://doi.org/10.1145/3383219.3383232.

P. Ferreira, D. C. Le, and N. Zincir-Heywood, “Exploring Feature Normalization and Temporal Information for Machine Learning Based Insider Threat Detection,” in 2019 15th International Conference on Network and Service Management (CNSM), pp. 1–7, 2019, https://doi.org/10.23919/CNSM46954.2019.9012708.

K. M. Ong, P. Ong, and C. K. Sia, “A new flower pollination algorithm with improved convergence and its application to engineering optimization,” Decision Analytics Journal, vol. 5, p. 100144, 2022, https://doi.org/10.1016/j.dajour.2022.100144.

M. Ćalasan, S. H. E. Abdel Aleem, and A. F. Zobaa, “On the root mean square error (RMSE) calculation for parameter estimation of photovoltaic models: A novel exact analytical solution based on Lambert W function,” Energy Conversion and Management, vol. 210, p. 112716, 2020, https://doi.org/10.1016/j.enconman.2020.112716.

S.-H. Tseng and T. Son Nguyen, “Agent-Based Modeling of Rumor Propagation Using Expected Integrated Mean Squared Error Optimal Design,” ASI, vol. 3, no. 4, p. 48, 2020, https://doi.org/10.3390/asi3040048.

C.-I. Chang, “An Effective Evaluation Tool for Hyperspectral Target Detection: 3D Receiver Operating Characteristic Curve Analysis,” IEEE Trans. Geosci. Remote Sensing, vol. 59, no. 6, pp. 5131–5153, 2021, https://doi.org/10.1109/TGRS.2020.3021671.

B. Richhariya and M. Tanveer, “A reduced universum twin support vector machine for class imbalance learning,” Pattern Recognition, vol. 102, p. 107150, 2020, https://doi.org/10.1016/j.patcog.2019.107150.

Y. Zhang, H. Yang, H. Cui, and Q. Chen, “Comparison of the Ability of ARIMA, WNN and SVM Models for Drought Forecasting in the Sanjiang Plain, China,” Nat Resour Res, vol. 29, no. 2, pp. 1447–1464, 2020, https://doi.org/10.1007/s11053-019-09512-6.

D. Albashish, A. I. Hammouri, M. Braik, J. Atwan, and S. Sahran, “Binary biogeography-based optimization based SVM-RFE for feature selection,” Applied Soft Computing, vol. 101, p. 107026, 2021, https://doi.org/10.1016/j.asoc.2020.107026.

A. Binbusayyis and T. Vaiyapuri, “Unsupervised deep learning approach for network intrusion detection combining convolutional autoencoder and one-class SVM,” Appl Intell, vol. 51, no. 10, pp. 7094–7108, 2021, https://doi.org/10.1007/s10489-021-02205-9.

A. Ghavidel and P. Pazos, “Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review,” J Cancer Surviv, 2023, https://doi.org/10.1007/s11764-023-01465-3.

F. Nie, W. Zhu, and X. Li, “Decision Tree SVM: An extension of linear SVM for non-linear classification,” Neurocomputing, vol. 401, pp. 153–159, 2020, https://doi.org/10.1016/j.neucom.2019.10.051.

G. Aguiar, B. Krawczyk, and A. Cano, “A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework,” Mach Learn, Jun. 2023, https://doi.org/10.1007/s10994-023-06353-6.

P. Gnip, L. Vokorokos, and P. Drotár, “Selective oversampling approach for strongly imbalanced data,” PeerJ Computer Science, vol. 7, p. e604, 2021, https://doi.org/10.7717/peerj-cs.604.

L. Ju et al., “Hierarchical Knowledge Guided Learning for Real-world Retinal Disease Recognition,” IEEE Trans. Med. Imaging, pp. 1–1, 2023, https://doi.org/10.1109/TMI.2023.3302473.

N. Liu, X. Li, E. Qi, M. Xu, L. Li, and B. Gao, “A Novel Ensemble Learning Paradigm for Medical Diagnosis With Imbalanced Data,” IEEE Access, vol. 8, pp. 171263–171280, 2020, https://doi.org/10.1109/ACCESS.2020.3014362.

Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, “A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data,” Information Sciences, vol. 572, pp. 574–589, 2021, https://doi.org/10.1016/j.ins.2021.02.056.

B. Baesens, S. Höppner, I. Ortner, and T. Verdonck, “robROSE: A robust approach for dealing with imbalanced data in fraud detection,” Stat Methods Appl, vol. 30, no. 3, pp. 841–861, 2021, https://doi.org/10.1007/s10260-021-00573-7.

T. C. Tran and T. K. Dang, “Machine Learning for Prediction of Imbalanced Data: Credit Fraud Detection,” in 2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM), pp. 1–7, 2021, https://doi.org/10.1109/IMCOM51814.2021.9377352.

C. A. Brosey and J. A. Tainer, “Evolving SAXS versatility: solution X-ray scattering for macromolecular architecture, functional landscapes, and integrative structural biology,” Current Opinion in Structural Biology, vol. 58, pp. 197–213, 2019, https://doi.org/10.1016/j.sbi.2019.04.004.

A. Khraisat and A. Alazab, “A critical review of intrusion detection systems in the internet of things: techniques, deployment strategy, validation strategy, attacks, public datasets and challenges,” Cybersecur, vol. 4, no. 1, p. 18, 2021, https://doi.org/10.1186/s42400-021-00077-7.

Downloads

Published

2024-09-27

How to Cite

[1]
D. C. E. Saputra, “HNIHA: Hybrid Nature-Inspired Imbalance Handling Algorithm to Addressing Imbalanced Datasets for Improved Classification: In Case of Anemia Identification”, Buletin Ilmiah Sarjana Teknik Elektro, vol. 6, no. 3, pp. 254–270, Sep. 2024.

Issue

Section

Artikel