HNIHA: Hybrid Nature-Inspired Imbalance Handling Algorithm to Addressing Imbalanced Datasets for Improved Classification: In Case of Anemia Identification
DOI:
https://doi.org/10.12928/biste.v6i3.11306Keywords:
Imbalanced Classification, Natured Inspired Algorithm, MCC, SMOTE, SVMAbstract
This study presents a comprehensive evaluation of three ensemble models designed to handle imbalanced datasets. Each model incorporates the hybrid nature-inspired imbalance handling algorithm (HNIHA) with matthews correlation coefficient and synthetic minority oversampling technique in conjunction with different base classifiers: support vector machine, random forest, and LightGBM. Our focus is to address the challenges posed by imbalanced datasets, emphasizing the balance between sensitivity and specificity. The HNIHA algorithm-guided support vector machine ensemble demonstrated superior performance, achieving an impressive matthews correlation coefficient of 0.8739, showcasing its robustness in balancing true positives and true negatives. The f1-score, precision, and recall metrics further validated its accuracy, precision, and sensitivity, attaining values of 0.9767, 0.9545, and 1.0, respectively. The ensemble demonstrated its ability to minimize prediction errors by minimizing the mean squared error and root mean squared error to 0.0384 and 0.1961, respectively. The HNIHA-guided random forest ensemble and HNIHA-guided LightGBM ensemble also exhibited strong performances.
References
J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting methods for multi-class imbalanced data classification: an experimental review,” J Big Data, vol. 7, no. 1, p. 70, 2020, https://doi.org/10.1186/s40537-020-00349-y.
M. Koziarski, “Radial-Based Undersampling for imbalanced data classification,” Pattern Recognition, vol. 102, p. 107262, 2020, https://doi.org/10.1016/j.patcog.2020.107262.
H. Liu, M. Zhou, and Q. Liu, “An embedded feature selection method for imbalanced data classification,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 703–715, 2019, https://doi.org/10.1109/JAS.2019.1911447.
F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Information Sciences, vol. 513, pp. 429–441, 2020, https://doi.org/10.1016/j.ins.2019.11.004.
P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,” Knowledge-Based Systems, vol. 212, p. 106631, 2021, https://doi.org/10.1016/j.knosys.2020.106631.
N. W. S. Wardhani, M. Y. Rochayani, A. Iriany, A. D. Sulistyono, and P. Lestantyo, “Cross-validation Metrics for Evaluating Classification Performance on Imbalanced Data,” in 2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA), pp. 14–18, 2019, https://doi.org/10.1109/IC3INA48034.2019.8949568.
A. Ali-Gombe and E. Elyan, “MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network,” Neurocomputing, vol. 361, pp. 212–221, 2019, https://doi.org/10.1016/j.neucom.2019.06.043.
Q. Wang, W. Cao, J. Guo, J. Ren, Y. Cheng, and D. N. Davis, “DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values,” IEEE Access, vol. 7, pp. 102232–102238, 2019, https://doi.org/10.1109/ACCESS.2019.2929866.
G. Douzas, F. Bacao, J. Fonseca, and M. Khudinyan, “Imbalanced Learning in Land Cover Classification: Improving Minority Classes’ Prediction Accuracy Using the Geometric SMOTE Algorithm,” Remote Sensing, vol. 11, no. 24, p. 3040, 2019, https://doi.org/10.3390/rs11243040.
H. B. Jethva and P. A. Barot, “ImbTree: Minority Class Sensitive Weighted Decision Tree for Classification of Unbalanced Data,” ijisae, vol. 9, no. 4, pp. 152–158, 2021, https://doi.org/10.18201/ijisae.2021473633.
D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, p. 6, 2020, https://doi.org/10.1186/s12864-019-6413-7.
D. S. Depto, Md. M. Rizvee, A. Rahman, H. Zunair, M. S. Rahman, and M. R. C. Mahdy, “Quantifying imbalanced classification methods for leukemia detection,” Computers in Biology and Medicine, vol. 152, p. 106372, 2023, https://doi.org/10.1016/j.compbiomed.2022.106372.
V. H. Alves Ribeiro and G. Reynoso-Meza, “Ensemble learning by means of a multi-objective optimization design approach for dealing with imbalanced data sets,” Expert Systems with Applications, vol. 147, p. 113232, 2020, https://doi.org/10.1016/j.eswa.2020.113232.
A. Altan, “Performance of Metaheuristic Optimization Algorithms based on Swarm Intelligence in Attitude and Altitude Control of Unmanned Aerial Vehicle for Path Following,” in 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1–6, 2020, https://doi.org/10.1109/ISMSIT50672.2020.9255181.
A. Hussain and Y. S. Muhammad, “Trade-off between exploration and exploitation with genetic algorithm using a novel selection operator,” Complex Intell. Syst., vol. 6, no. 1, pp. 1–14, 2020, https://doi.org/10.1007/s40747-019-0102-7.
Morales-Castañeda, D. Zaldívar, E. Cuevas, F. Fausto, and A. Rodríguez, “A better balance in metaheuristic algorithms: Does it exist?,” Swarm and Evolutionary Computation, vol. 54, p. 100671, 2020, https://doi.org/10.1016/j.swevo.2020.100671.
R. C. Wilson, E. Bonawitz, V. D. Costa, and R. B. Ebitz, “Balancing exploration and exploitation with information and randomization,” Current Opinion in Behavioral Sciences, vol. 38, pp. 49–56, 2021, https://doi.org/10.1016/j.cobeha.2020.10.001.
L. Korycki and B. Krawczyk, “Concept Drift Detection from Multi-Class Imbalanced Data Streams,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 1068–1079, 2021, https://doi.org/10.1109/ICDE51399.2021.00097.
W. Liu, H. Zhang, Z. Ding, Q. Liu, and C. Zhu, “A comprehensive active learning method for multiclass imbalanced data streams with concept drift,” Knowledge-Based Systems, vol. 215, p. 106778, 2021, https://doi.org/10.1016/j.knosys.2021.106778.
S. Priya and R. A. Uthra, “Deep learning framework for handling concept drift and class imbalanced complex decision-making on streaming data,” Complex Intell. Syst., vol. 9, no. 4, pp. 3499–3515, 2023, https://doi.org/10.1007/s40747-021-00456-0.
W. Grote-Ramm, D. Lanuschny, F. Lorenzen, M. Oliveira Brito, and F. Schönig, “Continual learning for neural regression networks to cope with concept drift in industrial processes using convex optimisation,” Engineering Applications of Artificial Intelligence, vol. 120, p. 105927, 2023, https://doi.org/10.1016/j.engappai.2023.105927.
P. Li, H. Zhang, X. Hu, and X. Wu, “High-Dimensional Multi-Label Data Stream Classification With Concept Drifting Detection,” IEEE Trans. Knowl. Data Eng., pp. 1–15, 2022, https://doi.org/10.1109/TKDE.2022.3200068.
H. Mehmood, P. Kostakos, M. Cortes, T. Anagnostopoulos, S. Pirttikangas, and E. Gilman, “Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams,” Smart Cities, vol. 4, no. 1, pp. 349–371, 2021, https://doi.org/10.3390/smartcities4010021.
S. Ryan, R. Corizzo, I. Kiringa, and N. Japkowicz, “Deep Learning Versus Conventional Learning in Data Streams with Concept Drifts,” in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 1306–1313, 2019, https://doi.org/10.1109/ICMLA.2019.00213.
S. Dhivya and R. Arul, “Hybrid Flower Pollination Algorithm for Optimization Problems,” in Proceedings of the International Conference on Computational Intelligence and Sustainable Technologies, pp. 751–762, 2022, https://doi.org/10.1007/978-981-16-6893-7_65.
P. E. Mergos and X.-S. Yang, “Flower pollination algorithm parameters tuning,” Soft Comput, vol. 25, no. 22, pp. 14429–14447, 2021, https://doi.org/10.1007/s00500-021-06230-1.
Y. Lu, Y.-M. Cheung, and Y. Y. Tang, “Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem,” IEEE Trans. Neural Netw. Learning Syst., vol. 31, no. 9, pp. 3525–3539, 2020, https://doi.org/10.1109/TNNLS.2019.2944962.
S. Tyagi and S. Mittal, “Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning,” in Proceedings of ICRIC 2019, vol. 597, pp. 209–221, 2020, https://doi.org/10.1007/978-3-030-29407-6_17.
A. J. Larrazabal, N. Nieto, V. Peterson, D. H. Milone, and E. Ferrante, “Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis,” Proc. Natl. Acad. Sci. U.S.A., vol. 117, no. 23, pp. 12592–12594, 2020, https://doi.org/10.1073/pnas.1919012117.
T. Liu, W. Fan, and C. Wu, “A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset,” Artificial Intelligence in Medicine, vol. 101, p. 101723, 2019, https://doi.org/10.1016/j.artmed.2019.101723.
L. Gao, L. Zhang, C. Liu, and S. Wu, “Handling imbalanced medical image data: A deep-learning-based one-class classification approach,” Artificial Intelligence in Medicine, vol. 108, p. 101935, 2020, https://doi.org/10.1016/j.artmed.2020.101935.
E. Mortaz, “Imbalance accuracy metric for model selection in multi-class imbalance classification problems,” Knowledge-Based Systems, vol. 210, p. 106490, 2020, https://doi.org/10.1016/j.knosys.2020.106490.
A. Özdemir, K. Polat, and A. Alhudhaif, “Classification of imbalanced hyperspectral images using SMOTE-based deep learning methods,” Expert Systems with Applications, vol. 178, p. 114986, 2021, https://doi.org/10.1016/j.eswa.2021.114986.
P. Thölke et al., “Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data,” NeuroImage, vol. 277, p. 120253, 2023, https://doi.org/10.1016/j.neuroimage.2023.120253.
D. Chicco and G. Jurman, “The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification,” BioData Mining, vol. 16, no. 1, p. 4, 2023, https://doi.org/10.1186/s13040-023-00322-4.
Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” Journal of Biomedical Informatics, vol. 107, p. 103465, 2020, https://doi.org/10.1016/j.jbi.2020.103465.
G. Kovács, “An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets,” Applied Soft Computing, vol. 83, p. 105662, 2019, https://doi.org/10.1016/j.asoc.2019.105662.
S. Susan and A. Kumar, “The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art,” Engineering Reports, vol. 3, no. 4, p. e12298, 2021, https://doi.org/10.1002/eng2.12298.
A. Ishaq et al., "Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques,” IEEE Access, vol. 9, pp. 39707–39716, 2021, https://doi.org/10.1109/ACCESS.2021.3064084.
M. Khushi et al., “A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data,” IEEE Access, vol. 9, pp. 109960–109975, 2021, https://doi.org/10.1109/ACCESS.2021.3102399.
A. Kishor and C. Chakraborty, “Early and accurate prediction of diabetics based on FCBF feature selection and SMOTE,” Int J Syst Assur Eng Manag, 2021, https://doi.org/10.1007/s13198-021-01174-z.
V. P. K. Turlapati and M. R. Prusty, “Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19,” Intelligence-Based Medicine, vol. 3–4, p. 100023, 2020, https://doi.org/10.1016/j.ibmed.2020.100023.
D. C. E. Saputra, K. Sunat, and T. Ratnaningsih, “A New Artificial Intelligence Approach Using Extreme Learning Machine as the Potentially Effective Model to Predict and Analyze the Diagnosis of Anemia,” Healthcare, vol. 11, no. 5, p. 697, 2023, https://doi.org/10.3390/healthcare11050697.
X.-S. Yang, “Flower Pollination Algorithm for Global Optimization,” in Unconventional Computation and Natural Computation, vol. 7445, pp. 240–249, 2021, https://doi.org/10.1007/978-3-642-32894-7_27.
M. Abdel-Basset and L. A. Shawky, “Flower pollination algorithm: a comprehensive review,” Artif Intell Rev, vol. 52, no. 4, pp. 2533–2557, 2019, https://doi.org/10.1007/s10462-018-9624-4.
X.-S. Yang, M. Karamanoglu, and X. He, “Flower pollination algorithm: A novel approach for multiobjective optimization,” Engineering Optimization, vol. 46, no. 9, pp. 1222–1237, 2014, https://doi.org/10.1080/0305215X.2013.832237.
Z. A. Abdalkareem, M. A. Al-Betar, A. Amir, P. Ehkan, A. I. Hammouri, and O. H. Salman, “Discrete flower pollination algorithm for patient admission scheduling problem,” Computers in Biology and Medicine, vol. 141, p. 105007, 2022, https://doi.org/10.1016/j.compbiomed.2021.105007.
M. Abdel-Basset, R. Mohamed, S. Saber, S. Askar, and M. Abouhawwash, “Modified Flower Pollination Algorithm for Global Optimization,” Mathematics, vol. 9, no. 14, p. 1661, 2021, https://doi.org/10.3390/math9141661.
F. B. Ozsoydan and A. Baykasoglu, “Chaos and intensification enhanced flower pollination algorithm to solve mechanical design and unconstrained function optimization problems,” Expert Systems with Applications, vol. 184, p. 115496, 2021, https://doi.org/10.1016/j.eswa.2021.115496.
S. Lalljith, I. Fleming, U. Pillay, K. Naicker, Z. J. Naidoo, and A. K. Saha, “Applications of Flower Pollination Algorithm in Electrical Power Systems: A Review,” IEEE Access, vol. 10, pp. 8924–8947, 2022, https://doi.org/10.1109/ACCESS.2021.3138518.
M. K. Y. Shambour, A. A. Abusnaina, and A. I. Alsalibi, “Modified Global Flower Pollination Algorithm and its Application for Optimization Problems,” Interdiscip Sci Comput Life Sci, vol. 11, no. 3, pp. 496–507, 2019, https://doi.org/10.1007/s12539-018-0295-2.
Z. A. Alkareem Alyasseri et al., “A hybrid flower pollination with β -hill climbing algorithm for global optimization,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 4821–4835, 2022, https://doi.org/10.1016/j.jksuci.2021.06.015.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” jair, vol. 16, pp. 321–357, 2002, https://doi.org/10.1613/jair.953.
D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,” IEEE Trans. Neural Netw. Learning Syst., vol. 34, no. 9, pp. 6390–6404, 2023, https://doi.org/10.1109/TNNLS.2021.3136503.
J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,” J Big Data, vol. 6, no. 1, p. 27, 2019, https://doi.org/10.1186/s40537-019-0192-5.
D. Chicco, M. J. Warrens, and G. Jurman, “The Matthews Correlation Coefficient (MCC) is More Informative Than Cohen’s Kappa and Brier Score in Binary Classification Assessment,” IEEE Access, vol. 9, pp. 78368–78381, 2021, https://doi.org/10.1109/ACCESS.2021.3084050.
J. Yao and M. Shepperd, “Assessing software defection prediction performance: why using the Matthews correlation coefficient matters,” in Proceedings of the Evaluation and Assessment in Software Engineering, pp. 120–129, 2020, https://doi.org/10.1145/3383219.3383232.
P. Ferreira, D. C. Le, and N. Zincir-Heywood, “Exploring Feature Normalization and Temporal Information for Machine Learning Based Insider Threat Detection,” in 2019 15th International Conference on Network and Service Management (CNSM), pp. 1–7, 2019, https://doi.org/10.23919/CNSM46954.2019.9012708.
K. M. Ong, P. Ong, and C. K. Sia, “A new flower pollination algorithm with improved convergence and its application to engineering optimization,” Decision Analytics Journal, vol. 5, p. 100144, 2022, https://doi.org/10.1016/j.dajour.2022.100144.
M. Ćalasan, S. H. E. Abdel Aleem, and A. F. Zobaa, “On the root mean square error (RMSE) calculation for parameter estimation of photovoltaic models: A novel exact analytical solution based on Lambert W function,” Energy Conversion and Management, vol. 210, p. 112716, 2020, https://doi.org/10.1016/j.enconman.2020.112716.
S.-H. Tseng and T. Son Nguyen, “Agent-Based Modeling of Rumor Propagation Using Expected Integrated Mean Squared Error Optimal Design,” ASI, vol. 3, no. 4, p. 48, 2020, https://doi.org/10.3390/asi3040048.
C.-I. Chang, “An Effective Evaluation Tool for Hyperspectral Target Detection: 3D Receiver Operating Characteristic Curve Analysis,” IEEE Trans. Geosci. Remote Sensing, vol. 59, no. 6, pp. 5131–5153, 2021, https://doi.org/10.1109/TGRS.2020.3021671.
B. Richhariya and M. Tanveer, “A reduced universum twin support vector machine for class imbalance learning,” Pattern Recognition, vol. 102, p. 107150, 2020, https://doi.org/10.1016/j.patcog.2019.107150.
Y. Zhang, H. Yang, H. Cui, and Q. Chen, “Comparison of the Ability of ARIMA, WNN and SVM Models for Drought Forecasting in the Sanjiang Plain, China,” Nat Resour Res, vol. 29, no. 2, pp. 1447–1464, 2020, https://doi.org/10.1007/s11053-019-09512-6.
D. Albashish, A. I. Hammouri, M. Braik, J. Atwan, and S. Sahran, “Binary biogeography-based optimization based SVM-RFE for feature selection,” Applied Soft Computing, vol. 101, p. 107026, 2021, https://doi.org/10.1016/j.asoc.2020.107026.
A. Binbusayyis and T. Vaiyapuri, “Unsupervised deep learning approach for network intrusion detection combining convolutional autoencoder and one-class SVM,” Appl Intell, vol. 51, no. 10, pp. 7094–7108, 2021, https://doi.org/10.1007/s10489-021-02205-9.
A. Ghavidel and P. Pazos, “Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review,” J Cancer Surviv, 2023, https://doi.org/10.1007/s11764-023-01465-3.
F. Nie, W. Zhu, and X. Li, “Decision Tree SVM: An extension of linear SVM for non-linear classification,” Neurocomputing, vol. 401, pp. 153–159, 2020, https://doi.org/10.1016/j.neucom.2019.10.051.
G. Aguiar, B. Krawczyk, and A. Cano, “A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework,” Mach Learn, Jun. 2023, https://doi.org/10.1007/s10994-023-06353-6.
P. Gnip, L. Vokorokos, and P. Drotár, “Selective oversampling approach for strongly imbalanced data,” PeerJ Computer Science, vol. 7, p. e604, 2021, https://doi.org/10.7717/peerj-cs.604.
L. Ju et al., “Hierarchical Knowledge Guided Learning for Real-world Retinal Disease Recognition,” IEEE Trans. Med. Imaging, pp. 1–1, 2023, https://doi.org/10.1109/TMI.2023.3302473.
N. Liu, X. Li, E. Qi, M. Xu, L. Li, and B. Gao, “A Novel Ensemble Learning Paradigm for Medical Diagnosis With Imbalanced Data,” IEEE Access, vol. 8, pp. 171263–171280, 2020, https://doi.org/10.1109/ACCESS.2020.3014362.
Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, “A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data,” Information Sciences, vol. 572, pp. 574–589, 2021, https://doi.org/10.1016/j.ins.2021.02.056.
B. Baesens, S. Höppner, I. Ortner, and T. Verdonck, “robROSE: A robust approach for dealing with imbalanced data in fraud detection,” Stat Methods Appl, vol. 30, no. 3, pp. 841–861, 2021, https://doi.org/10.1007/s10260-021-00573-7.
T. C. Tran and T. K. Dang, “Machine Learning for Prediction of Imbalanced Data: Credit Fraud Detection,” in 2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM), pp. 1–7, 2021, https://doi.org/10.1109/IMCOM51814.2021.9377352.
C. A. Brosey and J. A. Tainer, “Evolving SAXS versatility: solution X-ray scattering for macromolecular architecture, functional landscapes, and integrative structural biology,” Current Opinion in Structural Biology, vol. 58, pp. 197–213, 2019, https://doi.org/10.1016/j.sbi.2019.04.004.
A. Khraisat and A. Alazab, “A critical review of intrusion detection systems in the internet of things: techniques, deployment strategy, validation strategy, attacks, public datasets and challenges,” Cybersecur, vol. 4, no. 1, p. 18, 2021, https://doi.org/10.1186/s42400-021-00077-7.
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Dimas Chaerul Ekty Saputra, Tri Ratnaningsih, Irianna Futri, Elvaro Islami Muryadi, Raksmey Phann, Su Sandi Hla Tun, Ritchie Natuan Caibigan
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
This journal is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.