Data Pre-Processing Technique for Enhancing Healthcare Data Quality Using Artificial Intelligence

Authors

  • Arati K Kale  Department of Computer Engineering, Pune University, Pune, Maharashtra, India
  • Dr. Dev Ras Pandey  Department of Computer Science and Engineering, Kalinga University, Naya Raipur, Chhattisgarh, India

DOI:

https://doi.org/10.32628/IJSRST52411130

Keywords:

Data Pre-Processing, Healthcare Data, Artificial Intelligence, Data Quality, Medical

Abstract

Healthcare datasets frequently contain large dimensional, distorted, uneven, missing, and imbalanced data. These difficulties may lower the effectiveness of machine learning algorithms. Before using machine learning algorithms for healthcare datasets, pre-processing is necessary to ensure the data is adequate for learning. The data pre-processing is essential to improve the performance of classification or prediction. This paper proposes a data pre-processing technique for enhancing healthcare data quality using artificial intelligence. The pre-processing includes handling missing values, outlier detection and handling imbalanced data. The missing values are imputed using the KNN-based approach, the outliers are detected using a cluster-based algorithm, and SMOTE and the Random resampling approach can rebalance the imbalanced data. Different machine learning classification algorithms are used to analyze the data quality. The real-time healthcare dataset is used to evaluate the performance of the proposed approach using accuracy, sensitivity, specificity, precision and f-measure. This research shows that the pre-processing techniques chosen have a considerable positive impact on the model's performance when comparing the model's efficiency with and without pre-processed data.

References

  1. Almuhaideb, S., & Menai, M. E. B. (2016). Impact of pre-processing on medical data classification. Frontiers of Computer Science, 10, 1082-1102.
  2. Idri, A., Benhar, H., Fernández-Alemán, J. L., & Kadi, I. (2018). A systematic map of medical data pre-processing in knowledge discovery. Computer methods and programs in biomedicine, 162, 69-85.
  3. Jena, M., & Dehuri, S. (2022). An Integrated Novel Framework for Coping Missing Values Imputation and Classification. IEEE Access, 10, 69373-69387.
  4. Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524.
  5. Lin, W. C., Tsai, C. F., & Zhong, J. R. (2022). Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowledge-Based Systems, 239, 108079.
  6. Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., & Saeed, J. (2020). A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. Journal of Applied Science and Technology Trends, 1(2), 56-70.
  7. Alghushairy, O., Alsini, R., Soule, T., & Ma, X. (2020). A review of local outlier factor algorithms for outlier detection in big data streams. Big Data and Cognitive Computing, 5(1).
  8. Orooji, A., & Kermani, F. (2021). Machine learning based methods for handling imbalanced data in hepatitis diagnosis. Frontiers in Health Informatics, 10(1), 57.
  9. Psychogyios, K., Ilias, L., Ntanos, C., & Askounis, D. (2023). Missing value imputation methods for electronic health records. IEEE Access, 11, 21562-21574.
  10. Nijman, S. W. J., Leeuwenberg, A. M., Beekers, I., Verkouter, I., Jacobs, J. J. L., Bots, M. L., (2022). Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. Journal of clinical epidemiology, 142, 218-229.
  11. Le, T. D., Beuran, R., & Tan, Y. (2018). Comparison of the most influential missing data imputation algorithms for healthcare. In 2018 10th international conference on knowledge and systems engineering (KSE) (pp. 247-251). IEEE.
  12. Samara, M. A., Bennis, I., Abouaissa, A., & Lorenz, P. (2022). A survey of outlier detection techniques in IoT: review and classification. Journal of Sensor and Actuator Networks, 11(1), 4.
  13. Christy, A., Gandhi, G. M., & Vaithyasubramanian, S. (2015). Cluster based outlier detection algorithm for healthcare data. Procedia Computer Science, 50, 209-215.
  14. Palli, A. S., Jaafar, J., Hashmani, M. A., Gomes, H. M., & Gilal, A. R. (2022). A hybrid sampling approach for imbalanced binary and multi-class data using clustering analysis. IEEE Access, 10, 118639-118653.
  15. Ofek, N., Rokach, L., Stern, R., & Shabtai, A. (2017). Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing, 243, 88-102.
  16. Zhang, X., Yan, C., Gao, C., Malin, B. A., & Chen, Y. (2020). Predicting missing values in medical data via XGBoost regression. Journal of healthcare informatics research, 4, 383-394.
  17. Al-Helali, B., Chen, Q., Xue, B., & Zhang, M. (2021). A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data. Soft Computing, 25, 5993-6012.
  18. Cubillos, M., Wøhlk, S., & Wulff, J. N. (2022). A bi-objective k-nearest-neighbors-based imputation method for multilevel data. Expert Systems with Applications, 204, 117298.
  19. Karmitsa, N., Taheri, S., Bagirov, A., & Mäkinen, P. (2020). Missing value imputation via clusterwise linear regression. IEEE Transactions on Knowledge and Data Engineering, 34(4), 1889-1901.
  20. Thomas, T., & Rajabi, E. (2021). A systematic review of machine learning-based missing value imputation techniques. Data Technologies and Applications, 55(4), 558-585.
  21. Du, H., Ye, Q., Sun, Z., Liu, C., & Xu, W. (2020). FAST-ODT: A lightweight outlier detection scheme for categorical data sets. IEEE Transactions on Network Science and Engineering, 8(1), 13-24.
  22. Ma, Y., & Zhao, X. (2021). POD: a parallel outlier detection algorithm using weighted KNN. IEEE Access, 9, 81765-81777.
  23. Yang, J., Rahardja, S., & Fränti, P. (2021). Mean-shift outlier detection and filtering. Pattern Recognition, 115, 107874.
  24. Li, J., Zhang, J., Pang, N., & Qin, X. (2020). Weighted outlier detection of high-dimensional categorical data using feature grouping. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 50(11), 4295-4308.

Downloads

Published

2024-02-29

Issue

Section

Research Articles

How to Cite

[1]
Arati K Kale, Dr. Dev Ras Pandey "Data Pre-Processing Technique for Enhancing Healthcare Data Quality Using Artificial Intelligence" International Journal of Scientific Research in Science and Technology(IJSRST), Online ISSN : 2395-602X, Print ISSN : 2395-6011,Volume 11, Issue 1, pp.299-309, January-February-2024. Available at doi : https://doi.org/10.32628/IJSRST52411130