Effect of Feature Scaling Pre-processing Techniques on Machine Learning Algorithms to Predict Particulate Matter Concentration for Gandhinagar, Gujarat, India
DOI:
https://doi.org/10.32628/IJSRST52411150Keywords:
Particulate Matter, Machine learning, Pre-processing Techniques, Data ScalingAbstract
Particulate matter (PM) has widely been recognized as the primary factor responsible for air pollution, posing significant health hazards, particularly cardiovascular and respiratory diseases. Major sources of particulate matter include construction sites, power plants, industries and automobiles, landfills and agriculture, wildfires and brush/waste burning, industrial sources, wind-blown dust from open lands, pollen, and fragments of bacteria. Even though various studies have been carried out to predict particulate matter concentration, there are only a handful of papers that focus on the data scaling pre-processing aspect and how it affects the prediction. For the study, Gandhinagar Smart City Development Limited, Gandhinagar, Gujarat has provided Air Quality data from 26-1-2022 to 16-01-2023. The provided data has several challenges such as missing data, inconsistent data, and mixed data (numerical and categorical). Data pre-processing is an essential step in machine learning regression problems. Data pre-processing techniques include missing value handling, data scaling, outlier detection, feature selection/engineering, and imputation. So, this paper aims to identify the effect of the data scaling pre-processing technique to predict the concentration of Particulate Matter (PM10) for Gandhinagar, Gujarat. Data scaling will be performed based on whether data are normally distributed or not. Four data scaling techniques such as Normalizer, Robust Scaler, Min-Max Scaler, and Standard Scaler in combination with six machine learning algorithms such as Multiple Linear Regressor, Support Vector Regressor, K-Nearest Neighbour regressor, Decision Tree Regressor, Random Forest Regressor, and XGBoost Regressor were compared to identify best prediction model for Particulate Matter (PM10) concentration.
References
- M. Mahajan, S. Kumar, B. Pant, U. K. Tiwari and R. Khan, 'Feature Selection and Analysis in Air Quality Data,' 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 2021, pp. 280-285, doi: 10.1109/Confluence51648.2021.9376882.
- Particulate Matter (PM) Basics | US EPA. (2016, April 19). US EPA. https://www.epa.gov/pm-pollution/particulate-matter-pm-basics#PM
- Gokhale S, Raokhande N (2008) Performance evaluation of air quality models for predicting PM10 and PM2.5 concentrations at urban traffic intersection during winter period. Sci Total Environ 9–24
- Report on the Environment, “Particulate Matter Emissions”- United states Environment protection Agency https://cfpub.epa.gov/roe/indicator_pdf.cfm?i=19
- J Brownlee -,” Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python”,2020
- Djarum, D.H., Ahmad, Z., Zhang, J. (2021). Comparing Different Pre-processing Techniques and Machine Learning Models to Predict PM10 and PM2.5 Concentration in Malaysia. In: Zaini, M.A.A., Jusoh, M., Othman, N. (eds) Proceedings of the 3rd International Conference on Separation Technology. Lecture Notes in Mechanical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-16-0742-4_25
- Y. Rybarczyk and R. Zalakeviciute, “Machine Learning Approaches for Outdoor Air Quality Modelling: A Systematic Review,” Applied Sciences, vol. 8, no. 12, p. 2570, Dec. 2018.
Downloads
Published
Issue
Section
License
Copyright (c) IJSRST

This work is licensed under a Creative Commons Attribution 4.0 International License.