Air pollution forecasting: Application of machine learning models to estimate PM2.5 index
7 viewsDOI:
https://doi.org/10.54939/1859-1043.j.mst.FEE.2024.286-294Keywords:
Air quality prediction; Machine learning; Random forest; SVR; XGBoost; PM2.5.Abstract
In the context of ongoing industrialization, air pollution has become an urgent global problem, particularly severe in large cities such as Hanoi (Vietnam), Beijing (China), and others. Air pollution, especially the concentration of fine particulate matter (PM2.5), is not only harmful to human health but also has significant negative impacts on the environment, economy, and quality of life. This study aims to enhance the ability to predict air pollution levels more accurately. By using machine learning models, meteorologists can better predict air pollution levels and propose more effective mitigation solutions. The article utilizes a multivariate time series dataset, including meteorological and air pollution indices from Beijing, China, from 2010 to 2014. Machine learning models such as Lasso Regression, Support Vector Regression, Random Forest, XGBoost, and, notably, a Stack Model combining the four aforementioned models, are evaluated. The performance of these models is measured using statistical indicators such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R²). Among these models, the Stack model provides the most accurate predictions for the PM2.5 index.
References
[1]. WHO, "Air pollution," (2018). [Online]. Available: https://www.who.int/en/news-room/fact- sheets/detail/ambient-(outdoor)-air-quality-and-health.
[2]. Wei, J., Huang, W., Li, Z., Xue, W., Peng, Y., Sun, L., & Gribb, M. "Estimating 1-km-resolution PM2.5 concentrations across China using the spacetime random forest approach". Remote Sensing of Environment, 231, 111221, (2019). https://doi.org/10.1016/j.rse.2019.111221 DOI: https://doi.org/10.1016/j.rse.2019.111221
[3]. Viana, M., Rivas, I., Reche, C., Fonseca, A., Pérez, N., Querol, X. "Field comparison of portable and stationary instruments foroutdoor urban air exposure assessments". Atmospheric Environment, 123, 220–228, (2015). https://doi.org/10.1016/j.atmosenv.2015.10.076 DOI: https://doi.org/10.1016/j.atmosenv.2015.10.076
[4]. Motlagh, N. H., Lagerspetz, E., Nurmi, P., Li, X., Varjonen, S., Mineraud, J. "Toward massive scale air quality monitoring". IEEECommunications Magazine, 58(2), 54–59, (2020). https://doi.org/10.1109/MCOM.001.1900515 DOI: https://doi.org/10.1109/MCOM.001.1900515
[5]. He, X. N., Chen, P., Zhang, C., & Chen, J. Y. "Study on the correlation between PM 2.5 and onset of acute myocardial infarction amongfemale patients". Child Care China, 31(22), 4626–4629, (2016).
[6]. Ong, B. T., Sugiura, K., & Zettsu, K. "Dynamically pre-trained deep recurrent neural networks using environmental monitoring data forpredicting PM2.5". Neural Computing and Applications, 27(6), 1553–1566, (2016). https://doi.org/10.1007/s00521-015-1955-3 DOI: https://doi.org/10.1007/s00521-015-1955-3
[7]. Fang, X., Zou, B., Liu, X., Sternberg, T., & Zhai, I. "Satellite-based ground PM 2.5 estimation using timely structure adaptive modeling". Remote Sensing of Environment, 186, 152–163, (2016). https://doi.org/10.1016/j.rse.2016.08.027 DOI: https://doi.org/10.1016/j.rse.2016.08.027
[8]. Bzdok, D., Altman, N., & Krzywinski, M. "Statistics versus machine learning". Nature Methods, 15(4), 233–234, (2018). https://doi.org/10.1038/nmeth.4642 DOI: https://doi.org/10.1038/nmeth.4642
[9]. Bazoukis, G., Stavrakis, S., Zhou, J., Bollepalli, S. C., Tse, G., Zhang, Q. "Machine learning versus conventional clinical methods in guiding management of heart failure patients - A systematic review". Heart Failure Reviews, 26(1), 23–34, (2021). https://doi.org/10.1007/s10741-020-10007-3 DOI: https://doi.org/10.1007/s10741-020-10007-3
[10]. Makridakis, S., Spilliotis, E., & Assimakopoulos, V. "Statistical and machine learning forecasting methods: Concerns and ways forward". PLoS One, 13(3), e0194889, (2018). https://doi.org/10.1371/journal.pone.0194889 DOI: https://doi.org/10.1371/journal.pone.0194889
[11]. Ranstam, J. and J.A.J.J.o.B.S. Cook, "LASSO regression". 105(10): p. 1348-1348, (2018). DOI: https://doi.org/10.1002/bjs.10895
[12]. Awad, M., et al., "Support vector regression". p. 67-80, (2015). DOI: https://doi.org/10.1007/978-1-4302-5990-9_4
[13]. Hastie, T. J., Tibshirani, R. J., & Friedman, J. H. "The elements of statistical learning: Data mining inference and prediction (2nd ed.)". Springer, (2009).
[14]. Bao, Y., Hayashida, M., & Akutsu, T. "LBSizeClev: Improved support vector machines (SVM)-based predictions of Dicer cleavage sitesusing loop/bulge length". BMC Bioinformatics, 17(1), 487, (2016). https://doi.org/10.1186/s12859-016-1353-6 DOI: https://doi.org/10.1186/s12859-016-1353-6
[15]. Segal, M.R., "Machine learning benchmarks and random forest regression". (2004).
[16]. Chen, T. and C. Guestrin. "Xgboost: A scalable tree boosting system". Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. (2016).
[17]. Chen, T., & Guestrin, C. "XGBoost: A scalable tree boosting system". In Proceedings of the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (pp. 785–794), (2016). https://doi.org/10.1145/2939672.2939785 DOI: https://doi.org/10.1145/2939672.2939785
[18]. Wolpert, D.H., "Stacked generalization". Neural Networks. 5(2): p. 241-259, (1992). DOI: https://doi.org/10.1016/S0893-6080(05)80023-1
[19]. X. Liang et al., “Assessing Beijing’s PM 2.5 pollution: severity, weather impact, APEC and winter heating”. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, vol. 471, no. 2182, p. 20150257, (2015). DOI: https://doi.org/10.1098/rspa.2015.0257
[20]. J. Brownlee, “How to Convert a Time Series to a Supervised Learning Problem in Python,” (2017). https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/.