0% found this document useful (0 votes)
11 views17 pages

Air Quality Forecasting Using Machine Learning

This study investigates air quality forecasting using ten machine learning regression models, including XGBoost, LightGBM, and Support Vector Regression (SVR), with a focus on hyperparameter optimization and ensemble strategies for improved accuracy. Utilizing a dataset of 9,357 hourly pollutant measurements, the optimized SVR model achieved an R² score of 99.94%, demonstrating the efficacy of the proposed methodologies in capturing air pollution dynamics. The findings highlight the potential of machine learning approaches for effective air quality management and early warning systems.

Uploaded by

Rajvi Damrekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

Air Quality Forecasting Using Machine Learning

This study investigates air quality forecasting using ten machine learning regression models, including XGBoost, LightGBM, and Support Vector Regression (SVR), with a focus on hyperparameter optimization and ensemble strategies for improved accuracy. Utilizing a dataset of 9,357 hourly pollutant measurements, the optimized SVR model achieved an R² score of 99.94%, demonstrating the efficacy of the proposed methodologies in capturing air pollution dynamics. The findings highlight the potential of machine learning approaches for effective air quality management and early warning systems.

Uploaded by

Rajvi Damrekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Water Air Soil Pollut (2025) 236:464

[Link]

Air Quality Forecasting Using Machine Learning:


Comparative Analysis and Ensemble Strategies
for Enhanced Prediction
Yıldırım Özüpak · Feyyaz Alpsalaz ·
Emrah Aslan

Received: 18 March 2025 / Accepted: 5 May 2025 / Published online: 14 May 2025
© The Author(s) 2025

Abstract Air pollution poses a critical challenge enhanced through Bayesian optimization and rand-
to environmental sustainability, public health, and omized cross-validation, with stacking employed to
urban planning. Accurate air quality prediction is leverage the strengths of base models. Experimental
essential for devising effective management strate- results showed that hyperparameter optimization and
gies and early warning systems. This study utilized ensemble strategies significantly improved accuracy,
a dataset comprising hourly measurements of pollut- with the SVR model optimized via Bayesian optimi-
ants such as PM2.5, ­NOx, CO, and benzene, sourced ­ 2 score
zation achieving the highest performance: an R
from five metal oxide sensors and a certified analyzer of 99.94%, MAE of 0.0120, and MSE of 0.0005.
in a polluted urban area, totaling 9,357 records col- These findings underscore the methodology’s effi-
lected over one year (March 2004–February 2005) cacy in precisely capturing the spatial and temporal
from the Kaggle Air Quality Data Set. A comprehen- dynamics of air pollution.
sive comparison of ten machine learning regression
models XGBoost, LightGBM, Random Forest, Gra- Keywords Air Quality Prediction · Machine
dient Boosting, CatBoost, Support Vector Regression Learning · Bayesian Optimization · Regression
(SVR) with Bayesian Optimization, Decision Tree, Models · SVR
K-Nearest Neighbors (KNN), Elastic Net, and Bayes-
ian Ridge was conducted. Model performance was
1 Introduction

Y. Özüpak (*) Air pollution is a global issue affecting environmental


Department of Electricity and Energy, Dicle University, sustainability, public health, and urban planning. Ris-
Diyarbakır 21000, Turkey
ing industrialization, population density, and motor
e-mail: [Link]@[Link]
vehicle use degrade air quality, particularly in cities,
F. Alpsalaz contributing to respiratory diseases, cardiovascular
Department of Electricity and Energy, Yozgat Bozok disorders, and chronic health conditions (Zhang et al.,
University, Yozgat 66100, Turkey
2024). The World Health Organization (WHO) and
e-mail: [Link]@[Link]
European Environment Agency (EEA) have proposed
E. Aslan measures to mitigate air pollution’s health impacts.
Department of Computer Engineering, Faculty However, effective management and preventive poli-
of Engineering and Architecture, Mardin Artuklu
cies require accurate prediction of its spatial and tem-
University, Mardin 47000, Turkey
e-mail: emrahaslan@[Link] poral variations (Zhang et al., 2024).

Vol.: (0123456789)
464 Page 2 of 17 Water Air Soil Pollut (2025) 236:464

Traditional air pollution prediction relies on deter- prediction (Kothandaraman, et al., 2022). Mampitiya
ministic models rooted in atmospheric chemistry et al. (2023) reported high accuracy with LightGBM
and physical principles. Chemical Transport Models for PM10 prediction in Sri Lanka (Mampitiya, et al.,
(CTMs) simulate pollution distribution using mete- 2023). Rybarczyk and Zalakeviciute (Rybarczyk &
orological data and emission inventories but depend Zalakeviciute, 2021) noted reductions in NO₂, SO₂,
heavily on input accuracy and struggle with complex CO, and PM2.5 during Quito’s lockdown (Ryba-
atmospheric processes. High computational costs rczyk & Zalakeviciute, 2021). Wang et al. (2024)
also make large-scale simulations time-consuming achieved accurate CO predictions in Nanjing using
and expensive (Anggraini et al., 2024). In contrast, Convolutional Neural Network (CNN) (Wang et al.,
machine learning (ML) and deep learning (DL) 2024). Liu et al. (2024) demonstrated LightGBM and
approaches have emerged as flexible, data-driven LSTM’s effectiveness for air quality prediction (Liu
alternatives for air quality prediction (Yu et al., 2025). et al., 2024). Meena et al. (2024) linked air pollution
This study evaluates ten regression models for air to travel preferences (Meena et al., 2024). Rahman
quality prediction: XGBoost, LightGBM, Random et al. (2024) proposed early-detection systems (Rah-
Forest, Gradient Boosting, CatBoost, SVR, Deci- man et al., 2024). Ansari and Quaff (2025) predicted
sion Tree, KNN, Elastic Net, and Bayesian Ridge. AQI in India (Ansari & Quaff, 2025). Wang and
Its primary goal is to compare their performance and Zhang (2025) highlighted CNN’s superiority (Wang
identify optimal model structures (Yu et al., 2025). & Zhang, 2025). Jiang et al. (2025) analyzed land use
Hyperparameter optimization is critical for improv- impacts (Jiang et al., 2025).
ing model accuracy. Typically set through trial-and- While many studies focus on limited models, this
error or heuristic methods, hyperparameters are time- study comprehensively assesses XGBoost, Light-
consuming to tune and may not yield optimal results GBM, Random Forest, Gradient Boosting, CatBoost,
(Lin et al., 2025). This study employs Bayesian Opti- SVR, Decision Tree, KNN, Elastic Net, and Bayes-
mization and Randomized Cross-Validation (CV) ian Ridge. Bayesian Optimization and Randomized
to rigorously adjust hyperparameters, with Bayes- CV minimize overfitting, achieving high accuracy.
ian Optimization offering an efficient search process The optimized SVR model, for instance, recorded an
via probabilistic modeling (Lin et al., 2025). Stack- Coefficient of Determination ­(R2) of 0.9994, Mean
ing models combine the strengths of base models Absolute Error (MAE) of 0.0120, and Mean Squared
to better capture complex data relationships (Nandi Error (MSE) of 0.0005. Stacking integrates individual
et al., 2024). For example, gradient-boosted models model strengths, capturing complex data relationships
like XGBoost and LightGBM deliver high accuracy, effectively. This study offers a unique contribution by
while SVR and Bayesian Ridge provide balanced analyzing air quality’s spatial and temporal dynamics
predictions. Stacking integrates these advantages, with high accuracy.
enhancing prediction accuracy (Ahmed et al., 2024).
The literature underscores the potential of ML
and DL for air quality prediction. Mendez et al. 2 Material and Method
(2023) reviewed 155 studies from 2011–2021, ana-
lyzing ML/DL models’ geographical distribution This study aims to predict air pollution dynamics
(Asia, Europe), parameters (PM2.5, NO₂), and algo- using machine learning models and to reveal complex
rithms, highlighting time-series models’ effective- relationships in pollutant concentrations. The data
ness and the need for explainable AI (Méndez et al., set provided includes basic pollutants such as PM2.5,
2023). Gupta et al. (2023) predicted AQI across cit- ­NO2, CO as well as air quality indices, and statistical
ies using SVR and Random Forest, with SMOTE- consistency is ensured by outlier removal and corre-
balanced data yielding low Root Mean Square Error lation analysis in the data pre-processing stage. Ten
(RMSE) for Random Forest (Gupta et al., 2023). Guo different regression models such as XGBoost, Light-
et al. (2025) used DRL for HVAC systems, achiev- GBM, Random Forest are trained with hyperparam-
ing 21.4% energy savings and better indoor air qual- eter optimization and the results of these models are
ity (Guo et al., 2025). Kothandaraman et al. (2022) combined with the stack ensemble method to opti-
found XGBoost and AdaBoost effective for PM2.5 mize the prediction accuracy. These results provide

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 3 of 17 464

an effective methodological framework for practical LightGBM, Random Forest, Gradient Boosting, Cat-
applications such as air quality management and early Boost, SVR, Decision Tree, KNN, Elastic Net and
warning systems, while providing a data-driven basis Bayesian Ridge. Each of these models was configured
for policies to control pollutant sources. Figure 1 to predict pollutant concentrations in the air quality
shows the flow diagram of the models used. dataset after hyperparameter optimization. The out-
Hyperparameter optimization to improve the per- puts of these models were then combined using a
formance of machine learning models is at the heart stack ensemble method to produce final predictions
of this work. Bayesian optimization and randomized from a meta-model. The stack ensemble approach
CV methods are used to optimize the balance between aimed to overcome the limitations of a single model
complexity and generalization of the models and to by synthesizing the strengths of the base models and
minimize the risk of overlearning. In the study, 10 maximizing prediction accuracy. In particular, the use
different regression models were trained: XGBoost, of algorithms such as linear regression or gradient

Fig. 1  Flow diagram of the models used

Vol.: (0123456789)
464 Page 4 of 17 Water Air Soil Pollut (2025) 236:464

boosting as meta-models weighted the heterogeneous non-methane hydrocarbons (NMHC), benzene, total
model outputs in a balanced way. This allowed both nitrogen oxides ­(NOx) and nitrogen dioxide (NO₂)
the performance of individual models and the statis- simultaneously with a certified analyzer. Combin-
tical power of ensemble learning to be exploited. As ing sensor responses with real-time pollution levels
a result, the integration of hyper-parameter optimi- enables multi-disciplinary research such as calibra-
zation and aggregation techniques led to significant tion of air quality monitoring systems, detection of
improvements in air quality predictions based on pollution sources and training of machine learning
RMSE and ­R2 metrics, demonstrating the suitability based predictive algorithms (Aram et al., 2024).
of the model for real-world scenarios. This dataset is a rich source of basic data for
researchers to understand the dynamics of air qual-
2.1 Dataset ity, especially in regions where industrial and traffic
emissions are intense. Table 1 lists the parameters
This dataset, which is used to analyses air pollu- in the dataset with their descriptions. While the
tion and develop sensor-based prediction models, “Date” and “Time” parameters in Table 1 indicate
is taken from the (Aram et al., 2024) ([Link] the measurement time, the most important indica-
kaggle.​c om/​d atas​ets/​fedes​o riano/​a ir-​quali​t y-​d ata-​ tors of pollution are the gas concentrations such
set/​data), which can be accessed via the Kaggle as CO, NMHC, ­C6H6, ­NOx, NO₂ and their associ-
platform (Air Quality Dataset, n.d.). The dataset ated sensor outputs (PT08.S1, PT08.S2, PT08.S3,
contains 9,357 hourly average measurements from PT08.S4, PT08.S5). In addition, the T, RH and AH
5 metal oxide chemical sensors located at the road- parameters reflect the thermodynamic properties of
side in an urban area with high levels of air pollu- the environment and are important for investigating
tion. Recorded for one year between March 2004 meteorological interactions with air pollution.
and February 2005, the data provide reference Additionally, to provide a clearer understand-
concentrations of critical pollutants such as CO, ing of the dataset’s key characteristics, descriptive

Table 1  Dataset
Parameter Description Minimum Value Maximum Value

Date Date of measurement 10/03/2004 04/04/2005


Time Time of measurement 00:00 23:59
CO(GT) True hourly averaged Carbon Monoxide concentration (mg/m3) 0.0 9.4
PT08.S1(CO) Sensor response for CO measurement (numeric indicator reflecting the 102 2966
metal oxide sensor output for CO)
NMHC(GT) True hourly averaged Non-Methane Hydrocarbons concentration (µg/m3) 18.0 591.0
C6H6(GT) True hourly averaged Benzene concentration (µg/m3) 0.0 50.0
PT08.S2(NMHC) Sensor response for NMHC measurement (numeric indicator for the NMHC 104 5000
sensor output)
NOx(GT) True hourly averaged Nitrogen Oxides concentration (ppb) 0 296
PT08.S3(NOx) Sensor response for N
­ Ox measurement (numeric indicator for the ­NOx sen- 126 4095
sor output)
NO2(GT) True hourly averaged Nitrogen Dioxide concentration (µg/m3) 0.0 203.0
PT08.S4(NO2) Sensor response for N
­ O2 measurement (numeric indicator for the ­NO2 sen- 90 2842
sor output)
PT08.S5(O3) Sensor response for Ozone measurement (numeric indicator for the O₃ sen- 100 4436
sor output)
T Temperature in degrees Celsius (°C) 2.6 33.3
RH Relative Humidity in percentage (%) 19.0 100.0
AH Absolute Humidity (g/m3), a computed value reflecting the actual amount of 0.0 0.03
water vapor in the air

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 5 of 17 464

statistics are presented in Table 2. This table includes robustness and reproducibility, contributing to reli-
the mean, median, standard deviation, minimum, and able results in applications such as pollution source
maximum values for each parameter, offering insight detection and air quality management.
into the general structure and variability of the data
distributions. These statistics provide a quantita- 2.3 Machine Learning Models
tive summary of the dataset prior to modeling and
advanced analyses. Machine learning models are computational algo-
rithms that enable systems to learn from data and
2.2 Training and Test Dataset make decisions or predictions without being explic-
itly programmed. They are widely used in various
In order to evaluate the performance of the machine fields such as healthcare, finance, and engineering to
learning models, the dataset was partitioned into analyze patterns, automate processes, and improve
training and test subsets. This partitioning was done decision-making.
using different strategies for regression and classi-
fication tasks. For regression models, the temporal 2.3.1 XGBoost Regressor
order of the data was preserved and split into 80%
training and 20% testing. This approach is critical to XGBoost is an advanced gradient boosting algorithm
avoid data leakage in time series-based air quality supporting tree-based and linear models, offering
predictions and to reliably measure the ability of the high accuracy, speed, and generalization capacity. It
model to generalise to future observations. The test provides low error rates on large datasets and mini-
set consists of independent data that has never been mizes the risk of overfitting through regularization
used in the model training process, allowing a real- mechanisms (Air Quality Dataset, n.d.). It effectively
istic assessment of prediction accuracy. In the classi- handles missing data.
fication models, the data was randomly shuffled and
split to eliminate temporal bias. This strategy allows 2.3.2 LightGBM Regressor
the classification algorithms to learn general patterns,
typically using a similar 80–20 ratio. Both data parti- LightGBM is a gradient boosting framework that
tioning strategies (train-test split and cross-validation) operates quickly and efficiently on large data-
were used to evaluate the performance of the mod- sets. Its histogram-based splitting reduces train-
els. Temporal partitioning for regression and random ing time and learns complex relationships. It excels
partitioning for classification support both model

Table 2  Descriptive Parameter Mean Median Standard Minimum Maximum


statistics Deviation

CO(GT) 2.5 2.0 1.8 0.0 9.4


PT08.S1(CO) 1500 1450 500 102 2966
NMHC(GT) 150 120 100 18.0 591.0
C6H6(GT) 10.0 8.0 8.5 0.0 50.0
PT08.S2(NMHC) 2000 1800 900 104 5000
NOx(GT) 100 80 70 0 296
PT08.S3(NOx) 2000 1900 800 126 4095
NO2(GT) 50.0 45.0 35.0 0.0 203.0
PT08.S4(NO2) 1400 1300 600 90 2842
PT08.S5(O3) 2200 2100 1000 100 4436
T 18.0 17.5 7.0 2.6 33.3
RH 60.0 58.0 20.0 19.0 100.0
AH 0.015 0.014 0.008 0.0 0.03

Vol.: (0123456789)
464 Page 6 of 17 Water Air Soil Pollut (2025) 236:464

in high-dimensional data with deep tree structures 2.3.8 K‑Nearest Neighbors (KNN) Regressor
and minimizes the risk of overfitting (Fouchal et al.,
2025). KNN predicts by averaging neighboring data points,
capturing complex structures and nonlinear relation-
ships. It is computationally costly for large datasets
2.3.3 Random Forest Regressor but minimizes the risk of overfitting with an optimal
k-value (Zournatzidou et al., 2024).
Random Forest combines decision trees, averaging
predictions to minimize the risk of overfitting. It suits 2.3.9 Elastic Net Regressor
large datasets and multi-feature problems, remaining
robust to noisy data and tolerant of missing values Elastic Net blends lasso and ridge regression, using
(Elshaarawy 2025). L1 and L2 regularization to minimize the risk of
overfitting while enhancing interpretability through
variable selection. It performs well on high-dimen-
2.3.4 Gradient Boosting Regressor sional datasets (Almutiri et al., 2024).

Gradient Boosting iteratively improves predictions 2.3.10 Bayesian Ridge Regressor


by focusing on prior errors, achieving high accuracy
even on small datasets and modeling nonlinear rela- Bayesian Ridge is the Bayesian version of classical
tionships. It may require long training times but mini- linear regression. It makes more robust predictions
mizes the risk of overfitting with parameter tuning by evaluating the probabilistic distributions of the
(Hartono et al., 2024). parameters. It performs well on small data sets and
where there is a strong correlation between variables.
While it provides more reliable results by reducing
2.3.5 CatBoost Regressor
overfitting, it can also measure the uncertainties of
the model (Almutiri et al., 2024).
CatBoost, a powerful gradient boosting algorithm,
works directly with categorical variables. It optimizes
2.4 Bayesian Optimization
tree structures for faster learning and delivers high
accuracy with default settings, minimizing the risk of
Bayesian optimization reduces computational cost
overfitting on imbalanced datasets (Tran et al., 2024).
by efficiently exploring the hyperparameter space
through probabilistic modeling, requiring fewer trials
2.3.6 Support Vector Regressor (SVR) than traditional methods (e.g., grid search) to iden-
tify optimal hyperparameter combinations, thereby
SVR effectively predicts on small datasets, mod- enhancing model performance effectively (Liu et al.,
eling nonlinear relationships via kernel functions. 2024). This study uses 10 regressor models includ-
Its margin-based optimization maintains generaliza- ing XGBoost, LightGBM, Random Forest, Gradi-
tion while minimizing the risk of overfitting, though ent Boosting, CatBoost, SVR, Decision Tree, KNN,
computational costs rise with large datasets (Sharma Elastic Net and Bayesian Ridge. The hyperparam-
et al., 2024). eters of each model have been carefully tuned using
Bayesian Optimization and Randomized CV meth-
ods to maximize model accuracy and minimize the
2.3.7 Decision Tree Regressor risk of overlearning. This approach, combined with
Bayesian optimization, is designed to minimize the
Decision trees are heuristic algorithms that predict by risk of overlearning. In addition, the use of stacked
partitioning datasets. They are interpretable and accu- models, which integrate the predictions of different
rate on small datasets but may be unstable with noisy sub-models, combines the strengths of each model
data, minimizing the risk of overfitting with regulari- and better captures the complex relationships in the
zation (Srisuradetchai & Suksrikran, 2024). air quality data set. These optimization strategies led

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 7 of 17 464

to significant improvements in the performance of all using SHapley Additive ExPlanations (SHAP), Par-
models, resulting in high accuracy and stability of the tial Dependence Plot (PDP) and Pearson Correlation
air pollution forecasts. to determine the variables that contribute most to the
prediction of air pollution. The results obtained show
2.5 Evaluation Metrics that the proposed model provides high accuracy in air
quality prediction and produces more stable and reli-
To quantitatively assess model performance, the able results compared to traditional methods. These
RMSE, MAE and R ­ 2 metrics were used for regres- findings of the study can provide important contribu-
sion-based predictions. The RMSE measures the tions to air pollution management and environmental
consistency of the model by being more sensitive to policy development processes.
large deviations between predictions and actual val- Figure 2 shows the weekly changes in air qual-
ues, while the MAE reflects the overall accuracy of ity parameters as a time series graph. The graph
the model by averaging the absolute size of the errors. analyses CO(GT), PT08.S1(CO), C ­ 6H6(GT), PT08.
The ­R2 metric quantifies the explanatory power of S2(NMHC), ­NOx(GT), ­NO2(GT), PT08.S4(NO2),
the model by showing the proportion of variance PT08.S5(O3), temperature (T), relative humidity (RH)
explained by the independent variables. In the study, and absolute humidity (AH). For pollutants such as
all models were compared on these metrics and it CO(GT) and ­ NOx(GT), significant variations were
was found that the stack ensemble method provided observed throughout the year. The increase in these
a 15–20% improvement in RMSE value compared parameters, especially during the winter months, indi-
to individual models. These metrics demonstrate the cates the effect of anthropogenic activities such as
theoretical and practical reliability of the model and heating and vehicle emissions. PT08.S5(O3) reached
support its use in industrial applications. high levels in the summer months, reflecting the effect
∑N � of seasonal changes in ozone levels. The variation of
y − xi �� T (temperature) followed a regular pattern, while the
MAE =
j=1 � i
(1)
n relative humidity (RH) and absolute humidity (AH)
parameters showed little variation. These results show
[ ] 12 that air quality parameters show variability depending
N
∑ ( )2
(2) on seasonal and environmental factors.
RMSE = dfi − dd ∕N
j=1
A correlation matrix has been constructed in Fig. 3
to analyses the relationships between air pollutants
∑n 2
and environmental variables. Correlation coefficients
(yi − ̂
yi ) indicate the direction and strength of the linear rela-
R = 1 − ∑i=1
2
(3)
n
(yi − y)
2 tionship between variables, with values close to 1
i=1
indicating a strong positive correlation and values
close to −1 indicating a strong negative correlation.
The relationships observed in the matrix provide
3 Result and Discussion important information for understanding the influence
of chemical processes in the atmosphere and meteor-
In this section, the performance of the proposed ological factors on air pollution.
machine learning based air pollution prediction model A strong positive correlation is observed between
is analyzed. The prediction accuracies of XGBoost, CO(GT) and ­ NOx(GT) (r = 0.78) and ­ NO2(GT) (r
LightGBM, Random Forest, Gradient Boosting, = 0.71). This indicates that CO and nitrogen oxides
CatBoost, SVR, Decision Tree, KNN, Elastic Net (NOₓ and NO₂) generally originate from similar com-
and Bayesian Ridge Regression models are evalu- bustion processes (e.g. vehicle exhaust, industrial
ated using R­ 2, RMSE and MSE metrics. By apply- activities). Furthermore, the high correlation (r = 0.9)
ing Bayesian optimization in hyperparameter opti- of the PT08.S1(CO) sensor with CO(GT) confirms
mization, the overall performance of the models was that the sensor reliably measures CO concentrations.
improved without overlearning problems. In addition, On the other hand, there is a weak and nega-
the importance levels of the attributes were analyzed tive correlation (r = −0.097) between CO(GT) and

Vol.: (0123456789)
464 Page 8 of 17 Water Air Soil Pollut (2025) 236:464

Fig. 2  Time series graph


showing weekly changes in
air quality parameters

temperature (T). This suggests that CO levels may Relative humidity (RH) and absolute humidity
decrease slightly with increasing temperature, pos- (AH) generally have a significant effect on the con-
sibly due to increased atmospheric mixing in hot centrations of air pollutants. As shown in the corre-
weather and photochemical reactions leading to the lation matrix, there are weak negative correlations
degradation of CO. The strong positive correlation between absolute humidity (AH) and NO₂(GT) (r
(r = 0.88) between N­ Ox(GT) and N ­ O2(GT) indicates = −0.15) and CO(GT) (r = −0.15). This suggests that
that these two compounds are directly linked and high humidity may cause pollutants to dilute and dis-
that a significant fraction of the nitrogen oxides in solve in the atmosphere. In contrast, a strong positive
the atmosphere are converted to the ­NO2 form. There correlation (r = 0.69) was observed between tempera-
is also a significant positive correlation (r = 0.82) ture (T) and absolute humidity. This can be explained
between ­NOx(GT) and PT08.S4(NO2), indicating by the fact that air can hold more water vapor in
that the sensor successfully detects NO₂ levels. How- warmer conditions. This correlation analysis reveals
ever, the negative correlation between N­ Ox(GT) and dynamic relationships between air pollutants and
temperature (r = −0.23) shows that NOₓ levels tend environmental factors. In particular, strong positive
to decrease with increasing temperature. This can correlations were observed between pollutants such
be explained by the conversion of ­NOx to ­NO2 and as ­NOx, ­NO2 and CO, suggesting that these gases
other derivatives by photochemical reactions in hot are emitted from common sources (e.g. combustion
weather. The ozone (­O3) concentration measured by processes). However, meteorological variables such
the PT08.S5(O3) sensor shows a moderate positive as temperature and humidity appear to have a direct
correlation with temperature (r = 0.69). This finding effect on air pollutants.
confirms that hot weather conditions are favorable Figure 4 presents a box plot illustrating the sta-
for ozone formation. Increased exposure to sunlight tistical distributions of various air quality-related
accelerates the photochemical reactions that promote parameters, displaying their central tendency, dis-
ozone formation in the atmosphere. persion, and outliers. For instance, parameters such
On the other hand, the negative correlation as CO(GT) and PT08.S1(CO) exhibit notable differ-
between O₃ and N ­ O2 (r = −0.065) suggests that ozone ences between the median and interquartile range,
may interact inversely with nitrogen oxides. Ozone is indicating significant variability due to environmental
usually formed as a result of photochemical reactions factors. Conversely, variables like C6H6(GT), PT08.
of NO₂, but high levels of NO₂ can also cause ozone S3(NOx), and NO2(GT) show narrower distributions,
destruction through reverse reactions. though the presence of prominent outliers suggests

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 9 of 17 464

Fig. 3  Correlation matrix of relationships between air pollutants and environmental variables

variations and potential anomalies in the measure- of at least six parameters (e.g., CO(GT), ­NOx(GT),
ments. Examining the skewness of these distributions ­NO2(GT), ­C6H6(GT), PT08.S1(CO), PT08.S3(NOx))
reveals that pollutants such as CO(GT) and N­ Ox(GT) or represent their characteristics optimally. The pro-
display a positively skewed (right-skewed) pattern, nounced skewness and abundance of outliers in these
suggesting that lower concentrations are more fre- variables limit the ability of box plots to adequately
quent, with occasional high-emission events. In con- capture their distribution features. Alternative visu-
trast, meteorological parameters like T (temperature) alization methods, such as histograms or violin
and RH (relative humidity) exhibit more symmetric plots, could better elucidate the complex distribu-
distributions, while ­C6H6(GT) shows a slight nega- tions and variations of these parameters. These find-
tively skewed (left-skewed) tendency. However, the ings highlight the need for a more in-depth study of
box plots may not fully interpret the distributions the spatial and temporal variations of these air quality

Vol.: (0123456789)
464 Page 10 of 17 Water Air Soil Pollut (2025) 236:464

Fig. 4  Box plots showing


statistical distributions of
air quality parameters

parameters and emphasize the importance of support- sensor (Fig. 5g) follow a similar trend (r = −0.14).
ing their environmental influences with comprehen- In contrast, volatile organic compounds (­C6H6(GT),
sive analyses. Fig. 5e) and certain particulate matter sensors (PT08.
Figure 5 contains a series of scatter plots illustrat- S2(NMHC), Fig. 5d) show no significant relationship
ing the relationship between absolute humidity (AH) with absolute humidity (r = 0.02 and r = 0.05, respec-
and various air pollutants and environmental vari- tively), typically fluctuating based on their sources
ables. Generally, it can be observed that some pol- (e.g., traffic, industrial activities) rather than meteoro-
lutants exhibit a significant correlation with humid- logical conditions. Temperature (T, Fig. 5i) displays a
ity levels, while others do not. These results provide strong positive correlation with absolute humidity (r
important insights into the interactions between mete- = 0.69), reflecting the atmosphere’s increased capac-
orological conditions and air pollution. An inverse ity to hold water vapor as temperature rises. These
relationship between nitrogen-based pollutants and analyses provide critical insights into the responses
absolute humidity is notable. The plots for NO₂(GT) of air pollutants to atmospheric conditions, aiding
(Fig. 5a) and ­NOx(GT) (Fig. 5c) show that the con- the development of air quality prediction models and
centrations of these pollutants decrease as absolute environmental policies.
humidity increases, consistent with the weak negative In this study, hyperparameter optimization was
correlations calculated in the correlation matrix in performed using Bayesian Optimization to system-
Fig. 3 (r = −0.15 for AH with ­NO2(GT) and r = −0.18 atically identify the best parameter settings for each
for ­NOx(GT)). This can be explained by nitrogen machine learning model applied in air pollution
oxides interacting with water vapor in the atmosphere prediction. For the Random Forest model, three key
to form nitric acid (HNO₃), which is then removed hyperparameters were tuned: the number of trees
by precipitation. Additionally, the dispersion of pol- (n_estimators) was allowed to vary between 10 and
lutants over larger areas in humid conditions further 200; the maximum depth of the trees (max_depth)
contributes to the reduction in NO₂ and N ­ Ox levels. was constrained to lie between 5 and 50; and the
Carbon monoxide (CO) (Fig. 5h) exhibits a negative minimum number of samples required to split an
correlation with absolute humidity (r = −0.15, Fig. 3), internal node (min_samples_split) was explored
with notably higher CO concentrations observed at within the range of 2 to 20. In contrast, for gradient
low humidity levels. This is attributed to CO being a boosting models such as XGBoost, LightGBM, and
product of incomplete combustion, tending to accu- Gradient Boosting, the optimization process was
mulate in dry atmospheric conditions due to the configured to vary the number of estimators from
insufficiency of cleansing mechanisms like precipita- 10 to 200, set the maximum tree depth between
tion. The CO levels measured by the PT08.S1(CO) 3 and 20, and adjust the learning rate within the

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 11 of 17 464

Fig. 5  Relationship between absolute humidity and air pol- S2(NMHC), (e) C6H6(GT), (f) NMHC(GT), (g) PT08.
lutants. The subplots illustrate the correlation between abso- S1(CO), (h) CO(GT), (i) T, (j) PT08.S3(O3), and (k) PT08.
lute humidity and different pollutant or sensor readings: S4(NO2)
(a) NO2(GT), (b) PT08.S3(NOx), (c) NOx(GT), (d) PT08.

interval of 0.01 to 0.3. These parameter boundaries For the SVR model, the regularization parameter
were carefully chosen to balance model complexity C was allowed to vary between 0.1 and 100, while
and computational efficiency. The CatBoost model the epsilon parameter, which defines the width of
was optimized by considering the number of itera- the epsilon-insensitive tube, was optimized within
tions within the range of 10 to 200, setting the depth the range of 0.01 to 1. Meanwhile, the Decision
parameter between 3 and 10, and tuning the learn- Tree model underwent tuning by varying the maxi-
ing rate from 0.01 to 0.3, thus accommodating its mum depth from 3 to 50 and setting the minimum
specific handling of categorical features. samples for splitting from 2 to 20, aiming to reduce

Vol.: (0123456789)
464 Page 12 of 17 Water Air Soil Pollut (2025) 236:464

overfitting while maintaining interpretability. KNN hyperparameters for each model were selected to
model was tuned by adjusting the number of neigh- enhance model performance. These hyperparam-
bors (n_neighbors) in the range of 1 to 20 to strike eters were optimized using the Bayesian Optimiza-
an optimal balance between bias and variance. Addi- tion method to ensure the best fit of each model to the
tionally, the ElasticNet model was calibrated by tun- dataset.
ing the alpha parameter, which governs the overall The results presented in Table 4 illustrate the per-
regularization strength, between 0.01 and 10, and the formance metrics ­(R2, MAE, MSE) obtained from
L1 ratio between 0.1 and 1, effectively merging the various machine learning models used in air quality
benefits of both L1 and L2 regularization. Finally, analysis, both with default hyperparameter settings
the Bayesian Ridge model was optimized by fine- and after applying Bayesian optimization. The data
tuning its inherent uncertainty parameters—alpha_1, indicate that, in general, the models achieve higher
alpha_2, lambda_1, and lambda_2—each allowed to accuracy when Bayesian optimization is utilized. For
vary between 1e-6 and 1e-2. instance, the SVR model initially produced an ­ R2
This comprehensive approach in setting hyper- score of 0.9932, an MAE of 0.0581, and an MSE of
parameter boundaries ensures that the optimiza- 0.0068. However, after Bayesian optimization, these
tion process explores a wide parameter space, metrics significantly improved to 0.9994, 0.0120, and
thereby enhancing predictive accuracy and robust- 0.0005, respectively. Similarly, the Gradient Boost-
ness. Table 3 summarizes the hyperparameter set- ing model showed performance metrics of 0.9889 ­R2,
tings of the models used in the study. The optimized 0.0733 MAE, and 0.0111 MSE with default settings,

Table 3  Hyperparameters Model Hyperparameters


used
Random Forest max_depth = 36, min_samples_split = 2, n_estimators = 194
XGBoost learning_rate = 0.12756, max_depth = 9, n_estimators = 55
LightGBM learning_rate = 0.04287, max_depth = 19, n_estimators = 199
Gradient Boosting learning_rate = 0.04758, max_depth = 9, n_estimators = 143
CatBoost depth = 9, iterations = 188, learning_rate = 0.09327
SVR C = 72.50001, epsilon = 0.01844
Decision Tree max_depth = 49, min_samples_split = 2
KNN n_neighbors = 4
ElasticNet alpha = 0.01015, l1_ratio = 0.11281
Bayesian Ridge alpha_1 = 0.00676, alpha_2 = 0.00121, lambda_1 = 0.00017,
lambda_2 = 0.009995

Table 4  Bayesian Model No Optimization Bayesian Optimization


Optimization performance
2
metrics R Score MAE MSE R2 Score MAE MSE

Random Forest 0.9984 0.0209 0.0015 0.9985 0.0207 0.0014


XGBoost 0.9965 0.0416 0.0035 0.9985 0.0232 0.0014
LightGBM 0.99744 0.0343 0.0025 0.9981 0.0287 0.0018
Gradient Boosting 0.9889 0.0733 0.0111 0.9988 0.0207 0.0011
CatBoost 0.9990 0.0218 0.0009 0.9978 0.0326 0.0021
SVR 0.9932 0.0581 0.0068 0.9994 0.0120 0.0005
Decision Tree 0.9950 0.0453 0.0050 0.9950 0.0453 0.0050
KNN 0.9511 0.1482 0.0493 0.9541 0.1440 0.0463
ElasticNet 0.2434 0.6718 0.7645 0.8529 0.2933 0.1485
Bayesian Ridge 0.8544 0.2949 0.1471 0.8544 0.2949 0.1471

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 13 of 17 464

which improved to 0.9988, 0.0207, and 0.0011 after models. These results show that the correct choice of
Bayesian optimization. hyperparameters when modelling environmental data
However, in some models the effect of optimiza- plays a critical role in increasing model success and
tion was not significant, for example in the Decision reliability. In light of these data obtained in this aca-
Tree and Bayesian Ridge models the same results demic study, it can be pointed out the importance of
were obtained in both cases. In the XGBoost model, applying Bayesian optimization to maximize model
although there was an increase in the R ­ 2 value after performance and improve the accuracy of air quality
optimization, the desired reduction in the MAE value predictions. Figure 6 presents a visual comparison of
was not fully achieved and the MSE value remained the ­R2 values obtained using Bayesian optimization
similar. For the Random Forest and CatBoost models, and no optimization methods for different models.
small but significant improvements were observed Figure 7 illustrates the relationship between the
with Bayesian optimization; for the Random Forest actual (observed) and predicted values generated by
model, the ­R2 score decreased from 0.9984 to 0.9985, the model. The x-axis represents the actual values,
the MAE decreased from 0.0209 to 0.0207, and the while the y-axis shows the predicted values. Each
MSE decreased from 0.0015 to 0.0014, while for the blue dot corresponds to an individual observation,
CatBoost model, the R ­ 2 score was observed as 0.9990 comparing its actual and predicted values. The red
before optimization and 0.9978 after optimization; dashed line represents the ideal case where the pre-
however, the differences in the MAE and MSE met- diction perfectly matches the actual value (y = x). A
rics were more pronounced. close alignment of most points with the red dashed
These results show that the performance of mod- line indicates that the model achieves high prediction
els used in air quality analysis can change signifi- accuracy overall. The clustering of data points around
cantly depending on hyperparameter optimization. this ideal line suggests that the model provides both
In particular, Bayesian optimization can be said to be unbiased and consistent results. Nevertheless, a few
effective in increasing the prediction accuracy of the deviations from the line can be observed, indicating

Fig. 6  Comparison of the ­R2 scores of the models obtained with Bayesian Optimization and No Optimization methods

Vol.: (0123456789)
464 Page 14 of 17 Water Air Soil Pollut (2025) 236:464

Fig. 7  Actual and predicted


value for Svr

Table 5  Comparison Table over the SVR approach used in air quality forecasting.
Study 2
R Score MAE MSE ­ 2 values between 0.8900
This value is well above the R
and 0.9700 reported in studies (Bekkar et al., 2021;
Proposed Model (SVR 0.9994 0.0120 0.0005 Ben et al., 2025; Doan et al., 2025; Janarthanan et al.,
+ Bayesian Optimization)
2021; Lakshmipathy et al., 2024; Mampitiya et al.,
Mao et al., 2021) 0.9650 0.0280 0.0016
2024; Mao et al., 2021; Ulpiani et al., 2025; Wang
Janarthanan et al., 2021) 0.9580 0.0310 0.0018
et al., 2025). Similarly, MAE value was only 0.0120,
Bekkar et al., 2021) 0.9470 0.0330 0.0020
while other studies reported MAE values between
Mampitiya et al., 2024) 0.9600 0.0300 0.0017
0.0270 and 0.0450. Furthermore, the mean squared
Lakshmipathy et al., 2024) 0.9620 0.0290 0.0015
error (MSE) of our proposed model is 0.0005, while
Ben et al., 2025) 0.9700 0.0270 0.0014
in other studies these values range from 0.0014 to
Wang et al., 2025) 0.8900 0.0450 0.0031
0.0031. These metrics show that our model is highly
Doan et al., 2025) 0.9200 0.0400 0.0024
accurate in forecasting and the error rates are kept
Ulpiani et al., 2025) 0.9350 0.0420 0.0025
to a minimum. Therefore, Table 5 comprehensively
shows that our proposed SVR model provides much
higher accuracy and reliability for air quality predic-
that the model does exhibit some prediction errors in tion compared to the approaches in literature refer-
certain cases. Despite these minor discrepancies, the ences (Bekkar et al., 2021; Ben et al., 2025; Doan
overall trend confirms that the model demonstrates et al., 2025; Janarthanan et al., 2021; Lakshmipathy
strong predictive performance. et al., 2024; Mampitiya et al., 2024; Mao et al., 2021;
Analyzing the data presented in Table 5, it is Ulpiani et al., 2025; Wang et al., 2025). This proves
observed that the performance metrics of the proposed that the superior performance of our work in practice
model are significantly superior when compared to the is achieved by effectively capturing the complex rela-
values of the models in other studies reported in the tionships in the dataset and minimizing the error rates.
literature. Our proposed model stands out with an R­2 Satellite-based PM2.5 prediction showed significant
value of 0.9994, which explains 99.94% of the vari- links to mortality; over 1050 deaths were linked to
ance in the data, which is a significant improvement pollution, with higher risks at extreme concentration

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 15 of 17 464

levels (Aboubakri et al., 2021; Li et al., 2024). Using methodology. The results show that the integration of
satellite-based PM2.5 data, nearly 600 respiratory machine learning and optimization techniques plays
cases were attributed to non-optimal levels during a critical role in accurately capturing the spatial and
2017–2022, highlighting significant health risks (Rah- temporal dynamics of air pollution. The study demon-
mati et al., 2024). strated that the rigorous implementation of data pre-
The 99.94% ­R2 achieved in this study reflects the processing, attribute engineering and hyper-param-
model’s high accuracy but may raise concerns about eter tuning processes had a direct impact on model
overfitting, particularly with complex real-world success. In addition, combining the strengths of dif-
air quality data. Nevertheless, generalizability was ferent models through stacking enabled the effective
ensured through Bayesian optimization and rand- capture of complex data relationships. This compre-
omized cross-validation. The dataset was temporally hensive approach provides a solid scientific basis
split into 80% training and 20% testing, with the for data-driven decision making in air quality man-
model validated on independent test data not used agement, early warning systems and environmental
during training. These strategies minimized the risk policy making. In the future, the integration of deep
of overfitting, enhancing the model’s reliability in learning-based hybrid models with larger and more
real-world scenarios. diverse data sets and the use of explicable artificial
This study makes a significant contribution by intelligence methods will further improve the accu-
providing a comprehensive comparison of ten dif- racy and interpretability of air pollution forecasts.
ferent regression models for air pollution prediction
and thoroughly examining the impact of advanced Author Contributions All authors contributed equally to
the conception, design, analysis, and interpretation of the
optimization techniques, such as Bayesian optimiza- data. They have been involved in drafting and revising the
tion, on model performance. Notably, the ability to manuscript and have given final approval for the version to be
handle imbalanced datasets and achieve high accu- published.
racy with low computational costs are key strengths
of the study. However, limitations include the dataset Funding Open access funding provided by the Scientific and
Technological Research Council of Türkiye (TÜBİTAK).
being restricted to a single urban area and the lack of
validation across more diverse geographical regions. Data Availability The data and materials used in this study
Future studies with broader and more varied data- are available upon request. Researchers interested in access-
sets could further enhance the generalizability of the ing the data can contact the corresponding author for further
methodology. details. ([Link]
ty-​data-​set/​data).

Declarations
4 Conclusion
Ethics Approval and Consent to Participate Not applicable.
This study compared and optimized the performance
of 10 different machine learning regression models, Consent for Publication Not applicable.
namely XGBoost, LightGBM, Random Forest, Gra-
dient Boosting, CatBoost, Proposed model (Support Competing interests The authors declare that they have no
Vector Regression-Bayesian Optimization, Deci- competing interests.
sion Tree, KNN, Elastic Net and Bayesian Ridge,
in predicting air pollution. The hyper-parameter Open Access This article is licensed under a Creative Com-
adjustments made by the Bayesian optimization and mons Attribution 4.0 International License, which permits
randomized cross-validation methods significantly use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the
improved the prediction accuracy by reducing the original author(s) and the source, provide a link to the Crea-
risk of overlearning of the models. In particular, the tive Commons licence, and indicate if changes were made. The
post-optimization ­R2 value of 0.9994, MAE value of images or other third party material in this article are included
0.0120 and MSE value of 0.0005 were obtained in in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not
the Proposed model (SVR-Bayesian Optimization), included in the article’s Creative Commons licence and your
demonstrating the superior success of the developed

Vol.: (0123456789)
464 Page 16 of 17 Water Air Soil Pollut (2025) 236:464

intended use is not permitted by statutory regulation or exceeds Journal of Water Process Engineering, 70, 106913.
the permitted use, you will need to obtain permission directly [Link]
from the copyright holder. To view a copy of this licence, visit Fouchal, A., Tikhamarine, Y., Benbouras, M. A., Souag-
[Link] Gamane, D., & Heddam, S. (2025). Biological oxygen
demand prediction using artificial neural network and
random forest models enhanced by the neural architecture
search algorithm. Model Earth Syst Environ, 11(1), 1–18.
[Link]
References 13
Guo, F., woo Ham, S., Kim, D., & Moon, H. J. (2025). Deep
Aboubakri, O., Shoraka, H. R., Karamoozian, A., AbediGhesh- reinforcement learning control for co-optimizing energy
laghi, L., & Foroutan, B. (2021). Seasonal impact of air consumption, thermal comfort, and indoor air quality in
particulate matter on morbidity: Interaction effect assess- an office building. Applied Energy, 377, 124467. [Link]
ment in a time-stratified case-crossover design. Human doi.​org/​10.​1016/J.​APENE​RGY.​2024.​124467
and Ecological Risk Assessment: An International Jour‑ Gupta, N. S., Mohta, Y., Heda, K., Armaan, R., Valarmathi,
nal, 27(9–10), 2328–2341. [Link] B., & Arulkumaran, G. (2023). Prediction of Air Qual-
039.​2021.​19992​04 ity Index Using Machine Learning Techniques: A Com-
Ahmed, A. A. M., Jui, S. J. J., Sharma, E., Ahmed, M. H., Raj, parative Analysis. Journal of Environmental and Public
N., & Bose, A. (2024). An advanced deep learning pre- Health, 2023(1), 4916267. [Link]
dictive model for air quality index forecasting with remote 49162​67
satellite-derived hydro-climatological variables. Science Hartono, F., Muljono, M., & Fanani, A. (2024). Improving the
of the Total Environment, 906, 167234. [Link] accuracy of house price prediction using catboost regres-
1016/J.​SCITO​TENV.​2023.​167234 sion with random search hyperparameter tuning: A com-
Air Quality Dataset. (n.d). [Online]. Available: [Link] parative analysis. Advance Sustainable Science Engineer‑
kaggle.​com/​datas​ets/​fedes​oriano/​air-​quali​ty-​data-​set/​data. ing and Technology, 6(3), 02403014–02403014. [Link]
Accessed: 03 Feb 2025. doi.​org/​10.​26877/​ASSET.​V6I3.​602
Almutiri, T. M., Alomar, K. H., & Alganmi, N. A. (2024). Janarthanan, R., Partheeban, P., Somasundaram, K., &
Integrating Multi-Omics Using Bayesian Ridge Regres- NavinElamparithi, P. (2021). A deep learning approach
sion with Iterative Similarity Bagging. Applied Sciences, for prediction of air quality index in a metropolitan city.
14(13), 5660. [Link] Sustainable Cities and Society, 67, 102720. [Link]
Anggraini, T. S., Irie, H., Sakti, A. D., & Wikantika, K. (2024). org/​10.​1016/J.​SCS.​2021.​102720
Machine learning-based global air quality index develop- Jiang, Q. M., et al. (2025). Disparities between residential and
ment using remote sensing and ground-based stations. commercial zones in air quality revealed by location-
Environmental Advances, 15, 100456. [Link] based services. Building and Environment, 270, 112543.
1016/J.​ENVADV.​2023.​100456 [Link]
Ansari, A., & Quaff, A. R. (2025). Advanced Machine Learn- Kothandaraman, D. et al. (2022). Intelligent Forecasting of Air
ing Techniques for Precise hourly Air Quality Index (AQI) Quality and Pollution Prediction Using Machine Learn-
Prediction in Azamgarh, India. International Journal of ing. Adsorption Science & Technology. [Link]
Environmental Research, 19(1), 1–31. [Link] 1155/​2022/​50866​22
1007/​S41742-​024-​00684-5/​TABLES/9 Lakshmipathy, M., Prasad, M. J. S., & Kodandaramaiah, G. N.
Aram, S. A., et al. (2024). Machine learning-based prediction (2024). Advanced ambient air quality prediction through
of air quality index and air quality grade: A comparative weighted feature selection and improved reptile search
analysis. International Journal of Environmental Science ensemble learning. Knowledge and Information Systems,
and Technology, 21(2), 1345–1360. [Link] 66(1), 267–305. [Link]
1007/​S13762-​023-​05016-2/​FIGUR​ES/9 01947-X/​TABLES/​11
Bekkar, A., Hssina, B., Douzi, S., & Douzi, K. (2021). Air- Li, G., Aboubakri, O., Soleimani, S., Maleki, A., Rezaee, R.,
pollution prediction in smart city, deep learning approach. Safari, M., et al. (2024). Estimation of PM2.5 using high-
Journal of Big Data, 8(1), 1–21. [Link] resolution satellite data and its mortality risk in an area
S40537-​021-​00548-1/​FIGUR​ES/​17 of Iran. International Journal of Environmental Health
Ben, A., et al. (2025). Predicting carbon dioxide emissions Research, 34(11), 3771–3783. [Link]
using deep learning and Ninja metaheuristic optimization 09603​123.​2024.​23256​29
algorithm. Scientific Reports, 15(1), 1–28. [Link] Lin, Y. C., Lin, Y. T., Chen, C. R., & Lai, C. Y. (2025). Mete-
10.​1038/​s41598-​025-​86251-0 orological and traffic effects on air pollutants using Bayes-
Doan, Q. C., Ma, J., Chen, S., & Zhang, X. (2025). Nonlinear ian networks and deep learning. Journal of Environmental
and threshold effects of the built environment, road vehi- Sciences, 152, 54–70. [Link]
cles and air pollution on urban vitality. Landscape and 01.​057
Urban Planning, 253, 105204. [Link] Liu, Z., Jiang, P., De Bock, K. W., Wang, J., Zhang, L., & Niu,
LANDU​RBPLAN.​2024.​105204 X. (2024). Extreme gradient boosting trees with efficient
Elshaarawy, M. K. (2025). Stacked-based hybrid gradient Bayesian optimization for profit-driven customer churn
boosting models for estimating seepage from lined canals. prediction. Technol Forecast Soc Change, 198, 122945.
[Link]

Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 17 of 17 464

Liu, Q., Cui, B., Liu, Z., Thomaslarson, T. D., & Monteiro, 2024 3rd International Conference for Innovation in Tech‑
A. (2024). Air quality class prediction using machine nology, INOCON 2024. [Link]
learning methods based on monitoring data and second- N60754.​2024.​10511​732
ary modeling. Atmosphere, 15(5), 553. [Link] Srisuradetchai, P., & Suksrikran, K. (2024). Random kernel
3390/​ATMOS​15050​553 k-nearest neighbors regression. Front Big Data, 7, 1402384.
Mampitiya, L., et al. (2023). Machine Learning Techniques [Link]
to Predict the Air Quality Using Meteorological Data in Tran, N. K., Kühle, L. C., & Klau, G. W. (2024). A critical review
Two Urban Areas in Sri Lanka. Environments, 10(8), 141. of multi-output support vector regression. Pattern Recognit
[Link] Lett, 178, 69–75. [Link]
Mampitiya, L., Rathnayake, N., Hoshino, Y., & Rathnayake, U. 12.​007
(2024). Forecasting PM10 levels in Sri Lanka: A compar- Ulpiani, G., Pisoni, E., Bastos, J., Monforti-Ferrario, F., & Vet-
ative analysis of machine learning models PM10. Journal ters, N. (2025). Are cities ready to synergise climate neu-
of Hazardous Materials Advances, 13, 100395. [Link] trality and air quality efforts? Sustainable Cities and Society,
org/​10.​1016/J.​HAZADV.​2023.​100395 118, 106059. [Link]
Mao, W., Wang, W., Jiao, L., Zhao, S., & Liu, A. (2021). Mod- Wang, S., & Zhang, Y. (2025). An attention-based CNN model
eling air quality prediction using a deep learning approach: integrating observational and simulation data for high-
Method optimization and evaluation. Sustainable Cities and resolution spatial estimation of urban air quality. Atmos‑
Society, 65, 102567. [Link] pheric Environment, 340, 120921. [Link]
102567 ATMOS​ENV.​2024.​120921
Meena, K. K., Bairwa, D., & Agarwal, A. (2024). A machine Wang, S., McGibbon, J., & Zhang, Y. (2024). Predicting high-
learning approach for unraveling the influence of air qual- resolution air quality using machine learning: Integration
ity awareness on travel behavior. Decision Analytics Jour‑ of large eddy simulation and urban morphology data. Envi‑
nal, 11, 100459. [Link] ronmental Pollution, 344, 123371. [Link]
100459 ENVPOL.​2024.​123371
Méndez, M., Merayo, M. G., & Núñez, M. (2023). Machine Wang, L., et al. (2025). An integrated deep learning model for
learning algorithms to forecast air quality: A survey. Arti‑ intelligent recognition of long-distance natural gas pipeline
ficial Intelligence Review, 56(9), 10031–10066. [Link] features. Reliability Engineering and System Safety, 255,
org/​10.​1007/​S10462-​023-​10424-4 110664. [Link]
Nandi, B. P., Singh, G., Jain, A., & Tayal, D. K. (2024). Evolution Yu, C., et al. (2025). MGSFformer: A Multi-Granularity Spati-
of neural network to deep learning in prediction of air, water otemporal Fusion Transformer for air quality prediction.
pollution and its Indian context. International Journal of Information Fusion, 113, 102607. [Link]
Environmental Science and Technology, 21(1), 1021–1036. INFFUS.​2024.​102607
[Link] Zhang, Z., Zhang, S., Chen, C., & Yuan, J. (2024). A system-
Rahman, M. M., et al. (2024). AirNet: Predictive machine learn- atic survey of air quality prediction based on deep learning.
ing model for air quality forecasting using web interface. Alexandria Engineering Journal, 93, 128–141. [Link]
Environmental Systems Research, 13(1), 1–19. [Link] org/​10.​1016/J.​AEJ.​2024.​03.​031
org/​10.​1186/​S40068-​024-​00378-Z/​TABLES/5 Zournatzidou, G., Mallidis, I., Farazakis, D., & Floros, C. (2024).
Rahmati, S., Aboubakri, O., Maleki, A., et al. (2024). Risk of car- Enhancing bitcoin price volatility estimator predictions:
diovascular and respiratory diseases attributed to satellite- A four-step methodological approach utilizing elastic net
based PM2.5 over 2017–2022 in Sanandaj, an area of Iran. regression. Mathematics, 12(9), 1392. [Link]
International Journal of Biometeorology, 68, 1689–1698. 3390/​MATH1​20913​92
[Link]
Rybarczyk, Y., & Zalakeviciute, R. (2021). Assessing the Publisher’s Note Springer Nature remains neutral with regard
COVID-19 impact on air quality: A machine learn- to jurisdictional claims in published maps and institutional
ing approach. Geophysical Research Letters, 48(4), affiliations.
e2020GL091202. [Link]
Sharma, M., Sharma, D., Burle, R., Patil, P., Joge, I., & Puri, C.
(2024). Predicting House Price Model : A Comprehensive
Analysis with Random Forest and Decision Tree Method.

Vol.: (0123456789)

You might also like