Air Quality Forecasting Using Machine Learning
Air Quality Forecasting Using Machine Learning
[Link]
Received: 18 March 2025 / Accepted: 5 May 2025 / Published online: 14 May 2025
© The Author(s) 2025
Abstract Air pollution poses a critical challenge enhanced through Bayesian optimization and rand-
to environmental sustainability, public health, and omized cross-validation, with stacking employed to
urban planning. Accurate air quality prediction is leverage the strengths of base models. Experimental
essential for devising effective management strate- results showed that hyperparameter optimization and
gies and early warning systems. This study utilized ensemble strategies significantly improved accuracy,
a dataset comprising hourly measurements of pollut- with the SVR model optimized via Bayesian optimi-
ants such as PM2.5, NOx, CO, and benzene, sourced 2 score
zation achieving the highest performance: an R
from five metal oxide sensors and a certified analyzer of 99.94%, MAE of 0.0120, and MSE of 0.0005.
in a polluted urban area, totaling 9,357 records col- These findings underscore the methodology’s effi-
lected over one year (March 2004–February 2005) cacy in precisely capturing the spatial and temporal
from the Kaggle Air Quality Data Set. A comprehen- dynamics of air pollution.
sive comparison of ten machine learning regression
models XGBoost, LightGBM, Random Forest, Gra- Keywords Air Quality Prediction · Machine
dient Boosting, CatBoost, Support Vector Regression Learning · Bayesian Optimization · Regression
(SVR) with Bayesian Optimization, Decision Tree, Models · SVR
K-Nearest Neighbors (KNN), Elastic Net, and Bayes-
ian Ridge was conducted. Model performance was
1 Introduction
Vol.: (0123456789)
464 Page 2 of 17 Water Air Soil Pollut (2025) 236:464
Traditional air pollution prediction relies on deter- prediction (Kothandaraman, et al., 2022). Mampitiya
ministic models rooted in atmospheric chemistry et al. (2023) reported high accuracy with LightGBM
and physical principles. Chemical Transport Models for PM10 prediction in Sri Lanka (Mampitiya, et al.,
(CTMs) simulate pollution distribution using mete- 2023). Rybarczyk and Zalakeviciute (Rybarczyk &
orological data and emission inventories but depend Zalakeviciute, 2021) noted reductions in NO₂, SO₂,
heavily on input accuracy and struggle with complex CO, and PM2.5 during Quito’s lockdown (Ryba-
atmospheric processes. High computational costs rczyk & Zalakeviciute, 2021). Wang et al. (2024)
also make large-scale simulations time-consuming achieved accurate CO predictions in Nanjing using
and expensive (Anggraini et al., 2024). In contrast, Convolutional Neural Network (CNN) (Wang et al.,
machine learning (ML) and deep learning (DL) 2024). Liu et al. (2024) demonstrated LightGBM and
approaches have emerged as flexible, data-driven LSTM’s effectiveness for air quality prediction (Liu
alternatives for air quality prediction (Yu et al., 2025). et al., 2024). Meena et al. (2024) linked air pollution
This study evaluates ten regression models for air to travel preferences (Meena et al., 2024). Rahman
quality prediction: XGBoost, LightGBM, Random et al. (2024) proposed early-detection systems (Rah-
Forest, Gradient Boosting, CatBoost, SVR, Deci- man et al., 2024). Ansari and Quaff (2025) predicted
sion Tree, KNN, Elastic Net, and Bayesian Ridge. AQI in India (Ansari & Quaff, 2025). Wang and
Its primary goal is to compare their performance and Zhang (2025) highlighted CNN’s superiority (Wang
identify optimal model structures (Yu et al., 2025). & Zhang, 2025). Jiang et al. (2025) analyzed land use
Hyperparameter optimization is critical for improv- impacts (Jiang et al., 2025).
ing model accuracy. Typically set through trial-and- While many studies focus on limited models, this
error or heuristic methods, hyperparameters are time- study comprehensively assesses XGBoost, Light-
consuming to tune and may not yield optimal results GBM, Random Forest, Gradient Boosting, CatBoost,
(Lin et al., 2025). This study employs Bayesian Opti- SVR, Decision Tree, KNN, Elastic Net, and Bayes-
mization and Randomized Cross-Validation (CV) ian Ridge. Bayesian Optimization and Randomized
to rigorously adjust hyperparameters, with Bayes- CV minimize overfitting, achieving high accuracy.
ian Optimization offering an efficient search process The optimized SVR model, for instance, recorded an
via probabilistic modeling (Lin et al., 2025). Stack- Coefficient of Determination (R2) of 0.9994, Mean
ing models combine the strengths of base models Absolute Error (MAE) of 0.0120, and Mean Squared
to better capture complex data relationships (Nandi Error (MSE) of 0.0005. Stacking integrates individual
et al., 2024). For example, gradient-boosted models model strengths, capturing complex data relationships
like XGBoost and LightGBM deliver high accuracy, effectively. This study offers a unique contribution by
while SVR and Bayesian Ridge provide balanced analyzing air quality’s spatial and temporal dynamics
predictions. Stacking integrates these advantages, with high accuracy.
enhancing prediction accuracy (Ahmed et al., 2024).
The literature underscores the potential of ML
and DL for air quality prediction. Mendez et al. 2 Material and Method
(2023) reviewed 155 studies from 2011–2021, ana-
lyzing ML/DL models’ geographical distribution This study aims to predict air pollution dynamics
(Asia, Europe), parameters (PM2.5, NO₂), and algo- using machine learning models and to reveal complex
rithms, highlighting time-series models’ effective- relationships in pollutant concentrations. The data
ness and the need for explainable AI (Méndez et al., set provided includes basic pollutants such as PM2.5,
2023). Gupta et al. (2023) predicted AQI across cit- NO2, CO as well as air quality indices, and statistical
ies using SVR and Random Forest, with SMOTE- consistency is ensured by outlier removal and corre-
balanced data yielding low Root Mean Square Error lation analysis in the data pre-processing stage. Ten
(RMSE) for Random Forest (Gupta et al., 2023). Guo different regression models such as XGBoost, Light-
et al. (2025) used DRL for HVAC systems, achiev- GBM, Random Forest are trained with hyperparam-
ing 21.4% energy savings and better indoor air qual- eter optimization and the results of these models are
ity (Guo et al., 2025). Kothandaraman et al. (2022) combined with the stack ensemble method to opti-
found XGBoost and AdaBoost effective for PM2.5 mize the prediction accuracy. These results provide
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 3 of 17 464
an effective methodological framework for practical LightGBM, Random Forest, Gradient Boosting, Cat-
applications such as air quality management and early Boost, SVR, Decision Tree, KNN, Elastic Net and
warning systems, while providing a data-driven basis Bayesian Ridge. Each of these models was configured
for policies to control pollutant sources. Figure 1 to predict pollutant concentrations in the air quality
shows the flow diagram of the models used. dataset after hyperparameter optimization. The out-
Hyperparameter optimization to improve the per- puts of these models were then combined using a
formance of machine learning models is at the heart stack ensemble method to produce final predictions
of this work. Bayesian optimization and randomized from a meta-model. The stack ensemble approach
CV methods are used to optimize the balance between aimed to overcome the limitations of a single model
complexity and generalization of the models and to by synthesizing the strengths of the base models and
minimize the risk of overlearning. In the study, 10 maximizing prediction accuracy. In particular, the use
different regression models were trained: XGBoost, of algorithms such as linear regression or gradient
Vol.: (0123456789)
464 Page 4 of 17 Water Air Soil Pollut (2025) 236:464
boosting as meta-models weighted the heterogeneous non-methane hydrocarbons (NMHC), benzene, total
model outputs in a balanced way. This allowed both nitrogen oxides (NOx) and nitrogen dioxide (NO₂)
the performance of individual models and the statis- simultaneously with a certified analyzer. Combin-
tical power of ensemble learning to be exploited. As ing sensor responses with real-time pollution levels
a result, the integration of hyper-parameter optimi- enables multi-disciplinary research such as calibra-
zation and aggregation techniques led to significant tion of air quality monitoring systems, detection of
improvements in air quality predictions based on pollution sources and training of machine learning
RMSE and R2 metrics, demonstrating the suitability based predictive algorithms (Aram et al., 2024).
of the model for real-world scenarios. This dataset is a rich source of basic data for
researchers to understand the dynamics of air qual-
2.1 Dataset ity, especially in regions where industrial and traffic
emissions are intense. Table 1 lists the parameters
This dataset, which is used to analyses air pollu- in the dataset with their descriptions. While the
tion and develop sensor-based prediction models, “Date” and “Time” parameters in Table 1 indicate
is taken from the (Aram et al., 2024) ([Link] the measurement time, the most important indica-
kaggle.c om/d atasets/fedeso riano/a ir-qualit y-d ata- tors of pollution are the gas concentrations such
set/data), which can be accessed via the Kaggle as CO, NMHC, C6H6, NOx, NO₂ and their associ-
platform (Air Quality Dataset, n.d.). The dataset ated sensor outputs (PT08.S1, PT08.S2, PT08.S3,
contains 9,357 hourly average measurements from PT08.S4, PT08.S5). In addition, the T, RH and AH
5 metal oxide chemical sensors located at the road- parameters reflect the thermodynamic properties of
side in an urban area with high levels of air pollu- the environment and are important for investigating
tion. Recorded for one year between March 2004 meteorological interactions with air pollution.
and February 2005, the data provide reference Additionally, to provide a clearer understand-
concentrations of critical pollutants such as CO, ing of the dataset’s key characteristics, descriptive
Table 1 Dataset
Parameter Description Minimum Value Maximum Value
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 5 of 17 464
statistics are presented in Table 2. This table includes robustness and reproducibility, contributing to reli-
the mean, median, standard deviation, minimum, and able results in applications such as pollution source
maximum values for each parameter, offering insight detection and air quality management.
into the general structure and variability of the data
distributions. These statistics provide a quantita- 2.3 Machine Learning Models
tive summary of the dataset prior to modeling and
advanced analyses. Machine learning models are computational algo-
rithms that enable systems to learn from data and
2.2 Training and Test Dataset make decisions or predictions without being explic-
itly programmed. They are widely used in various
In order to evaluate the performance of the machine fields such as healthcare, finance, and engineering to
learning models, the dataset was partitioned into analyze patterns, automate processes, and improve
training and test subsets. This partitioning was done decision-making.
using different strategies for regression and classi-
fication tasks. For regression models, the temporal 2.3.1 XGBoost Regressor
order of the data was preserved and split into 80%
training and 20% testing. This approach is critical to XGBoost is an advanced gradient boosting algorithm
avoid data leakage in time series-based air quality supporting tree-based and linear models, offering
predictions and to reliably measure the ability of the high accuracy, speed, and generalization capacity. It
model to generalise to future observations. The test provides low error rates on large datasets and mini-
set consists of independent data that has never been mizes the risk of overfitting through regularization
used in the model training process, allowing a real- mechanisms (Air Quality Dataset, n.d.). It effectively
istic assessment of prediction accuracy. In the classi- handles missing data.
fication models, the data was randomly shuffled and
split to eliminate temporal bias. This strategy allows 2.3.2 LightGBM Regressor
the classification algorithms to learn general patterns,
typically using a similar 80–20 ratio. Both data parti- LightGBM is a gradient boosting framework that
tioning strategies (train-test split and cross-validation) operates quickly and efficiently on large data-
were used to evaluate the performance of the mod- sets. Its histogram-based splitting reduces train-
els. Temporal partitioning for regression and random ing time and learns complex relationships. It excels
partitioning for classification support both model
Vol.: (0123456789)
464 Page 6 of 17 Water Air Soil Pollut (2025) 236:464
in high-dimensional data with deep tree structures 2.3.8 K‑Nearest Neighbors (KNN) Regressor
and minimizes the risk of overfitting (Fouchal et al.,
2025). KNN predicts by averaging neighboring data points,
capturing complex structures and nonlinear relation-
ships. It is computationally costly for large datasets
2.3.3 Random Forest Regressor but minimizes the risk of overfitting with an optimal
k-value (Zournatzidou et al., 2024).
Random Forest combines decision trees, averaging
predictions to minimize the risk of overfitting. It suits 2.3.9 Elastic Net Regressor
large datasets and multi-feature problems, remaining
robust to noisy data and tolerant of missing values Elastic Net blends lasso and ridge regression, using
(Elshaarawy 2025). L1 and L2 regularization to minimize the risk of
overfitting while enhancing interpretability through
variable selection. It performs well on high-dimen-
2.3.4 Gradient Boosting Regressor sional datasets (Almutiri et al., 2024).
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 7 of 17 464
to significant improvements in the performance of all using SHapley Additive ExPlanations (SHAP), Par-
models, resulting in high accuracy and stability of the tial Dependence Plot (PDP) and Pearson Correlation
air pollution forecasts. to determine the variables that contribute most to the
prediction of air pollution. The results obtained show
2.5 Evaluation Metrics that the proposed model provides high accuracy in air
quality prediction and produces more stable and reli-
To quantitatively assess model performance, the able results compared to traditional methods. These
RMSE, MAE and R 2 metrics were used for regres- findings of the study can provide important contribu-
sion-based predictions. The RMSE measures the tions to air pollution management and environmental
consistency of the model by being more sensitive to policy development processes.
large deviations between predictions and actual val- Figure 2 shows the weekly changes in air qual-
ues, while the MAE reflects the overall accuracy of ity parameters as a time series graph. The graph
the model by averaging the absolute size of the errors. analyses CO(GT), PT08.S1(CO), C 6H6(GT), PT08.
The R2 metric quantifies the explanatory power of S2(NMHC), NOx(GT), NO2(GT), PT08.S4(NO2),
the model by showing the proportion of variance PT08.S5(O3), temperature (T), relative humidity (RH)
explained by the independent variables. In the study, and absolute humidity (AH). For pollutants such as
all models were compared on these metrics and it CO(GT) and NOx(GT), significant variations were
was found that the stack ensemble method provided observed throughout the year. The increase in these
a 15–20% improvement in RMSE value compared parameters, especially during the winter months, indi-
to individual models. These metrics demonstrate the cates the effect of anthropogenic activities such as
theoretical and practical reliability of the model and heating and vehicle emissions. PT08.S5(O3) reached
support its use in industrial applications. high levels in the summer months, reflecting the effect
∑N � of seasonal changes in ozone levels. The variation of
y − xi �� T (temperature) followed a regular pattern, while the
MAE =
j=1 � i
(1)
n relative humidity (RH) and absolute humidity (AH)
parameters showed little variation. These results show
[ ] 12 that air quality parameters show variability depending
N
∑ ( )2
(2) on seasonal and environmental factors.
RMSE = dfi − dd ∕N
j=1
A correlation matrix has been constructed in Fig. 3
to analyses the relationships between air pollutants
∑n 2
and environmental variables. Correlation coefficients
(yi − ̂
yi ) indicate the direction and strength of the linear rela-
R = 1 − ∑i=1
2
(3)
n
(yi − y)
2 tionship between variables, with values close to 1
i=1
indicating a strong positive correlation and values
close to −1 indicating a strong negative correlation.
The relationships observed in the matrix provide
3 Result and Discussion important information for understanding the influence
of chemical processes in the atmosphere and meteor-
In this section, the performance of the proposed ological factors on air pollution.
machine learning based air pollution prediction model A strong positive correlation is observed between
is analyzed. The prediction accuracies of XGBoost, CO(GT) and NOx(GT) (r = 0.78) and NO2(GT) (r
LightGBM, Random Forest, Gradient Boosting, = 0.71). This indicates that CO and nitrogen oxides
CatBoost, SVR, Decision Tree, KNN, Elastic Net (NOₓ and NO₂) generally originate from similar com-
and Bayesian Ridge Regression models are evalu- bustion processes (e.g. vehicle exhaust, industrial
ated using R 2, RMSE and MSE metrics. By apply- activities). Furthermore, the high correlation (r = 0.9)
ing Bayesian optimization in hyperparameter opti- of the PT08.S1(CO) sensor with CO(GT) confirms
mization, the overall performance of the models was that the sensor reliably measures CO concentrations.
improved without overlearning problems. In addition, On the other hand, there is a weak and nega-
the importance levels of the attributes were analyzed tive correlation (r = −0.097) between CO(GT) and
Vol.: (0123456789)
464 Page 8 of 17 Water Air Soil Pollut (2025) 236:464
temperature (T). This suggests that CO levels may Relative humidity (RH) and absolute humidity
decrease slightly with increasing temperature, pos- (AH) generally have a significant effect on the con-
sibly due to increased atmospheric mixing in hot centrations of air pollutants. As shown in the corre-
weather and photochemical reactions leading to the lation matrix, there are weak negative correlations
degradation of CO. The strong positive correlation between absolute humidity (AH) and NO₂(GT) (r
(r = 0.88) between N Ox(GT) and N O2(GT) indicates = −0.15) and CO(GT) (r = −0.15). This suggests that
that these two compounds are directly linked and high humidity may cause pollutants to dilute and dis-
that a significant fraction of the nitrogen oxides in solve in the atmosphere. In contrast, a strong positive
the atmosphere are converted to the NO2 form. There correlation (r = 0.69) was observed between tempera-
is also a significant positive correlation (r = 0.82) ture (T) and absolute humidity. This can be explained
between NOx(GT) and PT08.S4(NO2), indicating by the fact that air can hold more water vapor in
that the sensor successfully detects NO₂ levels. How- warmer conditions. This correlation analysis reveals
ever, the negative correlation between N Ox(GT) and dynamic relationships between air pollutants and
temperature (r = −0.23) shows that NOₓ levels tend environmental factors. In particular, strong positive
to decrease with increasing temperature. This can correlations were observed between pollutants such
be explained by the conversion of NOx to NO2 and as NOx, NO2 and CO, suggesting that these gases
other derivatives by photochemical reactions in hot are emitted from common sources (e.g. combustion
weather. The ozone (O3) concentration measured by processes). However, meteorological variables such
the PT08.S5(O3) sensor shows a moderate positive as temperature and humidity appear to have a direct
correlation with temperature (r = 0.69). This finding effect on air pollutants.
confirms that hot weather conditions are favorable Figure 4 presents a box plot illustrating the sta-
for ozone formation. Increased exposure to sunlight tistical distributions of various air quality-related
accelerates the photochemical reactions that promote parameters, displaying their central tendency, dis-
ozone formation in the atmosphere. persion, and outliers. For instance, parameters such
On the other hand, the negative correlation as CO(GT) and PT08.S1(CO) exhibit notable differ-
between O₃ and N O2 (r = −0.065) suggests that ozone ences between the median and interquartile range,
may interact inversely with nitrogen oxides. Ozone is indicating significant variability due to environmental
usually formed as a result of photochemical reactions factors. Conversely, variables like C6H6(GT), PT08.
of NO₂, but high levels of NO₂ can also cause ozone S3(NOx), and NO2(GT) show narrower distributions,
destruction through reverse reactions. though the presence of prominent outliers suggests
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 9 of 17 464
Fig. 3 Correlation matrix of relationships between air pollutants and environmental variables
variations and potential anomalies in the measure- of at least six parameters (e.g., CO(GT), NOx(GT),
ments. Examining the skewness of these distributions NO2(GT), C6H6(GT), PT08.S1(CO), PT08.S3(NOx))
reveals that pollutants such as CO(GT) and N Ox(GT) or represent their characteristics optimally. The pro-
display a positively skewed (right-skewed) pattern, nounced skewness and abundance of outliers in these
suggesting that lower concentrations are more fre- variables limit the ability of box plots to adequately
quent, with occasional high-emission events. In con- capture their distribution features. Alternative visu-
trast, meteorological parameters like T (temperature) alization methods, such as histograms or violin
and RH (relative humidity) exhibit more symmetric plots, could better elucidate the complex distribu-
distributions, while C6H6(GT) shows a slight nega- tions and variations of these parameters. These find-
tively skewed (left-skewed) tendency. However, the ings highlight the need for a more in-depth study of
box plots may not fully interpret the distributions the spatial and temporal variations of these air quality
Vol.: (0123456789)
464 Page 10 of 17 Water Air Soil Pollut (2025) 236:464
parameters and emphasize the importance of support- sensor (Fig. 5g) follow a similar trend (r = −0.14).
ing their environmental influences with comprehen- In contrast, volatile organic compounds (C6H6(GT),
sive analyses. Fig. 5e) and certain particulate matter sensors (PT08.
Figure 5 contains a series of scatter plots illustrat- S2(NMHC), Fig. 5d) show no significant relationship
ing the relationship between absolute humidity (AH) with absolute humidity (r = 0.02 and r = 0.05, respec-
and various air pollutants and environmental vari- tively), typically fluctuating based on their sources
ables. Generally, it can be observed that some pol- (e.g., traffic, industrial activities) rather than meteoro-
lutants exhibit a significant correlation with humid- logical conditions. Temperature (T, Fig. 5i) displays a
ity levels, while others do not. These results provide strong positive correlation with absolute humidity (r
important insights into the interactions between mete- = 0.69), reflecting the atmosphere’s increased capac-
orological conditions and air pollution. An inverse ity to hold water vapor as temperature rises. These
relationship between nitrogen-based pollutants and analyses provide critical insights into the responses
absolute humidity is notable. The plots for NO₂(GT) of air pollutants to atmospheric conditions, aiding
(Fig. 5a) and NOx(GT) (Fig. 5c) show that the con- the development of air quality prediction models and
centrations of these pollutants decrease as absolute environmental policies.
humidity increases, consistent with the weak negative In this study, hyperparameter optimization was
correlations calculated in the correlation matrix in performed using Bayesian Optimization to system-
Fig. 3 (r = −0.15 for AH with NO2(GT) and r = −0.18 atically identify the best parameter settings for each
for NOx(GT)). This can be explained by nitrogen machine learning model applied in air pollution
oxides interacting with water vapor in the atmosphere prediction. For the Random Forest model, three key
to form nitric acid (HNO₃), which is then removed hyperparameters were tuned: the number of trees
by precipitation. Additionally, the dispersion of pol- (n_estimators) was allowed to vary between 10 and
lutants over larger areas in humid conditions further 200; the maximum depth of the trees (max_depth)
contributes to the reduction in NO₂ and N Ox levels. was constrained to lie between 5 and 50; and the
Carbon monoxide (CO) (Fig. 5h) exhibits a negative minimum number of samples required to split an
correlation with absolute humidity (r = −0.15, Fig. 3), internal node (min_samples_split) was explored
with notably higher CO concentrations observed at within the range of 2 to 20. In contrast, for gradient
low humidity levels. This is attributed to CO being a boosting models such as XGBoost, LightGBM, and
product of incomplete combustion, tending to accu- Gradient Boosting, the optimization process was
mulate in dry atmospheric conditions due to the configured to vary the number of estimators from
insufficiency of cleansing mechanisms like precipita- 10 to 200, set the maximum tree depth between
tion. The CO levels measured by the PT08.S1(CO) 3 and 20, and adjust the learning rate within the
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 11 of 17 464
Fig. 5 Relationship between absolute humidity and air pol- S2(NMHC), (e) C6H6(GT), (f) NMHC(GT), (g) PT08.
lutants. The subplots illustrate the correlation between abso- S1(CO), (h) CO(GT), (i) T, (j) PT08.S3(O3), and (k) PT08.
lute humidity and different pollutant or sensor readings: S4(NO2)
(a) NO2(GT), (b) PT08.S3(NOx), (c) NOx(GT), (d) PT08.
interval of 0.01 to 0.3. These parameter boundaries For the SVR model, the regularization parameter
were carefully chosen to balance model complexity C was allowed to vary between 0.1 and 100, while
and computational efficiency. The CatBoost model the epsilon parameter, which defines the width of
was optimized by considering the number of itera- the epsilon-insensitive tube, was optimized within
tions within the range of 10 to 200, setting the depth the range of 0.01 to 1. Meanwhile, the Decision
parameter between 3 and 10, and tuning the learn- Tree model underwent tuning by varying the maxi-
ing rate from 0.01 to 0.3, thus accommodating its mum depth from 3 to 50 and setting the minimum
specific handling of categorical features. samples for splitting from 2 to 20, aiming to reduce
Vol.: (0123456789)
464 Page 12 of 17 Water Air Soil Pollut (2025) 236:464
overfitting while maintaining interpretability. KNN hyperparameters for each model were selected to
model was tuned by adjusting the number of neigh- enhance model performance. These hyperparam-
bors (n_neighbors) in the range of 1 to 20 to strike eters were optimized using the Bayesian Optimiza-
an optimal balance between bias and variance. Addi- tion method to ensure the best fit of each model to the
tionally, the ElasticNet model was calibrated by tun- dataset.
ing the alpha parameter, which governs the overall The results presented in Table 4 illustrate the per-
regularization strength, between 0.01 and 10, and the formance metrics (R2, MAE, MSE) obtained from
L1 ratio between 0.1 and 1, effectively merging the various machine learning models used in air quality
benefits of both L1 and L2 regularization. Finally, analysis, both with default hyperparameter settings
the Bayesian Ridge model was optimized by fine- and after applying Bayesian optimization. The data
tuning its inherent uncertainty parameters—alpha_1, indicate that, in general, the models achieve higher
alpha_2, lambda_1, and lambda_2—each allowed to accuracy when Bayesian optimization is utilized. For
vary between 1e-6 and 1e-2. instance, the SVR model initially produced an R2
This comprehensive approach in setting hyper- score of 0.9932, an MAE of 0.0581, and an MSE of
parameter boundaries ensures that the optimiza- 0.0068. However, after Bayesian optimization, these
tion process explores a wide parameter space, metrics significantly improved to 0.9994, 0.0120, and
thereby enhancing predictive accuracy and robust- 0.0005, respectively. Similarly, the Gradient Boost-
ness. Table 3 summarizes the hyperparameter set- ing model showed performance metrics of 0.9889 R2,
tings of the models used in the study. The optimized 0.0733 MAE, and 0.0111 MSE with default settings,
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 13 of 17 464
which improved to 0.9988, 0.0207, and 0.0011 after models. These results show that the correct choice of
Bayesian optimization. hyperparameters when modelling environmental data
However, in some models the effect of optimiza- plays a critical role in increasing model success and
tion was not significant, for example in the Decision reliability. In light of these data obtained in this aca-
Tree and Bayesian Ridge models the same results demic study, it can be pointed out the importance of
were obtained in both cases. In the XGBoost model, applying Bayesian optimization to maximize model
although there was an increase in the R 2 value after performance and improve the accuracy of air quality
optimization, the desired reduction in the MAE value predictions. Figure 6 presents a visual comparison of
was not fully achieved and the MSE value remained the R2 values obtained using Bayesian optimization
similar. For the Random Forest and CatBoost models, and no optimization methods for different models.
small but significant improvements were observed Figure 7 illustrates the relationship between the
with Bayesian optimization; for the Random Forest actual (observed) and predicted values generated by
model, the R2 score decreased from 0.9984 to 0.9985, the model. The x-axis represents the actual values,
the MAE decreased from 0.0209 to 0.0207, and the while the y-axis shows the predicted values. Each
MSE decreased from 0.0015 to 0.0014, while for the blue dot corresponds to an individual observation,
CatBoost model, the R 2 score was observed as 0.9990 comparing its actual and predicted values. The red
before optimization and 0.9978 after optimization; dashed line represents the ideal case where the pre-
however, the differences in the MAE and MSE met- diction perfectly matches the actual value (y = x). A
rics were more pronounced. close alignment of most points with the red dashed
These results show that the performance of mod- line indicates that the model achieves high prediction
els used in air quality analysis can change signifi- accuracy overall. The clustering of data points around
cantly depending on hyperparameter optimization. this ideal line suggests that the model provides both
In particular, Bayesian optimization can be said to be unbiased and consistent results. Nevertheless, a few
effective in increasing the prediction accuracy of the deviations from the line can be observed, indicating
Fig. 6 Comparison of the R2 scores of the models obtained with Bayesian Optimization and No Optimization methods
Vol.: (0123456789)
464 Page 14 of 17 Water Air Soil Pollut (2025) 236:464
Table 5 Comparison Table over the SVR approach used in air quality forecasting.
Study 2
R Score MAE MSE 2 values between 0.8900
This value is well above the R
and 0.9700 reported in studies (Bekkar et al., 2021;
Proposed Model (SVR 0.9994 0.0120 0.0005 Ben et al., 2025; Doan et al., 2025; Janarthanan et al.,
+ Bayesian Optimization)
2021; Lakshmipathy et al., 2024; Mampitiya et al.,
Mao et al., 2021) 0.9650 0.0280 0.0016
2024; Mao et al., 2021; Ulpiani et al., 2025; Wang
Janarthanan et al., 2021) 0.9580 0.0310 0.0018
et al., 2025). Similarly, MAE value was only 0.0120,
Bekkar et al., 2021) 0.9470 0.0330 0.0020
while other studies reported MAE values between
Mampitiya et al., 2024) 0.9600 0.0300 0.0017
0.0270 and 0.0450. Furthermore, the mean squared
Lakshmipathy et al., 2024) 0.9620 0.0290 0.0015
error (MSE) of our proposed model is 0.0005, while
Ben et al., 2025) 0.9700 0.0270 0.0014
in other studies these values range from 0.0014 to
Wang et al., 2025) 0.8900 0.0450 0.0031
0.0031. These metrics show that our model is highly
Doan et al., 2025) 0.9200 0.0400 0.0024
accurate in forecasting and the error rates are kept
Ulpiani et al., 2025) 0.9350 0.0420 0.0025
to a minimum. Therefore, Table 5 comprehensively
shows that our proposed SVR model provides much
higher accuracy and reliability for air quality predic-
that the model does exhibit some prediction errors in tion compared to the approaches in literature refer-
certain cases. Despite these minor discrepancies, the ences (Bekkar et al., 2021; Ben et al., 2025; Doan
overall trend confirms that the model demonstrates et al., 2025; Janarthanan et al., 2021; Lakshmipathy
strong predictive performance. et al., 2024; Mampitiya et al., 2024; Mao et al., 2021;
Analyzing the data presented in Table 5, it is Ulpiani et al., 2025; Wang et al., 2025). This proves
observed that the performance metrics of the proposed that the superior performance of our work in practice
model are significantly superior when compared to the is achieved by effectively capturing the complex rela-
values of the models in other studies reported in the tionships in the dataset and minimizing the error rates.
literature. Our proposed model stands out with an R2 Satellite-based PM2.5 prediction showed significant
value of 0.9994, which explains 99.94% of the vari- links to mortality; over 1050 deaths were linked to
ance in the data, which is a significant improvement pollution, with higher risks at extreme concentration
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 15 of 17 464
levels (Aboubakri et al., 2021; Li et al., 2024). Using methodology. The results show that the integration of
satellite-based PM2.5 data, nearly 600 respiratory machine learning and optimization techniques plays
cases were attributed to non-optimal levels during a critical role in accurately capturing the spatial and
2017–2022, highlighting significant health risks (Rah- temporal dynamics of air pollution. The study demon-
mati et al., 2024). strated that the rigorous implementation of data pre-
The 99.94% R2 achieved in this study reflects the processing, attribute engineering and hyper-param-
model’s high accuracy but may raise concerns about eter tuning processes had a direct impact on model
overfitting, particularly with complex real-world success. In addition, combining the strengths of dif-
air quality data. Nevertheless, generalizability was ferent models through stacking enabled the effective
ensured through Bayesian optimization and rand- capture of complex data relationships. This compre-
omized cross-validation. The dataset was temporally hensive approach provides a solid scientific basis
split into 80% training and 20% testing, with the for data-driven decision making in air quality man-
model validated on independent test data not used agement, early warning systems and environmental
during training. These strategies minimized the risk policy making. In the future, the integration of deep
of overfitting, enhancing the model’s reliability in learning-based hybrid models with larger and more
real-world scenarios. diverse data sets and the use of explicable artificial
This study makes a significant contribution by intelligence methods will further improve the accu-
providing a comprehensive comparison of ten dif- racy and interpretability of air pollution forecasts.
ferent regression models for air pollution prediction
and thoroughly examining the impact of advanced Author Contributions All authors contributed equally to
the conception, design, analysis, and interpretation of the
optimization techniques, such as Bayesian optimiza- data. They have been involved in drafting and revising the
tion, on model performance. Notably, the ability to manuscript and have given final approval for the version to be
handle imbalanced datasets and achieve high accu- published.
racy with low computational costs are key strengths
of the study. However, limitations include the dataset Funding Open access funding provided by the Scientific and
Technological Research Council of Türkiye (TÜBİTAK).
being restricted to a single urban area and the lack of
validation across more diverse geographical regions. Data Availability The data and materials used in this study
Future studies with broader and more varied data- are available upon request. Researchers interested in access-
sets could further enhance the generalizability of the ing the data can contact the corresponding author for further
methodology. details. ([Link]
ty-data-set/data).
Declarations
4 Conclusion
Ethics Approval and Consent to Participate Not applicable.
This study compared and optimized the performance
of 10 different machine learning regression models, Consent for Publication Not applicable.
namely XGBoost, LightGBM, Random Forest, Gra-
dient Boosting, CatBoost, Proposed model (Support Competing interests The authors declare that they have no
Vector Regression-Bayesian Optimization, Deci- competing interests.
sion Tree, KNN, Elastic Net and Bayesian Ridge,
in predicting air pollution. The hyper-parameter Open Access This article is licensed under a Creative Com-
adjustments made by the Bayesian optimization and mons Attribution 4.0 International License, which permits
randomized cross-validation methods significantly use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the
improved the prediction accuracy by reducing the original author(s) and the source, provide a link to the Crea-
risk of overlearning of the models. In particular, the tive Commons licence, and indicate if changes were made. The
post-optimization R2 value of 0.9994, MAE value of images or other third party material in this article are included
0.0120 and MSE value of 0.0005 were obtained in in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not
the Proposed model (SVR-Bayesian Optimization), included in the article’s Creative Commons licence and your
demonstrating the superior success of the developed
Vol.: (0123456789)
464 Page 16 of 17 Water Air Soil Pollut (2025) 236:464
intended use is not permitted by statutory regulation or exceeds Journal of Water Process Engineering, 70, 106913.
the permitted use, you will need to obtain permission directly [Link]
from the copyright holder. To view a copy of this licence, visit Fouchal, A., Tikhamarine, Y., Benbouras, M. A., Souag-
[Link] Gamane, D., & Heddam, S. (2025). Biological oxygen
demand prediction using artificial neural network and
random forest models enhanced by the neural architecture
search algorithm. Model Earth Syst Environ, 11(1), 1–18.
[Link]
References 13
Guo, F., woo Ham, S., Kim, D., & Moon, H. J. (2025). Deep
Aboubakri, O., Shoraka, H. R., Karamoozian, A., AbediGhesh- reinforcement learning control for co-optimizing energy
laghi, L., & Foroutan, B. (2021). Seasonal impact of air consumption, thermal comfort, and indoor air quality in
particulate matter on morbidity: Interaction effect assess- an office building. Applied Energy, 377, 124467. [Link]
ment in a time-stratified case-crossover design. Human doi.org/10.1016/J.APENERGY.2024.124467
and Ecological Risk Assessment: An International Jour‑ Gupta, N. S., Mohta, Y., Heda, K., Armaan, R., Valarmathi,
nal, 27(9–10), 2328–2341. [Link] B., & Arulkumaran, G. (2023). Prediction of Air Qual-
039.2021.1999204 ity Index Using Machine Learning Techniques: A Com-
Ahmed, A. A. M., Jui, S. J. J., Sharma, E., Ahmed, M. H., Raj, parative Analysis. Journal of Environmental and Public
N., & Bose, A. (2024). An advanced deep learning pre- Health, 2023(1), 4916267. [Link]
dictive model for air quality index forecasting with remote 4916267
satellite-derived hydro-climatological variables. Science Hartono, F., Muljono, M., & Fanani, A. (2024). Improving the
of the Total Environment, 906, 167234. [Link] accuracy of house price prediction using catboost regres-
1016/J.SCITOTENV.2023.167234 sion with random search hyperparameter tuning: A com-
Air Quality Dataset. (n.d). [Online]. Available: [Link] parative analysis. Advance Sustainable Science Engineer‑
kaggle.com/datasets/fedesoriano/air-quality-data-set/data. ing and Technology, 6(3), 02403014–02403014. [Link]
Accessed: 03 Feb 2025. doi.org/10.26877/ASSET.V6I3.602
Almutiri, T. M., Alomar, K. H., & Alganmi, N. A. (2024). Janarthanan, R., Partheeban, P., Somasundaram, K., &
Integrating Multi-Omics Using Bayesian Ridge Regres- NavinElamparithi, P. (2021). A deep learning approach
sion with Iterative Similarity Bagging. Applied Sciences, for prediction of air quality index in a metropolitan city.
14(13), 5660. [Link] Sustainable Cities and Society, 67, 102720. [Link]
Anggraini, T. S., Irie, H., Sakti, A. D., & Wikantika, K. (2024). org/10.1016/J.SCS.2021.102720
Machine learning-based global air quality index develop- Jiang, Q. M., et al. (2025). Disparities between residential and
ment using remote sensing and ground-based stations. commercial zones in air quality revealed by location-
Environmental Advances, 15, 100456. [Link] based services. Building and Environment, 270, 112543.
1016/J.ENVADV.2023.100456 [Link]
Ansari, A., & Quaff, A. R. (2025). Advanced Machine Learn- Kothandaraman, D. et al. (2022). Intelligent Forecasting of Air
ing Techniques for Precise hourly Air Quality Index (AQI) Quality and Pollution Prediction Using Machine Learn-
Prediction in Azamgarh, India. International Journal of ing. Adsorption Science & Technology. [Link]
Environmental Research, 19(1), 1–31. [Link] 1155/2022/5086622
1007/S41742-024-00684-5/TABLES/9 Lakshmipathy, M., Prasad, M. J. S., & Kodandaramaiah, G. N.
Aram, S. A., et al. (2024). Machine learning-based prediction (2024). Advanced ambient air quality prediction through
of air quality index and air quality grade: A comparative weighted feature selection and improved reptile search
analysis. International Journal of Environmental Science ensemble learning. Knowledge and Information Systems,
and Technology, 21(2), 1345–1360. [Link] 66(1), 267–305. [Link]
1007/S13762-023-05016-2/FIGURES/9 01947-X/TABLES/11
Bekkar, A., Hssina, B., Douzi, S., & Douzi, K. (2021). Air- Li, G., Aboubakri, O., Soleimani, S., Maleki, A., Rezaee, R.,
pollution prediction in smart city, deep learning approach. Safari, M., et al. (2024). Estimation of PM2.5 using high-
Journal of Big Data, 8(1), 1–21. [Link] resolution satellite data and its mortality risk in an area
S40537-021-00548-1/FIGURES/17 of Iran. International Journal of Environmental Health
Ben, A., et al. (2025). Predicting carbon dioxide emissions Research, 34(11), 3771–3783. [Link]
using deep learning and Ninja metaheuristic optimization 09603123.2024.2325629
algorithm. Scientific Reports, 15(1), 1–28. [Link] Lin, Y. C., Lin, Y. T., Chen, C. R., & Lai, C. Y. (2025). Mete-
10.1038/s41598-025-86251-0 orological and traffic effects on air pollutants using Bayes-
Doan, Q. C., Ma, J., Chen, S., & Zhang, X. (2025). Nonlinear ian networks and deep learning. Journal of Environmental
and threshold effects of the built environment, road vehi- Sciences, 152, 54–70. [Link]
cles and air pollution on urban vitality. Landscape and 01.057
Urban Planning, 253, 105204. [Link] Liu, Z., Jiang, P., De Bock, K. W., Wang, J., Zhang, L., & Niu,
LANDURBPLAN.2024.105204 X. (2024). Extreme gradient boosting trees with efficient
Elshaarawy, M. K. (2025). Stacked-based hybrid gradient Bayesian optimization for profit-driven customer churn
boosting models for estimating seepage from lined canals. prediction. Technol Forecast Soc Change, 198, 122945.
[Link]
Vol:. (1234567890)
Water Air Soil Pollut (2025) 236:464 Page 17 of 17 464
Liu, Q., Cui, B., Liu, Z., Thomaslarson, T. D., & Monteiro, 2024 3rd International Conference for Innovation in Tech‑
A. (2024). Air quality class prediction using machine nology, INOCON 2024. [Link]
learning methods based on monitoring data and second- N60754.2024.10511732
ary modeling. Atmosphere, 15(5), 553. [Link] Srisuradetchai, P., & Suksrikran, K. (2024). Random kernel
3390/ATMOS15050553 k-nearest neighbors regression. Front Big Data, 7, 1402384.
Mampitiya, L., et al. (2023). Machine Learning Techniques [Link]
to Predict the Air Quality Using Meteorological Data in Tran, N. K., Kühle, L. C., & Klau, G. W. (2024). A critical review
Two Urban Areas in Sri Lanka. Environments, 10(8), 141. of multi-output support vector regression. Pattern Recognit
[Link] Lett, 178, 69–75. [Link]
Mampitiya, L., Rathnayake, N., Hoshino, Y., & Rathnayake, U. 12.007
(2024). Forecasting PM10 levels in Sri Lanka: A compar- Ulpiani, G., Pisoni, E., Bastos, J., Monforti-Ferrario, F., & Vet-
ative analysis of machine learning models PM10. Journal ters, N. (2025). Are cities ready to synergise climate neu-
of Hazardous Materials Advances, 13, 100395. [Link] trality and air quality efforts? Sustainable Cities and Society,
org/10.1016/J.HAZADV.2023.100395 118, 106059. [Link]
Mao, W., Wang, W., Jiao, L., Zhao, S., & Liu, A. (2021). Mod- Wang, S., & Zhang, Y. (2025). An attention-based CNN model
eling air quality prediction using a deep learning approach: integrating observational and simulation data for high-
Method optimization and evaluation. Sustainable Cities and resolution spatial estimation of urban air quality. Atmos‑
Society, 65, 102567. [Link] pheric Environment, 340, 120921. [Link]
102567 ATMOSENV.2024.120921
Meena, K. K., Bairwa, D., & Agarwal, A. (2024). A machine Wang, S., McGibbon, J., & Zhang, Y. (2024). Predicting high-
learning approach for unraveling the influence of air qual- resolution air quality using machine learning: Integration
ity awareness on travel behavior. Decision Analytics Jour‑ of large eddy simulation and urban morphology data. Envi‑
nal, 11, 100459. [Link] ronmental Pollution, 344, 123371. [Link]
100459 ENVPOL.2024.123371
Méndez, M., Merayo, M. G., & Núñez, M. (2023). Machine Wang, L., et al. (2025). An integrated deep learning model for
learning algorithms to forecast air quality: A survey. Arti‑ intelligent recognition of long-distance natural gas pipeline
ficial Intelligence Review, 56(9), 10031–10066. [Link] features. Reliability Engineering and System Safety, 255,
org/10.1007/S10462-023-10424-4 110664. [Link]
Nandi, B. P., Singh, G., Jain, A., & Tayal, D. K. (2024). Evolution Yu, C., et al. (2025). MGSFformer: A Multi-Granularity Spati-
of neural network to deep learning in prediction of air, water otemporal Fusion Transformer for air quality prediction.
pollution and its Indian context. International Journal of Information Fusion, 113, 102607. [Link]
Environmental Science and Technology, 21(1), 1021–1036. INFFUS.2024.102607
[Link] Zhang, Z., Zhang, S., Chen, C., & Yuan, J. (2024). A system-
Rahman, M. M., et al. (2024). AirNet: Predictive machine learn- atic survey of air quality prediction based on deep learning.
ing model for air quality forecasting using web interface. Alexandria Engineering Journal, 93, 128–141. [Link]
Environmental Systems Research, 13(1), 1–19. [Link] org/10.1016/J.AEJ.2024.03.031
org/10.1186/S40068-024-00378-Z/TABLES/5 Zournatzidou, G., Mallidis, I., Farazakis, D., & Floros, C. (2024).
Rahmati, S., Aboubakri, O., Maleki, A., et al. (2024). Risk of car- Enhancing bitcoin price volatility estimator predictions:
diovascular and respiratory diseases attributed to satellite- A four-step methodological approach utilizing elastic net
based PM2.5 over 2017–2022 in Sanandaj, an area of Iran. regression. Mathematics, 12(9), 1392. [Link]
International Journal of Biometeorology, 68, 1689–1698. 3390/MATH12091392
[Link]
Rybarczyk, Y., & Zalakeviciute, R. (2021). Assessing the Publisher’s Note Springer Nature remains neutral with regard
COVID-19 impact on air quality: A machine learn- to jurisdictional claims in published maps and institutional
ing approach. Geophysical Research Letters, 48(4), affiliations.
e2020GL091202. [Link]
Sharma, M., Sharma, D., Burle, R., Patil, P., Joge, I., & Puri, C.
(2024). Predicting House Price Model : A Comprehensive
Analysis with Random Forest and Decision Tree Method.
Vol.: (0123456789)