0% found this document useful (0 votes)
16 views8 pages

Air Quality Prediction Using Machine Learning Tech-8

This paper evaluates various machine learning algorithms for air quality prediction, highlighting that ensemble methods like Random Forest and XGBoost significantly outperform traditional linear models. The study emphasizes the importance of accurate forecasting methods due to the health risks associated with poor air quality and demonstrates the effectiveness of machine learning in capturing complex interactions among pollutants. Results indicate that hybrid algorithms can further enhance prediction accuracy, showcasing the potential for improved air quality assessment.

Uploaded by

22051967
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Air Quality Prediction Using Machine Learning Tech-8

This paper evaluates various machine learning algorithms for air quality prediction, highlighting that ensemble methods like Random Forest and XGBoost significantly outperform traditional linear models. The study emphasizes the importance of accurate forecasting methods due to the health risks associated with poor air quality and demonstrates the effectiveness of machine learning in capturing complex interactions among pollutants. Results indicate that hybrid algorithms can further enhance prediction accuracy, showcasing the potential for improved air quality assessment.

Uploaded by

22051967
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Air Quality Prediction Using Machine Learning

Techniques: A Comparative Analysis

Ryansh Rajput¹, Utsov Mohanty¹, Taniya Adak¹, Shubam Chakraborty¹, Samadrita


Samanta¹
1
Kalinga Institute of Industrial Technology, Bhubaneshwar, Odisha, 751024, India

Abstract: This paper provides an extensive assessment of various machine learning


algorithms, applied to air quality prediction. We thus tested various regression models,
from Linear Regression, Lasso, K-Nearest Neighbors (KNN), Decision Trees, Random
Forest, XGBoost to Artificial Neural Networks (ANN) and found that ensemble methods
(xgboost-based) significantly outperformed both linear and instance-based methods. The
Random Forest and ANN showed the best prediction accuracy with MAE values of
approximately 19, compared to 44.5 against linear models, thus indicating their higher
capability to capture nonlinear environmental pollutants interactions. PM₂. ₂ NO₄, and
more generally a contribution of over 60% of the predictions across all models from those
two pollutants, consistent with methodologies of official AQI computation. And hybrid
algorithms combining XGBoost and Random Forest significantly reduced RMSE by an
additional 12% over even using the most mostly performant individual algorithms together,
highlighting the nature of ensemble methods in air quality assessment.

Keywords: XGBoost, AQI, ANN, KNN, CNN.

1 Introduction

One major factor in both environmental sustainability and public health is the quality of
the air. So that in the recent time due to increasing industrialization and vehicle emission
present air quality is degrading and contributing to air pollution. Poor air quality
contributes to respiratory diseases, cardiovascular diseases, and even climate changes [1].
Pollutants such as PM₂. ₅ and PM₁₀, nitrogen oxides (NOₓ), sulfur dioxide (SO₂),
carbon monoxide (CO), VOCs are studied and measured by which researchers are able to
set air quality standards which aids in developing methods to effectively control pollution
[1].
Traditional scientific methods of assessing air quality based on statistics, and are
inaccurate for the needs of the current day. Conditions like these highlight the
importance of more accurate forecasting methods. In this study, we tackle this important
issue by creating machine learning-based models to model air quality using historical
environmental data currently available [2].
2 Motivation

The main motivation for the presented air quality prediction study is the severe global
health and environmental problems associated with air pollution. Poor air quality is a
common risk factor for diseases like respiratory and cardiovascular diseases, and a
common cause of climate change. Methods for assessing air quality are insufficient with
regard to their leads status with conventional statistical techniques.
By increasing industrialization and vehicle emissions are degrading air quality
globally the study is increasingly motivated. By studying pollutants like PM₂. ₅, PM₁₀,
etc., allows researchers to derive air quality health standards and more effective pollution
control measures.
The results of this comparison of lasso, KNN, Decision Trees, Random Forest,
XGBoost and ANN showed that ensemble-based models performed significantly better
than linear and instance based ones.

3 Contribution

This study formulates and evaluates the prediction of air quality via various machine
learning models to overcome the limitations of traditional statistical methods. Our
findings show that ensemble-based methods (Random Forest and XGBoost) clearly
outperform classical approaches. We present an extensive framework for using historical
ambient data to predict the complex behavior of multiple air pollutants.

4 Machine Learning Algorithms for Air Quality Prediction

This research evaluates several machine learning algorithms for air quality prediction:
Linear Regression models the relationship between independent variables and a dependent
variable using a linear equation [3].
Simple linear regression:

Y = a + bx

Multiple linear regression:

Y = a + b₁X₁ + b₂X₂ + ... + bₙXₙ

Lasso Regression incorporates L1 regularization which shrinks coefficients to zero,


effectively performing feature selection by eliminating less important variables [3].
Lasso regression = RSS + λΣ|βⱼ|, where RSS = Σ(yᵢ – ŷᵢ)

KNN calculates the distance between new data points and existing data points where the
label is based on the majority class of the K Nearest Neighbors.
Decision Tree divides the data based on features using Gini Index or Mean
Squared Error where the final decision is taken based on the average value at the leaf
nodes.
Random Forest builds multiple trees using randomly selected subsets of the data, where
each tree makes a decision and the final verdict is passed based on majority voting or
averaging.
XGBoost is an ensemble learning method where each tree corrects the errors
from the previous trees by giving more weight to misclassified samples.
Artificial Neural Network (ANN) consists of layers of neurons and works by adjusting the
weights using back propagation and gradient methods to minimize the loss function.

5 Related Works

Rahman et al. developed a web interface based on predictive machine learning models
like RF, LR, DT, SVM, LSVC for early prediction of air quality using air pollutants [4].
Samad et al. aimed at substituting existing monitoring stations with virtual observatory
stations using meteorological parameters from nearby monitoring stations [5]. Singh et al.
applied RF, GNB, LR to the AQI dataset of Ahmedabad and achieved better accuracy
measure using RF [8]. Gupta et al. used pollutant data from several Indian cities and
applied the SMOTE algorithm for dataset balancing. Wang et al. proposes the ILSTM
model without the output gate to maximize training efficiency. In CNN-ILSTM their
model used CNN for feature extraction for improved accuracy. Suman addressed the
concerns surrounding the simplicity of current AQI models [6]. Improving the AQI
prediction using advanced regression was proposed by Zayed and Abbod [7]. Natarajan
et al. and achieved better results using a combination of Grey Wolf Optimization for
feature selection and Decision Tree for prediction on already balanced data using
SMOTE.

7 Methodology

Our methodology primarily involved three stages: data collection, data pre-processing,
and model implementation, aiming to provide accurate methods for predicting air quality
and pollution levels [3].
Fig.1 ML Pipeline for AQI Prediction

Flow Chart Description:

 Data Collection - The initial stage where air quality data is gathered.
 Data Pre-processing - Raw data is cleaned, normalized, and prepared for
analysis.
 Feature Engineering - Important features are selected.
 Model Implementation - Machine learning algorithms are applied.
 Performance Evaluation - The model's accuracy and effectiveness are assessed.
 Performance satisfactory - Whether to proceed or return to model
implementation.
 Results Analysis – Detailed analysis of the model's predictions is conducted.
 Model deployment ready? – If the model is ready for real-world application.
 Deployment - Final stage where the validated model is implemented in
production.

The models used in this air quality prediction pipeline include Linear Regression and
Lasso Regression, [11] KNN [12], DT, RF [13], XGB, SVM/SVR [14], ANN [15],
Gradient Boosting [16], CNN-ILSTM [17], Stacking Ensemble methods [13].
After the model is implemented, the pipeline continues to performance evaluation,
in which the predictive abilities of the model are evaluated rigorously based on metrics
like RMSE, MAE ,R-squared, Domain-specific metrics, The flowchart includes a decision
diamond marked “Performance satisfactory?” If NOT, the process goes back to model
implementation for further iteration. If performance meets expectations, results analysis
follows within the pipeline. The final decision point asks "Model deployment ready?" If
the answer is. If "Yes," the model proceeds to deployment.
7.1 Performance Metrics Selection

For the continuous AQI value prediction (regression), three supplementary metrics were
applied [4] including the MAE which allows to measure average prediction error in
interpretable units; the RMSE which allows to penalize more severely the larger
prediction errors the Coefficient of determination (R²) which assess the share of variance
that was explained from each model. For the categorical prediction of AQI classes, the
evaluation used parameters like Accuracy that helps to measure overall correctness, AUC-
ROC that helps to assess discrimination ability across different decision thresholds.

7.2 Comparative Analysis Framework

The comparative analysis was designed to allow the performance of all the models to be
compared across both predictive power and computational requirement. To statistically
evaluate the dataset for performance differences between models, paired t tests were
performed, specifically ensuring that differences were not due to chance. This gave a
more robust framework for determining the best-suited algorithm [5].
8 Results and Performance Analysis

This thorough assessment uncovered significant disparities in the predictive capabilities


of the algorithms deployed. Results showed significant advantages of tree-based
ensemble methods over linear and instance-based methods.
Random Forest and XGBoost were significantly better than the other algorithms
on all points of evaluation. Random Forest revealed considerable improvement over
XGBoost for a classification accuracy of 83% as opposed to 81%.

Table 1. Comparative Model Performance.

Model MAE RMSE R² Accuracy (AQI


8.1 Class)

Linear Regression 44.50 3627 0.50 52%

Lasso 44.50 3627 0.50 53%

KNN 34.97 2970 0.54 61%

Decision Tree 22.09 1977 0.72 68%

Random Forest 19.39 1551 0.78 83%

XGBoost 19.02 1355 0.81 81%

ANN 38.53 3314 0.52 72%

Key Performance Insights

The analyzed leading tree-based models perform better in many aspects. On small
datasets, the ANN performed fairly poorly. Such result suggests the data-hungry nature of
neural network methods. PM₂. ₅ and NO₂ contributed greater than 60% to the
predictions from all models. This study showed strong positive impacts of ensemble
approaches, which improved upon single algorithms.
Result Graphs

Random Forest is the best graph of all the predictions as it gives us 83% accuracy and
according to plotting its 0.025

Fig.2 Lasso Regression Prediction Fig.3 ANN


Prediction

Fig.4 Random Forest Regressor Prediction

9 Conclusion

Machine learning methods outperform traditional methods for air quality prediction,
especially tree-based ensemble methods. Linear models struggle with temporal
dependencies, while KNN is highly sensitive to missing data. Even the top-performing
tree-based models face challenges in capturing sudden pollution events.
The next steps could include real-time data streams integration, further efforts into
deep learning architectures that can account for, exploit and even generalise beyond the
temporal dependencies found in sequences. The integration of meteorological elements
and spatial relationships may, however, provide an even more accurate prediction.
References
1. Rahman, M.M., Nayeem, .E.H., Ahmed, .S. et al. AirNet: predictive machine learning model
for air quality forecasting using web interface. Environ Syst Res 13, 44 (2024)
2. Samad, A., Garuda, S., Vogt, U., Yang, B. Air pollution prediction using machine learning
techniques – An approach to replace existing monitoring stations with virtual monitoring
stations, Atmospheric Environment, Volume 310, 2023, 119987
3. Singh, R., Raghav, S., Maini, T., Singh, M., Arquam, Md. Air Quality Prediction using
Machine Learning. SSRN Electronic Journal (2022)
4. Gupta, N.S., Mohta, Y., Heda, K., Armaan, R., Valarmathi, B., Arulkumaran, G., Prediction
of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis, Journal
of Environmental and Public Health, 2023, 4916267, 26 pages (2023)
5. Wang, J., Li, X., Jin, L. et al. An air quality index prediction model based on CNN-ILSTM.
Sci Rep 12, 8373 (2022)
6. Suman, Air quality indices: A review of methods to interpret air quality status, Materials
Today: Proceedings, Volume 34, Part 3, 2021, Pages 863-868
7. Zayed, R., & Abbod, M. Air Quality Index Prediction Using DNN-Markov Modeling.
Applied Artificial Intelligence, 38(1) (2024)
8. Natarajan, S.K., Shanmurthy, P., Arockiam, D. et al. Optimized machine learning model for
air quality index prediction in major cities in India. Sci Rep 14, 6795 (2024)
9. Méndez, M., Merayo, M. G., & Núñez, M. Machine learning algorithms to forecast air
quality: a survey. Artificial intelligence review, 1–36 (2023)
10. Liang, Y.C., Maimury, Y., Chen, A., Juarez, J. Machine Learning-Based Prediction of Air
Quality. Applied Sciences. 10. 9151 (2020)
11. Evangelista, Adelia et al. “High dimensional variable selection through group Lasso for
multiple function‐on‐function linear regression: A case study in PM10
monitoring.” Environmetrics (2024)
12. Mandvi, Hrishikesh Kumar Singh and Vipin Kumar. “Air quality index prediction for
Gorakhpur city using k-nearest neighbors: Model evaluation and analysis.” World Journal of
Advanced Research and Reviews (2024)
13. Resti, Yulia et al. “Ensemble of naive Bayes, decision tree, and random forest to predict air
quality.” IAES International Journal of Artificial Intelligence (IJ-AI) (2024)
14. Shafii, Nor Hayati Binti et al. “Forecasting of Air Pollution Index PM2.5 Using Support
Vector Machine(SVM).” (2020)
15. Purwanti, Cindy Ulan et al. “Portable Air Quality Monitoring System in ANN Using
Combination Hidden Layer Hyperparameters.” 2022 IEEE International Conference on
Communication, Networks and Satellite (COMNETSAT) (2022): 368-373.
16. Ksibi, Amel et al. “Insights for Wellbeing: Predicting Personal Air Quality Index Using
Regression Approach.” MediaEval Benchmarking Initiative for Multimedia Evaluation
(2020).
17. Wang, Jingyang et al. “An air quality index prediction model based on CNN-ILSTM.”
Scientific Reports 12 (2022)

You might also like