Air Quality Analysis Using Machine Learning
Abstract
This report presents a comprehensive analysis of air quality data using machine learning techniques.
The study utilizes two datasets—one detailing daily air pollutant measurements in various Indian cities
and another containing global AQI values with geographic coordinates.
By applying regression and classification models, the report aims to predict Air Quality Index (AQI)
values and categorize them into standard AQI buckets.
The models achieve strong performance, demonstrating the potential of machine learning in air pollution
monitoring and forecasting.
Introduction
Air pollution is one of the most serious environmental concerns worldwide. It affects human health,
ecosystems, and contributes to climate change.
Monitoring and predicting air quality are crucial for public safety and policy-making. This report uses
machine learning (ML) to analyze air quality data and build models to predict AQI and its categories,
which helps understand pollution patterns and anticipate hazardous conditions.
Dataset Description
2.1 city_day.csv
Total Records: 29,531
Main Features: PM2.5, PM10, NO, NO2, NOx, NH3, CO, SO2, O3, Benzene, Toluene, Xylene
Target Variables: AQI (for regression), AQI_Bucket (for classification)
Cities Covered: Various Indian cities
Time Range: Includes multiple dates for each city
2.2 AQI-and-Lat-Long-of-Countries.csv
Total Records: 16,695
Main Features: AQI Value, CO AQI Value, Ozone AQI Value, NO2 AQI Value, PM2.5 AQI Value
Other Data: Latitude and Longitude of monitoring points
Data Preprocessing
Removed rows with missing AQI or AQI_Bucket
Filled missing pollutant values using median imputation
Encoded AQI categories using Label Encoding for classification tasks
Split data into training and testing sets (80:20 split)
Machine Learning Techniques
4.1 Regression (AQI Prediction)
Model: Linear Regression
Features Used: All pollutant values
Target: AQI (continuous)
4.2 Classification (AQI Category)
Model: Decision Tree Classifier
Features Used: All pollutant values
Target: AQI_Bucket (encoded)
Results
5.1 Regression Model
R² Score: 0.807
RMSE: 59.44
The model explains 80.7% of the variance in AQI values.
5.2 Classification Model
Accuracy: 72.6%
Classification Report:
Category Precision Recall F1-Score
Good 0.62 0.58 0.60
Moderate 0.76 0.75 0.75
Poor 0.57 0.57 0.57
Satisfactor 0.77 0.78 0.78
y
Severe 0.78 0.76 0.77
Very Poor 0.68 0.69 0.69
Visualizations
(Optional – You can insert graphs like bar plots, confusion matrix, and AQI distribution here. Let me know
if you want me to generate them.)
Conclusion
This project demonstrates the usefulness of machine learning in environmental monitoring. The regression
model effectively predicts AQI values, while the classification model accurately identifies air quality
categories. These models can be integrated into air pollution tracking systems to provide real-time alerts and
long-term insights.
References
Government of India, Central Pollution Control Board (CPCB) Data
Scikit-learn Documentation: https://scikit-learn.org
Python Pandas Documentation: https://pandas.pydata.org