Phase-2 Submission: Air Quality
Prediction Project
**Student Name:**
K. SRITHIKA
A. SHIFANA
P. SWETHA
B. SUBASHINI
J. SHOBANA
**Register Number:**
621123205053
621123205051
621123205055
621123205054
621123205052
**Institution:** Idhaya Engineering College for Women
**Department:** B.Tech Information Technology
**Date of Submission:** [Insert Date]
**GitHub Repository Link:** [Insert Link]
1. Problem Statement
Air pollution severely affects environmental and human health. Traditional air quality
monitoring systems lack predictive capabilities and often don't offer actionable early
warnings. This project aims to build a regression-based machine learning model to predict
Air Quality Index (AQI) using real-time environmental and pollutant data. Predictive
insights will empower citizens and governments to make proactive decisions to mitigate
health risks and environmental impact.
2. Project Objectives
- Develop a robust ML model to predict AQI levels based on environmental features.
- Compare performance of multiple algorithms (Linear Regression, Random Forest,
XGBoost).
- Identify key pollutants influencing AQI.
- Visualize trends and patterns in air quality data.
- Create a user-friendly dashboard or tool (optional deployment).
- Adjust project goals post-EDA for improved performance and interpretability.
3. Flowchart of the Project Workflow
1. Data Collection
2. Data Cleaning & Preprocessing
3. Exploratory Data Analysis
4. Feature Engineering
5. Model Building
6. Model Evaluation
7. Visualization & Insights
8. (Optional) Deployment
4. Data Description
- Dataset: Delhi Air Quality Dataset
- Source: Kaggle (https://www.kaggle.com/datasets)
- Type: Structured, time-series
- Features: PM2.5, PM10, NO2, CO, SO2, O3, temperature, humidity, wind speed
- Target Variable: AQI
- Records: ~30,000 rows, 15+ features
- Nature: Static dataset with potential for real-time API extension
5. Data Preprocessing
- Handled missing values using forward-fill and interpolation.
- Removed duplicate entries.
- Converted date columns to datetime format.
- Standardized pollutant values to common units.
- One-hot encoded categorical weather descriptions.
- Normalized numerical columns using Min-Max Scaling.
- Final cleaned dataset saved for modeling.
6. Exploratory Data Analysis (EDA)
Univariate Analysis:
- PM2.5 and PM10 show right-skewed distributions.
- AQI ranges mostly from 100 to 350 (Moderate to Hazardous).
Bivariate Analysis:
- Strong correlation between AQI and PM2.5 (r = 0.87).
- Seasonal variation: AQI increases during winter.
Insights:
- PM2.5, PM10, and NO2 are the most influential pollutants.
- Weekends show slightly lower pollution levels.
- AQI is affected by temperature and humidity to some extent.
7. Feature Engineering
- Created new feature: Pollution Category (Good, Moderate, Poor, etc.).
- Extracted datetime components: hour, weekday, month.
- Combined PM2.5 and PM10 as a composite feature.
- Removed redundant columns (e.g., city names if constant).
- Considered polynomial features (PM2.5^2) for non-linear models.
8. Model Building
Algorithms Tried:
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor
Why These?
- Linear Regression for baseline
- Random Forest for robustness and interpretability
- XGBoost for performance in structured data
Data Split: 80% training, 20% test using stratified sampling where applicable
Evaluation Metrics:
- MAE: ~28
- RMSE: ~35
- R² Score: ~0.85 (Random Forest best)
9. Visualization of Results & Model Insights
- Feature Importance Plot: PM2.5 and NO2 most significant
- Residual Plots: Random Forest shows least error residuals
- AQI Prediction vs Actual: Close alignment in most data segments
- Confusion in Categories: Misclassification mainly in borderline cases (e.g., Moderate vs
Poor)
10. Tools and Technologies Used
- Language: Python
- IDE: Google Colab
- Libraries: pandas, numpy, scikit-learn, seaborn, matplotlib, xgboost
- Visualization: Plotly, seaborn
- Version Control: GitHub
- (Optional): Streamlit for interface
11. Team Members and Contributions
| Name | Contribution |
|--------------|----------------------------------|
| K. Srithika | Data Collection & Integration |
| A. Shifana | Data Cleaning & Preprocessing |
| P. Swetha | EDA & Feature Engineering |
| B. Subashini | Model Training & Evaluation |
| J. Shobana | Documentation & Visualization |