Global temperature and Pollution
Prediction Model
A Project Report (phase-3)
Submitted by
Rohan Swarnkar– 23BAI70577
Kashish– 23BAI70495
Arnav Mishra– 23BAI70585
Raj Kumar– 23BAI70573
Bachelors of Engineering
In
Computer science and Engineering
CHAPTER 3.
PRELIMINARY DESIGN..........................................................................
SNO. DESCRIPTION PAGE NO.
1. 1
FEATURES /CHARASTRISTICS
IDENTIFICATION
2. 2
CONSTRAINTS IDENTIFICATION
3. 2
ANALYSIS OF FEATURES
4. 3,4
DESIGN SELECTION
PRELIMINARY DESIGN
3.1. Features/Characteristics Identification
Demographic Factors:
• Temperature - One of the features on which the prediction model depends as the
temperature tells us how hot the place is and the amount of greenhouse gasses present in
the atmosphere.
• Pollution Level - It tells us how much polluted a place is encountering every aspects and
variable . Pollution level is one of the key feature to evaluate a place
• Air Quality Index (AQI) - It is one of the key parameter that tells us the level of pollutant
present in our area .it depends on the amount smoke and greenhouse gasses released from
industrial and non industrial sources
3.2 Constraints Identification
Data Availability:
• Dataset Availability – Very large dataset was available on the internet which was very time
consuming.
Time Constraints:
• Time required for data collection, preprocessing, model development, and deployment was
limited options can disrupt scheduled meetings and affect productivity.
3.3. Analysis of Features and Finalization subject to constraints
Exploratory Data Analysis (EDA):
Data cleaning and data visualization as very large dataset were present .
Feature Importance:
• Aqi and temperature are the two key feature of this model.
Feature Engineering:
• Numeric values – Changing non-numeric values to numeric values for some features was
needed.
Data Availability:
• Took the dataset from Kaggle .
3.4. Design Selection
Model Selection:
• Out of three machine learning algorithms, selected the model which was having less errors.
Evaluation Metrics:
• Differentiated the machine learning algorithms on the basis of three types of errors
1. MAE- Mean Absolute Error.
2. MSE- Mean Squared Error.
3. R² error – Regression error.
Linear Regression Model
Support Vector Regression Model
RandomForestRegressor Model
On the basis of these three errors, we selected
the model working on random forest library as it is
having less errors than other algorithms.
Block diagram:
Model Code:-
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Load your dataset (replace 'city_hour.csv' with your actual dataset file)
data = pd.read_csv('city_hour.csv')
# Convert the 'Datetime' column to datetime format
data['Datetime'] = pd.to_datetime(data['Datetime'])
# Drop rows with missing values
data.dropna(inplace=True)
# Prepare features (X) and target (y)
X = data[['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene',
'Xylene']] # Use relevant columns
y = data['AQI'] # AQI is the target variable
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=42)
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared Score:", r2_score(y_test, y_pred))
# Visualize actual vs. predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs. Predicted AQI')
plt.show()