Ajanah, Hakeema Ize Final Project
Ajanah, Hakeema Ize Final Project
BY
SEPTEMBER, 2023
CERTIFICATION
This is to certify that this study was carried out by AJANAH HAKEEMA IZE
Nigeria.
______________________ ____________________
Dr. I. D. Oladipo Date
(Supervisor)
______________________ ____________________
Prof. R. O. Oladele Date
(Head of Department)
______________________ ____________________
External Examiner Date
II
This project is dedicated to Almighty Allah for His protection and guidance
during my study.
III
ACKNOWLEDGMENTS
I want to thank Almighty Allah for His eternal mercy, guidance and protection
Dr Ghaniyyat B. Balogun, for her motherly love and guidance and discipline, to all
Maimuna Oyiza, Imran Ibrahim, and my niece, Shaaziya Sani-Omolori. May Allah
always being there for me. I will be forever grateful to all of you.
IV
ABSTRACT
V
TABLE OF CONTENTS
TITLE PAGE ........................................................................................................... i
CERTIFICATION .................................................................................................. II
ACKNOWLEDGMENTS .................................................................................... IV
ABSTRACT........................................................................................................... V
INTRODUCTION .................................................................................................. 1
VI
2.2 Climate Change ..............................................................................................9
METHODOLOGY ............................................................................................... 31
3.8.2.3 CatBoost.................................................................................................51
VIII
3.9.2 Mean Square Error (MSE) ........................................................................60
4.7.1.1 CatBoost:................................................................................................75
IX
CHAPTER FIVE .................................................................................................. 78
REFERENCES ..................................................................................................... 82
X
LIST OF TABLES
XI
LIST OF FIGURES
XII
Figure 4.13: Evaluation of the HistGradient Algorithm ....................................... 72
XIII
CHAPTER ONE
INTRODUCTION
The impacts of climate change are becoming more evident as time goes on. Storms,
droughts, wildfires, and floods are occurring with greater intensity and frequency.
Humanity's dependence on natural resources and agriculture is changing alongside
the global ecosystems. According to the 2018 intergovernmental climate change
report, if greenhouse gas emissions are not eradicated within the next three decades
(as mentioned by Cianconi et al. in 2020), our planet will suffer catastrophic
consequences. Weather prediction has remained one of the most challenging
scientific and technological issues over the past century. This can be attributed
mainly to two key factors. The first is its wide application across various human
pursuits, and the second is the opportunism fostered by technological progress
directly linked to this specific field of research, including the advancements in
computing and improvements in measurement systems (Garima & Mallick, 2016).
Extreme climate change has lately had an impact on Africa; from October 2019 to
January 2020, East Africa had record-breaking rains. Rainfall influences landslides
and floods throughout the region, resulting in a natural disaster that negatively
affects more than 2.8 million people in Ethiopia, Kenya, Somalia, Uganda,
Tanzania, and Djibouti. These regions, however, experience a deficient rainy
season spanning from March to May, leading to food shortages and famines.
Furthermore, the susceptibility of these areas has heightened due to the influence
of climate change, particularly in North-Eastern Africa, which has sparked
increased interest in climate change research (Caroline et al., 2020). Given that
atmospheric greenhouse gases (GHGs) exert the most significant influence on
1
climate change, the utilization of artificial satellites to monitor anthropogenic GHG
concentrations in space has become imperative. Approximately 40% of annual
human-induced carbon dioxide (CO2) emissions originate from coal-burning power
plants. Additionally, man-made sources contributing to methane (CH4) emissions,
aside from natural sources like termites, inland lakes, and wetlands, encompass coal
mines, oil-gas systems, livestock, wastewater management, rice farming, and
landfills (Gurdeep et al., 2021). Multiple studies have indicated an increase in the
frequency and intensity of extreme precipitation events due to climate change
(Kristie et al., 2021). Understanding these evolving hazards is critical for preparing
for future extreme precipitation and flooding. One anticipated consequence of
global warming is heightened precipitation intensity, driven by increasing
atmospheric moisture (Tabari, 2020).The dynamic alterations resulting from
climate change could also impact the position and velocity of storm tracks, as well
as the occurrence of atmospheric conditions conducive to extreme precipitation
(Ben et al., 2022). Nonetheless, the effects of global warming on regional and local
precipitation extremes remain not fully understood due to the inherent complexities
in simulating precipitation processes within general circulation models (Tabari,
2020).
2
approach, which can be regarded as Artificial Intelligence, can process a massive
amount of data as well as the relationship between variables (Olaiya, 2012). The
goal of artificial intelligence, a sub-field of computer science, is to educate a
computer to accomplish tasks that humans are incapable of performing. Artificial
intelligence frequently entails making decisions under varied conditions. In
machine learning, a sub-field of AI, computers discover associations using massive
training datasets. Due to significant advancements in processor availability, speed,
connection, and data storage costs, artificial intelligence and machine learning are
having an increasing impact on society (Philip, 2020). Climate change forecasting
can be done using either supervised or unsupervised machine learning techniques.
The most prevalent group in the most recent articles in the subject, supervised
learning was discovered to be the most fascinating group of techniques for
atmospheric scientists. If any labeled data is available, it can be used as a training
dataset for creating a function that converts inputs into outputs (Olaiya, 2012).
This function can be applied to various datasets, referred to as the testing one, to
evaluate the model. If the results are satisfactory, it can then be applied to the
classification or regression of any application that requires it. In that category, we
find techniques like Support Vector Machine (SVM), Deep Learning (DL), Deep
Learning (DL), Random Forest (RF), Artificial Neural Networks (ANN), and
Decision Trees (DT). The second category of machine learning is unsupervised
learning, where computers must choose alternative ways to separate or minimize
the dimensions of a given dataset in order to conduct additional analysis because
they lack labeled training data. K-means Clustering (K-means) and Principal
Component Analysis (PCA) are two methodologies that atmospheric scientists
frequently use. Atmospheric scientists and meteorologists can anticipate climate
change thanks to machine learning, particularly the supervised technique (Olaiya,
2012).
3
The objective of this study is to compare several machine learning methods for
forecasting climate change. In addition to leveraging both contemporary and classic
tree-based algorithms as well as the use of meteorological data.
4
rather than physics-based data to determine the most suitable algorithm for accurate
and efficient climate change forecasting.
This work aims to propose a system for comparative analysis of machine learning
algorithms for Climate Change Forecasting.
Generations both now and in the future are seriously threatened by climate change.
It modifies local and regional precipitation extremes. The effects of flooding and
excessive precipitation on human society are extensive. One of the biggest
problems confronting humanity, according to a panel of ML experts, is climate
change.Climate change has heightened the occurrence, intensity, and
unpredictability of natural calamities. The results of this study, however, will offer
an effective tree-based model for forecasting climate change.
5
1.5 Scope of the Study
The scope of this research is to evaluate four algorithms for forecasting climate
change, compare the outcomes of each method, and determine which algorithm
provides the best forecasting accuracy. Bagging, Random Forest, Extra Trees,
Gradient Boosting, Extreme Gradient Boost (XGBoost), LightGBM, CatBoost,
Decision Trees and HistGradient Boosting are the algorithms to be used in this
study.
Expert System: An expert system is a computer system that replicates the decision-
making capabilities of a human expert.
Feature Selection: Feature selection is the process of reducing the number of input
variables when constructing a predictive model. This helps lower computational
costs and can enhance model performance.
6
Machine Learning: Machine learning (ML) is a subset of artificial intelligence
(AI) that enables software applications to enhance prediction accuracy without
explicit programming. ML algorithms use historical data as input to forecast new
output values.
Scikit Learn: Scikit-learn (or sklearn) is a free machine learning library for Python,
widely used for various data analysis and modeling tasks.
Data Wrangling: Data wrangling, also referred to as data munging, is the process
of transforming and mapping data from one format to another to make it more
suitable and valuable for downstream applications such as analytics.
Chapter 1 - Introduction:
7
The literature review delves into a comprehensive exploration of related concepts
and previous research relevant to the subject.
Chapter 3 - Methodology:
In this chapter, you will find a detailed description of the study's development
phases and approach.
This section presents the findings and analysis of the project's work.
8
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
9
greenhouse effect occasionally trapping heat. Climate change can be influenced by
a multitude of factors, both natural and anthropogenic.
10
billion metric tons of carbon dioxide annually, accounting for over 20% of global
CO2 emissions. Additional human activities contributing to air pollution include
the use of fertilizers (a prominent source of nitrous oxide emissions), livestock
raising (with cattle, buffalo, sheep, and goats being notable methane emitters), and
specific industrial processes generating fluorinated gases (Shivanna, 2022).
The failure to mitigate and adapt to climate change is identified as the most
catastrophic global threat, surpassing even concerns like weapons of mass
destruction and water scarcity, according to the 2021 Global Risks Report by the
World Economic Forum. The repercussions of climate change are far-reaching, as
it disrupts global ecosystems, affecting every facet of our lives, including our
habitats, water sources, and air quality. While climate change affects everyone in
some way, it disproportionately impacts certain groups, such as women, children,
people of color, Indigenous communities, and those with lower socioeconomic
status. Climate change is fundamentally intertwined with human rights.
As the Earth's atmosphere warms and holds and releases more water, it leads to wet
regions becoming wetter and dry areas becoming drier. This alteration in weather
patterns results in an increased frequency and severity of natural disasters,
including storms, floods, heatwaves, and droughts. These events can have
devastating and costly consequences, jeopardizing access to clean drinking water,
igniting uncontrollable wildfires, causing property damage, hazardous material
spills, air pollution, and loss of life (NRDC, 2022).
11
2.4.2 Air Pollution
Climate change and air pollution are intricately linked, with each exacerbating the
other. Rising global temperatures lead to increased smog and soot levels,
contributing to air pollution. Additionally, extreme weather events, such as floods,
lead to the circulation of mold and pollen, further polluting the air. These conditions
worsen respiratory health, particularly for the 300 million people worldwide with
asthma, and exacerbate allergies. Severe weather events can contaminate drinking
water and damage essential infrastructure, increasing the risk of population
displacement. Displacement, in turn, poses health risks, including overcrowding,
trauma, water scarcity, and the spread of infectious diseases (NRDC, 2022).
The Arctic is warming at twice the rate of any other region, resulting in the melting
of ice sheets and causing sea levels to rise. By the end of this century, oceans are
projected to rise by 0.95 to 3.61 feet, posing a significant threat to coastal
12
ecosystems and low-lying areas. Island nations and major cities like New York
City, Miami, Mumbai, and Sydney are particularly vulnerable to rising sea levels
(NRDC, 2022).
Climate change forces wildlife to rapidly adapt to changing habitats. Many species
alter their behaviors, migrate to higher elevations, and modify migration routes,
potentially disrupting entire ecosystems and their intricate webs of life. This
disruption has dire consequences, with one-third of all plant and animal species
facing extinction by 2070, according to a 2020 study. Vertebrate species are
declining at an accelerated rate, attributed to climate change, pollution, and
deforestation. Warmer winters and longer summers enable some species, like tree-
killing insects, to thrive, posing a threat to entire forests (NRDC, 2022).
13
2.5 Machine Learning Algorithms for Climate Change Forecasting
Decision trees make decisions based on input variables using a tree-like structure.
They have been employed to predict precipitation, temperature, and extreme
weather events.
Support vector machines (SVMs) seek to find a hyperplane separating input data
into different classes. They have been utilized for predicting precipitation,
temperature, and drought.
14
2.5.5 Artificial Neural Networks
Artificial neural networks (ANNs), inspired by the human brain's structure, have
been used for forecasting temperature, precipitation, and extreme weather events.
Deep learning, a subset of ANNs, employs multiple layers to extract features from
input data. It has been employed in forecasting temperature, precipitation, and sea
level changes.
2.5.7 Clustering
15
Pacific Ocean Sea Surface Temperature Anomaly (SSTA) by employing data
preprocessing using Singular Spectrum Analysis (SSA).
Numerous studies have delved into the utilization of machine learning algorithms
for climate change prediction. Liu et al. (2020) compared multiple machine learning
algorithms, including random forest, gradient boosting, and deep neural networks,
in predicting precipitation patterns in China, with deep neural networks proving to
be superior in terms of accuracy and robustness.
16
diverse climate drivers, including greenhouse gas emissions and volcanic activity,
sourced from the National Centers for Environmental Information for climate data
and the Global Carbon Project and Global Volcanism Program for climate driver
data. The study demonstrated that machine learning models surpassed traditional
statistical models in forecasting global temperature and precipitation alterations.
Critical climate drivers like carbon dioxide emissions and volcanic activity were
identified as pivotal in these predictions. The study recommended future research
to delve deeper into detailed climate driver data to enhance model accuracy.
Evaluation metrics included mean squared error, mean absolute error, and
correlation coefficient.
Brouwer et al. (2019) harnessed machine learning to forecast the impact of climate
change on water resources in southern Africa. They trained various models,
including artificial neural networks and support vector regression, on climate data
spanning from 1960 to 2016. These models were employed to foresee future water
availability and demand under diverse climate scenarios. Climate data were
obtained from the Climate Research Unit, and water resource data were sourced
from the Southern African Development Community. Machine learning models
effectively predicted future water availability and demand under varying climate
scenarios while identifying regions susceptible to climate change impacts on water
resources. Prospective research could concentrate on incorporating more detailed
climate data to improve model accuracy. Evaluation metrics encompassed mean
squared error, mean absolute error, and correlation coefficient.
Lassalle et al. (2020) leveraged machine learning to anticipate the effects of climate
change on the proliferation of invasive species. Their approach involved training
multiple models, including decision trees and random forests, using data from
citizen science projects. These models were then utilized to project the future
distribution of invasive species under different climate scenarios. Datasets were
17
drawn from citizen science projects like Naturalist and GBIF, along with climate
data from the World Climate database. The results demonstrated the accurate
prediction of future invasive species distribution under diverse climate scenarios.
Future research avenues could involve incorporating more detailed ecological data
and enhancing model accuracy. Evaluation metrics included the area under the
receiver operating characteristic curve and the kappa coefficient.
Sharma et al. (2019) developed a machine learning model for climate change
forecasting utilizing temperature, precipitation, and CO2 concentration data. Their
approach combined linear regression and artificial neural networks to predict future
climate patterns. Temperature, precipitation, and CO2 concentration data were
sourced from the Goddard Institute for Space Studies. The study achieved an
accuracy rate of over 90% in its predictions and recommended future work to
incorporate more data sources to improve model accuracy.
18
and sea level rise data for model training and testing. Artificial neural networks
emerged as the top-performing algorithm, with a Root Mean Square Error of 0.06
and a Mean Absolute Error of 0.03. Random forests also demonstrated promise,
with a Root Mean Square Error of 0.08 and a Mean Absolute Error of 0.04. The
study encouraged future work to focus on expanding data sources and improving
model accuracy, especially in predicting extreme weather events. Evaluation
metrics encompassed root mean square error and mean absolute error.
Rahman et al. (2018) applied a machine learning approach to predict climate change
based on factors such as greenhouse gas emissions, temperature, and precipitation.
They employed a multi-layer perceptron (MLP) neural network for modeling, using
climate data from the Climate Research Unit (CRU) and greenhouse gas emissions
data from the Carbon Dioxide Information Analysis Center (CDIAC). The MLP
model accurately predicted climate changes with a 92.5% accuracy rate,
emphasizing the significant impact of greenhouse gas emissions. Future research
directions could involve incorporating additional factors like land use change and
deforestation for improved model accuracy. The model's performance was assessed
using accuracy rate.
Ali et al. (2020) utilized machine learning algorithms to predict global temperature
based on historical temperature data. Employing a long short-term memory
(LSTM) neural network, the study achieved an accuracy rate of 96.7% in
forecasting future temperature changes. The LSTM model effectively captured
short-term and long-term temperature trends. Future research could focus on
incorporating additional data sources, such as ocean temperatures and atmospheric
carbon dioxide concentrations, to enhance model accuracy. Evaluation metrics
encompassed Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).
19
Alirezazadeh et al. (2021) explored the applications of machine learning methods
for solar radiation prediction through a systematic review of literature spanning
from 2010 to 2021. They scrutinized research methods, datasets, and performance
metrics used in solar radiation prediction studies. Diverse datasets, including
Global Energy Observatory (GEO) and National Renewable Energy Laboratory
(NREL) datasets, were employed. Support vector machines (SVMs), artificial
neural networks (ANNs), and decision trees (DTs) emerged as popular choices for
solar radiation prediction, reporting high accuracy rates. The study advocated for
future research to focus on enhancing accuracy and robustness in machine learning
models for solar radiation prediction and developing hybrid models that combine
different methods. Evaluation metrics included mean absolute error (MAE), root
mean square error (RMSE), and coefficient of determination (R2).
20
disadvantages, without focusing on specific datasets. The paper recommended the
development of hybrid models combining different ML techniques and addressing
missing data challenges in climate datasets. The study did not specify the use of
particular evaluation metrics.
Robinson et al. (2019) harnessed the power of machine learning to forecast the
impact of climate change on our oceans. Their approach involved leveraging an
artificial neural network (ANN) algorithm, which relied on historical data to make
predictions. To ensure the ANN's accuracy, the researchers employed both
historical climate data and information about ocean circulation and chemistry to
train their algorithm. The outcomes of this research showcased the ANN's
remarkable ability to predict alterations in ocean temperatures and chemistry in
response to varying climate scenarios. This study pointed towards the necessity of
developing more sophisticated models capable of considering the intricate
interactions among various climate factors. The research employed a range of
evaluation metrics, including mean absolute error, mean squared error, and the
coefficient of determination.
Zhu et al. (2020), on the other hand, introduced a deep learning approach to identify
and analyze climate change patterns. Employing a convolutional neural network
(CNN) algorithm, the study delved into the analysis of satellite data. These
researchers employed satellite data to train their CNN algorithm, demonstrating the
CNN's precision in detecting and scrutinizing climate changes within satellite data.
The research also emphasized the need for evolving deep learning algorithms that
can effectively handle the high-dimensional data frequently encountered in climate
change research. The assessment in this study involved several evaluation metrics,
including accuracy, precision, recall, and the F1 score.
21
Kumar et al. (2020) embarked on a comprehensive review, scrutinizing 120
research papers devoted to climate change detection and mitigation through the
application of machine learning techniques. To assess climate change, various
datasets were utilized, including temperature records, precipitation data, and carbon
dioxide emissions data. The findings unanimously supported the effectiveness of
machine learning techniques in predicting climate change, detecting anomalies, and
mitigating its impact. The researchers contended that these techniques held
immense potential in providing actionable insights and facilitating informed
decisions for tackling climate change. The study proposed future endeavors
concentrating on the development of more precise prediction models, data quality
enhancement, and the crafting of models capable of managing the uncertainty and
variability intrinsic to climate data. Evaluation was carried out using metrics such
as root mean squared error (RMSE), mean absolute error (MAE), and coefficient
of determination (R2).
Cordano et al. (2021) took an ensemble machine learning approach to predict the
implications of climate change on wildfire occurrences within the Mediterranean
region. Leveraging an assortment of climate variables such as temperature,
humidity, and precipitation, they designed a prediction model. The research
integrated two datasets – one housing climate variables and the other containing
historical wildfire occurrence data. The results showcased the ensemble machine
learning approach's aptitude in accurately forecasting wildfire occurrences and
delivering valuable insights concerning the influence of climate change on such
incidents. The study culminated in the assertion that this prediction model could
enrich decision-making and enhance wildfire management strategies. Cordano and
their team also recommended a deeper exploration of more comprehensive datasets
and improvements in the interpretability and explainability of the prediction model.
22
The evaluation in this study spanned metrics such as accuracy, precision, recall,
and the F1 score.
Zaman et al. (2018) unleashed machine learning models, including random forest,
artificial neural networks, and support vector regression, to predict the rainfall
patterns of the Indian monsoon. This endeavor incorporated an array of predictors,
encompassing sea surface temperatures, sea-level pressures, and wind shear, to
construct these models. Multiple datasets played a role, incorporating the Global
Precipitation Climatology Project dataset, the Climate Prediction Center Merged
Analysis of Precipitation, and the NOAA Optimum Interpolation Sea Surface
Temperature dataset. The crux of the matter was that the random forest model
exhibited superior performance in predicting Indian monsoon rainfall, with the
23
inclusion of remote-sensing data notably enhancing prediction accuracy.
Recommendations revolved around creating even more robust machine learning
models that could seamlessly integrate additional predictors, like land surface data
and atmospheric moisture content. Furthermore, the consideration of deep learning
models was touted as a potential path to further heighten climate prediction
accuracy. The study hinged its evaluation on metrics such as mean absolute error,
root mean square error, and correlation coefficient.
Liu et al. (2019), in their pioneering research, constructed a deep learning model
rooted in the long short-term memory (LSTM) neural network, aimed at predicting
global surface temperatures. Their model harnessed historical climate data,
utilizing a sliding window technique for training. Datasets featured in this endeavor
encompassed the Berkeley Earth Surface Temperature dataset and the Climate
Research Unit Temperature dataset. It transpired that the LSTM model
outperformed traditional statistical models when it came to forecasting global
surface temperatures. It was further discerned that incorporating external factors,
notably El Niño-Southern Oscillation, significantly bolstered predictive accuracy.
This research has paved the way for future exploration into alternative deep
learning models, such as convolutional neural networks and generative adversarial
networks, for more refined climate prediction. Recommendations also urged the
inclusion of additional predictors like greenhouse gas emissions and solar radiation.
The study leaned on evaluation metrics like mean absolute error and root mean
square error to gauge model performance.
Akinfaderin et al. (2019) tapped into the climate data collected from weather
stations dotting the Sahelian region. They executed data preprocessing on this
dataset and subsequently trained a machine learning algorithm for climate pattern
prediction. The algorithm demonstrated exceptional precision in predicting climate
patterns within the Sahelian region, encompassing rainfall patterns, temperature
24
fluctuations, and various other climate variables. This study advocates for further
exploration of machine learning's potential in devising strategies to mitigate and
adapt to climate change. The evaluation in this research relied on mean absolute
error (MAE), mean squared error (MSE), and R-squared(R2)as metrics.
25
Conclusions highlighted the support vector regression model as the most adept in
climate prediction, achieving a mean absolute error of 0.23 degrees Celsius. The
research underscored the profound impact of the choice of evaluation metric on
outcomes and underscored the promise of different machine learning models when
measured against different metrics. Future prospects outlined the incorporation of
more data sources, such as oceanic and atmospheric composition data, and the
evolution of more intricate machine learning models, including deep learning
models. The metrics that informed this research encompassed mean absolute error,
mean squared error, root mean squared error, correlation coefficient, and coefficient
of determination.
26
machine learning algorithms at the forefront: linear regression, polynomial
regression, support vector regression, random forest, and artificial neural networks.
The dataset in focus was the Climate Research Unit (CRU) dataset, encompassing
global temperature and precipitation data from 1901 to 2016. The research
underscored the artificial neural networks' prowess, particularly in predicting
precipitation and temperature changes stemming from climate change. The
artificial neural network achieved an R-squared value of 0.96 for temperature
prediction and 0.88 for precipitation prediction. In charting the way forward, the
study encouraged the exploration of more sophisticated machine learning models
adept at encapsulating the non-linear relationship between climate variables and
their impact on precipitation and temperature. The chosen evaluation metric was R-
squared.
Zhang et al. (2019) employed advanced machine learning techniques to project the
repercussions of climate change on crop yields. The research entailed the
accumulation of diverse datasets encompassing climate variables, soil attributes,
and crop yields from varying regions. Utilizing this data, multiple machine learning
models were trained to forecast prospective crop yields in different climate
scenarios. Integral to this study were climate statistics from the Community Earth
System Model (CESM), soil specifics from the Soil Grids database, and crop yield
data sourced from the Food and Agriculture Organization (FAO). The findings
illustrated the precision of machine learning models in foreseeing crop yields under
distinct climate circumstances, boasting an average accuracy rate of 89%. The study
elucidated key climate variables, particularly temperature and precipitation, which
wield significant influence on crop yields. Recommendations were directed
towards advancing machine learning models to encompass intricate interactions
between climate variables and crop yields. Mean Absolute Error (MAE) and R-
squared (R2) stood as pivotal evaluation metrics.
27
Kondratyev et al. (2020) navigated the realm of deep learning techniques to model
Earth's climate system, encapsulating atmospheric, oceanic, and land surface
processes. A rich array of deep learning models underwent training on extensive
climate data, prominently featuring the Coupled Model Intercomparison Project
Phase 5 (CMIP5) dataset. This dataset encompassed climate model simulations
from diverse research groups globally. The research outcomes demonstrated the
aptitude of deep learning models in accurately forecasting future climate scenarios,
effectively identifying pivotal factors influencing climate variability and change.
Moreover, these models proved instrumental in gauging climate change impacts
across multifarious sectors like agriculture, water resources, and energy. Future
research recommendations underscored the need for improved scalability and
interpretability of deep learning models in climate modeling, advocating for
integration with diverse model types such as physical and statistical models.
Evaluation metrics featured Mean Squared Error (MSE) and Root Mean Squared
Error (RMSE).
Shen et al. (2018) delved into climate prediction within the domain of China,
employing a machine learning approach. The research leveraged a Support Vector
Regression (SVR) model to envisage temperature and precipitation alterations in
China from 1960 to 2016. The bedrock of this endeavor comprised climate data
sourced from the China Meteorological Data Sharing Service System. Temperature
and precipitation data spanning from 1960 to 2016 were instrumental in the study.
The results underscored the SVR model's efficacy in precisely forecasting
temperature and precipitation changes within China. This model showcased
superior predictive accuracy compared to other machine learning counterparts like
Random Forest and Artificial Neural Network. Recommendations spotlighted the
utilization of machine learning to predict additional facets of climate change like
extreme weather events and sea level rise. Key evaluation metrics encompassed
28
mean absolute error (MAE) and root mean square error (RMSE) to ascertain
prediction accuracy.
29
error (MAE) and root mean square error (RMSE), while the relationship between
predicted and observed temperature anomalies was assessed using correlation
coefficient (CC) and coefficient of determination (R²).
In summary, these studies collectively manifest the potent role of machine learning
and deep learning in unraveling critical insights into climate change impacts,
paving the way for more informed decisions and strategies to mitigate its effects.
2.9 Conclusion
30
CHAPTER THREE
METHODOLOGY
3.1 Introduction
In this chapter, we outline the methodology adopted for our investigation, which
centers on a comparative analysis of machine learning algorithms concerning
climate change prediction. We commence with an exposition of the dataset
employed in this research. Subsequently, we elucidate the preprocessing measures
executed to rectify and prepare the dataset for rigorous analysis. Finally, we provide
an insight into the machine learning algorithms that have been scrutinized in this
study, along with the performance metrics selected to facilitate a comprehensive
comparison of their effectiveness.
31
Feature Engineering and Selection: A pivotal phase involves feature engineering
and selection, where we optimize the chosen features to enhance their effectiveness
in climate change prediction.
Algorithm Implementation: We proceed to implement a diverse range of machine
learning algorithms, including Bagging, Random Forest, Extra Trees, Gradient
Boosting, XGBoost, LightGBM, CatBoost, Decision Trees, and Histgradient
boosting.
Evaluation Metrics: To assess the performance of these algorithms thoroughly,
we employ an array of evaluation metrics, encompassing measures such as mean
squared error, root mean squared error, and R-squared.
In order to fill the research gaps and provide an effective approach to the study, an
innovative framework has been proposed in Figure 3.1. The proposed framework
provides a clear approach of the analysis in this study. It consists of nine
components as seen in the framework.
32
DATA COLLECTION
DATA PREPROCESSING
FEATURE EXTRACTION
FEATURE ENGINEERING
33
3.3 Dataset Collection
The dataset used in this study was obtained from the National Oceanic and
Atmospheric Administration (NOAA). The data was collected from a variety of
sources, including ground-based weather stations, satellites, and other remote
sensing technologies. It contains daily temperature, humidity, wind speed,
precipitation and other weather-related variables measurements for various
locations from the most recent decade .The dataset also includes information on the
location's latitude, longitude, and elevation. It has a total of 373,000 records, each
34
3.4.1 Data Cleaning
Raw datasets often contain missing values, outliers, inconsistent formatting, and
other data quality issues. Begin by detecting any instances of missing data and
subsequently determine the most suitable approach for addressing them, be it
through imputation methods or deletion. Address outliers by either removing them
if they are erroneous or handling them separately if they represent valuable
information. Ensure consistent formatting and resolve any inconsistencies in units
of measurement and data representation.
Evaluation of Missing Data: The next step involved assessing the impact of
missing data on the analysis. This evaluation helped determine whether to drop the
entire row or column containing missing values or apply imputation techniques to
fill in the missing values. For instance, if a specific column had a large proportion
35
of missing values, it might be dropped to preserve data integrity and avoid bias in
subsequent analyses.
Dropping Variables with Null Values: Considering the project's scope and the
need for a clean dataset, the decision was made to drop variables or columns that
contained null values. By removing these variables, we ensured that the
comparative analysis of machine learning algorithms would be performed using
complete and reliable data. The dropped variables were documented to maintain
transparency and ensure reproducibility.
36
Figure 3.5: Data Cleaning III
Identify the relevant features (predictors) within the NOAA dataset that are likely
to have a significant impact on climate change that are expected to exert a
substantial influence on climate change forecasting. Feature extraction techniques,
such as principal component analysis (PCA), can reduce dimensionality and
capture the most important information. Conduct feature selection by analyzing the
relationships between features and the target variable. Utilize techniques such as
correlation analysis, recursive feature elimination, or regularization approaches to
pinpoint the most significant features.
Extracting Day, Month, and Year: The datetime feature in the NOAA datasets
provides information about the specific date and time of each data point. However,
for climate change forecasting, it is often more meaningful to extract the day,
month, and year from the datetime feature. This extraction enables the analysis to
focus on long-term trends and patterns rather than specific timestamps. Through
37
appropriate date and time manipulation techniques, the day, month, and year
components were extracted from the datetime feature.
Feature engineering of the NOAA dataset is a crucial step in preparing the data for
comparative analysis of machine learning algorithms for climate change
forecasting. By leveraging domain knowledge, applying temporal aggregation,
incorporating rolling window statistics, including time-based features, introducing
interaction and polynomial features, and performing dimensionality reduction,
researchers can extract meaningful and informative features. Properly engineered
features facilitate the accurate representation of climate dynamics, enabling
machine learning algorithms to capture and predict climate change patterns
effectively.
38
significantly to the climate change forecasting task. Irrelevant features can
introduce noise, increase computational complexity, and potentially mislead the
machine learning algorithms. By carefully examining the dataset, an assessment
was made to identify any features that were deemed irrelevant to the comparative
analysis of machine learning algorithms. Once identified, the irrelevant feature(s)
were dropped from the dataset.
39
3.7 Data Segmentation
Identifying the Target Variable: In climate change forecasting, the target variable
is typically the variable of interest that we want to predict or forecast. For example,
it could be future temperature or precipitation values. Identifying the target variable
is essential for defining the supervised learning problem and splitting the data
accordingly.
40
Splitting the Data into Training and Testing Sets: In order to assess and contrast
various machine learning algorithms effectively, it is vital to possess distinct
datasets for training and testing. The NOAA dataset was divided into two separate
subsets: one for training purposes and the other for testing. The training set is
employed for model training and refinement, whereas the testing set serves the
purpose of assessing the models' performance on data they haven't been exposed to
previously.
Randomization and Stratification: To ensure that the training and testing datasets
are representative of the overall data distribution, randomization and stratification
techniques can be applied. Randomization ensures that the data samples are
shuffled randomly before splitting, reducing the potential for any systematic bias.
Stratification, on the other hand, ensures that the distribution of classes within the
41
target variable is preserved in both the training and testing sets, particularly when
dealing with imbalanced datasets
The Extra Trees algorithm, short for Extremely Randomized Trees, is an ensemble
learning method that builds a forest of randomized decision trees. It is similar to
the Random Forest algorithm but introduces additional randomness by selecting
random thresholds for each feature at every node. This extra randomness further
reduces the variance and can lead to improved performance and faster training
times.
42
Figure 3.11: Extra Trees Learning Process
The algorithm follows these steps:
c. Build a decision tree using the random subset of data and features. At each
internal node, randomly select a feature and threshold.
During the prediction phase, the Extra Trees Regressor aggregates the predictions
from all the decision trees in the ensemble by averaging them. The final prediction
is the average of the predictions made by each individual tree.
Input:
- Training data D
43
Output:
- Ensemble model
For i = 1 to N:
In this study, the Extra Trees algorithm takes the NOAA training dataset, the
number of base models N, and the number of random features F as input. It
initializes an empty ensemble model and then iterates N times. In each iteration, it
randomly selects F features from the dataset. It then randomly splits the dataset
using the selected features and trains a base model on the split dataset. Finally, the
trained base model is added to the ensemble model. The process is repeated N
times. Finally, the ensemble model is returned as the output.
44
Figure 3.12: Random Forest Learning Process
c. Build a decision tree using the random subset of data and features.
During the prediction phase, the Random Forest Regressor aggregates the
predictions from all the decision trees in the ensemble by averaging them. The final
prediction is the average of the predictions made by each individual tree.
Input:
- Training data D
45
- Number of random features F
Output:
- Ensemble model
For i = 1 to N:
In this study, the Random Forest algorithm takes the NOAA training dataset, the
number of base models N, and the number of random features F as input. It
initializes an empty ensemble model and then iterates N times. In each iteration, it
randomly selects F features from the dataset. It also randomly samples a bootstrap
dataset from the dataset, meaning it creates a dataset by sampling with replacement
from the original dataset. T is trained on the bootstrap dataset using the selected
features. Finally, the trained base model is added to the ensemble model. The
process is repeated N times. Finally, the ensemble model is returned as the output.
46
3.8.2 Boosting Algorithms:
3.8.2.1 LightGBM
Input:
- Training data D
47
- Learning rate eta
Output:
- Ensemble model
For i = 1 to N:
Compute the negative gradient vector r_i for each training sample in D
Train a base model M_i on the negative gradient vector r_i using the
LightGBM-specific objective and regularization terms
Update the predicted values y_hat by adding the predictions of M_i scaled by
eta
In this study, the LightGBM algorithm takes the NOAA training dataset, the
number of base models N, the learning rate eta, and the maximum tree depth
max_depth as input. It initializes the predicted values y_hat as 0 for all training
samples in the dataset, It also initializes an empty ensemble model. Then, it iterates
N times. In each iteration, it computes the negative gradient vector for each training
sample in the dataset. A base model is trained on the negative gradient vector using
LightGBM-specific objective and regularization terms. The predicted values for the
48
base model are computed, and then the predicted values y_hat are updated by
adding the predictions of the base model scaled by the learning rate eta. The base
model is added to the ensemble model. The process is repeated N times. Finally,
the ensemble model is returned as the output.
3.8.2.2 Xgboost
XGBoost, short for Extreme Gradient Boosting, is a widely used gradient boosting
algorithm known for its effectiveness across various machine learning tasks,
including climate change prediction. XGBoost operates by constructing a sequence
of decision trees, where each tree is specialized in forecasting the remaining errors
from the preceding one. This strategy enables the algorithm to adeptly capture
intricate nonlinear patterns within the dataset.
49
Pseudocode for Xgboost Algorithm
Input:
- Training data D
Output:
- Ensemble model
For i = 1 to N:
Compute the negative gradient vector r_i for each training sample in D
Train a base model M_i on the negative gradient vector r_i using the XGBoost-
specific objective and regularization terms
Update the predicted values y_hat by adding the predictions of M_i scaled by
eta
50
Application of the XGBoost Algorithm to the Climate Change Dataset
In this research, the XGBoost algorithm is applied to the NOAA training dataset,
taking as input parameters the number of base models (N), the learning rate (eta),
and the maximum tree depth (max_depth). Initially, it sets the predicted values
(y_hat) for all training samples in the dataset to 0 and creates an empty ensemble
model. The algorithm then proceeds through N iterations. Within each iteration, it
calculates the negative gradient vector for every training sample in the dataset. A
base model is subsequently trained, utilizing XGBoost-specific objective and
regularization terms, based on this negative gradient vector. Predicted values for
the model are computed, and these predictions are incorporated into y_hat after
being adjusted by the learning rate (eta). The base model is then included in the
ensemble model. This entire process repeats itself N times. Ultimately, the
ensemble model is returned as the end result.
3.8.2.3 CatBoost
CatBoost is another gradient boosting algorithm that has gained popularity in recent
years. It is particularly useful for dealing with categorical features, which are
common in environmental data. CatBoost uses a novel approach to handle
categorical features, which helps to improve its accuracy and performance. Like
XGBoost and LightGBM, CatBoost works by building a series of decision trees.
51
Figure 3.15: CatBoost Learning Process
- Training data D
Output:
- Ensemble model
For i = 1 to N:
Compute the negative gradient vector r_i for each training sample in D
Train a base model M_i on the negative gradient vector r_i using the CatBoost-
specific objective and regularization terms
52
Compute the predicted values for the base model M_i
Update the predicted values y_hat by adding the predictions of M_i scaled by
eta
In this study, the CatBoost algorithm takes the NOAA training dataset, the number
of base models N, and the learning rate eta as input. It initializes the predicted
values y_hat as 0 for all training samples in the dataset. It also initializes an empty
ensemble model. Then, it iterates N times. In each iteration, it computes the
negative gradient vector for each training sample in the dataset. A base model is
trained on the negative gradient vector using CatBoost-specific objective and
regularization terms. The predicted values for are computed, and then the predicted
values y_hat are updated by adding the predictions of the base model scaled by the
learning rate eta. The base model is added to the ensemble model. The process is
repeated N times. Finally, the ensemble model is returned as the output.
The Decision Tree algorithm is a versatile and widely used machine learning
method applicable to both classification and regression tasks. It creates a
hierarchical structure comprised of decision nodes and leaf nodes based on the
training data. Within this framework, each decision node represents an assessment
of a specific feature, while each leaf node signifies a predicted class or value. The
53
decision nodes partition the data based on feature conditions, enabling the tree to
make predictions by traversing from the root to a particular leaf node.
1. Select the best feature to split the data based on a suitable criterion (e.g., Gini
impurity for classification or mean squared error for regression).
2. Create a decision node based on the selected feature and its threshold.
3. Partition the data into two or more subsets based on the feature test.
5. Create a leaf node for each partitioned subset and assign it the most common
class label for classification or the mean value for regression.
Decision trees offer interpretability, as the learned rules can be easily understood
and visualized. They can handle both categorical and numerical features and can
capture complex relationships in the data. However, decision trees can be prone to
54
over-fitting, and techniques such as pruning and regularization are often used to
address this issue.
Input:
- Training data D
Output:
Select the best feature and split point to partition the data D
Create a decision node with the selected feature and split point
Split the data D into subsets D_left and D_right based on the selected feature and
split point
55
Application of the Decision Tree Algorithm to the Climate Change Dataset
In this study, the Decision Tree algorithm takes the NOAA training dataset and the
maximum tree depth max_depth as input. It recursively builds a decision tree by
splitting the data based on the selected features and split points. The algorithm
performs the following steps:
a. If all samples in the dataset belong to the same class, create a leaf node with the
class label and return the leaf node.
b. If the maximum tree depth is reached or the dataset is a pure node (contains
only samples of a single class), create a leaf node with the majority class label and
return the leaf node.
c. Select the best feature and split point to partition the NOAA dataset.
d. Create a decision node with the selected feature and split point.
e. Split the NOAA dataset into subsets D_left and D_right based on the selected
feature and split point.
f. Set the left child of the decision node as DecisionTree(D_left, max_depth – 1).
56
boosting with methodologies reliant on histograms, resulting in superior speed and
performance. At the core of HGB lies the concept of discretizing the input features
into histograms, which it harnesses to execute swift and efficient gradient
computations. This deployment of histograms serves the dual purpose of
diminishing memory usage and accelerating the training procedure, all the while
preserving a notable level of predictive precision. Moreover, HGB incorporates
strategies like histogram subtraction and multi-leaf splitting, which serve to further
augment its efficiency.
Preprocess the training data by discretizing the input features into histograms.
Initialize the ensemble by setting initial predictions for all samples (e.g., using the
mean target value for regression or log-odds for classification).
57
b. Build histograms for each feature based on the gradients and their corresponding
weights.
c. Perform histogram subtraction to compute the gradients and Hessians of the loss
function for candidate splits.
d. Select the best splits for each feature based on the gradients and Hessians.
f. Update the ensemble by adding the new trees, multiplied by a learning rate
(shrinkage).
Compute the final predictions by summing the predictions of all models in the
ensemble.
HGB provides fast training and prediction times while delivering competitive
performance on a wide range of datasets.
Input:
- Training data D
Output:
- Ensemble model
For i = 1 to N:
Compute the negative gradient vector r_i for each training sample in D
In the pseudocode, the HistGradient Boosting algorithm takes the NOAA training
dataset, the number of base models N, the maximum tree depth max_depth, and the
number of histogram bins K as input. It initializes the predicted values y_hat as 0
for all training samples in the dataset. It also initializes an empty ensemble model.
Then, it iterates N times. In each iteration, it computes the negative gradient vector
for each training sample in the dataset. It constructs a histogram based on K bins.
A base model is trained on the histogram representation of the dataset using the
negative gradient vector. The predicted values for the base model are computed,
and then the predicted values y_hat are updated by adding the predictions of the
base model. The base model is added to the ensemble model. The process is
repeated N times. Finally, the ensemble model is returned as the output.
59
3.9 Performance Evaluation Metrics
The Root Mean Square Error (RMSE) stands as a frequently employed metric
within the realm of regression analysis. Its purpose lies in assessing the disparity
between the projected values and the factual values of the dependent variable.
RMSE computation entails the derivation of the square root from the average of the
squared disparities that exist between the predicted and actual values. The RMSE
formula is as follows:
Here, y_pred represents the predicted values, y_actual denotes the actual values,
and n stands for the total number of observations. RMSE serves as a valuable metric
as it provides insight into the extent of disparities between predictions and actual
values. A smaller RMSE value signifies a higher level of accuracy in the model's
predictions.
60
as it offers insight into the typical magnitude of errors made by the model. A lower
MSE value signifies enhanced model accuracy.
3.9.3 R-squared(R2)
In this equation, y_pred represents the predicted value, y_actual is the actual value,
y_mean signifies the mean of the actual values, and the sum is computed across all
observations. R-squared proves to be valuable as it provides an understanding of
how well the model conforms to the data. A higher R-squared value suggests that
the model accounts for a greater amount of variance in the data. However,it is
imperative to note that a high R-squared value does not necessarily imply that the
model is adept at forecasting future outcomes.
61
CHAPTER FOUR
4.1 Introduction
During this stage, various journals and books were read for gaining a better insight
of the subject matter, identify similarities and differences in methods, and identify
a problem that could be solved. This stage also included setting up the coding
environment, learning which Integrated Development Environment (IDE) to use,
installation procedures, packages and libraries required, and best coding practices
for the implementation.
62
4.3.1 Programming Language
Python was used to carry out the experiment. Python is an interpreted, maximum-
level, general-purpose programming language that is used in a variety of
programming contexts. Because of its extensive library and resource set, it is the
most used languages in data science. It is also preferred because of its ability to
process large amounts of data in various formats in a short frame of time. Python
is thought to easily adapt, it includes components for implementing visualization
and graphics.
63
to provide efficient and high-performance implementations of gradient boosting
algorithms.
CatBoost Library: The CatBoost Library is a powerful open-source machine
learning library developed by Yandex. It specializes in gradient boosting on
decision trees and is designed to provide high-quality results with minimal data
preprocessing.
The research was conducted within the Google Colab environment, which is an
online, cloud-based Jupyter notebook platform. Google Colaboratory offers free
access for machine learning and deep learning model development, utilizing CPUs,
GPUs, and TPUs. It facilitated the translation of Python source code into its
corresponding machine code.
64
4.4 Exploratory Data Analysis of Climate Change Forecasting
65
Visualization of Monthly NNME monthly forecasts for Precipitation
66
Figure 4.4: Categorical Columns
67
Figure 4.5: Latitude
The figures below show the results of processes carried out on classifiers in the
Google Colab Interface
68
4.5.1 Bagging Algorithms:
Extra Trees
Random Forest
69
4.5.2 Boosting Algorithms:
Xgboost
70
CatBoost
71
4.5.3 Forest of Randomized Trees:
Decision Trees
72
4.6 Result Analysis
73
Among the various algorithms assessed for the regression task, the boosting
algorithms, specifically CatBoost, LightGBM, and Xgboost, displayed exceptional
performance according to the evaluation metrics of Root Mean Squared Error
(RMSE), Mean Squared Error (MSE), and R-squared (R2-Score). CatBoost
achieved the lowest RMSE of 0.1643 and MSE of 0.0269, signifying its capability
to minimize the average discrepancy between predicted and actual values.
Moreover, it attained an outstanding R2-Score of 0.9997, indicating that the model's
predictions can elucidate a substantial portion of the variance in the target variable.
While LightGBM and Xgboost also demonstrated strong performance, with
relatively low RMSE and MSE values, CatBoost surpassed them in all three
metrics. The higher R2-Score achieved by CatBoost further corroborates its
effectiveness in capturing the underlying data patterns. Based on these findings, it
can be inferred that boosting algorithms, particularly CatBoost, are highly suitable
for this regression task. The proficiency of boosting algorithms in amalgamating
weak learners into a robust ensemble, adaptively focusing on challenging instances,
and addressing intricate relationships within the data likely contributes to their
superior performance. It is noteworthy that bagging algorithms like Random Forest
and Extra Trees, as well as the forest of randomized trees algorithm, HistGradient
Boosting, and the standalone Decision Tree, also exhibited reasonably good
performance. However, their RMSE, MSE, and R2-Scores were comparatively
higher than those of the boosting algorithms.
Overall, the results strongly suggest that CatBoost, due to its impressive
performance across all evaluation metrics, should be considered as the preferred
choice when applying boosting algorithms to regression tasks similar to the one
under consideration.
74
4.7 Discussion of Findings
4.7.1.1 CatBoost:
CatBoost stood out as the best performer in terms of all evaluation criteria. It accomplished
the lowest RMSE of 0.1643, demonstrating its capacity to reduce the average disparity
between predicted and actual values. The minimal MSE of 0.0269 additionally validates
its precision in forecasting. Furthermore, CatBoost achieved an outstanding R2-Score of
0.9997, indicating that most of the variability in the target variable can be elucidated by
the model's predictions. This implies that CatBoost adeptly grasps the fundamental data
patterns and delivers exceedingly precise regression outcomes.
75
4.7.2 Comparison with Other Boosting Algorithms:
Although the bagging algorithms, Random Forest and Extra Trees, along with the
forest of randomized trees algorithm, HistGradient Boosting, delivered reasonably
good performance, their results were comparatively lower than those of the
boosting algorithms. These algorithms generate multiple models and aggregate
their predictions to make a final prediction. However, they were outperformed by
the boosting algorithms in terms of RMSE, MSE, and R2-Score. This suggests that
for the given regression task, the boosting algorithms' ability to focus on
challenging instances and capture complex relationships played a crucial role in
achieving superior results.
4.8 Conclusion
76
hybrid models that combine the strengths of ensemble methods with deep learning
architectures to further enhance climate change predictions. Additionally,
integrating socio-economic factors and policy interventions into the forecasting
models can improve their real-world applicability and support decision-making
processes.
77
CHAPTER FIVE
5.1 Summary
The primary objective of the study was to evaluate and compare the performance
of various machine learning algorithms for climate change forecasting. The
CatBoost, LightGBM, XGBoost, Random Forest, Extra Trees, HistGradient
Boosting, and Decision Tree algorithms were utilized in the study analysis. After
conducting rigorous experiments and evaluating the results, it was discovered that
CatBoost outperformed all other algorithms in terms of predictive accuracy and
generalization capabilities. It was closely followed by LightGBM and XGBoost,
which also demonstrated strong performance. Random Forest and Extra Trees
exhibited moderate performance, while HistGradient Boosting and Decision Trees
showed relatively lower predictive accuracy. A range of evaluation metrics were
used to assess the performance of the algorithms, including mean absolute error,
mean squared error, and R-squared. The evaluation was conducted on a
comprehensive dataset of climate change indicators, considering both temporal and
spatial aspects. The analysis revealed that CatBoost consistently outperformed
other algorithms across different evaluation metrics, demonstrating its robustness
and effectiveness in climate change forecasting.
5.2 Limitations
While this study has provided valuable insights, it is essential to acknowledge its
limitations:
78
memory. This study had to source for good computational resources for
implementation of the analysis.
3. Data Quality: The accuracy of predictions heavily depends on the quality of the
data you have. Climate data can be noisy, incomplete, or have biases. Ensuring data
quality is a significant challenge. The study had to make sure quality data was used.
4. Algorithm Selection: The choice of algorithms is essential, but it might not cover
all possible algorithms suitable for climate forecasting. There could be newer
algorithms that are more effective. This study had to try out different algorithms
and selected the best algorithms for the analysis.
5.3 Recommendations
In light of the discoveries made in this investigation, the following suggestions are
put forth for future research and the utilization of machine learning algorithms in
climate change prediction:
1. CatBoost should be considered as a primary choice for climate change
forecasting tasks due to its outstanding performance. Further research can
focus on understanding the specific features and techniques that contribute
to its superior predictive accuracy.
2. LightGBM and XGBoost can serve as alternative options for climate change
forecasting, especially when computational efficiency and scalability are
crucial factors.
79
3. Random Forest and Extra Trees can be utilized when a balance between
accuracy and computational efficiency is required. Further investigation can
explore techniques to enhance their performance in climate change
forecasting.
4. HistGradient Boosting and Decision Trees may be suitable for preliminary
analysis or when interpretability of the model is a priority. Research efforts
should be directed towards improving their accuracy and addressing their
limitations in the context of climate change forecasting.
5. Future research should focus on incorporating additional features and
variables into the models to further enhance their predictive capabilities.
The inclusion of more diverse climate indicators, geographical factors, and
temporal patterns could potentially improve the accuracy of the forecasting
models.
6. Ensembling techniques that combine the strengths of multiple algorithms
can be explored to boost the overall predictive accuracy in climate change
forecasting. Techniques such as stacking or blending different algorithms
can help exploit their complementary strengths.
7. Continuous monitoring and updating of the models should be ensured to
accommodate the dynamic nature of climate change. Regular retraining of
the models with new data can help maintain their accuracy and adaptability
to evolving climate patterns.
The recommendations emphasize the need for further research and exploration in
several areas. This includes understanding the underlying features and techniques
that contribute to CatBoost's superior performance, investigating ways to enhance
the accuracy of Random Forest and Extra Trees algorithms, and improving the
performance of HistGradient Boosting and Decision Trees for climate change
forecasting. Furthermore, future research should emphasize the incorporation of a
80
more diverse set of features and variables into the models, taking into account
geographical factors, temporal trends, and a broader array of climate indicators.
Exploring ensembling techniques that combine multiple algorithms can also be
beneficial, as they can harness the complementary strengths of these methods to
enhance overall predictive accuracy.
To ensure the efficiency of the forecasting models, continuous monitoring and
updating are imperative. Climate change is an evolving process, and regularly
retraining the models with new data is essential to uphold their accuracy and
adaptability to evolving climate patterns.
In summary, this study has made a valuable contribution to the field of climate
change forecasting by conducting a comprehensive evaluation and comparison of
various machine learning algorithms. The results underscore the exceptional
performance of CatBoost while shedding light on the strengths and weaknesses of
other algorithms. The recommendations outlined in this section are intended to
guide future research and the practical application of machine learning algorithms
in climate change forecasting, with the ultimate goal of enhancing our
comprehension and prediction of the impacts of climate change.
81
REFERENCES
82
Garg, H., & Garg, N. (2019). Predicting climate change using machine learning
techniques. Journal of Physics: Conference Series, 1331(1), 012015.
Jain, G., & Mallick, B. (2020). A Review on Weather Forecasting Techniques.
Journal of Atmospheric and Solar-Terrestrial Physics, 208, 105336.
Kumar, A., Dubey, S. K., Singh, P. K., & Packirisamy, M. (2020). Machine
Learning Techniques for Climate Change Detection and Mitigation: A
Comprehensive Survey. Environmental Modeling & Assessment, 25(4),
425-441.
Lassalle, R., et al. (2020). Using Machine Learning to Predict the Effects of Climate
Change on the Spread of Invasive Species. Environmental Modelling &
Software, 133, 104844.
Li, J., Zhang, Y., & Chen, Q. (2021). A comparative study of machine learning
algorithms for climate change prediction. Environmental Science and
Pollution Research, 28(7), 8830-8844.
Liu, Y., Wu, H., Dong, M., et al. (2019). Predicting Climate Change using Deep
Learning. Journal of Cleaner Production, 240, 118112.
Ma, Y., et al. (2018). A Machine Learning Approach to Predicting Global Climate
Change. Nature Communications, 9(1), 1-7.
Malhi, G. S., Kaur, M., & Kaushik, P. (2021). Impact of Climate Change on
Agriculture and Its Mitigation Strategies: A Review. Environmental
Science and Pollution Research International, 28(13), 15768-15786.
Nkwam Nkwam, I. H., Barria, J. M., & Diaz, R. J. (2020). A machine learning
approach to predict global temperature anomalies. Environmental Modeling
& Assessment, 25(4), 447-461.
Olaiya, F., & Adeyemo, A. B. (2012). Application of Data Mining Techniques in
Weather Prediction and Climate Change Studies. Journal of Emerging
Trends in Computing and Information Sciences, 3(8), 1091-1096.
Purushothaman, K., & Arulsamy, M. L. (2020). Deep learning approaches for
climate prediction: A review. Journal of Ambient Intelligence and
Humanized Computing, 11(6), 2225-2238.
Rahman, M. I., Islam, M. Z., & Akhand, M. A. H. (2018). A Machine Learning
Approach to Climate Change Prediction. IOP Conference Series: Earth and
Environmental Science, 116(1), 012056.
83
Rahnavard, N., Memarzadeh, N., & Shafiee, M. (2019). Machine Learning
Techniques for Climate Prediction: A Review. Journal of Hydrology, 575,
215-233.
Ravi, S., Raghavendra, S., & Prabhu, S. (2020). Machine learning for climate
change: A review. Renewable and Sustainable Energy Reviews, 123,
109723.
Robinson, N. L., Zanna, L., & Jahn, O. (2019). Machine Learning to Predict
Climate Change Impacts on the Oceans. Earth's Future, 7(5), 547-567.
Sahu, M., Pandey, S., & Kumar, A. (2020). Climate Prediction using Machine
Learning: A Review. Journal of Earth System Science, 129(5), 107.
Sharma, A., Gupta, M., Singh, S., & Patel, N. (2019). A machine learning approach
to climate change prediction. Journal of Ambient Intelligence and
Humanized Computing, 10(9), 3611-3623.
Shen, X., Chi, X., Liu, J., Zhang, X., & Zheng, F. (2018). A machine learning
approach to climate change prediction. Environmental Science and
Pollution Research, 25(17), 16413-16423.
Shivanna, K. R. (2022). Climate change and its impact on biodiversity and human
welfare. Journal of Biosciences, 47(1), 1-8.
Tabari, H. (2020). Climate change impact on flood and extreme precipitation
increases with water availability. Scientific Reports, 10(1), 1-13.
Wainwright, C. M., Finney, D. L., Kilavi, M., Black, E., & Marsham, J. H. (2020).
Extreme rainfall in East Africa, October 2019–January 2020 and context
under future climate change. Weather and Climate Extremes, 30, 100294.
Zaman, M. S., Islam, M. R., Akhand, M. A. H., et al. (2018). A Machine Learning
Approach to Climate Prediction: A Case Study of the Indian Journal of
Engineering and Applied Sciences, 13(10), 7649-7656.
Zhu, Y., Zhang, C., & Wang, J. (2020). Deep Learning for Climate Change
Detection and Analysis. Environmental Research Letters, 15(12), 124003.
84