0% found this document useful (0 votes)
37 views8 pages

Loan Default Prediction Using Random Forest

The document presents the design and implementation of a loan default prediction system using the Random Forest algorithm, aimed at improving decision-making for financial institutions. By analyzing various data sources, the system achieved a high Area Under the Curve (AUC) score of 98%, indicating its effectiveness in predicting loan defaults. The methodology includes data loading, cleaning, processing, feature extraction, model training, and evaluation, demonstrating the potential for enhanced loan approval processes.

Uploaded by

maan younis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views8 pages

Loan Default Prediction Using Random Forest

The document presents the design and implementation of a loan default prediction system using the Random Forest algorithm, aimed at improving decision-making for financial institutions. By analyzing various data sources, the system achieved a high Area Under the Curve (AUC) score of 98%, indicating its effectiveness in predicting loan defaults. The methodology includes data loading, cleaning, processing, feature extraction, model training, and evaluation, demonstrating the potential for enhanced loan approval processes.

Uploaded by

maan younis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

137

Scientia Africana, Vol. 22 (No. 3), December, 2023. Pp 137-144


© Faculty of Science, University of Port Harcourt, Printed in Nigeria ISSN 1118 – 1931

DESIGN AND IMPLEMENTATION OF A LOAN DEFAULT PREDICTION SYSTEM


USING RANDOM FOREST ALGORITHM
1Oghenekaro,
L. U., and 2Chimela, M. C.
1,2
Computer Science Department, Faculty of Computing, University of Port Harcourt, Nigeria
Emails: [email protected], [email protected]

Received: 20-09-2023
Accepted: 01-11-2023

https://dx.doi.org/10.4314/sa.v22i3.12
This is an Open Access article distributed under the terms of the Creative Commons Licenses [CC BY-NC-ND 4.0]
http://creativecommons.org/licenses/by-nc-nd/4.0.
Journal Homepage: http://www.scientia-african.uniportjournal.info
Publisher: Faculty of Science, University of Port Harcourt.

ABSTRACT
Loan default prediction is a crucial task in the lending industry; it helps financial institutions make
informed decisions about granting loans. It is usually a daunting task for the bank or financial
institution to predict customers who will default on a loan especially when there are thousands of
applicants. This loan default prediction system aimed to improve the Area Under the Curve (AUC)
score. This loan default prediction system used various data sources, such as demographic
information, credit history, and financial performance to predict the likelihood of a loan being
defaulted. The system used a random forest (RF) machine learning algorithm to analyze the data
and build predictive models. The model was then used to make predictions about new loan
applicants and existing borrowers who may default in the future. The system can be customized to
meet the specific requirements of different lending institutions. The system enables lenders to make
better decisions on loan approval, interest rate determination, and credit risk, management. The
loan default prediction system also provides insights into risk factors that contribute to loan default
and helps lenders develop effective strategies to mitigate these risks, making it an indispensable
tool for lenders. The resultant system achieved an improved AUC score of 98%.
Keywords: AUC score, Loan Default, Loan Processing, Predictive Model, Random Forest
Algorithm

INTRODUCTION minimizes the losses that could be incurred


from defaults, hence increasing the profit
Loan processing is a crucial issue faced by
generate from the interest from the loan. Loans
banks in recent years. It is a way of checking if
produce the largest income but constitute a huge
a customer will default on a loan in the process
risk and exposure. In order to fund viable
of repayment, and this knowledge will
projects, banks mobilize deposits and create
determine if a loan should be granted to the
loans. When loans are of good quality, they
customer or not. Many financial institutions or
generate revenue for the bank and at the same
banks approve and disburse loans following a
time help to stimulate economic growth
long authentication and validation process, but
(Hussain and Shorouq, 2014). In finance, a loan
there is no assurance that the selected candidate
is the lending of money by one or more
is the most eligible of all applicants (Purohit et
individuals, organizations, or other entities to
al., 2011). Through this process, the bank
individuals etc. In a lot of instances, the lenders
138

Oghenekaro, L.U., and Chimela, M.C.: Design and Implementation of a Loan Default Prediction System Using Random…

usually add some charges called interest to the Decision Tree, K-Nearest Neighbor and
amount borrowed which the debtor must pay Lightgbm. The algorithms were trained with
while repaying the amount borrowed. The secondary data obtained from kaggle website,
repayment of this loan by the debtor is usually the dataset contained 10,128 applicants, 23
within a fixed time frame maybe months or attributes and 1 class attribute. The data was
weeks. At times the debtors do not pay their preprocessed using missing value handling,
loan as at when due, resulting in a loan default. feature extraction and categorical variables
This leads to loss of money on the part of the transformation. The adopted the hold-out
lenders due to the fact that the debtors might approach to validate the dataset, where 70% of
end up not paying part of the loan taken. Loan the data where for training the algorithm and
defaulting is a major financial risk for the 30% was for testing. With this approach, the
finance industry as it harms the interest of the performance of all six machine learning
financiers and destroys social trust (Twala, algorithms that were adopted for the work were
2010). Due to loss on the part of the lender evaluated under the metrics of precision,
(usually financial institutions) there has been accuracy, recall, F1-Score and Area Under
efforts to forecast the outcome of a loan before Curve (AUC). Of which the Lightgbm recorded
approval to curb instances of bad debts. In the highest accuracy with a score of 0.9189, and
recent times, with the introduction of new decision tree had the lowest accuracy score of
technologies data is been generated with every 0.8497. In addition to the evaluation of the
click, and data scientists have been researching models in terms of accuracy, the models were
and making progress in the finance and banking also evaluated using the AUC metrics, and
field (Hamid and Ahmad, 2011). Research has AUC graphs were produced for all six
been carried out to build systems that will classification algorithms. The Lightgbm
predict if a customer will pay back his/her loan outperformed other ML algorithms with an
on time. Before now when the applicant filled AUC score of 75%. Based on the result from
out a form to get a loan from the bank, the the test data, it was concluded that applicants
customer's credit score history was usually with low credit score should be denied access to
analyzed by the loan officers together with loan facility as they have a high probability of
other things like the amount to be loaned, the defaulting. The results showed that applicants
salary of the applicant, reason for applying for with high income, requesting for small loan
loan, amount in the bank currently, and also if amounts were ideal applicants to be granted
the customer is on any loan when he is applying loan. Their study showed that data features such
for the new loan, with all this process it was as gender and marital status were not
usually time consuming and tasking, especially determining factors for the prediction output.
when the number for loan applicants are more.
Wu, W. J. (2022) applied the random forest
Currently with a lot of data been generated on
algorithm and the XGBoost algorithm to build
daily basis and with the aid of machine learning
prediction models. Dataset was obtained from
algorithms, the processing of loan gets faster
Imperial College London, the dataset contained
and more efficient, saving losses as incidence of
a total of 105,471 records and 778 features.
bad debts are reduced. The traditional system
The work employed the variance threshold
becomes slow as compared to what the speed,
method at the feature engineering stage, where
and accuracy we could get with the help of
unimportant features were filtered out of the
machine learning.
dataset. Variance inflation factor (VIF) was
LITERATURE REVIEW used to measure multi core linearity of the data
set. The pre-processed dataset was randomly
Almamun et al. (2022) adopted six different
separated into 80-20 proportion, where 80%
machine learning (ML) algorithms to predict if
was the training dataset, and 20% was the test
a loan applicant is eligible. The ML algorithms
dataset. The model demonstrated that though
include Random Forest, Adaboost, XGBoost,
the random forest and the XGBoost algorithms
139
Scientia Africana, Vol. 22 (No. 3), December, 2023. Pp 137-144
© Faculty of Science, University of Port Harcourt, Printed in Nigeria ISSN 1118 – 1931

are decision tree algorithms, the random forest Huang et al. (2023) attempted to increase the
model recorded a prediction accuracy of percentage accuracy of predicting loan defaulter
0.90657, while XGBoost was 0.90635. The by adopting the ensemble learning
result indicated an insignificant accuracy algorithm.The paper selected Adaboost
between the two decision tree algorithms. The algorithm as best performing model for loan
study was able to demonstrate that the random default prediction. Secondary dataset was from
forest as well as the XGBoost algorithm are the credit platform provided in a Tianchi
suitable algorithms for loan default prediction. competition. The dataset originally contained
1.2 million records and 47 data features.
Uwais & Khaleghzadeh (2022) implemented
However, considering time factor in processing
the machine learning (ML) algorithms preset on
the huge dataset, a total of 100,000 records
the Sparks Big Data Platform, to build loan
were randomly selected for the purpose of
default prediction models. The work applied six
model building. The data was cleaned for
different supervised ML classification
missing values and outliers, and the feature
algorithms to predict loan default, they include;
selection technique was adopted to select
Decision Tree, Logistic Regression, Gradient
relevant features from irrelevant features. At the
Boosted Tree, Random Forest, Linear Support
model construction stage, the initial value of the
Vector Machine, and Factorization Machine.
parameters and the tuned values were
Secondary dataset was adopted from Kaggle
tabularized in the work. The proposed model
website, the dataset contained 640,000 instances
recorded an accuracy of 88%.
and 14 features. The dataset was randomly
separated. Income was plotted against education Li et al. (2021) aimed to improve prediction
using a scatter plot to identify correlation accuracy by using the blending method to fuse
between these two features of the dataset, using 3 models; Random forest (RF), CatBoost and
the pandas matplotlib function of the python Logistics Regression (LR). The blending
language, available on Spark. A positive method involved training a new learner, and the
correlation was seen between applicant’s model of the blending method was a two-layer
educational level and income, because as level framework. Loan data was obtained from a
of education increases, the income increases. lending club for Q4 2019, as made publicly
The work adopted several histograms to available on kaggle website. The data contained
visualize the information from the dataset based 128,262 records and 150 attributes, however,
on minority and gender status. Data was pre- over 40% of the data was removed as they were
processed by removal of null values and insignificant to the study. The adaptive
adjustment of attribute data type. The pre- synthetic sampling approach (ADASYN) was
processed data was further prepared using the adopted to address class imbalance problem of
steps of feature selection, addressing class the dataset, and solve the problem of
imbalance problem, converting categorical data performance degradation due to data imbalance.
to numerical data, and randomly splitting data The RF, CatBoost, and LR served as benchmark
in 70% training and 30% test data. The six to the proposed fused model. Validation metrics
supervised ML algorithms present on Spark of accuracy, roc curve, F1-score and recall,
MLib were applied to the training data, and demonstrated that the fused model
used to train the models, while the test data was outperformed the other three individual models.
used to evaluate the model. Of all six ML
Odegua (2020) adopted the Extreme Gradient
classifiers, the decision tree and random forest
Boosting (XGBoost) to build a predictive model
demonstrated best performance with receiver
to predict loan defaulters. They obtained dataset
operating characteristic (ROC) curve score of
from Data Science Nigeria, hosted on Zindi
99.56%, recall 99.2%. F-Score 99.5%, and
platform. The dataset contained 26,897 records
precision 99.8%. The work demonstrated
and 31 attributes, which underwent data pre-
success in classifying loan defaulters in one of
processing and wrangling stages, before being
the available two classes.
140

Oghenekaro, L.U., and Chimela, M.C.: Design and Implementation of a Loan Default Prediction System Using Random…

used for training with the XGBoost classifier community data repository. The data contained
algorithm. The system was implemented with 148,670 thousand records and 34 features.
python programming language, and the
The following are the processes used to build
classifier was trained on the cleaned dataset,
the loan default prediction system:
using the good_bad_flag feature as target. Five
metrics; Recall, Accuracy, F1-Score, ROC 1) Data Loading;
value, and Precision were used to evaluate the 2) Data Cleaning;
model. 3) Data Processing;
4) Feature Extraction;
Literature Review has shown several attempts
5) Model Training;
made by researchers to improve the accuracy of
6) Model evaluation.
predicting loan defaulters automatically.
MATERIALS AND METHOD 1. Data Loading
The Dataset used in the loan default prediction The data was loaded into the Google Colab
dataset was compiled by M. Yasser and environment using the read_csv method from
uploaded to the Kaggle Data Science the pandas library. It can be seen in figure 1.

Figure1:Data loading using read_csv function


2. Data Cleaning

The data was cleaned from missing values, outliers, to make it fit for training, using the simple
imputer method in the scikit-learn library as seen in figure 2.

Figure 2: Data Cleaning


141
Scientia Africana, Vol. 22 (No. 3), December, 2023. Pp 137-144
© Faculty of Science, University of Port Harcourt, Printed in Nigeria ISSN 1118 – 1931

3. Data Processing
The data was processed to remove duplicate columns or features. One-hot encoding was done as seen
in figure 3, to convert categorical columns into numerical columns, filling those columns with 0’s and
1’s since the random forest classifier that will be used to train on the data cannot find patterns in
categorical values.

Figure 3: Code for data processing


4. Feature Extraction

Some features were expunged in this phase since they had little or no effect on the target (label) or
they were duplicates. In this phase, a total of twenty-four features were dropped, and ten remained
as seen in figure 4.

Figure 4: Code for Feature Extraction


5. Model Training
In this phase the data was fed into the random forest classifier in the scikit learn library in python.
The data was trained using 70% of the data set. The codes nippet can be seen in figure 5.
142

Oghenekaro, L.U., and Chimela, M.C.: Design and Implementation of a Loan Default Prediction System Using Random…

Figure 5: Code for model training


6. Model Evaluation
The model was tested with the test dataset and evaluated using the area under the curve score, recall
and precision. The following evaluation scores were generated as demonstrated in figure 6.

Figure 6. Code for model evaluation


RESULT DISCUSSION negative. This is the ability of the classifier to
classify all positive observation as positive,
Figure 7 shows the confusion matrix of the
the recall of the proposed system is 0.9965.
loan default prediction system. The number of
The ability of the model to classify all positive
true positives where 32,664 observations
observation was 99.6% accurate. Precision is
which implies that the number of those
the ratio of true positive to the sum of true
instances that will not default and were
positive and false positive. It represents the
predicted as such were 32,664 observations.
ability of this loan default classifier not to
The true negative where 10,930 observations
label as non–default, a sample that is default.
meaning the number of those instances that
The precision for the trained random forest
will default and were correctly predicted as
model is 0.9742, showing that the model
default were 10,930 observation. Table 1
classifies about 97 occurrences out of 100
shows some performance metrics of the
correctly. AUCscore represents the area under
model; such as precision, Recall, and
the curve. The AUC Score reflects how well a
F1_score. The F1_score is interpreted as the
model predicts the correct category a loan will
harmonic mean of precision and recall, where
fall into. The Area Under Curve score for the
an F1 score reaches it best value at 1 and
RF model was evaluated to be 0.9823. This
worst score at 0. The relative contribution of
represents the ability of the loan default
precision and recall to the F1 score are equal.
system to accurately make prediction, and
The F1_Score of 0.98 means that the model is
gives additional indication of the quality of
close to being optimal. Recall is the ratio true
prediction made by the model.
positive to the sum of true positive and false
143
Scientia Africana, Vol. 22 (No. 3), December, 2023. Pp 137-144
© Faculty of Science, University of Port Harcourt, Printed in Nigeria ISSN 1118 – 1931

Figure 7: Confusion matrix of the model

CONCLUSION Classification Method in Data Mining,


International Journal of Information
The study was aimed at achieving a higher
and Education Technology, 1(2): 150-
AUC score by adopting the random forest
155.
algorithm in building the predictive model for
Huang, Y., Shao, Y., Tang, D., Huang, J., and
predicting loan defaulters. Secondary data
Chen, S. (2023). Loan Default Prediction
was sourced for the research, and the data
Based on Ensemble Learning,
was preprocessed, and used to train the
International Journal of Innovation and
algorithm. The resultant model was evaluated
Research in Educational Sciences, 10(3):
using performance metrics and area under
149 – 159.
curve score. The results revealed that the
Hussain, A.B. and Shorouq,F.K.E. (2014).
predictive system built with the random forest
Credit risk assessment model for
algorithm recorded high performance
Jordanian commercial banks: Neural
percentage both in accuracy metrics and AUC
scoring approach”, Review of
score. Further works can be done, in the
Development Finance, Elsevier, 4(10):
aspect of creating a graphic user interface for
20–28.
the application, to make the system more
Li, X., Ergu, D., Zhang, D.,Qiu, D., Cai, Y.
user-friendly.
and Ma, B. (2021) Prediction of Loan
REFERENCES Default Based on Multi-model Fusion,
Procedia Computer Science.
Almamun, M., Farjana, A., Mamun, M.
Odegua, R. (2020) Predicting Bank Loan
(2022). Predicting Bank Loan Eligibility
Default with Extreme Gradient
uing Machine Learning Models and
Boosting, Preprint Cornell University.
Comparison Analysis, Proceedings of
Purohit, S. U., Mahadevan, V. and Kulkarni,
the 7th North American International
A. N. (2011) Credit Evaluation Models
Conference on Industrial Engineering
of Loan Proposals for Indian Banks,
and Operations Management, Florida.
International Journal of Modelling and
1423 – 1432.
Optimization. 2(4): 529 – 534.
Hamid,E. N. andAhmad, N (2011). A New
Twala, B. (2010) Multiple classifier
Approach for Labeling the Class of
Application to Credit Risk Assessment,
Bank Credit Customers via
144

Oghenekaro, L.U., and Chimela, M.C.: Design and Implementation of a Loan Default Prediction System Using Random…

Expert Systems with Applications. 37(4): Intelligence and Cognitive Science,


3326–3336. Dublin, 118-129.
Uwais, A. M. and Khaleghzadeh, H. (2022) Wu, W. J. (2022) Machine Learning
Loan Default Prediction using Spark Approaches to Predict Loan Default,
Machine Learning Algorithms, AIAI Intelligent Information Management.
29th Irish Conference on Artificial 14(3), 157-164.

You might also like