ABSTRACT
In the modern world, banks are crucial for providing financial support to individuals and
companies. Through the loan approval process, banks assess the borrower's creditworthiness to
make sure the loans they provide are secure. However, the traditional manual assessment-based
loan approval process is laborious and prone to human error. Thus, the development of an
automated and reliable loan approval system is required. The research paper's objective is to
evaluate various machine learning algorithms and identify the most effective algorithm for
determining whether or not to approve a bank loan for an individual. The study uses a dataset
from Kaggle to train various machine learning models, such as logistic regression, decision trees,
KNN, random forests, and neural networks. The performance of these models is then evaluated
using the criteria of accuracy, precision, recall, and F1score to determine which loan prediction
method is the most effective. The model is comparatively better because it takes into account
elements that should be taken into account to accurately calculate the probability of loan default,
such as a customer's characteristics like tax liens, purpose, credit history, credit amount, and home
ownership, in addition to checking account information that discloses a customer's financial status.
With an accuracy of 81.89%, our logistic regression model performed better than the others aside
from this. To determine the most pertinent factors that affect loan approval, a feature importance
analysis is also carried out. According to the research, machine learning models outperform
traditional statistical methods in loan prediction and can significantly increase accuracy.
Keywords: loan approval prediction, logistic regression, KNN, Neural Network, Random Forest,
Decision trees
INTRODUCTION
In the financial sector, approving bank loans is a crucial process that involves determining the risk
and creditworthiness of loan applicants. The manual review of financial and personal data that is
used in the conventional loan approval process has the potential to be slow, A critical step in the
financial industry is approving bank loans, which entails assessing the risk and creditworthiness of
loan applicants. The traditional loan approval process may be sluggish, biassed, and inaccurate due
to the manual review of financial and personal data. The advancement of machine learning (ML) and
artificial intelligence (AI) technologies has made it possible to automate and enhance the loan
approval process. Lately, ML-based models have been incredibly successful in a lot of industries,
including finance. ML algorithms that use historical loan application data can identify trends and
compute loan approval outcomes with high accuracy.
First, the dataset—which consists of 19 columns and 100,514 entries—is cleaned and pre-processed
to eliminate missing data and identify the key variables that are crucial for predicting the approval
status of a loan. The following criteria are the last set of factors that are used to forecast whether a
loan will be approved: the amount of the loan, its term, credit score, annual income, years of
employment, home ownership, purpose, monthly debt, number of open accounts, number of credit
problems, current credit balance, maximum amount of open credit, bankruptcies, tax liens, and loan
status.
The paper's primary goal is to evaluate five distinct algorithms and determine which is the most
accurate. In addition, it compares the effectiveness of each loan approval algorithm according to
predetermined metrics and parameters and identifies the advantages and disadvantages of each
algorithm.
LITERATURE SURVEY
To improve loan approval systems, machine learning and artificial intelligence (AI) techniques have
become more and more common in recent years. These methods can greatly improve credit scoring
and risk assessment models, according to the research. A concern with these approaches is the
potential ethical and legal fallout, especially in light of issues like bias and discrimination. However, a
system that predicts loan approval is advantageous since it will boost bank productivity and save
applicants' time, as they no longer need to wait five to seven business days for their loan to be
approved.
A machine learning method for predicting loan approval using the logistic regression algorithm has
been proposed by Mohammad Ahmad Sheikh and his co-authors (2020) [3]. The data used in this
approach was gathered from Kaggle. The model's conclusions imply that a bank should take into
account additional customer attributes that are critical for making credit decisions and identifying
loan defaulters in addition to lending to wealthy clients. Furthermore, it implied that applicants with
the lowest credit scores would not be approved for loans because they are more likely to default on
the loan. The model's accuracy was found to be 81.1% aside from this. A Customer Loan Approval
Prediction Using Logistic Regression was also proposed by Mahankali Gopinath, K. Srinivas Shankar
Maheep, and R. Sethuraman (2021) [2] in a related study. The purpose of the paper is to identify the
qualified applicants so that the bank can contact them and offer loans to those who can repay them
within the allotted time. While we have used 15 features to predict loan approval, they have only
used 12 attributes. The accuracy provided by the model was 80.945%. A.U. Farouk, A. Abdulkadir,
and A.Y. Abarshi (2022) [4] provided an enhanced performance evaluation of the classification
algorithm for loan approval prediction in a different study. The purpose of the study was to assess
the effectiveness of five classification models—Naïve Bayes, Random Forest algorithms, support
vector machines, logistic regression, and decision trees—that are used to predict loan approval.
There are 689 data instances and 12 variables in the dataset that were used. In addition, the R
package v4.1.1 was utilized to carry out every analysis carried out for the research project. The study
concluded that Naïve Bayes with six predictors had an overall accuracy of 83.2% and an AUC of
79.2%, respectively. Second place went to six-predictor logistic regression, which had an AUC of
73.7% and an overall accuracy of 81.6%. However, the results showed that the accuracy for decision
trees, SVM, and random forests was 81.7, 82.1, and 99.7, respectively. Because of the AUC and the
quantity of features in the model, logistic regression is ranked as the second-best method even
though Random Forest has a higher accuracy rate. The study found that Naïve Bayes is the best
algorithm for predicting loan approval based on the evaluation metric used.
In their research, Sharayu Dosalwar, Ketki Kinkar, Rahul Sannat, and Dr. Nitin Pise (2021) [5]
suggested utilizing machine learning techniques to analyse loan availability. The goal of the paper
was to forecast loan defaulters to lower the bank's non-performing assets. They compared several
algorithms for this, including XGBoost Classifier, Naive Bayes, K Neighbours Classifier, Random Forest
Classifier, Support Vector Machine, and Logistic Regression. With an accuracy of 78.5%, logistic
regression performed better than the other models. In their proposed modern approach to loan
sanctioning in banks, Golak Bihari Rath, Debasish Das, and Biswa Ranjan Acharya (2021) [6] tested
and compared various machine learning algorithms, including logistic regression, decision trees, and
SVM. To lower risk and human error in the loan sanction process and decide whether or not an
applicant is eligible for loan approval, they use a machine learning approach. Since logistic regression
performed better than the other models, it was thought that it could be applied as a predictive
model to forecast the loan applicants' future payment patterns. The model's accuracy was close to
79%.
In a different study, Mehul Madann (2021) [7] compared the use of decision trees and random
forests for loan default prediction. The study's objective was to predict, through the evaluation of
specific attributes, whether an individual should be granted a loan, thereby assisting the banking
authorities in their process of choosing qualified applicants from a pre-compiled list. This paper
suggests two machine learning models that use certain attributes to predict whether or not a person
should be granted a loan. The two algorithms—Random Forest and Decision Trees—are thoroughly
and favourably analyzed in this work. The discovery showed that the Random Forest algorithm
performed significantly more accurately than the Decision Tree algorithm. 80% of decisions were
made correctly with Random Forests and 73% with Decision Trees.
A comparison between logistic regression and binary trees for predicting applicants' loan status was
suggested in a different study by T. Sunitha, M. Chandravallika, M. Ranganayak, G. Suma Sri, T.V.S.
Jagadeesh and A. Tejaswi (2020) [8]. Reducing the time and effort required by bank employees to
separate loan applications based on their status was the driving force behind the study. Both
accuracy and precision score are given more weight in their model. The lender's risk factor is reduced
by a higher Precision Score. To execute this, the credit score factor was fed with data that has a
greater influence on the loan status and reduces ambiguity because it only has two unique values,
{0,1}, where {0} denotes a lower credit score and {1} denotes a higher credit score. Although the
accuracy of both models was 84%, the logistic regression model was selected due to the
characteristics of the output variable. In addition, this model was successful in lowering the risk
factor by producing fewer false predictions. Ultimately, these tests show that machine learning
methods, such as logistic regression, decision trees, random forests, and support vector machines,
can effectively forecast loan approval decisions. How well these methods work may depend on the
particular dataset and model features used.
PROPOSED METHODOLOGY
The suggested methodology attempts to construct an effective and precise loan approval prediction
system using machine learning techniques, which can aid banks and financial institutions in decision-
making. The steps taken are explained below.
1) Data Collection
The gathering of information on loan applications is the first step in the procedure. The dataset is
taken from Kaggle and it contains The dataset used contains 100,514 entries and 19 columns. Some
of the attributes are the Number of Open Accounts, Number of Credit Problems, Current Credit
Balance, Maximum Open Credit etc.
2) Data Preprocessing
Data must be cleaned, processed, and organized after it has been gathered. In this step, we remove
any redundant or unnecessary information, blanks are filled in, and categorical information is
transformed into numerical values.
3) Feature Selection
The following step involves choosing the dataset's most pertinent and crucial features using methods
like feature ranking and correlation analysis. This process aids in cutting down on the number of
features, improving the effectiveness of the prediction model.
4) Model Selection
After comparing different machine learning techniques, like Decision Tree, Random Forest, Neural
Networks, KNN, and Logistic Regression we choose the best algorithm that can be used to develop
the prediction model based on the chosen features.
5) Model Training
A portion of the dataset is used to train the model, and the remaining data is used for testing and
validation.
6) Model Evaluation
The trained model is then assessed using various metrics, including F1 score, recall, accuracy, and
precision. The evaluation assists in determining how well the model predicts loan acceptance.
Entries from 100,514 clients who have obtained bank loans are included in the dataset. In addition to
using a correlation matrix, we have graphed each feature of the data to make it easier to understand.
Training and testing are the two additional categories into which the dataset is separated. The
division ratio is 80:20, meaning that 80 percent of the data is used for training and 20 percent is used
for testing. This allows for the computation of the accuracy of each model.
1) Correlation Matrix
Correlation Matrix establishes a relationship between two variables and determines whether it is
proportional (more than 0.5), inversely proportional (smaller than -0.5), or has no relationship at all.
(close to zero). Knowing how the columns relate to one another will help us create a better model
because adding less significant columns will skew the results or otherwise damage the model.
The formula for finding a correlation is
= 2 𝑟∑ (𝑥 𝑖 − 𝑥 )(𝑦 𝑖 −𝑦 )
√∑ (𝑥𝑖 − 𝑥)2 ∑ (𝑦𝑖−𝑦)2
Where,
𝑟 = correlation coefficient
𝑥𝑖 = values of the x-variable in a sample
𝑥 = mean of the values of the x- variable
𝑦𝑖= values of the y-variable in a sample
𝑦 = mean of the values of the y- variable
Below is the correlation graph that we have plotted to understand our data better. From the graph
we found out that since the bank account of the individual is likely immediately frozen after filing for
bankruptcy, bankruptcy has a high correlation with the number of credit problems. Apart from it tax
liens also have high correlation with credit problems. Moreover, we also found out that the current
credit balance, annual income, and the number of open accounts also have a high correlation with
monthly debt.
DATA PRE-PROCESSING
1) Missing data
We removed unnecessary columns from the data pre-processing, such as the loan and customer IDs,
that were not required to forecast the loan approval. Following that, we looked for any missing data
and created a table showing the percentage of missing data for various features. We discovered that
there was a total of 51% missing data for the months since the last delinquent payment and 19%
missing data for the credit score and annual income. We eliminated the columns from this data that
had more than 50% missing data. Because there is 54% missing data in the column labelled "Months
since last delinquent," we have removed it.
We eliminated the rows with more than two missing values by using the threshold of eight. If there
are too many missing values in a single row, filling them all could lead to the corruption of our model.
In addition, because the last 514 rows were all NaN, we also removed them. Currently, the
percentage of missing data ranges from nearly nonexistent to less than 19%.
2) Filling Missing Values
To fill the missing values, we first converted all the categorical categories to numerical and then
found out the central tendency for the columns that had missing values. It turns out that for some
columns it's better to use mean, other median and other mode. We filled the missing values by
taking the mean, median or mode for different columns.
3) Outliers
To detect outliers, we used the interquartile method in this method we take the interquartile range
(IQR) and multiply it by 1.5. Now we add 1.5 x (IQR) to the third quartile. Any number greater than
this is a suspected outlier and subtract 1.5 x (IQR) from the first quartile. Any number less than this is
a suspected outlier. To handle them we deleted the found outliers and used Robust Scaler as it
reduces the effect of outliers.
4) Balancing Data
To make sure the data is balanced we used SMOTE (Synthetic Minority Over-sampling Technique) and
Scaling techniques.
i) SMOTE:
Synthetic Minority Over-sampling Technique, or SMOTE, is used when there are imbalanced
classification problems, or when one class has a disproportionately smaller number of examples than
the other class. When a bank loan approval prediction system finds itself in a scenario where the
number of loans approved exceeds the number of loans rejected, SMOTE can be used to mitigate the
problem of class imbalance.
By extrapolating between real examples of the minority class, the SMOTE method generates artificial
examples of the minority class. The method first selects at random an example from a minority class,
after which it finds its k-nearest neighbours in feature space. Next, new examples are generated by
randomly selecting one of the closest neighbours and interpolating between the two instances. Until
the ideal balance between the two classes is reached, this process is repeated.
ii) Scaling:
In order to guarantee the accuracy and efficiency of the predictive machine learning algorithms,
scaling is an essential step in the development of a bank loan approval system. Scaling is the process
of modifying data so that it falls into a specific range or scale. This is important because different
machine learning techniques perform differently depending on the size of the features. To lessen the
impact of outliers, we have utilised the robust scaler from the sklearn library.
2. MODEL SELECTION:
After pre-processing the data and selecting the features required for prediction, we now split are
dataset in 70 to 30 ratio, that is, 70% of data is used for training and 30% is used for testing.
Algorithms used for predictions are as follows
1) Logistic Regression
Logistic regression is the first algorithm we used to forecast loan approval. It is a statistical technique
used to resolve problems involving binary classification, such as the approval of bank loans. It
predicts the chance that an event (in this case, loan default) will transpire based on multiple
independent variables (credit score, income, etc.). A probability score between 0 and 1 is the result
of logistic regression, and a threshold value can be used to transform this score into a binary
prediction. Logistic regression is a widely utilised technique in bank loan approval systems because of
its ease of use, interpretability, and effectiveness when dealing with binary classification problems.
2) K-Nearest Neighbors (KNN)
KNN is a simple yet effective technique for bank loan approval systems; however, the choice of K and
the quality of the input data may have a significant impact on the system's performance. The method
finds the class that most of the K neighbours are in by comparing a new input data point to the K
nearest data points. KNN is widely utilised as a foundational algorithm for comparison with more
complex ones. Pre-processing is required to remove or decrease the influence of noise and irrelevant
features in the input data because it can be sensitive to them.
3) Decision Tree
Using a recursive process, the Decision Tree algorithm separates the data into homogenous subsets
according to the most informative feature. A decision tree's decision nodes each represent a feature,
and the decision tree's resulting tree-like structure's leaf nodes each represent a class label. Because
it is easy to comprehend and can be visualised, the algorithm is a useful tool for decision-making.
Decision trees are a flexible technique that can handle multiple data sources, is easy to understand,
and can be used for both regression and classification problems in bank loan approval systems.
4) Random Forest
Several Decision Trees in Random Forest are trained with bootstrapped data samples, and the
majority vote or the average of each tree's predictions is used to make the final forecast. The method
is easy to apply and can handle large datasets. In addition to supporting categorical and numerical
data, Random Forest can also handle binary and multi-class classification problems. Random Forest is
a popular technique in bank loan approval systems because of its high accuracy, robustness, and
ability to handle various types of data. However, Random Forest may be more difficult to understand
than individual Decision Trees due to its ensemble nature.
5) Neural Network
The Neural networks use an algorithm consisting of multiple layers of interconnected nodes, or
neurons, that learn over time how to map input data to output labels. Neural networks are capable
of handling intricate and non-linear interactions between the input variables and the output labels.
Neural networks are a popular technique in bank loan approval systems because of its high accuracy
and ability to handle complex data. Because neural networks can be difficult to understand,
additional safety measures may be required to ensure the model's transparency and fairness.
The accuracy, f1 score, precision, and recall value of these various algorithms are further assessed
and covered in detail in the following section.
MODEL EVALUATION AND COMPARISON
The process of evaluating a machine learning model's performance is known as model evaluation. It
is crucial to assess models to ascertain their correctness and efficacy. Depending on the kind of
problem being handled, several evaluation measures might be applied.
Metrics like accuracy, precision, recall and F1 score can be utilized to solve classification problems
whereas metrics like mean squared error, root mean squared error, mean absolute error, and R-
squared can be applied to regression issues. Testing, adjusting, and improving the model until it
performs well is all part of the iterative process of model evaluation.
Since we are dealing with a classification problem, we will be evaluating our model based on
accuracy, precision, recall and F1 score.
After comparing the five algorithms, it is seen that Logistic Regression outperforms other models
with an accuracy of 81.89. Hence, we have used a logistic regression model for our loan approval
prediction app
CONCLUSION AND FUTURE WORK
In summary, we evaluated and compared the performance of five distinct algorithms: Neural
Network, KNN, Random Forest, Decision Tree, and Logistic Regression. Based on our data, we
discovered that the Logistic Regression method yielded the best results in terms of accuracy and F1
score, making it the best choice for this application. Our classifier model is superior to the majority of
existing models, with an accuracy in the range of 70-80%. The model's accuracy was 81.89, and its F1
score was 0.89. Furthermore, we have included more parameters than in previous papers; these
parameters cover the customer's personal information as well as financial information, such as home
ownership, tax liens, and purpose. It is important to keep in mind that the choice of algorithm may
vary based on the project's requirements and the available data. All things considered, our analysis
demonstrates how important it is to appropriately select and evaluate multiple algorithms to
maximise the performance of a machine learning application. Because of its high accuracy, F1 score,
and precision score, we have decided to use the Logistic Regression model to predict whether or not
the applicants will be approved for bank loans