0% found this document useful (0 votes)
16 views16 pages

Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037

This document discusses the importance of accurate credit risk assessment for lending organizations and proposes a deep learning model for predicting credit approvals using data from UCI. It outlines the process of exploratory data analysis, data transformations, and the development of various analytical models, including logistic regression and classification trees, to improve credit scoring accuracy. The results indicate that the classification and regression tree model outperforms logistic regression, achieving an accuracy of 86.1% compared to 77%.

Uploaded by

mohitmochi666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views16 pages

Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037

This document discusses the importance of accurate credit risk assessment for lending organizations and proposes a deep learning model for predicting credit approvals using data from UCI. It outlines the process of exploratory data analysis, data transformations, and the development of various analytical models, including logistic regression and classification trees, to improve credit scoring accuracy. The results indicate that the classification and regression tree model outperforms logistic regression, achieving an accuracy of 86.1% compared to 77%.

Uploaded by

mohitmochi666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Machine learning Techniques for

Predicting Credit Approvals

Prawar Mundra
2018IMG-037
Introduction

● The accurate assessment of consumer credit risk is of uttermost importance for


lending organizations.

● Credit scoring is a widely used technique that helps financial institutions evaluates
the likelihood for a credit applicant to default on the financial obligation and
decide whether to grant credit or not

● The credit industry has experienced a tremendous growth in the past few decades.
The increased number of potential applicants impelled the development of
sophisticated techniques that automate the credit approval procedure and
supervise the financial health of the borrower.
● In the last few decades, various quantitative methods were proposed in the
literature to evaluate consumer loans and improve the credit scoring accuracy (for
a review, see e.g. Crook et al., 2007)
● The goal of a credit scoring model is to classify credit applicants into two classes:
the “good credit” class that is liable to reimburse the financial obligation and the
“bad credit” class that should be denied credit due to the high probability of
defaulting on the financial obligation.
● The classification is contingent on sociodemographic characteristics of the
borrower (such as age, education level, occupation and income), the repayment
Performance on previous loans and the type of loan
● This paper proposes a credit scoring model of consumer loans based on various
analytical models
Objective

● The aim of this paper is to create a deep learning model that can be used to aid
credit card acceptance decisions using data from UCI (University of California,
Irvine).

This analysis is organized as follows:


1. Generate several data visualizations to understand the underlying data;

2. Perform data transformations as needed;

3. Develop research questions about the data; and

4. Generate and apply the model to answer the research questions.


Exploratory Data analysis

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to


summarize their main characteristics, often with visual methods. A statistical model can
be used or not, but primarily EDA is for seeing what the data can tell us beyond the
formal modeling or hypothesis testing task.

EDA tackle specific tasks such as:

● Spotting mistakes and missing data


● Mapping out the underlying structure of the data
● Identifying the most important variables
● Listing anomalies and outliers;
● Testing a hypothesis / checking assumptions related to a specific model

● Clustering and dimension reduction techniques, which help you to create


graphical displays of high dimensional data containing many variables

● Univariate visualization of each field in the raw dataset, with summary statistics

● Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable you’re
looking at

● Multivariate visualizations, for mapping and understanding interactions between


different fields in the data

● K-Means Clustering (creating “centres” for each cluster, based on the nearest
mean); Predictive models, e.g. linear regression.
Dataset and codebook for credit approval data
● The first step in any analysis is to obtain the dataset and codebook. Both the dataset and the
codebook can be downloaded for free from the UCI website.
● Once the dataset is loaded, we’ll use the str() function to quickly understand the type of data in
the dataset
Data Transformations
The binary values, such as Approved, need to be converted to 1s and 0s. We’ll need to
do additional transformations such as filling in missing values. That process begins by
first identifying which values are missing and then determining the best way to address
them. We can remove them, zero them out, or estimate a plug value. A scan through the
dataset shows that missing values are labeled with ‘?’. For each variable, we’ll convert
the missing values to NA which will interpret differently than a character value.

1. Continuous Values:

we will use the summary() function to see the descriptive statistics of the numeric
values such as min, max, mean, and median. The range is the difference between
the minimum and maximum values and can be calculated from the summary()
output. For the B variable, the range is 66.5 and the standard deviation is 11.9667.
Missing Values:

Method would be to check the relationship among the numeric values and use a linear regression to
fill them in. The table below shows the correlation between all of the variables.The largest value in the
first row is 0.396 meaning age is most closely correlated with YearsEmployed. Similarly, Debt is mostly
correlated with YearsEmployed.
We can use this information to create a linear regression model between the two
variables. The model produces the two coefficients below: Intercept and
YearsEmployed. These coefficients are used to predict future values. The
YearsEmployed coefficients is multiplied by the value for YearsEmployed and the
intercept is added.
Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. Together with simple graphics analysis, they
form the basis of virtually every quantitative analysis of data.
● First, we use the mean and standard deviation calculated
● subtract the mean from each value and, finally, divide by the standard deviation.The end
result is the z-score.

We did similar transformations on the other continuous variables and then plotted them.
Categorical Variables (Association Rules)

A categorical variable is a discrete variable that captures qualitative outcomes by


placing observations into fixed groups (or levels).

The data is distributed across factors ‘1’ and ‘0’ plus 12 of them are missing values.
Again, the missing values will not work well in classifier models so we’ll need to fill in
them in. The simplest way to do so is to use the most common value. For example,
since the ‘0’ factor is the most common, we could replace all missing values with ‘o’

Generate Analytic Models


In order to prepare and apply a model to this dataset, we’ll first have to break it into
two subsets. The first will be the training set on which we will develop the model.
The second will be the test dataset which we will use to test the accuracy of our
model. We will allocate 75% of the items to Training and 25% items to the Test set
Base Line
There are 517 applications and 287 or 56% of which were denied. Since more
applications were denied than were approved, our baseline model will predict that all
applications were denied. This simple model would be correct 56% of the time. Our
models have to be more accurate than 56% to add value to the business.

Logistic Regression -Create the Model:


Regression models are useful for predicting continuous (numeric) variables.. However,
the target value in Approved is binary and can only be values of 1 or 0.We could use
linear regression to predict the approval decision using threshold and anything below
assigned to 0 and anything above is assigned to 1. Unfortunately, the predicted values
could be well outside of the 0 to 1 expected range. Therefore, linear or multivariate
regression will not be effective for predicting the values. Instead, logistic regression will
be more useful because it will produce probability that the target value is 1.
Probabilities are always between 0 and 1 so the output will more closely match the
target value range than linear regression.
The model summary shows that the p-
values for each coefficient. Alongside these
coefficients, the summary gives R’s usual
at-a-glance scale of asterisks for
significance.

Using this scale, we can see that the


coefficients for AgeNorm and Debt3 are not
significant. We can likely simplify the
model by removing these two variables
and get nearly the same accuracy.

The confusion matrix shows the


distribution of actual values and predicted
values.

Of the 517 observations, the model


correctly predicted 398 approval decisions
(249 + 149) or about 77% accuracy
Classification and Regression Tree - Create
the Model
Classification and Regression Trees (CART) can be used for similar purposes as logistic
regression. They both can be used to classify items in a dataset to a binary class attribute.
The trees work by splitting the dataset at series of nodes that eventually segregates the
data into the target variable. The models are sometimes referred to as decision trees
because at each node the model determines which path the item should take. They have
an advantage over logarithmic regression models in that the splits or decision are more
easily interpreted than a collection of numerical coefficients and logarithmic scores

The confusion matrix resulting from this CART model shows that we correctly classified 231
denied credit applications and 214 approved applications. The accuracy score for this
model is 86.1% which is better than the 75% accuracy the logistic regression model scored
and significantly better than the baseline model.
Apply the Model
We’ll now apply our classifier model to the test dataset and determine how
effective it is. Our confusion matrix shows 144 items were correctly
predicted for 83% accuracy. We can see that this model is both more
effective and easier to interpret than the logistic regression model.

Conclusion and Future enhancement


In this paper, data preprocessing and transformation techniques are applied
and results are generated by implementing analytical models. The
performance is analyzed using the confusion matrix table. We can also use
this model to make detail testing selections. Any credit application that
does not have the same outcome as predicted by the model is potential
audit exception. The inherent risk is that a credit card was issued to
someone that should have been denied. This account is more likely to
default than a properly approved account which, in turn, exposes the
company to loss. The different machine learning models can be

You might also like