Credit Risk Analysis Using EDA Techniques
Credit Risk Analysis Using EDA Techniques
Presented by:
Shakti Singh
Introduction to Exploratory Data Analysis
• “While fraud reduction is a common goal for banks and financial institutions, analytics
can be used to manage risk instead of simply detecting fraud” (Janaha, Data
Analytics in banking and Financial Services 2023).
• “Analytics can be used to identify and rate individual customers who are at risk of
fraud and then apply different levels of monitoring and verification to those accounts.
Analyzing the risk of the accounts allows banks and financial institutions to know
what to prioritize in their fraud detection efforts” (Janaha, Data Analytics in banking
and Financial Services 2023).
• “With the rise of computing power and new analytical techniques, banks can now
extract deeper and more valuable insights from their ever-growing mountains of
data.” Moreover, “The recent dramatic increases in computing power have allowed
banks to deploy advanced analytical techniques at an industrial scale” (Dash et al.,
Risk analytics enters its prime 2017).
Introduction to Exploratory Data Analysis
• “The loan providing companies find it hard to give loans to the people due to their
insufficient or non-existent credit history. Because of that, some consumers use it to their
advantage by becoming a defaulter” (upGrad, Credit EDA Assignment 2023).
• “When the company receives a loan application, the company has to decide for loan
approval based on the applicant’s profile” (upGrad, Credit EDA Assignment 2023).
• “This case study aims to identify patterns which indicate if a client has difficulty paying
their instalments which may be used for taking actions such as denying the loan, reducing
the amount of loan, lending (to risky applicants) at a higher interest rate, etc.” (upGrad,
Credit EDA Assignment 2023).
• Problem Statement: “The company wants to understand the driving factors (or driver
variables) behind loan default, i.e. the variables which are strong indicators of default.
The company can utilise this knowledge for its portfolio and risk assessment” (upGrad,
Credit EDA Assignment 2023).
Procedure followed from the Beginning
• All the warning and necessary libraries were imported such as import warnings,
numpy, pandas, matplotlib.pyplot and seaborn.
• Then the CSV file called ‘application_data.csv’ was imported along with CSV file
‘previous_application.csv’.
• ‘application_data.csv’ was converted to a pandas data frame and stored in a variable
and same was done for ‘previous_application.csv’.
• First analysis was done on current application or application data set and then later it
was done on the previous application dataset.
• File was read using the head( ) and tail( ) function
• After that the columns were checked, and info was printed.
• Then the data set was checked for null values.
Null Value Treatment
• Then the null values were converted to a percentage format.
• Columns that were having null values more than 35 percent were removed from the
dataset.
• Then null values that were less than 13 to 20 percent that were missing in the dataset
were identified labeled as minor_missing_values.
• Then the unique values of the columns were identified using the nuinque ( ) function
and was determined that they were of categorical nature.
• Some unnecessary columns were dropped that were not needed for the analysis.
• Then data type correction was performed making columns that are of object type or
categorical or numerical type and were converted accordingly.
• Columns having negative values were converted to positive where it was required.
Data Binning
Inference: We can
clearly see that
there are many
outliers present,
we can use
median or mode
to impute the
outliers because
mode or median
will give an
accurate
representation of
the whole dataset.
Box Plot for AMT_INCOME_TOTAL
Inference: We can
observe that there
are outliers present
here as well. We
can use median or
mode to impute the
outliers because
mode or median will
give an accurate
representation of
the whole dataset
Box Plot for AMT_CREDIT
Inference: There is an
outlier here as well. We
can use median or mode
to impute the outliers
because mode or median
will give an accurate
representation of the
whole dataset
Heat Map of Target Variable Zero
Inference: We can see clearly
that our upper matrix is
symmetrical and the amount
credit, and amount goods
price have the highest positive
correlation 0.99 which means
they are positively linearly
correlated. However, the ext.
source 3 and amount total
income have a negative
correlation. Additionally, the
amount good price and
amount annuity are also
presenting with a good
correlation.
Heat Map of Target Variable Zero
Inference: We can see from
the above heat map that the
amount credit and amount
goods price have highest
positive correlation which
shows the positive linear
relation of the two variables.
However, ext. source 3 and
amount income total have a
negative correlation. This just
shows that two variables that
have negative correlation are
not necessarily dependent on
each other.
Numerical Univariate Analysis
Box Plot comparing
YEARS_BIRTH by using
clients from dataset of Target
0 and 1
It is an interesting
observation here one can
see that clients that have
been paying regularly they
have moved more
frequently and clients that
have had problems with
payment have moved less
often.
Numerical Univariate Analysis
Box Plot comparing
AMT_ANNUITY and clients
from dataset of Target 0 and
1
An interesting observation,
Males in both the
categories have defaulted
less than the females.
However, the number of
females that have taken a
loan is also more than
males.
Categorical Univariate Analysis
Count Plot comparing
EDUCATION_TYPE and
clients from dataset of Target
0 and 1
Inference:
People without
payment difficulties
take more credit
for the annuity
Bi-variate Analysis
Count Plot on Contract
Type and Credit Range
Inference:
Here we see that
people from both
the category take
loans of cash type
more than the
revolving loans but
that can be
credited to most
people in Labor
and sales class.
Bi-variate Analysis
Count Plot on Gender and
Credit Range
Inference:
In both categories
females have
taken out more
loans and even the
amount is greater
in both cases in
females.
Numerical and Categorical Bi-variate Analysis
Box Plot on Credit
Amount and
Education Type
Inference: In case of
clients with difficulty
paying the loan, we
can see that people
with higher education
struggled more in
repayment it can be
because of the
situation of job
market and the
amount of loan or
their employment
status. On the other
hand, clients who
don't have payment
difficulty is also
leading with higher
education.
Numerical and Categorical Bi-variate Analysis
Box Plot on Total
Income and Education
Type
Inference:
In both cases there
are numerous
outliers, but
academic degree is
the least in both the
cases because of the
client base of the
dataset which is less.
Numerical and Categorical Bi-variate Analysis
Box Plot on Credit
Amount and
Occupation Type
Inference:
We can observe that
the amount of credit
taken is more in
clients with no
difficulties and
defaulters tend to
take less amount of
credit in any
occupation type.
Also, we can see that
Accountants and
Mangers tend to take
more loans and have
more difficulty paying
back as well.
Previous Application Data
• We have followed the same steps as the current application data for Data Cleaning.
• After Cleaning the data, we have merged both the current and previous application data to perform the final
analysis.
Final Dataset Analysis
Count Plot and Pie Plot showing Different status of Loan Offered
Final Dataset Analysis
Count Plot for Contract Type with four subcategory
Inference:
In all cases
working people
have been
approved more
and have more
loans when
compared to other
classes of people
followed by
commercial
associates
Final Dataset Analysis
Count Plot for Family Status with four subcategory
Inference:
We can observe
that married clients
are the biggest
borrowers from the
bank.
Final Dataset Analysis
Count Plot for Payment Type with four subcategory
Inference:
Inference:
Laborer, sales
staff, core staff and
drivers are the
leading borrowers
who highest in
approval and
rejection.
Univariate Analysis on Final Dataset
Logarithmic Comparison of Purpose of Borrowing
Inference:
Majority of rejected
loans are from the
category 'repairs'.
Also, education
has equal number
of approves and
rejection
Paying other loans
and buying a new
car is having
significant higher
rejection than
approvals.
Univariate Analysis on Final
Dataset
Logarithmic Comparison of Contract Status
Inference: