0% found this document useful (0 votes)
76 views23 pages

EDA-project Notes-1

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views23 pages

EDA-project Notes-1

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Banking Project

(Capstone Project Notes - 1)


DSBA

By:
E. AuroRajashri

0
List of Content
1) Introduction of the business
problem..................................................................................................3
1.1 Defining problem statement
1.2 Need of the study/project
1.3 Understanding business/social opportunity

2) Data
Report……………………………………………………….……………………………….….5
2.1 Understanding how data was collected in terms of time, frequency and
methodology
2.2 Visual inspection of data (rows, columns, descriptive details)
2.3 Understanding of attributes (variable info, renaming if required)

3) Exploratory data
analysis…………………………………………………………………………….………….…7
3.1 Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones)
3.2 Bivariate analysis (relationship between different variables, correlations)
3.3 Removal of unwanted variables (if applicable)
3.4 Missing Value treatment (if applicable)
3.5 Outlier treatment (if required)
3.6 Variable transformation (if applicable)
3.7 Addition of new variables (if required)

4) Business insights from


EDA ……………………………………………………………………………………...………21
4.1 Is the data unbalanced? If so, what can be done? Please explain in the context of the
business
4.2 Any business insights using clustering (if applicable)
4.3 Any other business insights

1
List of Tables
2.2 Descriptive Statistics…………………………………………………………………………….………5

2.3 Data Info……………………………………………………………………………………………….…….6

3.3.1 Removed User id variable from the data frame………………………………………….….15

3.3.2 Removed Name in email variable from the data frame…………………………….…….15

3.4.1 Percentage of missing value per column……………………………………………………….16

3.4.2 Post dropping off columns with 25% threshold……………………………………………..17

3.4.3 Post Imputation- Missing values……………………………………………………..…………..17

3.6.1 One-hot encoding……………………………………………………………………………..………..19

4.2.1 Post scaling treatment……………………………………………………………………….……….20

4.2.2 Inertia of various n_clusters………………………………………………………………….……20

4.2.4 Final dataset post clustering……………………………………………………………….……….21

List of Figures
3.1.1 Histogram of age………………………………………………………………………………………….7

3.1.2 Histogram of Time_hours…………………………………………………………………………….8

3.1.3 Number of Defaulters and Non-defaulters………………………………………………….….8

3.1.4 Top 10 Merchant Categories………………………………………………………………………….9

3.1.5 Top 10 Merchant groups………………………………………………………………………………10

3.1.6 Histogram of all numerical variables…………………………………………………………….11

3.2.1 Average Account Amount Added (12-24m) by Default status……………………………12

3.2.2 Distribution of Max paid invoice(0-12m) by Default status………………………………13

3.2.3 Violin plot: Age distribution by default status…………………………………………………13

3.2.4 Heat Map -Correlation………………………………………………………………………………….14

3.5.1 Outliers using box plot………………………………………………………………………………….18

3.5.2 Post Outliers Treatment……………………………………………………………………………….18

4.2.3 Elbow graph………………………………………………………………………………………………..21

2
1. Introduction of the business problem
1.1 Defining problem statement
Problem Statement: This business problem is a supervised learning
example for a credit card company. The objective is to predict the probability
of default (whether the customer will pay the credit card bill or not) based on
the variables provided. There are multiple variables on the credit card
account, purchase and delinquency information which can be used in the
modelling.
PD modelling problems are meant for understanding the riskiness of the
customers and how much credit is at stake in case the customer defaults.
This is an extremely critical part in any organization that lends money [both
secured and unsecured loans].
 The objective of this project is to develop a predictive model that
estimates the probability of default for credit card customers. This
involves using the provided dataset, which contains various variables
related to credit card accounts, purchases, and delinquency
information, to understand the riskiness of customers.
 By accurately predicting the likelihood of default, the credit card
company can better assess the credit risk associated with each
customer and make informed decisions regarding credit limits,
interest rates, and other lending terms. This is crucial for minimizing
potential losses and managing the overall credit risk portfolio of the
organization.

1.2 Need of the study/project


The need for this study or project arises from the critical role that predicting the
probability of default (PD) plays in the financial industry, particularly for credit
card companies and other lending institutions. Here are some key reasons why
this study is essential:
1. Risk Management: Understanding the riskiness of customers is crucial for
managing the overall risk portfolio of a lending institution. By predicting the
likelihood of default, companies can make informed decisions about whom to
lend to and under what terms.
2. Credit Allocation: Accurate PD models help in determining the appropriate
amount of credit to extend to each customer. This ensures that credit is
allocated efficiently, maximizing returns while minimizing risk.
3. Loss Mitigation: By identifying high-risk customers, companies can take
proactive measures to mitigate potential losses. This might include adjusting
credit limits, changing interest rates, or implementing stricter repayment
terms.
4. Regulatory Compliance: Financial institutions are often required to maintain
certain levels of capital reserves based on the riskiness of their loan portfolios.
Accurate PD models help in meeting these regulatory requirements by
providing a clear picture of potential defaults.

3
5. Profitability: By minimizing defaults and optimizing credit allocation,
companies can improve their profitability. This is achieved by reducing bad
debt expenses and increasing the overall efficiency of the lending process.
6. Customer Relationship Management: Understanding customer behavior and
risk profiles allows companies to tailor their products and services to meet the
needs of different customer segments, enhancing customer satisfaction and
loyalty.
7. Strategic Planning: PD models provide valuable insights that can inform
strategic decisions, such as entering new markets, developing new products,
or adjusting business models to better align with customer risk profiles.
Overall, this study is essential for enhancing the financial stability and operational
efficiency of lending institutions, ultimately contributing to their long-term success.

1.3 Understanding business/social opportunity


 The business and social opportunities of this project are significant, as it
not only enhances the financial stability and profitability of lending
institutions but also promotes financial inclusion and economic growth.

 By accurately predicting the probability of default, credit card companies


can extend credit more responsibly, reaching a broader range of
customers, including those who may have been previously underserved.
 This can lead to reduced financial risks, lower costs of credit, and
improved customer satisfaction, ultimately contributing to economic
stability and growth.
 Additionally, by understanding customer risk profiles, companies can
tailor their products and services to better meet the needs of different
customer segments, enhancing customer relationships and loyalty.

 Recent example: Yes Bank, once one of India's fastest-growing private


sector banks, faced a severe crisis in 2020 due to its inability to manage
credit risk effectively.
 Recent example: In 2019, SBI implemented an AI-powered credit
scoring system to assess loan applications and predict the probability of
default. This system has helped SBI better manage its non-performing
assets (NPAs) by more accurately predicting which borrowers are likely to
default

4
2. Data Report
2.1 Understanding how data was collected in terms of
time, frequency and methodology
 The data provided by a credit card company about its customer’s credit
activity and defaulters information.
 There are 99,979 customers and the observations are divided into 36
variables.

2.2 Visual inspection of data (rows, columns, descriptive


details)
 The data has 36 variables with 99,979 observations. And 99979
rows and 36 columns as shown in the result below.

 out of the data provided 3 discrete variables and 33 continuous


variables.
 There are 700,141 missing cells
 Default is the dependent variable which captures if the customer
has defaulted or not.

2.2 Descriptive Statistics

 Most users do not have severe debt problems, as indicated by the


low average values for acct_days_in_dc_12_24m,

5
acct_days_in_rem_12_24m, and other delinquency-related
metrics.
 The higher values of the worst status in the past (compared to
recent months) suggest that either users are improving their
financial behaviour, or perhaps recent data is still too new to reflect
long-term issues.
 acct_days_in_dc_12_24m and
acct_days_in_rem_12_24m: These features have low mean
values (3.75 and 1.58, respectively), which suggests that most users
are not spending much time in debt collection or remediation
 sum_capital_paid_acct_0_12m (mean of 351): This
measure of capital paid over the past 12 months is much lower than
the sum of invoices, indicating that many payments may be focused
on interest or smaller amounts, with fewer users paying off large
amounts of principal.
 The average age of users is around 42 years, with most users
between 34 (25th percentile) and 50 (75th percentile). This is a
relatively mature population, likely implying they have had time to
accumulate financial responsibilities (e.g., mortgages, loans). The
low minimum of 18 and the high maximum of 75 indicate a broad
age range, which could suggest different behavioural patterns
based on life stage (e.g., younger users might be less financially
stable).

2.3 Understanding of attributes (variable info, renaming


if required)
 It has 33 numerical variables and 3 categorical variables

2.3 Data Info

 The type of variables present in the data are:

6
1. Demographic variables: userid, age, name_in_email
2. Loan variables:
acct_amt_added_12_24m,acct_days_in_dc_12_24m,acct_days_in_rem_12_24
m,acct_days_in_term_12_24m,acct_incoming_debt_vs_paid_0_24m,acct_stat
us,has_paid,max_paid_inv_0_12m,max_paid_inv_0_24m,num_active_inv,rec
overy_debt,sum_capital_paid_acct_0_12m,sum_capital_paid_acct_12_24m,
sum_paid_inv_0_12m
3. Credit variables:
default,acct_worst_status_0_3m,acct_worst_status_12_24m,acct_worst_status
_3_6m,acct_worst_status_6_12m,avg_payment_span_0_12m,avg_payment_s
pan_0_3m,merchant_category,merchant_group,num_active_div_by_paid_inv
_0_12m,num_arch_dc_0_12m,num_arch_dc_12_24m,num_arch_ok_0_12m,
num_arch_ok_12_24m,num_arch_rem_0_12m,status_max_archived_0_6_m
onths,status_max_archived_0_12_months,status_max_archived_0_24_month.

3.Exploratory data analysis


3.1 Univariate analysis

3.1.1 Histogram of age

This histogram shows the distribution of age in the dataset. We can observe that:
 The age distribution is right-skewed, with most customers falling in the range
of 25-45 years old.
 The peak of the distribution is around 30-35 years old.

7
 There are fewer customers in the older age ranges (above 60).

3.1.2 Histogram of Time_hours

 The data shows a higher frequency of occurrences around the 15 to 21-hour


mark. This suggests that most of the recorded times fall within this range.
 The distribution appears to be right-skewed, with more values concentrated
in the later hours (past 10 hours) and fewer in the earlier hours.

3.1.3 Number of Defaulters and Non-defaulters

The dataset is highly imbalanced in terms of default status:


 88688 customers did not default (98.57%)

8
 1,288 customers defaulted (1.43%)

3.1.4 Top 10 Merchant category

Key insights:
 Concentration of Transactions: The Direct selling establishments category
dominates with the highest count, nearly 40,000, far exceeding the other
categories. This indicates a large number of transactions or significant activity
in this category.
 Moderate Activity: Categories like Books & Magazines and Youthful Shoes &
Clothing have moderate counts (around 10,000–15,000), showing significant
but not overwhelming activity compared to the leader.

 Low Activity Categories: Categories like Dietary Supplements, Prints &


Photos, and Diversified electronics have much lower counts (under 10,000).
These are niche categories with fewer transactions.
 Category Variety: The top 10 categories represent a broad range of industries,
including electronics, apparel, outdoor gear, books, and general merchandise.
This indicates diverse customer interests.

9
3.1.5 Top 10 Merchant groups

The bar chart you shared shows the top 10 merchant groups and the count of
transactions or occurrences associated with each group. Here's a breakdown of the
insights:
 Entertainment is by far the dominant category, with significantly more counts
(around 50,000) than the other categories. This suggests that consumers
engage with or spend more in this group.

 Clothing & Shoes follows as the second-highest group, though it's much lower
than Entertainment.
 The groups with the lowest counts are Jewelry & Accessories, Home & Garden,
Intangible Products, and Automotive Products.

 The distribution shows that spending or transaction volume is concentrated


heavily in Entertainment, with other categories having relatively smaller but
still notable volumes.

10
3.1.6 Histogram of all numerical variables

 Many variables, such as acct_worst_status_0_24m,


acct_worst_status_1_24m, and num_active_rev_tl, show high frequencies at

11
zero or low values with a steep decline as the values increase. This suggests
that most data points fall in the lower range, with fewer high values.
 Variables like sum_capital_paid_account_0_12m and num_active_tl also
show extreme right-skewness, where the majority of data points are
concentrated at lower values.
 Some histograms, like time_hours, show a bimodal distribution with
significant peaks around certain values, possibly indicating two common time
ranges in the data.
 Many variables, like num_tl_90g_dpd_24m, num_actv_bc_tl, and
max_bal_bc, have a significant concentration of values near zero, indicating
that for these variables, the majority of the data points reflect minimal activity
or involvement (e.g., low number of transactions or minimal balance).
 In many histograms (e.g., recovery_label,
sum_capital_paid_account_0_12m), there are long tails indicating the
presence of outliers or extreme values. This implies that there are a few cases
where the values are much higher than the rest of the data.

3.2 Bivariate analysis

3.2.1 Average Account Amount Added (12-24m) by Default status

This barplot compares the average account amount added in the last 12-24 months
for customers who defaulted (1) versus those who didn't (0). We can see that:
 Customers who defaulted (1) tend to have a higher average account amount
added compared to those who didn't default (0).
 This could suggest that customers who add larger amounts to their accounts
might be at a higher risk of default, possibly due to overextending their
financial capabilities.

12
3.2.2 Distribution of Max paid invoice(0-12m) by Default status

This strip plot shows the distribution of the maximum paid invoice in the last 12 months for
defaulted and non-defaulted customers. Observations:
 The distribution for non-defaulted customers (0) appears to be more concentrated in
the lower range, with some high-value outliers.

 Non-defaulted accounts (status 0) show a wider and higher distribution of max paid
invoices, while defaulted accounts (status 1) have smaller invoice amounts. This
pattern could be used for risk assessment or to better understand customer payment
behaviour

3.2.3 Violin plot: Age distribution by default status

This violin plot displays the age distribution for defaulted and non-defaulted
customers.
 The age distributions are fairly similar for both groups.

13
 Both distributions are slightly right-skewed, with most customers between
25-45 years old.
 There's a slight indication that defaulted customers might be younger on
average, but the difference doesn't appear to be substantial.

3.2.4 Heat Map -Correlation

Key Insights:
1. Highly Correlated Features:
 Features with a correlation coefficient close to 1 or -1 have a very
strong linear relationship, either positively or negatively
correlated.

 For example, if max_paid_inv_0_12m and


num_active_inv_0_12m show high positive correlation, it
implies that as the number of active invoices increases, the
maximum paid invoice also tends to increase.

 Similarly, features like acct_worst_status_12_24m might be


strongly correlated with acct_worst_status_6_12m, indicating a
consistency in worst account status over different periods.
2. Clusters of Features:
 Features that are highly correlated with each other may form
"clusters." For instance, all account status variables or payment-
related features might be grouped together, showing that they
are related aspects of customer behavior.

14
 Clustering often reveals related features that can be treated
similarly in model building or analysis, as they provide
overlapping information.
3. Negative Correlations:

 Strong negative correlations (close to -1) indicate an inverse


relationship. For example, if default_status has a negative
correlation with max_paid_inv_0_12m, it means that
customers with higher max paid invoices are less likely to
default.

 Similarly, a negative correlation between


acct_incoming_debt_vs_paid_0_24m and
acct_days_in_rem_12_24m might show that the more days a
person remains in arrears, the less they manage to reduce their
outstanding debt.
4. Redundancy:
 Features that are almost perfectly correlated (near 1) may
represent redundant information. For example, if
acct_worst_status_6_12m and acct_worst_status_3_6m are
highly correlated, it may be redundant to include both in certain
analyses. One of these features can potentially be dropped in a
model without losing valuable information.
5. Outliers in Correlation:
 If there are features that stand out with unexpectedly high or low
correlations compared to others, they may warrant deeper
investigation. These outliers could represent key insights into
behavior or relationships between variables that are not
immediately obvious.

3.3 Removal of unwanted variables


 Removed userid variable

3.3.1 Removed Userid from the data frame

 Removed name in email variable

3.3.2 Removed Name in email variable from the data frame

15
3.4 Missing Value treatment

 There are 615512 missing values. The percentage of missing value in each
variable calculated and the result is below:

3.4.1 Percentage of missing value per column

 Dropping off the columns which has missing value greater than 25% and below
are the missing values in remaining columns

16
3.4.2 Post dropping off columns with 25% threshold

 By importing SimpleImputer, missing values are imputed by median. Below is


the result post imputation.

3.4.3Post Imputation- Missing values

17
3.5 Outlier treatment
 For outlier treatment, we are separated the data into object and non-object to
visualize the outliers

3.5.1 Outliers using box plot

 Post outlier treatment,

3.5.2 Post Outliers Treatment

3.6 Variable transformation


 One-hot encoding done for Merchant Category and Merchant group (categorical
columns)
 Post that, shape of dataset is shown below:

18
3.6.1 One-hot encoding

3.7 Addition of new variables


 As the provided variables are good enough for the processing and modelling. So
there was no need aroused to add new variables for this dataset.

4. Business insights from EDA


4.1 Is the data unbalanced? If so, what can be done? Please
explain in the context of the business
 Imbalanced data refers to datasets where the target class has an uneven
distribution of observations, i.e., one class label has a very high number
of observations, and the other has a deficient number of observations.
 The dataset is highly imbalanced in terms of default status:
86,035 customers did not default (98.5%) and 1,280 customers defaulted (1.5%)
 This imbalance is important to consider when interpreting the results for
predictive modelling. We might need to use techniques like oversampling,

19
under sampling, or SMOTE (Synthetic Minority Oversampling Technique) to
balance the classes for better model performance.
 The general approach to applying SMOTE on this dataset:

1. Separate features and target: Use default as the target variable and
the remaining columns as features.
2. Apply SMOTE: Use SMOTE to oversample the minority class. In this,
new instances are synthesized from the existing data
3. Train a model: You can then use the balanced data to train your machine
learning model.

4.2 Any business insights using clustering (if


applicable)
 For clustering, target labels like default are not needed, as clustering is
unsupervised.
 K-means clustering performed on the dataset as shown below:
 Before performing k-means clustering, Scaling is important. Below is the
result of the scaled data

4.2.1 Post scaling treatment

 Different number of Clusters are performed and inertia is calculated

4.2.2 Inertia of various n_clusters

 It is visually viewed in elbow graph as shown below

20
4.2.3 Elbow graph

 Silhouette score is high at 5 number of clusters. Score is shown below

 Sil width and clus_kmeans are added to the dataset.

4.2.4 Final dataset post clustering

4.3 Any other business insights


1. Imbalanced Dataset: The data is highly imbalanced with 98.5% non-defaulters
and only 1.5% defaulters. This can affect the performance of predictive
models. Techniques like oversampling (SMOTE) should be applied to handle
the imbalance and improve prediction accuracy.
2. Customer Risk Segmentation: Customers who defaulted tend to have higher
average account amounts added in the last 12-24 months, which may indicate
that higher financial activity could be associated with default risk. Non-
defaulting customers generally have higher maximum paid invoices compared
to defaulters, implying better financial behaviour in terms of invoice
payments.
3. Age and Default Probability: While age distributions of defaulters and non-
defaulters are quite similar, younger customers might have a slightly higher
default tendency, though this difference isn't large.

21
4. Merchant Category Insights: The "Direct Selling Establishments" category has
the highest transaction count, indicating significant customer spending in this
area. Spending is heavily concentrated in the entertainment sector, followed
by clothing and shoes, highlighting key areas of customer expenditure.
5. Clustering Insights: The optimal number of customer clusters was identified
as five using K-means clustering, suggesting distinct customer segments based
on financial behaviour. This can help in targeted marketing and risk
assessment strategies.
6. Correlation Insights: Features related to account status over different periods
(e.g., 0-3 months vs. 6-12 months) are highly correlated, indicating
consistency in customer payment behaviour. Strong negative correlations
between default status and variables like the maximum paid invoice suggest
that higher payments are linked to lower default risk.
These insights can guide credit risk management, customer segmentation, and
business strategies for the credit card company.

22

You might also like