EDA-project Notes-1
EDA-project Notes-1
By:
E. AuroRajashri
0
List of Content
1) Introduction of the business
problem..................................................................................................3
1.1 Defining problem statement
1.2 Need of the study/project
1.3 Understanding business/social opportunity
2) Data
Report……………………………………………………….……………………………….….5
2.1 Understanding how data was collected in terms of time, frequency and
methodology
2.2 Visual inspection of data (rows, columns, descriptive details)
2.3 Understanding of attributes (variable info, renaming if required)
3) Exploratory data
analysis…………………………………………………………………………….………….…7
3.1 Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones)
3.2 Bivariate analysis (relationship between different variables, correlations)
3.3 Removal of unwanted variables (if applicable)
3.4 Missing Value treatment (if applicable)
3.5 Outlier treatment (if required)
3.6 Variable transformation (if applicable)
3.7 Addition of new variables (if required)
1
List of Tables
2.2 Descriptive Statistics…………………………………………………………………………….………5
List of Figures
3.1.1 Histogram of age………………………………………………………………………………………….7
2
1. Introduction of the business problem
1.1 Defining problem statement
Problem Statement: This business problem is a supervised learning
example for a credit card company. The objective is to predict the probability
of default (whether the customer will pay the credit card bill or not) based on
the variables provided. There are multiple variables on the credit card
account, purchase and delinquency information which can be used in the
modelling.
PD modelling problems are meant for understanding the riskiness of the
customers and how much credit is at stake in case the customer defaults.
This is an extremely critical part in any organization that lends money [both
secured and unsecured loans].
The objective of this project is to develop a predictive model that
estimates the probability of default for credit card customers. This
involves using the provided dataset, which contains various variables
related to credit card accounts, purchases, and delinquency
information, to understand the riskiness of customers.
By accurately predicting the likelihood of default, the credit card
company can better assess the credit risk associated with each
customer and make informed decisions regarding credit limits,
interest rates, and other lending terms. This is crucial for minimizing
potential losses and managing the overall credit risk portfolio of the
organization.
3
5. Profitability: By minimizing defaults and optimizing credit allocation,
companies can improve their profitability. This is achieved by reducing bad
debt expenses and increasing the overall efficiency of the lending process.
6. Customer Relationship Management: Understanding customer behavior and
risk profiles allows companies to tailor their products and services to meet the
needs of different customer segments, enhancing customer satisfaction and
loyalty.
7. Strategic Planning: PD models provide valuable insights that can inform
strategic decisions, such as entering new markets, developing new products,
or adjusting business models to better align with customer risk profiles.
Overall, this study is essential for enhancing the financial stability and operational
efficiency of lending institutions, ultimately contributing to their long-term success.
4
2. Data Report
2.1 Understanding how data was collected in terms of
time, frequency and methodology
The data provided by a credit card company about its customer’s credit
activity and defaulters information.
There are 99,979 customers and the observations are divided into 36
variables.
5
acct_days_in_rem_12_24m, and other delinquency-related
metrics.
The higher values of the worst status in the past (compared to
recent months) suggest that either users are improving their
financial behaviour, or perhaps recent data is still too new to reflect
long-term issues.
acct_days_in_dc_12_24m and
acct_days_in_rem_12_24m: These features have low mean
values (3.75 and 1.58, respectively), which suggests that most users
are not spending much time in debt collection or remediation
sum_capital_paid_acct_0_12m (mean of 351): This
measure of capital paid over the past 12 months is much lower than
the sum of invoices, indicating that many payments may be focused
on interest or smaller amounts, with fewer users paying off large
amounts of principal.
The average age of users is around 42 years, with most users
between 34 (25th percentile) and 50 (75th percentile). This is a
relatively mature population, likely implying they have had time to
accumulate financial responsibilities (e.g., mortgages, loans). The
low minimum of 18 and the high maximum of 75 indicate a broad
age range, which could suggest different behavioural patterns
based on life stage (e.g., younger users might be less financially
stable).
6
1. Demographic variables: userid, age, name_in_email
2. Loan variables:
acct_amt_added_12_24m,acct_days_in_dc_12_24m,acct_days_in_rem_12_24
m,acct_days_in_term_12_24m,acct_incoming_debt_vs_paid_0_24m,acct_stat
us,has_paid,max_paid_inv_0_12m,max_paid_inv_0_24m,num_active_inv,rec
overy_debt,sum_capital_paid_acct_0_12m,sum_capital_paid_acct_12_24m,
sum_paid_inv_0_12m
3. Credit variables:
default,acct_worst_status_0_3m,acct_worst_status_12_24m,acct_worst_status
_3_6m,acct_worst_status_6_12m,avg_payment_span_0_12m,avg_payment_s
pan_0_3m,merchant_category,merchant_group,num_active_div_by_paid_inv
_0_12m,num_arch_dc_0_12m,num_arch_dc_12_24m,num_arch_ok_0_12m,
num_arch_ok_12_24m,num_arch_rem_0_12m,status_max_archived_0_6_m
onths,status_max_archived_0_12_months,status_max_archived_0_24_month.
This histogram shows the distribution of age in the dataset. We can observe that:
The age distribution is right-skewed, with most customers falling in the range
of 25-45 years old.
The peak of the distribution is around 30-35 years old.
7
There are fewer customers in the older age ranges (above 60).
8
1,288 customers defaulted (1.43%)
Key insights:
Concentration of Transactions: The Direct selling establishments category
dominates with the highest count, nearly 40,000, far exceeding the other
categories. This indicates a large number of transactions or significant activity
in this category.
Moderate Activity: Categories like Books & Magazines and Youthful Shoes &
Clothing have moderate counts (around 10,000–15,000), showing significant
but not overwhelming activity compared to the leader.
9
3.1.5 Top 10 Merchant groups
The bar chart you shared shows the top 10 merchant groups and the count of
transactions or occurrences associated with each group. Here's a breakdown of the
insights:
Entertainment is by far the dominant category, with significantly more counts
(around 50,000) than the other categories. This suggests that consumers
engage with or spend more in this group.
Clothing & Shoes follows as the second-highest group, though it's much lower
than Entertainment.
The groups with the lowest counts are Jewelry & Accessories, Home & Garden,
Intangible Products, and Automotive Products.
10
3.1.6 Histogram of all numerical variables
11
zero or low values with a steep decline as the values increase. This suggests
that most data points fall in the lower range, with fewer high values.
Variables like sum_capital_paid_account_0_12m and num_active_tl also
show extreme right-skewness, where the majority of data points are
concentrated at lower values.
Some histograms, like time_hours, show a bimodal distribution with
significant peaks around certain values, possibly indicating two common time
ranges in the data.
Many variables, like num_tl_90g_dpd_24m, num_actv_bc_tl, and
max_bal_bc, have a significant concentration of values near zero, indicating
that for these variables, the majority of the data points reflect minimal activity
or involvement (e.g., low number of transactions or minimal balance).
In many histograms (e.g., recovery_label,
sum_capital_paid_account_0_12m), there are long tails indicating the
presence of outliers or extreme values. This implies that there are a few cases
where the values are much higher than the rest of the data.
This barplot compares the average account amount added in the last 12-24 months
for customers who defaulted (1) versus those who didn't (0). We can see that:
Customers who defaulted (1) tend to have a higher average account amount
added compared to those who didn't default (0).
This could suggest that customers who add larger amounts to their accounts
might be at a higher risk of default, possibly due to overextending their
financial capabilities.
12
3.2.2 Distribution of Max paid invoice(0-12m) by Default status
This strip plot shows the distribution of the maximum paid invoice in the last 12 months for
defaulted and non-defaulted customers. Observations:
The distribution for non-defaulted customers (0) appears to be more concentrated in
the lower range, with some high-value outliers.
Non-defaulted accounts (status 0) show a wider and higher distribution of max paid
invoices, while defaulted accounts (status 1) have smaller invoice amounts. This
pattern could be used for risk assessment or to better understand customer payment
behaviour
This violin plot displays the age distribution for defaulted and non-defaulted
customers.
The age distributions are fairly similar for both groups.
13
Both distributions are slightly right-skewed, with most customers between
25-45 years old.
There's a slight indication that defaulted customers might be younger on
average, but the difference doesn't appear to be substantial.
Key Insights:
1. Highly Correlated Features:
Features with a correlation coefficient close to 1 or -1 have a very
strong linear relationship, either positively or negatively
correlated.
14
Clustering often reveals related features that can be treated
similarly in model building or analysis, as they provide
overlapping information.
3. Negative Correlations:
15
3.4 Missing Value treatment
There are 615512 missing values. The percentage of missing value in each
variable calculated and the result is below:
Dropping off the columns which has missing value greater than 25% and below
are the missing values in remaining columns
16
3.4.2 Post dropping off columns with 25% threshold
17
3.5 Outlier treatment
For outlier treatment, we are separated the data into object and non-object to
visualize the outliers
18
3.6.1 One-hot encoding
19
under sampling, or SMOTE (Synthetic Minority Oversampling Technique) to
balance the classes for better model performance.
The general approach to applying SMOTE on this dataset:
1. Separate features and target: Use default as the target variable and
the remaining columns as features.
2. Apply SMOTE: Use SMOTE to oversample the minority class. In this,
new instances are synthesized from the existing data
3. Train a model: You can then use the balanced data to train your machine
learning model.
20
4.2.3 Elbow graph
21
4. Merchant Category Insights: The "Direct Selling Establishments" category has
the highest transaction count, indicating significant customer spending in this
area. Spending is heavily concentrated in the entertainment sector, followed
by clothing and shoes, highlighting key areas of customer expenditure.
5. Clustering Insights: The optimal number of customer clusters was identified
as five using K-means clustering, suggesting distinct customer segments based
on financial behaviour. This can help in targeted marketing and risk
assessment strategies.
6. Correlation Insights: Features related to account status over different periods
(e.g., 0-3 months vs. 6-12 months) are highly correlated, indicating
consistency in customer payment behaviour. Strong negative correlations
between default status and variables like the maximum paid invoice suggest
that higher payments are linked to lower default risk.
These insights can guide credit risk management, customer segmentation, and
business strategies for the credit card company.
22