0% found this document useful (0 votes)
83 views5 pages

EDA SummaryReport

Uploaded by

Atul Kumar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views5 pages

EDA SummaryReport

Uploaded by

Atul Kumar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Exploratory Data Analysis (EDA) Summary

Report

1. Introduction
This report outlines the exploratory data analysis conducted on Geldium's customer
dataset. The primary purpose of this EDA is to understand the dataset's structure,
assess its quality, identify potential issues, and uncover early risk indicators related to
loan delinquency before predictive modeling can commence.

2. Dataset Overview
This section summarizes the dataset, including the number of records, key variables,
and data types. It also highlights any anomalies, duplicates, or inconsistencies observed
during the initial review.

Key dataset attributes:

 Number of records: 500

 Key variables: Customer_ID, Age, Income, Credit_Score, Credit_Utilization,


Missed_Payments, Delinquent_Account, Loan_Balance, Debt_to_Income_Ratio,
Employment_Status, Account_Tenure, Credit_Card_Type, Location, Month_1 to
Month_6 (payment history).

 Data types:

o Categorical: Customer_ID, Employment_Status, Credit_Card_Type,


Location, Month_1 to Month_6

o Numerical: Age, Income, Credit_Score, Credit_Utilization,


Missed_Payments, Delinquent_Account, Loan_Balance,
Debt_to_Income_Ratio, Account_Tenure

3. Missing Data Analysis


Identifying and addressing missing data is critical to ensuring model accuracy. This
section outlines missing values in the dataset, the approach taken to handle them, and
justifications for the chosen method.
Key missing data findings:

 Variables with missing values: Income, Loan_Balance, Debt_to_Income_Ratio,


Credit_Score.

 Specific instances observed: CUST0009 (Loan_Balance), CUST0041 (Income),


CUST0043 (Income), CUST0046 (Loan_Balance), CUST0052 (Loan_Balance),
CUST0053 (Loan_Balance), CUST0060 (Income), CUST0067 (Income),
CUST0069 (Income), CUST0077 (Income), CUST0103 (Loan_Balance),
CUST0105 (Loan_Balance), CUST0114 (Income), CUST0116 (Credit_Score,
Loan_Balance, Debt_to_Income_Ratio), CUST0121 (Loan_Balance,
Debt_to_Income_Ratio), CUST0125 (Loan_Balance), CUST0154 (Income),
CUST0165 (Loan_Balance), CUST0178 (Loan_Balance,
Debt_to_Income_Ratio), CUST0188 (Income), CUST0192 (Loan_Balance),
CUST0196 (Loan_Balance), CUST0209 (Income, Loan_Balance,
Debt_to_Income_Ratio), CUST0225 (Income), CUST0229 (Income), CUST0244
(Loan_Balance), CUST0261 (Debt_to_Income_Ratio), CUST0266
(Loan_Balance), CUST0273 (Income, Loan_Balance), CUST0277 (Income),
CUST0279 (Income), CUST0282 (Loan_Balance), CUST0291 (Income),
CUST0294 (Income), CUST0299 (Income), CUST0300 (Income), CUST0308
(Income), CUST0343 (Loan_Balance), CUST0370 (Income), CUST0371
(Income), CUST0376 (Income), CUST0377 (Income), CUST0379 (Income,
Credit_Score), CUST0387 (Income), CUST0394 (Income), CUST0396
(Loan_Balance), CUST0408 (Income), CUST0413 (Loan_Balance), CUST0419
(Income), CUST0451 (Income), CUST0454 (Loan_Balance), CUST0459
(Income), CUST0460 (Loan_Balance), CUST0464 (Income), CUST0466
(Income, Loan_Balance), CUST0482 (Loan_Balance, Debt_to_Income_Ratio),
CUST0483 (Income, Loan_Balance), CUST0495 (Income).

Missing data treatment:

Handling
Missing Data Issue Justification
Method

Median imputation is robust to outliers and


Imputation
Missing Income values maintains distribution for skewed income
(Median)
data.

Median imputation is suitable for numerical


Missing Loan_Balance Imputation
data and less sensitive to extreme loan
values (Median)
values.

Missing Median imputation helps preserve the


Imputation
Debt_to_Income_Ratio overall distribution of this ratio, which can
(Median)
values be sensitive to outliers.
Median imputation is a reasonable
Missing Credit_Score Imputation
approach for credit scores, which can have
values (Median)
a wide range and potential outliers.

Standardizing terms (e.g., "EMP",


"employed", "Employed" to "Employed")
Inconsistent
Standardization ensures consistent categorization and
Employment_Status values accurate analysis.

4. Key Findings and Risk Indicators


This section identifies trends and patterns that may indicate risk factors for delinquency.
Feature relationships and statistical correlations are explored to uncover insights
relevant to predictive modeling.

Key findings:

 Correlations observed between key variables:

o Missed_Payments: Directly correlates with Delinquent_Account. A


higher number of missed payments significantly increases the likelihood
of an account being delinquent.

o Credit_Utilization: Higher credit utilization rates are frequently


associated with Late or Missed statuses in the Month_X columns,
indicating financial strain.

o Debt_to_Income_Ratio: Customers with higher debt-to-income ratios


tend to have more Late or Missed payments, suggesting a reduced
capacity to manage debt.

o Credit_Score: Lower credit scores are consistently observed among


customers with Delinquent_Account status and a history of Late or
Missed payments.

 Unexpected anomalies:

o Several accounts show extremely low Credit_Utilization (e.g., 0.05) while


still exhibiting Late or Missed payments. This could indicate specific
product types or data anomalies requiring further investigation.

High-risk indicators and their importance:

 High Missed_Payments count: A direct and strong indicator, as past behavior


is often the best predictor of future behavior.
 High Credit_Utilization: Suggests a customer is heavily reliant on credit,
increasing their vulnerability to financial shocks.

 Low Credit_Score: Reflects a history of poor credit management, making future


delinquency more probable.

 High Debt_to_Income_Ratio: Indicates a significant portion of income is


consumed by debt, leaving less disposable income for new obligations.

 Recent Late or Missed payments (in Month_1 to Month_6): These are


immediate and highly relevant signals of deteriorating financial health.

5. AI & GenAI Usage


Generative AI tools (specifically, this model) were used to summarize the dataset,
identify patterns, and suggest imputation strategies. This section documents AI-
generated insights and the prompts used to obtain results.

Example AI prompts used:

 'Summarize key patterns, outliers, and missing values in this dataset. Highlight
any fields that might present problems for modeling delinquency.'

 'Identify the top 3 variables most likely to predict delinquency based on this
dataset. Provide brief reasoning.'

 'Suggest an imputation strategy for missing values in this dataset based on


industry best practices.'

 'Propose best-practice methods to handle missing credit utilization data for


predictive modeling.'

6. Conclusion & Next Steps


The initial EDA revealed that Missed_Payments, Credit_Utilization,
Debt_to_Income_Ratio, Credit_Score, and recent payment history (Month_1 to
Month_6) are strong early indicators of delinquency risk. While the dataset is relatively
clean, addressing missing values in key financial columns and standardizing categorical
data like Employment_Status are crucial next steps for robust predictive modeling. The
observed anomalies, such as unusually low credit utilization with missed payments,
warrant further investigation to ensure data accuracy and uncover specific risk profiles.

Recommended next steps:


1. Implement the proposed imputation strategies for missing numerical data.

2. Standardize the Employment_Status column to ensure consistency.

3. Further investigate the anomalies in Credit_Utilization to understand their


underlying causes.

4. Proceed with feature engineering and model selection for delinquency prediction.

You might also like