Exploratory Data Analysis (EDA) Summary
1. Introduction
This report reviews the dataset used for predicting customer delinquency. The goal
is to identify data quality issues, detect key trends or anomalies, and highlight the
top predictors of delinquency risk.
2. Dataset Overview
Number of records: 110 (sample from CUST0001–CUST0110)
Key variables:
- Demographics: Age, Employment_Status, Location
- Financials: Income, Credit_Score, Credit_Utilization, Loan_Balance,
Debt_to_Income_Ratio
- Behavioral: Missed_Payments, Account_Tenure, Credit_Card_Type, Month_1–
Month_6 (repayment behavior)
- Target: Delinquent_Account (1 = delinquent, 0 = non-delinquent)
Data types: Mostly numerical (Age, Income, Credit_Score, Ratios), some categorical
(Employment_Status, Credit_Card_Type, Location).
3. Missing Data Analysis
Variables with missing or inconsistent values:
- Income: Missing for several customers (e.g., CUST0041, CUST0043,
CUST0049, CUST0060, etc.)
- Loan_Balance: Missing in multiple rows.
- Employment_Status: Inconsistencies such as ‘EMP’, ‘employed’, and
‘Employed’.
- Credit_Utilization: Contains extreme values (0.05 and >1.0).
Recommended treatment:
- Impute Income and Loan_Balance using median values by
Employment_Status.
- Standardize categorical text (e.g., unify Employment_Status casing).
Identifying and addressing missing data is critical to ensuring model accuracy. This
section outlines missing values in the dataset, the approach taken to handle them,
and justifications for the chosen method.
4. Key Findings and Risk Indicators
Correlations and patterns:
- Higher Missed_Payments and frequent ‘Late’/’Missed’ entries across months
align with delinquency.
- Lower Credit_Score (<450) and high Credit_Utilization (>0.6) correlate with
delinquency.
- Unemployed and Self-employed customers show higher risk.
- Short Account_Tenure (<5 years) may indicate instability.
Unexpected anomalies:
- Credit_Utilization values above 1.0 (e.g., 1.025) indicate scaling errors.
- Mixed casing for Employment_Status.
- Missing financial data reduces completeness.
5. AI & GenAI Usage
AI-assisted analysis was used to summarize key data patterns, identify missing
value trends, and detect anomalies. Prompts focused on identifying predictors of
delinquency and highlighting quality issues.
Prompts used:
- ‘Summarize key patterns, outliers, and missing values in this dataset.’
- ‘Identify early indicators of delinquency risk.’
- ‘Recommend imputation and data cleaning strategies for inconsistent
categorical fields.’
6. Conclusion & Next Steps:
Top 3 variables most likely to predict delinquency:
1. Credit_Utilization – Higher utilization rates often signal financial strain.
2. Missed_Payments – Direct behavioral evidence of repayment issues.
3. Credit_Score – Reflects long-term creditworthiness.
Next Steps:
- Clean and standardize categorical variables.
- Impute missing financial values.
- Cap or normalize outlier Credit_Utilization values.
- Conduct correlation and feature importance analysis before model training.
Overall, the dataset provides a rich mix of behavioral and financial indicators for
delinquency modeling. However, several quality issues were observed, including
missing income and loan data, inconsistent text formatting, and extreme credit
utilization values. The monthly repayment patterns and missed payment counts are
strong early indicators of risk. Once missing values and anomalies are addressed,
the dataset will be suitable for reliable delinquency prediction modeling.