519925, 772M
Data Analytics ipynb -Colab
v Exploratory Data Analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
drive.mount('/content/drive')
Sy Mounted at /content/drive
df = pd.read_csv('/content/drive/MyOrive/Colab Notebooks/ML Projects/Tata Data Analysis/Deli
# Missing Value Analysis
missing values = df. isnul1().sum().sort_values(ascending-False)
missing percent = (missing values / len(df)) * 100
# Univariate Analysis: Numerical Features
# Get numerical columns from the dataframe
numerical_cols = df.select_dtypes(include=[ ‘number’ ]).columns
# Calculate the summary statistics
numerical_sunmary = df[numerical_cols].describe()
# Univariate Analysis: Categorical Features
categorical_cols = df.select_dtypes(include=["object']).colunns # Define categorical_cols
categorical_sunnary = df[categorical_cols].describe()
# Identify numerical columns
numerical_cols = df.select_dtypes(include=['int64', ‘floaté4']).colunns.tolist()
numerical_cols.remove(‘Delinquent_Account’) # Exclude the target
# Apply median imputation
for col in numerical_cols
median_value = df[col].median()
df[col].fillna(median_value, inplac:
rue)
Sy
:4: Futurearning: A value is trying to be set on a copy ¢
The behavior will change in pandas 3.0. This inplace method will never work because the
For example, when doing ‘df[col].method(value, inplace=True)', try using ‘df.method({col
hitps:oolab research google.comidrveleTOOINK7peMtcS 1G55qs5WW9AoqyZInGitscrolTo=yzaqblajR3Sz&priniMode=tus 18513125, 7:12AM Ta
‘Data Anais ipynb -Colab
df{col].fillna(median_value, inplace=True)
# Optional: Check if missing values remain
print (d#[numerical_cols].isnull().sum())
By Age @
SO
Credit_Score
Credit_Utilization
Missed_Payments
Loan_Balance
Debt_to_Income_Ratio
Account_Tenure
dtype: intea
‘# Plotting distributions for numerical features
fig, axes = plt.subplots(len(numerical_cols), 1, figsize=(8, len(numerical_cols)*3))
for i, col in enumerate(nunerical_cols):
sns.histplot(df[col], kde-True, ax-axes[])
axes[i].set_title(#"Distribution of (col)
pit. tight_layout()
plt.show()
htips:ifeolab.
rch google.comidrveleTOOINH7pcMtcS 1G55qs5WW94oqyZInGitscrolTo=yzaqblajR3Sz8priniMode=tus
28513125, 7:12AM Ta
Data Analytis.pynb - Cola
=
Distribution of Age
60
50
40 a
30
count
20
0
20 30 40 50
Age
istribution of Income
70
Di
80
60
“ = - :
20
count
M
/
25000 © sod00- 75000 © 100000125000 «150000175000 200000
Income
Distribution of Credit_Score
60
count
20
300 400 500 600 700 00
ceedit_ score
Distribution of Credit_Utilization
60 a
20
02 04 06 og 10
arch google.comidrveleTOOINHI7pcMtcS 1G55qs5W94oqy7InGitscrolTo=yzaqblajR3Sz8priniMode=tus
htips:ifeolab.513125, 7:12AM Ta
htips:ifeolab.
Data Analjtics.ipynb - Clad
Credit_utilization
Distribution of Missed_Payments
80
60
30) = i—f i=
—~
20
o
° 1 2 3 4 5 6
issed_payments
Distribution of Loan Balance
80 rd
60 oS
g 0 =~
20
°
° 20000 40900 160000 ‘30000 100000
Loan Balance
Distribution of Debt_to_Income Ratio
60
count
20
OL a2 a3 oa os
Debt to_incame_Ratio
Distribution of Account_Tenure
60
é
count
20
0
°
a0 25 5.0 73 10.0 25 15.0 5
rch google.comidrveleTOOINNI7peMtcS 1G55qs5W94oqyZInGitscrolTo=yzaqblajR3Sz&priniMode=tus
48519925, 772M ‘ata Data Analytics ipynb - Colab
‘Account_Tenure
hitps:oolab research googla.comidrvel1eTOOINK7peMtcS 1G55qs5W9AoqyZInGitscrolTo=yzaqblajR3Sz&priniMode=tus 58513125, 7:12AM
“ata Data Analytics. pynb - Colab
# Target Variable Distribution
plt. figure(Figsize=(10, 8))
sns.countplot (x='Delinquent_Account', data=df)
plt.title("Target Variable Distribution")
plt.show()
=
Target Variable Distribution
count
Delinguent_Account
# Correlation Heatmap
plt. figure(Figsize=(10, 8))
# numerical_cols is already a list, no need to call tolist()
sns.heatmap(df[numerical_cols + ['Delinquent_Account']].corr(), annot=True, cmap="coolwarm}
plt.title("Correlation Matrix”)
plt.shon()
|ntps:ifolab research google.comidrve/1eTOOINXI7peMcS 1GS5qs6W94od)yZInCitscro\To=yzaqbajR3Sz&printMode=ue
68513125, 7:12AM Data Analytis.pynb - Cola
=
Correlation Matrix
credit Score
reeit utilization
Missed Payments
toan Balance
Debt. to_Income_Ratio F ery
‘Account Tenure
Delinquent_Account MeateeamE
yyments
toan_palance |
linquent_Account
credit
Missed Pay
Ipip install ace_tools
SB collecting ace_tools
Downloading [email protected] (300 bytes)
Downloading ace_tools-@[email protected] (1.1 kB)
Installing collected packages: ace_tools
!ntips:ifolab research google. com/drvateTOOINXTpeMeS 1G55qs6W94oqyZInCitscro\To=yzaqbOaJR35z&printMode=rus
Lo
08
-06
-04
2
oo519925, 772M
Successfully installed ace_tools-0.0
df .head(1@)
=
GC > ( >)
Next steps: (Generate code with df ) (€2 View recommended plots ) ((New interactive sheet )
v Predicting Delinquency with Al
# Re-import necessary packages after kernel reset
Customer_ID Age
0 cusTo001
4 cusToo02
2 cusTo003
3 CUSTO004
4 cUSTOO05
5 CUSTO006
6 — cUSTOOO7
7 cUusToo08
8 — CUSTO009
9 cusToo10
import pandas as pd
sklearn. impute import SimpleImputer
sklearn. preprocessing import OneHotEncoder, StandardScaler
sklearn.compose import ColumnTransformer
sklearn.pipeline import Pipeline
from
from
from
from
from
from
from
from
from
sklearn.model_selection import train_test_split
sklearn.linear_model import LogisticRegression
56
69
46
32
60
25
38
56
36
40
Income
165580.0
100999.0
188416.0
101672.0
38524.0
84042.0
35056.0
123215.0
66991.0
34870.0
‘ata Data Analytics. ipynb - Colab
Credit Score Credit _Utilization
398.0
493.0
500.0
413.0
487.0
700.0
364.0
415.0
405.0
679.0
sklearn.tree import DecisionTreeClassifier
sklearn.neural_network import MLPClassifier
sklearn.metrics import accuracy_score, precision_score, recall_score, #1_score
# Define features and target
X = df.drop(columns=[ ‘Delinquent_Account' ,
y = df[ 'Delinquent_Account"]
hitps:oolab research googla.comidrvel1eTOOINK7peMtcS 1G55qs5W9AoqyZInGitscrolTo=yzaqblajR3Sz&priniMode=tus
“customer_ID"])
0.390502
0.312444
0.359930
0.371400
0.234716
0.650540
0.390581
0.532715
0.413035
0.361824
Missed Payments Deling
a8