Topic
PREDICTING
POSSIBLE LOAN
DEFAULTERS
L S
KI A M O
UK N A U
S HU N J
HI S I A
AT H S N
L HA H Y
AA A A
L S
KI A M O
UK N A U
S HU N J
HI S I A
AT H S N
L HA H Y
AA A A
L S
KI M A O
UK A N U
SH N U J
HI I S A
AT S H N
LH H A Y
AA A A
L S
KI M A O
UK A N U
SH N U J
HI I S A
AT S H N
LH H A Y
AA A A
L S
K I M A O
U K A N U
S H N U J
H I I S A
A T S H N
L H H A Y
A A A A
OUR TEAM
KUSHALA B Coding and Data
1KS23AI023
GOWDA Analysis
Coding and
MANISHA T P 1KS23AI028
Data collection
Presentation
ANUSHA C 1KS23AI003 Layout and
Design
Report and
LIKHITHA M 1KS23AI025
editing
Presentation
SOUJANYA Coding
Typing
7
Objectives
The program aims to predict the likelihood of
loan defaults among borrowers using statistical,
probabilistic, and machine learning techniques.
This helps financial institutions make informed
lending decisions and manage risk effectively.
Language
Programming language used is python.
8
LOAN
In Last Class We Discussed About
1. PROBLEMS FACED BY BANK
2. SOLUTIONS
3. MATHEMATIC TOOLS
* STATISTICS
* GRAPHS
*PROBABILITY
9
SAMPLE DATA
1
Married House_ Car_
Experienc / Ownershi Ownershi Professio CURRENT_ CURRENT_
ID Income Age e Single p p n CITY STATE JOB_YRS HOUSE_YRS
739309 West
1 0 59 19 single rented no Geologist Malda Bengal 4 13
121500 Firefighte Maharashtr
2 4 25 5 single rented no r Jalna a 5 10
890134 Maharashtr
3 2 50 12 single rented no Lawyer Thane a 9 14
194442 Maharashtr
4 1 49 9 married rented yes Analyst Latur a 3 12
Comedia West
5 13429 25 18 single rented yes n Berhampore Bengal 13 11
343762 Economis
6 1 78 14 single rented no t Ramgarh Jharkhand 3 10
510149
7 8 55 0 married rented no Artist Pallavaram Tamil Nadu 0 14
671694 Flight
8 6 70 15 single rented yes attendant Yamunanagar Haryana 14 13
836980
9 2 43 7 single rented no Secretary Anand Gujarat 6 13
1 956545 Andhra
0 7 65 5 single rented yes Engineer Nandyal Pradesh 3 12
1
CODE
LANGUAGE -->PYTHON
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_theme(style = "darkgrid")
data = pd.read_csv("/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv")
data.head()
rows, columns = data.shape #understanding the data set
('Rows:', rows)
('Columns:',columns)
data.info()
1
()
data.isnull().sum()
data.columns
data.describe() #Analysing Numerical columns
data.corr()
data.hist( figsize = (22, 20) )
plt.show()
data["Risk_Flag"].value_counts()
fig, ax = plt.subplots( figsize = (12,8) )
corr_matrix = data.corr()
corr_heatmap = sns.heatmap( corr_matrix, cmap = "flare", annot=True, ax=ax, annot_kws={"size": 14})
plt.show()
def categorical_valcount_hist(feature): #Analysing the categorical features
(data[feature].value_counts())
fig, ax = plt.subplots( figsize = (6,6) )
sns.countplot(x=feature, ax=ax, data=data)
plt.show()
categorical_valcount_hist("Married/Single")
categorical_valcount_hist("House_Ownership")
Print( "Total categories in STATE:", len(data["STATE"].unique() ) )
Print()
Print(data["STATE"].value_counts() )
Print( "Total categories in Profession:",len ( data["Profession"].unique() ) )
Print()
data["Profession"].value_counts()
data.info() #Data Analysis
sns.boxplot(x ="Risk_Flag",y="Income" ,data = data)
sns.boxplot(x ="Risk_Flag",y="Age" ,data = data)
sns.boxplot(x ="Risk_Flag",y="Experience" ,data = data)
sns.boxplot(x ="Risk_Flag",y="CURRENT_JOB_YRS" ,data = data)
sns.boxplot(x ="Risk_Flag",y="CURRENT_HOUSE_YRS" ,data = data)
fig, ax = plt.subplots( figsize = (8, 6) )
sns.countplot(x='House_Ownership', hue='Risk_Flag', ax=ax, data=data)
fig, ax = plt.subplots( figsize = (8,6) )
sns.countplot(x='Car_Ownership', hue='Risk_Flag', ax=ax, data=data)
fig, ax = plt.subplots( figsize = (8,6) )
sns.countplot( x='Married/Single', hue='Risk_Flag', data=data )
fig, ax = plt.subplots( figsize = (10,8) )
sns.boxplot(x = "Risk_Flag", y = "CURRENT_JOB_YRS", hue='House_Ownership', data = data)
#Feature Engineering
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
data.info()
label_encoder = LabelEncoder()
for col in ['Married/Single','Car_Ownership’]:
data[col] = label_encoder.fit_transform( data[col] )
onehot_encoder = OneHotEncoder(sparse = False)
data['House_Ownership'] = onehot_encoder.fit_transform(data['House_Ownership'].values.reshape(-1, 1) )
high_card_features = ['Profession', 'CITY', 'STATE']
count_encoder = ce.CountEncoder()
# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_encoder.fit_transform( data[high_card_features] )
data = data.join(count_encoded.add_suffix("_count"))
data.head()
data= data.drop(labels=['Profession', 'CITY', 'STATE'], axis=1)
data.head()
#Splitting the data into train and test splits
x = data.drop("Risk_Flag", axis=1)y = data["Risk_Flag"]
from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y,
test_size = 0.2, stratify = y, random_state = 7)
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
rf_clf = RandomForestClassifier(criterion='gini', bootstrap=True, random_state=100)
smote_sampler = SMOTE(random_state=9)
pipeline = Pipeline(steps = [['smote', smote_sampler], ['classifier', rf_clf]])
pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score,
accuracy_score, roc_auc_score
Print("-------------------------TEST SCORES-----------------------")
Print(f"Recall: {(recall_score(y_test, y_pred)*100, 4) }")
Print(f"Precision: ({precision_score(y_test, y_pred)*100, 4) }")
Print(f"F1-Score:{(f1_score(y_test, y_pred)*100, 4)} ")
Print(f"Accuracy score: {(accuracy_score(y_test, y_pred)*100, 4) }")
Print(f"AUC Score:{ (roc_auc_score(y_test, y_pred)*100, 4) }")
Reference
YOUTUBE CHANNELS
@NYCDataScienceAcademy
@PyDataTV
WEBSITES
global.pydata.org
numfocus.org
https://www.kaggle.com/
2
THANK YOU