1.
Read the column description and ensure you understand each attribute well
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
data.head()
data.dtypes
data.shape
2. Perform univariate analysis of each and every attribute - use an appropriate plot
for a given attribute and mention your insights
sns.histplot(x=data['Age'], hue =data['Personal Loan'], bins = 5 )
sns.countplot(x=data['Family'], hue=data['Personal Loan'])
sns.countplot(x=data['Education'], hue=data['Personal Loan'])
sns.countplot(x=data['Online'], hue=data['Personal Loan'])
sns.countplot(x=data['Securities Account'], hue=data['Personal Loan'])
sns.countplot(x=data['CD Account'], hue=data['Personal Loan'])
sns.countplot(x=data['CreditCard'], hue=data['Personal Loan'])
sns.histplot(x=data['Income'], hue=data['Personal Loan'], hue_order = [1,0])
Inference:
i. The age groups of 35-45 take the maximum number of personal loans
ii. The bank has maximum customers with 1 family member, but this customer
group has the lowest personal loan number whereas customers with family of
3 and 4 members have taken a greater number of personal loans
iii. More customers with advanced/professional education take personal loan as
compared to the customers with undergrad or graduate education
iv. A greater number of customers who avail internet banking facilities avail
persona loans than those who do not avail internet banking but the percentage
of both is approximately same.
v. Customers without securities account with the bank avail more personal loan
vi. More number of customers without CD account use personal loans although
more percentage of customers with CD account use personal loan
vii. Higher number of customers without credit cards avail personal loan
viii. The count of people who have taken personal loan is the highest between the
income range of $120,000 – $140,000 and then rises again in the $170,000 –
$190,000. Lower income groups have not taken personal loans
3. Perform correlation analysis among all the variables - you can use Pairplot and
Correlation coefficients of every attribute with every other attribute
datacor = data.corr()
datacor
sns.pairplot(data, diag_kind='kde')
plt.subplots(figsize=(12,10))
sns.heatmap(datacor,annot=True)
4. One hot encode the Education variable (3 points)
OHED = pd.get_dummies(data, columns =['Education'])
OHED
5. Separate the data into dependant and independent variables and create training
and test sets out of them (X_train, y_train, X_test, y_test) (2 points)
Before separating, we need to drop the redundant columns and append the education
one hot encoded data into the main data.
Here, the dependent variable and the variable of interest is Personal loan, therefore,
Personal Loan column would be y axis, the rest of the columns would be counted as
independent variables.
data['Experience']=abs(data['Experience'])
dt = OHED
dt = dt.drop(['ID'], axis =1)
dt = dt.drop(['ZIP Code'], axis=1)
X = dt.drop(['Personal Loan'], axis=1)
Y = dt['Personal Loan']
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size = 0.7, test_size=0.3,
random_state = 100)
X_train
Y_train
print("Training set: {0:0.2f}% ".format((len(X_train)/len(data.index)) * 100))
print("Test set: {0:0.2f}% ".format((len(X_test)/len(data.index)) * 100))
6. Use StandardScaler( ) from sklearn, to transform the training and test data into
scaled values ( fit the StandardScaler object to the train data and transform
train and test data using this object, making sure that the test set does not
influence the values of the train set)
from sklearn.preprocessing import StandardScaler
Std_Scalar = StandardScaler()
X_train = Std_Scalar.fit_transform(X_train)
X_test = Std_Scalar.fit_transform(X_test)
X_train
X_test
7. Write a function which takes a model, X_train, X_test, y_train and y_test as
input and returns the accuracy, recall, precision, specificity, f1_score of the
model trained on the train set and evaluated on the test set
def funct(confusion_matrix):
total=sum(sum(confusion_matrix))
accuracy= (confusion_matrix[0,0]+confusion_matrix[1,1])/total
print ('Accuracy :{:.2%}'.format(accuracy))
specificity =
confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[0,1])
print('Specificity : {:.2%}'.format(specificity) )
sensitivity = confusion_matrix[1,1]/(confusion_matrix[1,0]+confusion_matrix[1,1])
print('Sensitivity : {:.2%}'.format(sensitivity))
precision = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
print('Precision : {:.2%}'.format(precision))
F1 = 2*((precision*sensitivity)/(precision+sensitivity))
print('F1 Score : {:.2%}'.format(F1))
return accuracy, sensitivity, specificity, precision, F1
8. Employ multiple Classification models (Logistic, K-NN, Naïve Bayes etc) and use
the function from step 7 to train and get the metrics of the model
Logistic Regression:
# Import libraries
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# Model
Log_model = LogisticRegression(solver="liblinear")
Log_model.fit(X_train, Y_train)
predict = Log_model.predict(X_test)
coef_df = pd.DataFrame(Log_model.coef_)
coef_df['intercept'] = Log_model.intercept_
print(coef_df)
# Score
score = Log_model.score(X_test, Y_test)
print(score)
# Confusion matrix
from sklearn.metrics import confusion_matrix
conf_mat_Log = confusion_matrix(Y_test, predict)
print(conf_mat_Log)
test_mat = funct(conf_mat_Log)
Naïve Bayes
# Import required libraries
from sklearn.naive_bayes import GaussianNB
# Model
GNB1 = GaussianNB()
GNB1.fit(X_train, Y_train)
predicted_labels_GNB = GNB1.predict(X_test)
GNB1.score(X_test, Y_test)
# Confusion matrix
con_mat = metrics.confusion_matrix(Y_test, predicted_labels_GNB)
print(con_mat)
u = funct(con_mat)
K-NN
# Import required libraries
from sklearn.neighbors import KNeighborsClassifier
# Model
NNH = KNeighborsClassifier(n_neighbors= 5 , weights = 'uniform' )
NNH.fit(X_train, Y_train)
predicted_labels_KNN = NNH.predict(X_test)
NNH.score(X_test, Y_test)
# Create confusion matrix
Conf_matrix_KNN = metrics.confusion_matrix(Y_test, predicted_labels_KNN)
print(Conf_matrix_KNN)
tests = funct(Conf_matrix_KNN)
9. Create a dataframe with the columns - “Model”, “accuracy”, “recall”,
“precision”, “specificity”, “f1_score”. Populate the dataframe accordingly
df = pd.DataFrame({'model': ['Logistic', 'Naive','KNN'],
'Accuracy':[95.47,90.53,95.60],
'Specificity':[99.25,93.89,99.55],
'sensitivity':[63.29,62.03,62.03],
'precision':[90.91,54.44,94.23],
'F1 Score':[74.63,57.99,74.81]
})
df
10. Give your reasoning on which is the best model in this case
In this case, Naïve Bayes is the worst fit model with lowest accuracy, F1 score,
Precision and specificity.
K-NN has the highest accuracy (95.60%), specificity (99.55%), precision (94.23%)
and F1 score (74.81%). Although the sensitivity of logistic regression is higher than
K-NN (63.29%), K-NN showcases an acceptable level of sensitivity at 62.03%.
Therefore, from the above data, it can be seen that K-NN is the best fit model for the
given bank data.