11722, 2332 AM.
In]:
In [8]:
In [9]:
In [10]:
In [12]:
In [14]:
In [15]:
In [16]:
# Home
import
import
import
import
local_data_path
Scaler DSML Feature Engg Class Code - Jupyter Notebook
Loan decision automation
pandas as pd
Aumpy as np
matplotlib.pyplot as
seaborn as sns
plt
E:\DATA_SCIENCE\Scaler\Data\loan-prediction\train.csv"
# https://drive. google. con/drive/folders/1QFDGIHCZPqS5kD7_Uo8SCH9BSAZAVBQj ?usp=st
# Step 1: Data exploration (Basic)
data = pd.read_csv(local_data_path)
data. info()
Rangelndex: 614 entries, @ to 613
Data columns (total 13 columns):
# Column
@ Loan_iD 614
1 Gender 601
2 Married 611
3. Dependents 599
4 Education 614
5 Self_Employed 582
6 ApplicantIncome 614
7 CoapplicantIncome 614
8 LoanAmount 592
9 Loan_Amount_Term 60
1@ Credit_History 564
11 Property_Area 614
12 Loan_Status 614
dtypes: floaté4(4), int64(1), object(8)
memory
usage: 62.5+ KB
Non-Null Count
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
data = data.drop("Loan_1D", axis=1)
object
object
object
object
object
object
inted
floatea
floate4
floated
Floated
object
object
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit
1611722, 2332 AM. Scaler DSML Feature Engg Class Code - Jupyter Notebook
In [17]: data.describe()
out [47]:
Applicantincome Coapplicantincome LoanAmount Loan Amount Term Credit History
count 614.0000 '614.000000 592.0000 ‘600.0000 564.000000
mean 5403.459283 1621.245798 148.412162 34200000 0.842199
std 6109.041673 2926.248369 85.587325 5.12041 0.364878
min 10,000000 0.000000 8.000000 2.00000 0.000000
25% 2677-50000 0.000000 100.000000 ‘360.0000 1.000000
50% 3812.500000 1188,500000 128.0000 360,00000 1.000000
75% 5796,000000 2297.250000 168.0000 360.00000 1.000000
max 81000.000000-—-41667:000000_700.000000 480.00000 1.000000
In [19]: data.describe(includ
“object"]).transpose()
out [19]:
count unique top freq
Gender 601 2 Male 489
Married 611 2 Yes 398
Dependents 5994 0 345
Education 614 2 Graduate 480
Solf_Employed 582 2 No 500
Property Area 614 3. Semiurban 233,
Loan Status 614 2 Y 422
In [25]: data.Loan_Status
out(25]:
<<<2<
Y
Y
Y
Y
N
Loan_Status, Length: 614, dtype: object
In [26]: # Step 2: Brainstorming
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 21611722, 2332 AM.
In [27]: data.info()
Scaler DSML Feature Engg Class Code - Jupyter Notebook
RangeIndex: 614 entries, @ to 613
Data columns (total 12 columns):
# Column
10 Property _Area
11 Loan_status
dtypes: float6a(4), int6a(1), object(7)
Loan_Amount_Term
Credit_History
@ Gender
1 Married
2 Dependents
3 Education
4 Self_Employed
5 ApplicantIncome
6 CoapplicantIncome
7 Loanamount
8
9
memory usage: 57.7+ KB
In [28]: # Step 3: Look at basic distributions (univariates)
Non-Null Count
601
611
599
614
582
614
614
592
600
564
614
614
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
# Step 4: Handle missing values
In [29]: data.isna().sum()
out[29]: Gender
Married
Dependents
Education
Self_Employed
Applicant Income
CoapplicantIncome
LoanAnount
Loan_Amount_Term
Credit_History
Property_Area
Loan_status
dtype: intea
13
15
32
22
14
50
In [33]: def missing_to_df(df):
total_missing df
percent_missing df = (df.isnul1().sum()/dF. isnul1().count()).sort_values(asce
missing data_df = pd.concat(
[total_missing df, percent_missing df], axis=1,
Percent”
keys
)
Tota:
df .isnull().sum() .sort_values (ascendin
return missing data_df
object
object
object
object
object
intea
Floated
floates
Floated
Floated
object
object
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit
3611722, 2332 AM.
In [34]:
out [34]:
In [36]:
In [38]:
out [38]:
In [39]:
In [40]:
In [42]:
In [46]:
Scaler DSML Feature Engg Class Code - Jupyter Notebook
missing_df = missing_to_df(data)
missing_df {missing df["Total"] > 0]
Total Percent
Credit History 50 0.081433
Solf_Employed 32. 0.052117
LeanAmount 22 0.096831
Dependents 15 0.024430
Loan_Amount_Term 14 0.022801
Gender 13. 0.021173
Married 3. 0.004886,
data["Credit_History"] = data["Credit_History"].fillna(2)
data["Self_Employe
].unique()
array(['No', ‘Yes', nan], dtype-object)
data["Self_Employe
] = data["Self_Employed"] .fil1na("other”
from sklearn.inpute import SimpleImputer
num_missing = [“LoanAmount", “Loan_Amount_Tern”]
median_imputer = SimpleInputer (strateg;
for col in num_missing:
data[col] = pd.DataFrame(median_imputer.fit_transform(pd.DataFrame(data[col])
median")
cat_missing = ["Gender", "Married", "Dependents"]
freq_imputer = SimpleInputer(strategy="nost_frequent")
for col in cat_missing:
data[col] = pd.DataFrame(freq_imputer. Fit_transform(pd.DataFrame(data[col]))!
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit ane31122, 232 AM ‘Scaler DSML Feature Engg Class Code - Jupyter Notebook
In [47]: data. isnul1().sum()
out [47]: Gender
Married
Dependents
Education
Self_Employed
ApplicantIncome
CoapplicantIncome
Loananount
Loan_Amount_Term
Credit_History
Property_Area
Loan_status
dtype: intea
In [48]: # Removing or replacing redundant or eroneous values
# if income was negative
# married had some number
In [49]: # detect and handle outLiers
In [ ]: #more EDA, univariates and bivariates
In [58]: sns.countplot(data=data, x="Loan_Status")
Out[50]:
400
350
300
0
count
20
150
100
Loan_Status
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 5161122, 232 AM ‘Scaler DSML Feature Engg Class Code - Jupyter Notebook
In [51]: sns.distplot(data[ "Applicant income” ])
€:\Users\nitis\anaconda3\1ib\site-packages\seaborn\distributions.py:2619: Futur
eWarning: “distplot” is a deprecated function and will be removed in a future v
ersion. Please adapt your code to use either “displot’ (a figure-level function
with similar flexibility) or “histplot” (an axes-level function for histogram
s).
warnings.warn(msg, FutureWarning)
out[51]:
0.00020
0.00005
0.00000
0 zo000 40000 —~—«eou00 |—«80000
eplicantincome
In [52]: data.boxplot(column="ApplicantIncome", by="Educat ion")
plt.show()
Boxlopgpseaat acation
sono
nooo
60000 o
0000 @
seo 8
:
100
x00 t
200
° =
csucaton
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 6611722, 2332 AM.
In [53]:
out [53]:
In [54]:
out [54]:
In [58]:
out [58]:
Scaler DSML Feature Engg Class Code - Jupyter Notebook
sns .distplot (data["CoapplicantIncome" ])
€:\Users\nitis\anaconda3\1ib\site-packages\seaborn\distributions.py:2619: Futur
eWarning: “distplot” is a deprecated function and will be removed in a future v
ersion. Please adapt your code to use either “displot’ (a figure-level function
with similar flexibility) or “histplot” (an axes-level function for histogran
s).
warnings.warn(msg, FutureWarning)
0.0005
0.0004
0.0003
Density
‘0.0002
0.0001
0.0000
0 1000 20000 ~=«30000~=«=«0000
Coapplicantincome
data.groupby("Loan_Status").mean(){"ApplicantIncone" ]
Loan_status
N 5aa6.e78125
Y 5384068720
Name: ApplicantIncone, dtype: floatea
bins = [@, 2500, 4000, 6000, 81000]
group_name = ["Low", "Average", "High", "Very High"]
data["Income_bin"] = pd.cut(data["ApplicantIncome"], bins=bins, labels=group_name
data["Income_bin"]
e High
1 High
2 Average
3 Average
4 High
609 Average
610 High
611 Very High
612 Very High
613 High
Name: Income_bin, Length: 614, dtype: category
Categories (4, object): [Low < Average < High < Very High]
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit m631122, 232 AM
In [61]:
In [62]:
In [ ]:
In [63]:
In [64]:
out [64]:
‘Scaler DSML Feature Engg Class Code - Jupyter Notebook
Income_bin = pd.crosstab(data["Incone_bin"], data[”Loan_status"])
Income_bin.div(Income_bin.sum(axis=1), axis=0).plot(kind="bar", figsize=(4,4))
plt.xlabel("ApplicantIncome")
plt.ylabel("Percentage")
plt.show()
07
06
Percentage
a1
oo
Low
Average
High
very High
‘Applicantincome
# above is also not useful, because the approval rate across income bins is very
# Feature Engineering
data["Totalincome"] = data["ApplicantIncone"] + data["Coapplicant Income" ]
data["TotalIncome_bin"] = pd.cut(data["TotalIncome"], bins=bins, labels=group_nar
data["TotalIncome_bin"]
RUNES
613
High
Very High
Average
High
High
Average
High
Very High
Very High
High
Name: TotalIncome_bin, Length: 614, dtype: category
Categories (4, object): [Low < Average < High < Very High]
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit ane31122, 232 AM ‘Scaler DSML Feature Engg Class Code - Jupyter Notebook
In [66]: TotalIncome_bin = pd.crosstab(data["Totalincome_bin"], data["Loan_status"])
Total Income_bin.div(TotalIncome_bin.sum(axis=
plt.xlabel ("Total Income")
plt.ylabel("Percentage")
plt.show()
), axis=0).plot(kind="bar", stackec
10 Loan status
os
02
oo
Low
Average
High
very High
“Btalincome
In [67]: data = data.drop(["Income_bin”, “TotalIncome_bin"], axis=1)
In [68]: data["Loan_Amount_Term"].nunique()
out[6s]: 10
In [69]: data["Loan_Anount_Term"].value_counts()
out[s9]: 360.8 526
180.0 44
480.0 15,
300.0
84.0
240.0
120.0
36.0
60.0
12.0
Name: Loan_Anount_Term, dtype: intea
ere weal
In [78]: data["Loan_Amount_Term"] = (data["Loan_Amount_Tern"]/12).astype(' float")
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit one11722, 2332 AM. Scaler DSML Feature Engg Class Code - Jupyter Notebook
In [71]: pd.crosstab(data["Loan_Amount_Term"], data["Loan_Status"])
out [71]:
Loan_status NY
Loan_Amount_1
19 0 4
a)
500 2
m 403
10 0 8
150 15 29
20 1 38
20 5 8
30.0 159 367
40 9 6
In [72]: data["Loan_Amount_per_year"] = data["LoanAmount" ]/data["Loan_Amount_Term"]
In [75]: data["EMI"] = data["Loan_Amount_per_year"]*1000/12
In [76]: data
out 76]:
Gonder Married Dependents Education SelfEmployed Applicantincome Coapplicantincome
0 Wale No 0 Graduate %o sea 06
4 Malo Yes + Graduate No 4583 1508.0
2 Mae Yes 0 Graduate ves eco oo
3 Mae Yes © reat No 2583 2580
4 Mae No 0 Graduate No 000 oo
608 Female No 0 Graduate No 200 oo
610 Male Yes, a Graduate wo 4106 oo
et Male Yes 1 Graduate No sore 2409
oz Mae Yes 2 Graduate No 7588 oo
13 Female No 0 Gradute ves 489 oo
614 rows x 15 columns
Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 101631122, 232 AM
In [7]: data”
‘Scaler DSML Feature Engg Class Code -Jupyter Notebook
‘able_to_pay_ENI"] = (data["TotalIncome"]/1@ > data["EMI"]).astype( int’)
In [78]: sns.countplot(x='able_to_pay_fMI’, datasdata, hue="Loan_status")
out[78]:
400
350
300
0
200
count
150
100
In [79]: datal"
out [79]: @
1
2
BF
Name:
In [85]: data["
In [87]: datat"
localhost 8888inotebookslJunyter
Loan status
° 1
sable_to_pay EMI
Dependents" ].value_counts()
360
102
101
51.
Dependents, dtype: intea
Dependents" ].replace(*3+', 3, inplace=
rue)
‘Dependents"] = data["Dependents”].astype("float")
Notebooks/ScalerScaler DSML Feature Engg Class Code jpyno#
ne31122, 232 AM
In [88]:
out [88]:
In [89]:
In [99]:
‘Scaler DSML Feature Engg Class Code -Jupyter Notebook
sns.countplot(data=data, x="Dependents’, hue="Loan_Status")
250
200
150
count
100
00 10 20 30
Dependents
# bivariate with credit_history, you will find that better credit history has bet
data. info()
Rangelndex: 614 entries, @ to 613
Data columns (total 16 columns
# Column Non-Null Count type
® Gender 614 non-null object
1 Married 614 non-null object
2 Dependents 614 non-null floatea
3. Education 614 non-null object
4 Self_Employed 614 non-null object
5 ApplicantIncome 614 non-null intea
6 CoapplicantIncone 614 non-null _—float6a
7 LoanAmount 614 non-null float
8 Loan_Amount_Term 614 non-null —_floate4
9 Credit_History 614 non-null floate4
1@ Property_Area 614 non-null object
11 Loan_status 614 non-null object
12 TotalIncome 614 non-null —_floatea
33. Loan_Amount_per_year 614 non-null —float6a
14 EMT 614 non-null —_floatea
35. able_to_pay_EMT 614 non-null —_int32
dtypes: floatea(8), int32(1), int64(1), object(6)
mem
jory usage: 74.5+ KB
Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit
1611722, 2332 AM.
In [91]:
out [91]:
In [95]:
In [96]:
In [ ]:
Scaler DSML Feature Engg Class Code - Jupyter Notebook
data["Self_Employed"].value_counts()
No 50
Yes 82
Other 32
Name: Self_Employed, dtype: inte4
data = pd.get_dummies(data, drop_first=True)
data. info()
Rangelndex:
# Column
@ Dependents
1 Applicant Income
2 CoapplicantIncome
3° LoanAmount
4 Loan_Anount_Term
5 Credit History
6 TotalIncome
7 Loan_Amount_per_year
8 EMT
9 able_to_pay_EMI
10 Gender_Nale
11 Married_Yes
12 Education Not Graduate
13 Self_Employed_other
14 Self_Employed_Yes
15 Property _Area_Semiurban
16 Property _Area_Urban
17 Loan_Status_Y
dtypes: floatea(), int32(1),
memory usage: [email protected] KB
Feature engineering
+ feature transformation
+ new features
+ one hot encoding and so on.
# Dimensionality reduction
# removing unwanted features
614 entries, @ to 613
Data columns (total 18 columns.
Non-Null Count
oa
614
14
1a
614
o14
o14
614
614
614
614
614
614
614
614
614
614
614
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
Dtype
Floates
intea
floated
floated
Floated
Floated
Floated
Floates
floates
int32
uints
uints
uints
uints
uints
uints
uints
uints
inte4(1), uinte(s)
# check corr and remove features
Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit
131631122, 232 AM ‘Scaler DML Feature Engg Class Code - Jupyter Notebook
In [97
plt. Figure(Figsize=(20,20))
sns.heatmap(data.corr(), annot=True)
plt.show()
In [98]: # spearmans ranking corr coeff
Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit 14631122, 232 AM ‘Scaler DML Feature Engg Class Code - Jupyter Notebook
In [99]: plt.figure(figsize=(20,20))
sns. heatmap (data. corr(method=" sp
plt.show()
arman"), annot=True)
In [100]: # feature scaling
In [11]: from sklearn.preprocessing import StandardScaler, MinMaxScaler
In [102]: normalizer = MinMaxScaler()
Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit 1911611722, 2332 AM.
Scaler DSML Feature Engg Class Code - Jupyter Notebook
In [104]: pd.DataFrame(normalizer.fit_transform(data), columns=data.columns)
out[104]:
Dependents Applicantincome Coapplicantincome LoanAmount Loan_Amount Term Credit Hi
© 0.000000 0.070489 0.000000 0.172214 0.743590
4 099399 0.054890 oossie2 —o.t72214 0.743590
2 0.000000 0.035250 0.900000 0.082489 0.743690
3 0.000000 0.030093, o0ssse2 0.160637 0.743590
4 0.000000 0.072356, 0.000000 0.191027 0.743590
609 0.000000 o.o34014 .000000 0.089725 0.743590
610 1.000000 0.048930 0.000000 0.044863, 0.358074
10399933 0.097984 0.005760 0.353111 0.743590
612 0.886867 0.091936, 0.000000 0.257598 0.743590
613 0.000000 0.054830 o.900000 0.179450 0.743590
614 rows * 18 columns
>
Inf]:
Inf]:
mf]:
localhost 88tinctobooks/JupyerNoleboots/Scala/Scalar DSML Feature Engg Class Code py tee