PROJECT - ASSIGNMENT
Financial Risk Analytics
Submitted by: Bibin Vadakkekara Bhaskaran (G1 - PGP BABI)
Great Lakes Institute of Management
Financial Risk Analytics
Table of Contents
List of Tables....................................................................................................................................0
List of Figures...................................................................................................................................0
1. Project Objective, Background................................................................................................1
2. Methodology...........................................................................................................................2
3. Solutions/Question Answer (Rubric Based).............................................................................2
3.1 Missing Value Treatment..................................................................................................3
3.2 Outlier Treatment - Outlier Treatment.............................................................................3
3.3 New Variables Creation (One ration for profitability, leverage, liquidity and company's
size each).....................................................................................................................................3
3.4 Check for multicollinearity................................................................................................3
3.5 Univariate & bivariate analysis.........................................................................................4
3.6 Build Logistic Regression Model on most important variables........................................8
3.7 Analyze coefficient & their signs.......................................................................................8
3.8 Predict accuracy of model on dev and validation datasets..............................................9
3.9 Sort the data in descending order based on probability of default and then divide into
10 deciles based on probability & check how well the model has performed.........................10
Appendix 1 – Source Code.............................................................................................................11
List of Tables
No table of figures entries found.
List of Figure
Figure 1 : Box plot of independent variables.................................................................................................2
Figure 2 : Education Level density bar plot....................................................................................................2
Figure 3 : Age in years vs Personal Loan........................................................................................................3
Figure 4 : Experience in years vs Personal loan.............................................................................................3
Figure 5 : Credit card spending vs Personal Loan..........................................................................................4
Figure 6 : Income per month vs Education level............................................................................................4
Figure 7 : Confusion Matrix – Credit Card holders vs Personal loan.............................................................5
Figure 8 : Confusion Matrix – Cash deposit holders vs Personal loan...........................................................5
Figure 9 : Confusion Matrix – Securities account holders vs Personal loan..................................................5
Figure 10 : Confusion Matrix – Internet banking customers vs Personal loan..............................................5
Figure 11 : Initial CART plot tree....................................................................................................................6
Figure 12 : Before pruning – Rel. error vs cp vs size of tree..........................................................................7
Figure 13 : After pruning - CART plot.............................................................................................................7
Figure 14 : Random forest error rate and characteristics..............................................................................8
Figure 15 : Random forest - Error rate vs trees.............................................................................................8
Figure 16 : TPR vs FPR -CART........................................................................................................................10
Figure 17: TPR vs FPR - Random Forest.......................................................................................................10
Figure 18 : TPR vs FPR - CHAID model..........................................................................................................11
1. Project Objective, Background
To create an India credit risk(default) model, using the data provided (financial data) in the spreadsheet
raw-data.xlsx, and validate it on validation_data.xlsx.
Use the logistic regression framework to develop the credit default model.
Variable explanation as per the data dictionary are as follows :
Variable Name Discreption
Networth Next Year Net worth of the customer in next year
Total assets Total assets of customer
Net worth Net worth of the customer of present year
Total income Total income of the customer
difference between value of current stock and the value of stock in last trading
Change in stock day
Total expenses Total expense done by customer
Profit after tax Profit after tax deduction
PBDITA Profit before depreciation, income tax and amortization
PBT Profit before tax deduction
Cash profit Total Cash profit
PBDITA as % of total
income PBDITA / Total income
PBT as % of total income PBT / Total income
PAT as % of total income PAT / Total income
Cash profit as % of total
income Cash Profit / Total income
PAT as % of net worth PAT / Net worth
Sales Sales done by customer
Income from financial
services Income from financial services
Other income Income from other sources
Total capital Total capital of the customer
Reserves and funds Total reserves and funds of the customer
Deposits (accepted by
commercial banks) All blank values
Borrowings Total amount borrowed by customer
Current liabilities &
provisions current liabilities of the customer
Deferred tax liability Future income tax customer will pay because of the current transaction
Shareholders funds Amount of equity in a company, which is belong to shareholder
Cumulative retained
profits Total cumulative profit retained by customer
Capital employed Current asset minus current liabilities
TOL/TNW Total liabilities of the customer divided by Total net worth
Total term liabilities /
tangible net worth Short + long term liabilities divided by tangible net worth
Contingent liabilities / Contingent liabilities / Net worth
Page | 1
Net worth (%)
Contingent liabilities Liabilities because of uncertain events
Net fixed assets purchase price of all fixed assets
Investments Total invested amount
Current assets Assets that are expected to be converted to cash within a year
Net working capital Difference of current liabilities and current assets
Quick ratio (times) Total cash divided by current liabilities
Current ratio (times) Current assets divided by current liabilities
Debt to equity ratio
(times) Total liabilities divided by its shareholder equity
Cash to current liabilities
(times) Total liquid cash divided by current liabilities
Cash to average cost of
sales per day Total cash divided by average cost of the sales
Creditors turnover Net credit purchase divided to average trade creditors
Debtors turnover Net credit sales divided by average accounts receivable
Finished goods turnover Annual sales divided by average inventory
The cost of goods sold for a period divided by the average inventory for that
WIP turnover period
Raw material turnover Cost of goods sold is divided by the average inventory for the same period
Shares outstanding Number of issued shares minus the number of share held in the company
Equity face value cost of the equity at the time of issuing
EPS Net income divided by total number of outstanding share
Adjusted net earning divided by the weighted average number of common
Adjusted EPS share outstanding on a diluted basis during the plan year
Total liabilities Sum of all type of liabilities
PE on BSE Company current stock price divided by its earning per share
2. Methodology
EDA (Datatype corrections -> Missing value treatment -> Outlier corrections -> Univariate and Bivariate
analysis -> Creating new variables for Profitability, Leverage, Liquidity & Company’s size)
Modelling (Logistic Regression model to be built and analysed)
Model performance (Accuracy of the model on training and test dataset, 10 decile sorting)
3. Solutions/Question Answer (Rubric Based)
While importing dataset using the import option in R studio, datatypes that was seen as
characters is changed into numeric from character type.
Default, a new column is added using the Networth Next year as a criteria and for those who
have positive net worth is given not a defaulter or ‘0’ and those who have negative is given
defaulter ‘1’ status.
In the total training data set named as rawdata, 6.8% of recordings fall under the defaulter
category.
Page | 2
3.1 Missing Value Treatment
Missing values are first visually represented / observed using the plot_intro function from the
Data explorer package
Each of the missing value is replaced using the median of the corresponding column using a loop
function throughout for the training dataset
In both the training and test dataset, an entire column (namely Deposits (accepted by
commercial banks)) is blank and thus that coloumn is removed from the datasets.
Testing dataset follows the same procedure as the training set
3.2 Outlier Treatment - Outlier Treatment
Treatment is done using the scales library -> squish function
Observations below the 1st percentile is given value of the first percentile and the observations
above the 95th percentile is given the value of 95th percentile
Redundant variables (Number & Net worth next year (now categorised by Default) from the
training and Number from the testing dataset) are removed.
3.3 New Variables Creation (One ration for profitability, leverage, liquidity and company's size each)
Profitability ratio (variable name: Profitability) = Profit After Tax / Sales
Leverage ratio (variable name: leverage) = Total Assets / (Ratio of liabilities and debt to equity
ratio)
Liquidity ratio (variable name: liquidity) = Net working capital / Total assets
Company’s size ratio (variable name: comp.size) = Total Assets / Total Liabilities
3.4 Check for multicollinearity
Correlation is plotted between all the variables available in the dataset (training)
There exists significant correlation between certain independent variables
Page | 3
Clusters of Red indicate that there exists correlation between the variables
3.5 Univariate & bivariate analysis
Univariate Analysis – Histogram of independent variables
Page | 4
Page | 5
Univariate Analysis – Density plot of independent variables
Page | 6
Bi variate analysis
Page | 7
3.6 Build Logistic Regression Model on most important variables
Logistic reg. model is built on the dataset with all variables primarily
Analysing the importance of variables based on Z scores, secondary model is generated
3.7 Analyze coefficient & their signs
AIC value of initial model is 1060.5
Page | 8
Secondary model generated using the significant variables indicate a reduction in AIC value to
975.16
3.8 Predict accuracy of model on dev and validation datasets
Accuracy of the training model is calculated and found to be 94 percent whereas the same model
when implemented on the testing / validation dataset, accuracy was found to be 93%
Confusion matrix is generated for both the training set and validation set and is depicted below
Training dataset
Testing data set
ROC is calculated and plotted for both the training and test dataset as shown below
Page | 9
3.9 Sort the data in descending order based on probability of default and then divide into 10 deciles
based on probability & check how well the model has performed
Deciling of dataset based on the probability of default is done.
Plots are made to compare the training dataset, test dataset (default vs predictions)
Page | 10
Page | 11
Appendix 1 – Source Code
####intro ####
getwd()
setwd("C:/Users/bibin/OneDrive/Great Lakes/FRA/Project Works/Project Works -RCode/ProjectV1.0")
#importing train dataset (rawdata) using inbuilt import dataset function
#correcting datatypes from character to numeric in the UI itself
rawdata=readxl::read_excel("raw-data.xlsx")
rawdata
str(rawdata)
rawdata=as.data.frame(rawdata)
for(i in 1:ncol(rawdata))
{
rawdata[,i]=as.numeric(rawdata[,i])
}
str(rawdata)
#importing test data (validate) set
validate=readxl::read_excel("validation_data.xlsx")
str(validate)
validate=as.data.frame(validate)
for(i in 1:ncol(validate))
{
validate[,i]=as.numeric(validate[,i])
}
str(validate)
#dataset while importing via -> import function was corrected for datatype from character to numeric
Page | 12
library(DataExplorer)
dim(rawdata)
names(rawdata)
####missing value & outlier treatment ####
rawdata_new=rawdata
validate_new=validate
str(validate_new)
rawdata_new$Default=ifelse(rawdata_new$`Networth Next Year`>0,0,1)
rawdata_new=rawdata_new[!rawdata_new$`Total assets`<=3,]
summary(rawdata_new$Default)
summary(rawdata_new$Default)
summary(rawdata_new[,53])
#3298 are non defaulters and 243 are defaulters
(243/(3298+243))*100
#6.86 % of total are defaulters
summary(rawdata_new)
summary(rawdata_new$`Total assets`)
#Deposits (accepted by commercial banks) is blank for both rawdata_new and validate_new
str(rawdata_new)
#22nd coloumn in both datasets
Page | 13
#removing the coloumns
rawdata_new=rawdata_new[,-22]
validate_new=validate_new[,-22]
plot_intro(rawdata_new)
plot_intro(validate_new)
summary(rawdata_new$Default)
sum(is.na(rawdata_new$Default))
#replace missing values of rawdata using median
rawdata_new=as.data.frame(rawdata_new)
#understanding the dataset under consideration
plot_intro(rawdata_new)
dim(rawdata_new)
#replace the missing NA with Median
for(a in 1:51)
{
rawdata_new[,a]=as.numeric(rawdata_new[,a])
rawdata_new[is.na(rawdata_new[,a]),a]=median(rawdata_new[,a],na.rm=TRUE)
}
str(validate_new)
#replacing the test dataset - validate_new with median
validate_new=as.data.frame(validate_new)
#removing the first two columns from replacing the omitted values
for(b in 3:ncol(validate_new))
{
Page | 14
validate_new[,b]=as.numeric(validate_new[,b])
validate_new[is.na(validate_new[,b]),b]=median(validate_new[,b],na.rm=TRUE)
}
summary(rawdata_new)
summary(validate_new)
plot_intro(rawdata_new)
plot_intro(validate_new)
#all obersvations are present / replaced for both the training dataset and test data set
####outlier identification ####
ncol(rawdata_new)
boxplot(rawdata_new)
library(scales)
ncol(rawdata_new)
str(rawdata_new)
summary(rawdata_new[,52])
#squish function utilisation for
for(c in 2:51)
{
Page | 15
quan=quantile(rawdata_new[,c],c(0.01,0.95))
rawdata_new[,c]=squish(rawdata_new[,c],quan)
}
summary(rawdata_new)
rawdata_new1=rawdata_new[,-c(1,2)]
validate_new1 = validate_new[,-1]
summary(rawdata_new1)
summary(validate_new1)
#rawdata new 1 -> does not contain the id of the company and net worth next year
#validate data new 1 -> does not contain the id of the company but starts with the default coloumn
#### univariate & bi variate analysis #####
plot_str(rawdata_new)
plot_intro(rawdata_new)
plot_missing(rawdata_new)
dev.off()
plot_histogram(rawdata_new1,ggtheme = theme_light())
dev.off()
plot_density(rawdata_new1,ggtheme= theme_minimal())
Page | 16
?plot_correlation
plot_correlation(rawdata_new1,title = "Correlation Plot between all variables", ggtheme =
theme_dark())
#bi variate
??plot
dev.off()
rawdata_new1$Default=as.factor(rawdata_new1$Default)
par(mfrow=c(3,2))
plot(log(rawdata_new1$`Net worth`)~rawdata_new1$Default, main = "Networth")
plot(log(rawdata_new1$`Total expenses`)~rawdata_new1$Default, main = "Tot. Exp")
plot(log(rawdata_new1$Sales)~rawdata_new1$Default, main = "Sales")
plot(log(rawdata_new1$`Total liabilities`)~rawdata_new1$Default, main = "liabilities")
plot(log(rawdata_new1$`Cash to average cost of sales per day`)~rawdata_new1$Default, main = "Cash
to avg cost of sale")
plot(log(rawdata_new1$`Current assets`)~rawdata_new1$Default, main = "Curnt Assts")
plot(log(rawdata_new1$`Capital employed`)~rawdata_new1$Default, main = "Cap. employed")
plot(log(rawdata_new1$`Other income`)~rawdata_new1$Default, main = "Oth Income")
plot((rawdata_new1$`Reserves and funds`)~rawdata_new1$Default, main = "Rsrv and Fund")
plot((rawdata_new1$`Net fixed assets`)~rawdata_new1$Default)
#multi collinearity check
####new variable creation #####
#profitability, leverage, liquidity and company's size
#profitability -> new variable
# profit after tax / sales
Page | 17
rawdata_new1$Profitability=rawdata_new1$`Profit after tax`/rawdata_new1$Sales
boxplot(log(rawdata_new1$Profitability),main="Profitability")
#leverage variable
# total assets / (ratio of liabilities and debt to equity ratio)
rawdata_new1$leverage = rawdata_new1$`Total assets`/(rawdata_new1$`Total
liabilities`/rawdata_new1$`Debt to equity ratio (times)`)
plot(density(rawdata_new1$leverage))
#liquidity variable
#net working capital / total assets
rawdata_new1$liquidity=rawdata_new1$`Net working capital`/rawdata_new1$`Total assets`
#company's size
#size = Total Assets / total liabilities
rawdata_new1$comp.size=rawdata_new1$`Total assets`/rawdata_new1$`Total liabilities`
#same variables generated for testing dataset - validate_new1
#profitability -> new variable
# profit after tax / sales
validate_new1$Profitability=validate_new1$`Profit after tax`/validate_new1$Sales
boxplot(log(validate_new1$Profitability),main="Profitability")
Page | 18
#leverage variable
# total assets / (ratio of liabilities and debt to equity ratio)
validate_new1$leverage = validate_new1$`Total assets`/(validate_new1$`Total
liabilities`/validate_new1$`Debt to equity ratio (times)`)
plot(density(validate_new1$leverage))
#liquidity variable
#net working capital / total assets
validate_new1$liquidity=validate_new1$`Net working capital`/validate_new1$`Total assets`
#company's size
#size = Total Assets / (total income - total expenses)
validate_new1$comp.size=validate_new1$`Total assets`/validate_new1$`Total liabilities`
boxplot(log(validate_new1$comp.size))
####modeling #####
#log reg
summary(rawdata_new1$Default)
set.seed(1000)
rawdata.logit.train = glm(rawdata_new1$Default~ ., data=rawdata_new1,family=binomial)
summary(rawdata.logit.train)
attach(rawdata_new1)
rawdata.logit.train.1=glm(rawdata_new1$Default~ `PAT as % of net worth`+`Current liabilities &
provisions`
+`Cumulative retained profits`+ `Cash to current liabilities (times)`+ `Current ratio (times)`
+`Debt to equity ratio (times)`+`Debtors turnover`+Profitability, data =
rawdata_new1,family=binomial)
Page | 19
summary(rawdata.logit.train.1)
rawdata.logit.train.2 =glm(rawdata_new1$Default~ `PAT as % of net worth`+`Current liabilities &
provisions`
+ `Cash to current liabilities (times)`+ `Current ratio (times)`
+`Debt to equity ratio (times)`+`Debtors turnover`+Profitability , data =
rawdata_new1,family=binomial)
summary(rawdata.logit.train.2)
trainmodel=rawdata.logit.train.2
summary(trainmodel)
library(pROC)
library(caret)
library(e1071)
library(SDMTools)
?predict.glm
prediction = predict.glm(trainmodel, newdata=rawdata_new1,type='response')
## install.packages("caret", repos="http://cran.rstudio.com/", dependencies=TRUE)
rawdata_new1$predictvalue= predict.glm(trainmodel, newdata=rawdata_new1,type='response')
rawdata_new1$prediction=ifelse(rawdata_new1$predictvalue>0.9,1,0)
library(caTools)
conf.mat.table= table(rawdata_new1$Default,rawdata_new1$prediction)
Page | 20
print(conf.mat.table)
accuracy.prediction = sum(diag(conf.mat.table)/sum(conf.mat.table))
print(accuracy.prediction)*100
roc.train = roc(rawdata_new1$Default,prediction)
plot(roc.train, main="ROC curve - Training Dataset")
#94 percent accuracy in the training dataset
validate_new1$predictvalue= predict.glm(trainmodel, newdata=validate_new1,type='response')
prediction.test = predict.glm(trainmodel, newdata=validate_new1,type='response')
validate_new1$prediction=ifelse(prediction.test>0.9,1,0)
conf.mat.table.test= table(validate_new1$`Default - 1`,validate_new1$prediction)
print(conf.mat.table.test)
accuracy.prediction.test = sum(diag(conf.mat.table.test)/sum(conf.mat.table.test))
print(accuracy.prediction.test)*100
roc.test = roc(validate_new1$`Default - 1`,prediction.test)
plot(roc.test, main="ROC curve - Test Dataset")
Page | 21
#93 percent accurate in the test data set
####deciling#####
#for training
decile<-function(x)
{ decile = vector(length=10)
for(i in seq(0.1,1,.1))
{
decile[i*10]<-quantile(x,i,na.rm=T)
}
return(
ifelse(x<decile[1],1,
ifelse(x<decile[2],2,
ifelse(x<decile[3],3,
ifelse(x<decile[4],4,
ifelse(x<decile[5],5,
ifelse(x<decile[6],6,
ifelse(x<decile[7],7,
ifelse(x<decile[8],8,
ifelse(x<decile[9],9,10 ))))))))))
}
rawdata_new1$decile= decile(rawdata_new1$predictvalue)
Page | 22
temporary_table=data.table::data.table(rawdata_new1)
ranking = temporary_table[,list(cnt=length(rawdata_new1$prediction),
cnt_resp=sum(rawdata_new1$prediction==1),
cnt_non_resp=sum(rawdata_new1$prediction==0)
)
, by =decile][order(-decile)]
View(ranking)
#for testing
decile2<-function(x)
{ decile = vector(length=10)
for(i in seq(0.1,1,.1))
{
decile[i*10]<-quantile(x,i,na.rm=T)
}
return(
ifelse(x<decile[1],1,
ifelse(x<decile[2],2,
ifelse(x<decile[3],3,
ifelse(x<decile[4],4,
ifelse(x<decile[5],5,
ifelse(x<decile[6],6,
ifelse(x<decile[7],7,
Page | 23
ifelse(x<decile[8],8,
ifelse(x<decile[9],9,10 ))))))))))
}
validate_new1$decile= decile2(validate_new1$predictvalue)
temporary_table.test=data.table::data.table(validate_new1)
ranking.test = temporary_table.test[,list(cnt=length(validate_new1$prediction),
cnt_resp=sum(validate_new1$prediction==1),
cnt_non_resp=sum(validate_new1$prediction==0)
)
, by =decile][order(-decile)]
View(ranking.test)
#training
mean.observations.training =
aggregate(rawdata_new1$Default~rawdata_new1$decile,data=rawdata_new1,mean)
mean.predictions.training = aggregate(rawdata_new1$predictvalue
~rawdata_new1$decile,data=rawdata_new1,mean)
par(mfrow=c(1,2))
plot(mean.observations.training, main = "training dataset , default ")
lines(mean.observations.training)
Page | 24
plot(mean.predictions.training, main = "training dataset , prediction")
lines(mean.predictions.training)
#test
mean.observations.test = aggregate(validate_new1$`Default - 1`~validate_new1$decile
,data=validate_new1,mean)
mean.predictions.test = aggregate(validate_new1$predictvalue
~validate_new1$decile,data=validate_new1,mean)
par(mfrow=c(1,2))
plot(mean.observations.test, main = "test dataset , default ")
lines(mean.observations.test)
plot(mean.predictions.test, main = "test dataset , prediction")
lines(mean.predictions.test)
Page | 25