71% found this document useful (7 votes)
4K views28 pages

Project Report - FRA V1.0

This document outlines a project to develop an India credit risk model using financial data to predict the likelihood of default. It describes the variables in the datasets, the methodology which includes exploratory data analysis, missing value treatment, outlier treatment, creating new variables, and building and analyzing a logistic regression model. Performance will be evaluated by predicting accuracy on validation data and sorting probabilities to check model performance across probability deciles. Figures and code are provided in the appendices. The goal is to create a model to assess credit risk of customers in India.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
71% found this document useful (7 votes)
4K views28 pages

Project Report - FRA V1.0

This document outlines a project to develop an India credit risk model using financial data to predict the likelihood of default. It describes the variables in the datasets, the methodology which includes exploratory data analysis, missing value treatment, outlier treatment, creating new variables, and building and analyzing a logistic regression model. Performance will be evaluated by predicting accuracy on validation data and sorting probabilities to check model performance across probability deciles. Figures and code are provided in the appendices. The goal is to create a model to assess credit risk of customers in India.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

PROJECT - ASSIGNMENT

Financial Risk Analytics


Submitted by: Bibin Vadakkekara Bhaskaran (G1 - PGP BABI)

Great Lakes Institute of Management


Financial Risk Analytics
Table of Contents
List of Tables....................................................................................................................................0
List of Figures...................................................................................................................................0
1. Project Objective, Background................................................................................................1
2. Methodology...........................................................................................................................2
3. Solutions/Question Answer (Rubric Based).............................................................................2
3.1 Missing Value Treatment..................................................................................................3
3.2 Outlier Treatment - Outlier Treatment.............................................................................3
3.3 New Variables Creation (One ration for profitability, leverage, liquidity and company's
size each).....................................................................................................................................3
3.4 Check for multicollinearity................................................................................................3
3.5 Univariate & bivariate analysis.........................................................................................4
3.6 Build Logistic Regression Model on most important variables........................................8
3.7 Analyze coefficient & their signs.......................................................................................8
3.8 Predict accuracy of model on dev and validation datasets..............................................9
3.9 Sort the data in descending order based on probability of default and then divide into
10 deciles based on probability & check how well the model has performed.........................10
Appendix 1 – Source Code.............................................................................................................11
List of Tables
No table of figures entries found.
List of Figure

Figure 1 : Box plot of independent variables.................................................................................................2


Figure 2 : Education Level density bar plot....................................................................................................2
Figure 3 : Age in years vs Personal Loan........................................................................................................3
Figure 4 : Experience in years vs Personal loan.............................................................................................3
Figure 5 : Credit card spending vs Personal Loan..........................................................................................4
Figure 6 : Income per month vs Education level............................................................................................4
Figure 7 : Confusion Matrix – Credit Card holders vs Personal loan.............................................................5
Figure 8 : Confusion Matrix – Cash deposit holders vs Personal loan...........................................................5
Figure 9 : Confusion Matrix – Securities account holders vs Personal loan..................................................5
Figure 10 : Confusion Matrix – Internet banking customers vs Personal loan..............................................5
Figure 11 : Initial CART plot tree....................................................................................................................6
Figure 12 : Before pruning – Rel. error vs cp vs size of tree..........................................................................7
Figure 13 : After pruning - CART plot.............................................................................................................7
Figure 14 : Random forest error rate and characteristics..............................................................................8
Figure 15 : Random forest - Error rate vs trees.............................................................................................8
Figure 16 : TPR vs FPR -CART........................................................................................................................10
Figure 17: TPR vs FPR - Random Forest.......................................................................................................10
Figure 18 : TPR vs FPR - CHAID model..........................................................................................................11
1. Project Objective, Background
To create an India credit risk(default) model, using the data provided (financial data) in the spreadsheet
raw-data.xlsx, and validate it on validation_data.xlsx.
Use the logistic regression framework to develop the credit default model.
Variable explanation as per the data dictionary are as follows :
Variable Name Discreption
Networth Next Year Net worth of the customer in next year
Total assets Total assets of customer
Net worth Net worth of the customer of present year
Total income Total income of the customer
difference between value of current stock and the value of stock in last trading
Change in stock day
Total expenses Total expense done by customer
Profit after tax Profit after tax deduction
PBDITA Profit before depreciation, income tax and amortization
PBT Profit before tax deduction
Cash profit Total Cash profit
PBDITA as % of total
income PBDITA / Total income
PBT as % of total income PBT / Total income
PAT as % of total income PAT / Total income
Cash profit as % of total
income Cash Profit / Total income
PAT as % of net worth PAT / Net worth
Sales Sales done by customer
Income from financial
services Income from financial services
Other income Income from other sources
Total capital Total capital of the customer
Reserves and funds Total reserves and funds of the customer
Deposits (accepted by
commercial banks) All blank values
Borrowings Total amount borrowed by customer
Current liabilities &
provisions current liabilities of the customer
Deferred tax liability Future income tax customer will pay because of the current transaction
Shareholders funds Amount of equity in a company, which is belong to shareholder
Cumulative retained
profits Total cumulative profit retained by customer
Capital employed Current asset minus current liabilities
TOL/TNW Total liabilities of the customer divided by Total net worth
Total term liabilities /
tangible net worth Short + long term liabilities divided by tangible net worth
Contingent liabilities / Contingent liabilities / Net worth
Page | 1
Net worth (%)
Contingent liabilities Liabilities because of uncertain events
Net fixed assets purchase price of all fixed assets
Investments Total invested amount
Current assets Assets that are expected to be converted to cash within a year
Net working capital Difference of current liabilities and current assets
Quick ratio (times) Total cash divided by current liabilities
Current ratio (times) Current assets divided by current liabilities
Debt to equity ratio
(times) Total liabilities divided by its shareholder equity
Cash to current liabilities
(times) Total liquid cash divided by current liabilities
Cash to average cost of
sales per day Total cash divided by average cost of the sales
Creditors turnover Net credit purchase divided to average trade creditors
Debtors turnover Net credit sales divided by average accounts receivable
Finished goods turnover Annual sales divided by average inventory
The cost of goods sold for a period divided by the average inventory for that
WIP turnover period
Raw material turnover Cost of goods sold is divided by the average inventory for the same period
Shares outstanding Number of issued shares minus the number of share held in the company
Equity face value cost of the equity at the time of issuing
EPS Net income divided by total number of outstanding share
Adjusted net earning divided by the weighted average number of common
Adjusted EPS share outstanding on a diluted basis during the plan year
Total liabilities Sum of all type of liabilities
PE on BSE Company current stock price divided by its earning per share

2. Methodology

EDA (Datatype corrections -> Missing value treatment -> Outlier corrections -> Univariate and Bivariate
analysis -> Creating new variables for Profitability, Leverage, Liquidity & Company’s size)
Modelling (Logistic Regression model to be built and analysed)
Model performance (Accuracy of the model on training and test dataset, 10 decile sorting)

3. Solutions/Question Answer (Rubric Based)


 While importing dataset using the import option in R studio, datatypes that was seen as
characters is changed into numeric from character type.
 Default, a new column is added using the Networth Next year as a criteria and for those who
have positive net worth is given not a defaulter or ‘0’ and those who have negative is given
defaulter ‘1’ status.
 In the total training data set named as rawdata, 6.8% of recordings fall under the defaulter
category.

Page | 2
3.1 Missing Value Treatment
 Missing values are first visually represented / observed using the plot_intro function from the
Data explorer package
 Each of the missing value is replaced using the median of the corresponding column using a loop
function throughout for the training dataset
 In both the training and test dataset, an entire column (namely Deposits (accepted by
commercial banks)) is blank and thus that coloumn is removed from the datasets.
 Testing dataset follows the same procedure as the training set
3.2 Outlier Treatment - Outlier Treatment
 Treatment is done using the scales library -> squish function
 Observations below the 1st percentile is given value of the first percentile and the observations
above the 95th percentile is given the value of 95th percentile
 Redundant variables (Number & Net worth next year (now categorised by Default) from the
training and Number from the testing dataset) are removed.
3.3 New Variables Creation (One ration for profitability, leverage, liquidity and company's size each)
 Profitability ratio (variable name: Profitability) = Profit After Tax / Sales
 Leverage ratio (variable name: leverage) = Total Assets / (Ratio of liabilities and debt to equity
ratio)
 Liquidity ratio (variable name: liquidity) = Net working capital / Total assets
 Company’s size ratio (variable name: comp.size) = Total Assets / Total Liabilities
3.4 Check for multicollinearity
 Correlation is plotted between all the variables available in the dataset (training)
 There exists significant correlation between certain independent variables

Page | 3

 Clusters of Red indicate that there exists correlation between the variables
3.5 Univariate & bivariate analysis
 Univariate Analysis – Histogram of independent variables

Page | 4

Page | 5
 Univariate Analysis – Density plot of independent variables

Page | 6


 Bi variate analysis

Page | 7

3.6 Build Logistic Regression Model on most important variables


 Logistic reg. model is built on the dataset with all variables primarily
 Analysing the importance of variables based on Z scores, secondary model is generated
3.7 Analyze coefficient & their signs
 AIC value of initial model is 1060.5

Page | 8
 Secondary model generated using the significant variables indicate a reduction in AIC value to
975.16
3.8 Predict accuracy of model on dev and validation datasets
 Accuracy of the training model is calculated and found to be 94 percent whereas the same model
when implemented on the testing / validation dataset, accuracy was found to be 93%
 Confusion matrix is generated for both the training set and validation set and is depicted below
 Training dataset


 Testing data set


 ROC is calculated and plotted for both the training and test dataset as shown below

Page | 9

3.9 Sort the data in descending order based on probability of default and then divide into 10 deciles
based on probability & check how well the model has performed
 Deciling of dataset based on the probability of default is done.
 Plots are made to compare the training dataset, test dataset (default vs predictions)

Page | 10

Page | 11
Appendix 1 – Source Code

####intro ####
getwd()
setwd("C:/Users/bibin/OneDrive/Great Lakes/FRA/Project Works/Project Works -RCode/ProjectV1.0")
#importing train dataset (rawdata) using inbuilt import dataset function
#correcting datatypes from character to numeric in the UI itself
rawdata=readxl::read_excel("raw-data.xlsx")
rawdata
str(rawdata)
rawdata=as.data.frame(rawdata)
for(i in 1:ncol(rawdata))
{
rawdata[,i]=as.numeric(rawdata[,i])
}
str(rawdata)

#importing test data (validate) set


validate=readxl::read_excel("validation_data.xlsx")
str(validate)
validate=as.data.frame(validate)
for(i in 1:ncol(validate))
{
validate[,i]=as.numeric(validate[,i])
}
str(validate)

#dataset while importing via -> import function was corrected for datatype from character to numeric

Page | 12
library(DataExplorer)

dim(rawdata)
names(rawdata)

####missing value & outlier treatment ####

rawdata_new=rawdata
validate_new=validate
str(validate_new)
rawdata_new$Default=ifelse(rawdata_new$`Networth Next Year`>0,0,1)
rawdata_new=rawdata_new[!rawdata_new$`Total assets`<=3,]
summary(rawdata_new$Default)

summary(rawdata_new$Default)
summary(rawdata_new[,53])
#3298 are non defaulters and 243 are defaulters
(243/(3298+243))*100
#6.86 % of total are defaulters

summary(rawdata_new)
summary(rawdata_new$`Total assets`)

#Deposits (accepted by commercial banks) is blank for both rawdata_new and validate_new

str(rawdata_new)
#22nd coloumn in both datasets

Page | 13
#removing the coloumns

rawdata_new=rawdata_new[,-22]
validate_new=validate_new[,-22]
plot_intro(rawdata_new)
plot_intro(validate_new)

summary(rawdata_new$Default)
sum(is.na(rawdata_new$Default))

#replace missing values of rawdata using median


rawdata_new=as.data.frame(rawdata_new)

#understanding the dataset under consideration


plot_intro(rawdata_new)
dim(rawdata_new)
#replace the missing NA with Median
for(a in 1:51)
{
rawdata_new[,a]=as.numeric(rawdata_new[,a])
rawdata_new[is.na(rawdata_new[,a]),a]=median(rawdata_new[,a],na.rm=TRUE)
}
str(validate_new)
#replacing the test dataset - validate_new with median
validate_new=as.data.frame(validate_new)
#removing the first two columns from replacing the omitted values
for(b in 3:ncol(validate_new))
{

Page | 14
validate_new[,b]=as.numeric(validate_new[,b])
validate_new[is.na(validate_new[,b]),b]=median(validate_new[,b],na.rm=TRUE)
}

summary(rawdata_new)
summary(validate_new)
plot_intro(rawdata_new)
plot_intro(validate_new)

#all obersvations are present / replaced for both the training dataset and test data set

####outlier identification ####


ncol(rawdata_new)

boxplot(rawdata_new)

library(scales)
ncol(rawdata_new)

str(rawdata_new)

summary(rawdata_new[,52])

#squish function utilisation for

for(c in 2:51)
{

Page | 15
quan=quantile(rawdata_new[,c],c(0.01,0.95))
rawdata_new[,c]=squish(rawdata_new[,c],quan)
}

summary(rawdata_new)

rawdata_new1=rawdata_new[,-c(1,2)]
validate_new1 = validate_new[,-1]
summary(rawdata_new1)
summary(validate_new1)

#rawdata new 1 -> does not contain the id of the company and net worth next year
#validate data new 1 -> does not contain the id of the company but starts with the default coloumn

#### univariate & bi variate analysis #####

plot_str(rawdata_new)
plot_intro(rawdata_new)
plot_missing(rawdata_new)

dev.off()
plot_histogram(rawdata_new1,ggtheme = theme_light())
dev.off()
plot_density(rawdata_new1,ggtheme= theme_minimal())

Page | 16
?plot_correlation
plot_correlation(rawdata_new1,title = "Correlation Plot between all variables", ggtheme =
theme_dark())

#bi variate
??plot
dev.off()
rawdata_new1$Default=as.factor(rawdata_new1$Default)
par(mfrow=c(3,2))
plot(log(rawdata_new1$`Net worth`)~rawdata_new1$Default, main = "Networth")
plot(log(rawdata_new1$`Total expenses`)~rawdata_new1$Default, main = "Tot. Exp")
plot(log(rawdata_new1$Sales)~rawdata_new1$Default, main = "Sales")
plot(log(rawdata_new1$`Total liabilities`)~rawdata_new1$Default, main = "liabilities")
plot(log(rawdata_new1$`Cash to average cost of sales per day`)~rawdata_new1$Default, main = "Cash
to avg cost of sale")
plot(log(rawdata_new1$`Current assets`)~rawdata_new1$Default, main = "Curnt Assts")
plot(log(rawdata_new1$`Capital employed`)~rawdata_new1$Default, main = "Cap. employed")
plot(log(rawdata_new1$`Other income`)~rawdata_new1$Default, main = "Oth Income")
plot((rawdata_new1$`Reserves and funds`)~rawdata_new1$Default, main = "Rsrv and Fund")
plot((rawdata_new1$`Net fixed assets`)~rawdata_new1$Default)

#multi collinearity check

####new variable creation #####


#profitability, leverage, liquidity and company's size

#profitability -> new variable


# profit after tax / sales

Page | 17
rawdata_new1$Profitability=rawdata_new1$`Profit after tax`/rawdata_new1$Sales
boxplot(log(rawdata_new1$Profitability),main="Profitability")

#leverage variable
# total assets / (ratio of liabilities and debt to equity ratio)

rawdata_new1$leverage = rawdata_new1$`Total assets`/(rawdata_new1$`Total


liabilities`/rawdata_new1$`Debt to equity ratio (times)`)
plot(density(rawdata_new1$leverage))

#liquidity variable
#net working capital / total assets
rawdata_new1$liquidity=rawdata_new1$`Net working capital`/rawdata_new1$`Total assets`

#company's size
#size = Total Assets / total liabilities
rawdata_new1$comp.size=rawdata_new1$`Total assets`/rawdata_new1$`Total liabilities`

#same variables generated for testing dataset - validate_new1

#profitability -> new variable


# profit after tax / sales

validate_new1$Profitability=validate_new1$`Profit after tax`/validate_new1$Sales


boxplot(log(validate_new1$Profitability),main="Profitability")
Page | 18
#leverage variable
# total assets / (ratio of liabilities and debt to equity ratio)

validate_new1$leverage = validate_new1$`Total assets`/(validate_new1$`Total


liabilities`/validate_new1$`Debt to equity ratio (times)`)
plot(density(validate_new1$leverage))

#liquidity variable
#net working capital / total assets
validate_new1$liquidity=validate_new1$`Net working capital`/validate_new1$`Total assets`

#company's size
#size = Total Assets / (total income - total expenses)
validate_new1$comp.size=validate_new1$`Total assets`/validate_new1$`Total liabilities`
boxplot(log(validate_new1$comp.size))

####modeling #####
#log reg
summary(rawdata_new1$Default)
set.seed(1000)
rawdata.logit.train = glm(rawdata_new1$Default~ ., data=rawdata_new1,family=binomial)
summary(rawdata.logit.train)
attach(rawdata_new1)
rawdata.logit.train.1=glm(rawdata_new1$Default~ `PAT as % of net worth`+`Current liabilities &
provisions`
+`Cumulative retained profits`+ `Cash to current liabilities (times)`+ `Current ratio (times)`
+`Debt to equity ratio (times)`+`Debtors turnover`+Profitability, data =
rawdata_new1,family=binomial)

Page | 19
summary(rawdata.logit.train.1)

rawdata.logit.train.2 =glm(rawdata_new1$Default~ `PAT as % of net worth`+`Current liabilities &


provisions`
+ `Cash to current liabilities (times)`+ `Current ratio (times)`
+`Debt to equity ratio (times)`+`Debtors turnover`+Profitability , data =
rawdata_new1,family=binomial)
summary(rawdata.logit.train.2)

trainmodel=rawdata.logit.train.2
summary(trainmodel)

library(pROC)
library(caret)
library(e1071)
library(SDMTools)
?predict.glm
prediction = predict.glm(trainmodel, newdata=rawdata_new1,type='response')
## install.packages("caret", repos="http://cran.rstudio.com/", dependencies=TRUE)
rawdata_new1$predictvalue= predict.glm(trainmodel, newdata=rawdata_new1,type='response')

rawdata_new1$prediction=ifelse(rawdata_new1$predictvalue>0.9,1,0)

library(caTools)

conf.mat.table= table(rawdata_new1$Default,rawdata_new1$prediction)

Page | 20
print(conf.mat.table)

accuracy.prediction = sum(diag(conf.mat.table)/sum(conf.mat.table))

print(accuracy.prediction)*100

roc.train = roc(rawdata_new1$Default,prediction)
plot(roc.train, main="ROC curve - Training Dataset")

#94 percent accuracy in the training dataset

validate_new1$predictvalue= predict.glm(trainmodel, newdata=validate_new1,type='response')

prediction.test = predict.glm(trainmodel, newdata=validate_new1,type='response')


validate_new1$prediction=ifelse(prediction.test>0.9,1,0)

conf.mat.table.test= table(validate_new1$`Default - 1`,validate_new1$prediction)


print(conf.mat.table.test)

accuracy.prediction.test = sum(diag(conf.mat.table.test)/sum(conf.mat.table.test))
print(accuracy.prediction.test)*100

roc.test = roc(validate_new1$`Default - 1`,prediction.test)


plot(roc.test, main="ROC curve - Test Dataset")

Page | 21
#93 percent accurate in the test data set

####deciling#####

#for training
decile<-function(x)
{ decile = vector(length=10)
for(i in seq(0.1,1,.1))
{
decile[i*10]<-quantile(x,i,na.rm=T)
}
return(
ifelse(x<decile[1],1,
ifelse(x<decile[2],2,
ifelse(x<decile[3],3,
ifelse(x<decile[4],4,
ifelse(x<decile[5],5,
ifelse(x<decile[6],6,
ifelse(x<decile[7],7,
ifelse(x<decile[8],8,
ifelse(x<decile[9],9,10 ))))))))))
}

rawdata_new1$decile= decile(rawdata_new1$predictvalue)

Page | 22
temporary_table=data.table::data.table(rawdata_new1)

ranking = temporary_table[,list(cnt=length(rawdata_new1$prediction),
cnt_resp=sum(rawdata_new1$prediction==1),
cnt_non_resp=sum(rawdata_new1$prediction==0)
)
, by =decile][order(-decile)]

View(ranking)

#for testing

decile2<-function(x)
{ decile = vector(length=10)
for(i in seq(0.1,1,.1))
{
decile[i*10]<-quantile(x,i,na.rm=T)
}
return(
ifelse(x<decile[1],1,
ifelse(x<decile[2],2,
ifelse(x<decile[3],3,
ifelse(x<decile[4],4,
ifelse(x<decile[5],5,
ifelse(x<decile[6],6,
ifelse(x<decile[7],7,

Page | 23
ifelse(x<decile[8],8,
ifelse(x<decile[9],9,10 ))))))))))
}

validate_new1$decile= decile2(validate_new1$predictvalue)
temporary_table.test=data.table::data.table(validate_new1)

ranking.test = temporary_table.test[,list(cnt=length(validate_new1$prediction),
cnt_resp=sum(validate_new1$prediction==1),
cnt_non_resp=sum(validate_new1$prediction==0)
)
, by =decile][order(-decile)]

View(ranking.test)

#training

mean.observations.training =
aggregate(rawdata_new1$Default~rawdata_new1$decile,data=rawdata_new1,mean)
mean.predictions.training = aggregate(rawdata_new1$predictvalue
~rawdata_new1$decile,data=rawdata_new1,mean)

par(mfrow=c(1,2))
plot(mean.observations.training, main = "training dataset , default ")
lines(mean.observations.training)

Page | 24
plot(mean.predictions.training, main = "training dataset , prediction")
lines(mean.predictions.training)

#test

mean.observations.test = aggregate(validate_new1$`Default - 1`~validate_new1$decile


,data=validate_new1,mean)
mean.predictions.test = aggregate(validate_new1$predictvalue
~validate_new1$decile,data=validate_new1,mean)

par(mfrow=c(1,2))
plot(mean.observations.test, main = "test dataset , default ")
lines(mean.observations.test)

plot(mean.predictions.test, main = "test dataset , prediction")


lines(mean.predictions.test)

Page | 25

You might also like