Loan Default Prediction - Course Slides
Loan Default Prediction - Course Slides
Course Introduction
Machine Learning models are used to make predictions and recommendations using data.
Load, clean, and explore the data set. Use feature engineering to modify and Build & evaluate logistic regression
improve data. and random forest classification
models.
Evaluate the accuracy of each model This course is an end-to-end experience of how to implement a
and explore ways to improve machine learning model, in python, from scratch.
performance (i.e. class balancing).
Supervised ML Unsupervised ML
Helps us understand how a given variable Start with a dataset of assumed real data,
is related to other variables in our without a given variable in mind.
dataset.
Start with a defined question- how do No specific question- the model reveals
the loan & customer characteristics affect how the data is related.
the likelihood of loan default? The model
can then predict the likelihood of a
default and classify loans accordingly.
Use:
Load a CSV file of loan default data into Investigate the row and column counts Identify columns that need cleaning
a python notebook using pandas
Identify and deal with missing values Save cleaned data as a new CSV file Manipulate data and prepare for
exploration
• Before we go any further there are a few questions we need to answer about our data
– How many data points do we have?
– How many variables are there?
– What are the variable types?
– What timespan does the data cover?
– Are there any missing values?
• Answering these basic questions provides context from which we can investigate further
• It will also help us identify individual columns which could need to be cleaned
What is a DataFrame?
• Columns have associated variable types (e.g. integer, string, date, list)
Why Dataframes?
• Missing data can reduce the predictive power of a machine learning model
• Moreover, empty or null values can cause many algorithms to produce errors
• First step is to identify columns which have ‘NaN’, ‘NA’ or ‘None’ values and count their frequency
• Numerical missing values can be replaced with the mean or median value, depending on the distribution of the data
• Categorical missing values can be replaced with some placeholder category e.g. ‘Missing’ or filled with the most
frequent value
• More advanced imputation techniques use algorithms such as K nearest neighbor clustering or linear regression to
replace missing values
• Columns which store dates can not be included directly in machine learning models which perform mathematical
operations/comparisons on features
• Date formatted columns often contain valuable information which can be preserved by converting the date into a
numerical representation
– E.g. We can calculate a person's age in years based on their date of birth
String Manipulation
• However, they can also contain poorly formatted information which should be cleaned
– E.g. Time period information stored in the format “1yrs 3mon”
• We can use string manipulation techniques such as splitting or regular expressions to identify patterns and extract
data
Loading & Cleaning data is incredibly important, as it will make our data easier to work with.
Pick variables for initial investigation Create summary stats and plots Check assumptions, spot anomalies,
and form hypothesis
Using EDA, our understanding of the data will help guide our model to
Create reusable functions in code, the right variables.
allowing you to repeat basic EDA far
quicker next time around
01 02
Unordered Categorical Features Ordered Categorical Features
E.g. Employment Type- categorizes E.g. Groups of house size- such as Small,
people into one of many groups. Medium and Large- there is a clear order
to these categories.
These can be analyzed to see if they have
an impact on the rate of loan defaults.
03 04
Continuous Variables Binary Features
Measurable variables such as asset costs Define whether a certain condition is met.
and customer age. Also known as truth values or flags.
The data shows numbers in the asset E.g. whether or not the loan applicant
column are far bigger than those in the holds a passport.
age column- needs to be fixed.
1= true
0 = false
Boxplots
Distributions
Hypothesis forming
• Using exploratory techniques we can start to form hypothesis about our data
• Pandas and seaborn provide very useful functions that allow us to visualize this relationship and answer
questions
– Are loans for cars from certain manufacturers more likely to default?
– Does the distribution of age change among people who defaulted on their loan?
– What was the average value of loans which defaulted?
Corporate Finance Institute®
Chapter Summary
EDA is particularly important for supervised ML, where our model will benefit from some guidance as to which
variables are most important.
Make sure columns are in the Learn how to use binning to create Create new features from existing
optimal format for feature engineering categorical variables from continuous ones and use scaling to normalize
ones continues variables
• Create categorical variables from continuous data by grouping instances into bins and assigning labels to the bins
– Split the continuous variables into bins with the same range/width
• Depending on the choice of algorithm, models can end up weighted heavily towards variables with high magnitudes
• Imagine you have two columns describing different weights, with one in kilograms and another in grams.
– Taken at face value some models will place more power towards the weight in grams
• Feature scaling aims to reduce this effect by bringing features into the same level of magnitude
Scaling Techniques
– Standardization - rescale features to have a mean of 0 and standard deviation of 1. Essentially make features
unitless
Feature engineering is important for modifying the structure of the input data to help our models be as
effective as possible. Examples include:
Y Y
y=1 y=1
y=0 y=0
X X
Predicted Y can exceed 0 and 1 range. Predicted Y lies within 0 and 1 range.
01
1
2
Split data frame into inputs = x,
3
and outputs = y.
4
02
7
8
ID X1 X2 X3 X4 Y
Split both x and y into training 9
and test data sets.
10
Row ID X1 X2 X3 X4 Y
10
We need to reserve a portion of our data to test our model in order to see how well it performs on
data it has not seen before.
Row ID AB NY NJ WS OK OH HA NX
Row ID State ID
1 AB 1 1 0 0 0 0 0 0 0
2 NY 2 0 1 0 0 0 0 0 0
3 NJ 3 0 0 1 0 0 0 0 0
4 WS 4 0 0 0 1 0 0 0 0
5 OK 5 0 0 0 0 1 0 0 0
6 OH 6 0 0 0 0 0 1 0 0
7 HA
7 0 0 0 0 0 0 1 0
8 NX
8 0 0 0 0 0 0 0 1
Logistic regression can’t One-hot-encoding: These columns can be used like any other binary
interpret string data. column to determine if they affect the results.
Logistic Regression
• Logistic regression uses an S-shaped logistic function to output a probability value between 0 and 1
• A data point can be classified by putting it through the function and classifying it according to a threshold
– E.g. if the output probability is > 0.5 it belongs to class 1
• The function is fitted to the training data using a concept called maximum likelihood
– Simply put, choose the line that maximizes the likelihood of observing the training data
• We can answer these questions by creating a linear function based on a weighted combination of
our input variables
• Variable weights determine how much influence a particular variable has over the predicted output
• Draw a line through the data using variable weights which minimize the Mean Squared Error (MSE) for our training
data
– MSE is the sum of the squared of distance between the predicted output and the actual value
• Training Data?
– In supervised machine learning we train our models on a labelled subset of the total data
• Machine learning algorithms make predictions by adjusting certain values based on the data they are shown
• To do this we need to split the data into training and test sets
– Test Data: used to test the predictive power of the trained model
• In some cases we may also create Validation Data which is used to fine-tune the model before testing
Dataset
Train the
model
Predictions
Check
performance
Corporate Finance Institute®
Stratification
• It is important that class distribution in the test set matches the natural distribution within the data
• This must be kept in mind when splitting the data into training and test sets
Stratification
• The process of sampling the data to match the distribution of a certain variable
• Particularly useful for classification problems where the classes are unevenly distributed
– Ensure that the class distribution in the test data is representative of the natural class distribution
Congratulations! You have built a basic logistic regression model, which allows the user to predict whether loans are
going to default or not.
We split data into X and Y dataframes, before using the train_test_split function to feed the logistic regression
with the training data set.
These metrics, charts, and techniques can give a good indication of the predictive
power of the model, whether it should be used for live data, and assist with
comparisons between models.
Familiarize with several metrics,
charts and techniques that will help
decide if the model is performing well or
not.
Learn and use the formulas for Plot the ROC curve & confusion Explore more advanced evaluation
Accuracy, Precision, Recall, and F1 score. matrices as additional ways to techniques.
interpret/model results.
TPR
Prediction
True False
FPR is also high.
Negative Positive
Actual
Positive (1)
FPR
• Accuracy tells us the percentage of data points our model classified correctly
– Imagine building a model for disease classification where only 1% of patients had the disease
– A model that predicts no one ever has the disease would be 99% accurate
Recall
• How many of the actual positive cases did we correctly identify? 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
• Useful when the cost of false negatives is high. i.e. in disease detection
F1 Score
• Harmonic mean of precision and recall
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
• Useful when we need a balance between precision and recall 𝐹1 = 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
• Less affected by large numbers of true negatives than accuracy
𝑇𝑃 𝐹𝑃 𝑇𝑃
𝑇𝑃𝑅 = 𝐹𝑃𝑅 = 𝑇𝑃𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁
• The real power of the ROC curve comes from calculating the AUC
• The higher the AUC the better the model is at separating instances
of the target variable
• Sometimes we might need to dig a little deeper, especially if the model is not performing well
• We can expand on the confusion matrix idea by looking at the percentage classification splits
– In other words, we have a list of values between 0 and 1 that tell us how likely the model thinks it is that a loan
belongs to the default class
• We can visualize this by splitting the test data into groups according to the true class labels and plotting the predicted
probability distributions for both groups
• In a perfect world these two distributions would meet at the classification threshold (0.5)
– Instances of class 0 would have a predicted probability < 0.5 of belonging to class 1
– Instances of class 1 would have a predicted probability > 0.5 of belonging to class 1
– These plots give us another way of examining the models ability to separate the two classes
Learn the basic theory on decision Create functions and review results Identify overfitting in your results and
trees & random forests, before of the model. learn how to adjust hyperparameters
building a random forest in your models to improve
classification model. performance.
Ensemble model: model where results are based on the combined output of multiple other models
to [hopefully] achieve a more reliable prediction.
In this course, we are using information on car loans to predict which ones will default.
Improves separation of
By guessing, our
loan outcomes separation score is
Corporate Finance Institute®
50%.
Decision Trees
Root Node
Decision Nodes
Leaf Nodes
Once the tree is completed, it can be used to evaluate each loan by answering the questions until a leaf node
is reached and a prediction is made.
CH?
Predict 1
A random forest classifier is a It’s predictions are based on the This approach of combining multiple
machine learning technique predictions of a number of models often improves accuracy, but
used for predictive analysis. underlying decision tree models. makes them more difficult to interpret.
• To predict the class of some input data, each tree produces a prediction and the Random Forest chooses the most
popular class as its output
• Individual trees in the random forest should be uncorrelated, this means that they can counteract each other’s errors
– The output of some trees may be wrong but others will be right
– As a group the forest can produce better results than individual trees
• Unique training sets are generated for each tree by randomly sampling the original training set with replacement
• This exploits the tendency of decision trees to overfit on their training data
• Each tree is trained on a unique training set, create diversity among the forest
Feature Randomness
• Unlike regular decision trees, trees in the random forest choose which feature to split on from a random subset of all
the training features
Bias
• How far are the predicted values from the actual values
• High bias usually means the model is too simple and misses important relationships between input features and the
target variable
Variance
• The variability of predicted values if the model is shown different training data
• High variance means the model performs well on training data but does not generalize out to test/unseen data
Error
resolved by reducing
Optimum Model
complexity.
Complexity
Variance
Bias2
Over generalizes the data Generalize enough to predict Does not generalize enough
future patterns
• By combining multiple models together ensemble methods can minimize bias and variance
Ensemble Techniques
• Imagine we have a group of models are outputting their own classification results
• Ensemble learners can produce more accurate results from stable robust models
• However, the resulting models often have high computational cost and can be difficult to interpret
Random Forest models tend to perform best when we pay attention to hyperparmeters such as:
Created functions to help speed up the model build and evaluation process
Random forest classification model is an important and popular ML tool that is widely used in data science.
This is a common problem which applies to loan defaults, fraud detection, disease detection, among
other areas.
Learn how to use class balancing We want to change the weighting of the model to
in the random forest & logistic
regression models, as well as how
focus more on the defaulting loans.
to interpret the results.
• In most imbalanced classification problems we are mainly interested in predicting the minority class
• This poses a problem, we want our model to pick up on the characteristics and patterns of the minority class but
there are fewer instances to learn from
– Fraud Detection
– Disease Detection
• Weight balancing negates the effect of imbalanced data by changing the weight that each class carries when
computing the loss/error
• To balance the classes we can assign weights that are inverse to the class distribution
– For our loan data we could assign a weight of 0.217 (21.7%) to class 0
• When using resampling techniques it is essential that the test data is not resampled.
• Up-sampling means that we randomly duplicate instances of the minority class to create a balanced data set
• Most commonly we select a sample to duplicate without removing the original sample from the data set
• Down-sampling creates a balanced data set by randomly removing instances from the minority class without
replacement
• Standard up-sampling creates duplicate instances which do not add any new information to the model
SMOTE
• Up-sample the minority class by creating synthetic samples with their own features
• Draw a line between similar samples and create a new synthetic sample at a point along the line
Covered the basics of the entire Imported & cleaned data before Written code to create a simple
machine learning process conducting exploratory data analysis logistic regression & random forest
classification model