0% found this document useful (0 votes)
14 views25 pages

SMLDS_BAD702_Module5_Notes

The document covers Discriminant Analysis and its applications in data science, focusing on classification techniques such as Naive Bayes, Linear Discriminant Analysis (LDA), and Logistic Regression. It explains the concepts of probability scores, multiclass classification, and the importance of covariance matrices in LDA. Additionally, it provides practical examples and Python code snippets for implementing these classification methods.

Uploaded by

Shashank R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views25 pages

SMLDS_BAD702_Module5_Notes

The document covers Discriminant Analysis and its applications in data science, focusing on classification techniques such as Naive Bayes, Linear Discriminant Analysis (LDA), and Logistic Regression. It explains the concepts of probability scores, multiclass classification, and the importance of covariance matrices in LDA. Additionally, it provides practical examples and Python code snippets for implementing these classification methods.

Uploaded by

Shashank R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SML for DS [BAD702]

Module 5
Discriminant Analysis

Syllabus:

Covariance Matrix, Fisher’s Linear discriminant, Generalized Linear Models, Interpreting the

coefficients and odd ratios, Strategies for Imbalanced Data.

Textbook: Chapter 5

Introduction
Data scientists automate business decisions through classification, a type of supervised
learning where a model is trained on labeled data (known outcomes) and used to predict
unknown outcomes.
Examples:
 Detecting phishing emails (phishing/not phishing)
 Predicting customer churn (churn-stop using a service/not churn)
 Estimating ad clicks (click/no click)
Types:
 Binary classification: Two possible outcomes (1 or 0).
 Multiclass classification: More than two categories (e.g., Gmail inbox labels – primary,
social, promotions, forums).
Probability Scores:
Instead of just assigning classes, models can output probability estimates (propensities) that
indicate how likely a record belongs to a particular class.
 Logistic regression outputs log-odds, which are converted to probabilities.
 In Python’s scikit-learn, predict() gives classes, while predict_proba() gives
probabilities.
The General Decision Process:
1. Set a cutoff probability (threshold).
2. Estimate the probability of belonging to the target class.
3. Classify the record as belonging to that class if the estimated probability exceeds the
cutoff.

Dept. of CSE-DS, RNSIT Smitha B A 1


SML for DS [BAD702]

More Than Two Categories?


While most classification problems are binary (two possible outcomes), some involve multiple
outcomes—known as multiclass classification.
Example:
At a customer’s contract renewal, three outcomes are possible:
 Y = 0: Signs a new long-term contract
 Y = 1: Switches to a month-to-month contract
 Y = 2: Leaves or “churns”
The goal is to predict which category (Y = j) applies for j = 0, 1, or 2.
Most classification algorithms can handle multiclass problems directly or with slight
modifications.
Sometimes, the problem can be recast into multiple binary problems using conditional
probabilities, for example:
1. Predict whether Y = 0 or Y > 0 (churn vs. not churn).
2. Given Y > 0, predict whether Y = 1 or Y = 2 (type of contract).
This stepwise approach simplifies model fitting and is especially helpful when one class is much
more frequent than others.

5.1 Naive Bayes


The Naive Bayes algorithm is based on Bayes’ theorem, which relates the probability of an
outcome given some predictors.
It estimates the probability of a class (Y = i) given a set of predictor values.
Concept:
 It uses the probability of observing predictor values given an outcome (P(X | Y)) to
compute the probability of the outcome given predictors (P(Y | X)).
 This helps determine which class a new record most likely belongs to.
Exact Bayesian Classification (Conceptual Idea):
For each new record:
1. Identify all training records with the same predictor values (predictor profile).
2. Observe the class distribution among these identical records.
3. Assign the class that occurs most frequently (most probable class).

Dept. of CSE-DS, RNSIT Smitha B A 2


SML for DS [BAD702]

Limitation:
 In reality, it’s rare to find exact matches of predictor profiles, especially when there are
many predictors.
 The Naive Bayes approach overcomes this by assuming that predictors are
conditionally independent, allowing probability estimation even when exact matches
don’t exist.

5.1.1 Why Exact Bayesian Classification Is Impractical


When the number of predictor variables increases, finding exact matches for a new record
becomes highly unlikely.
Example:
In predicting voting behaviour using demographic variables, even with a large dataset, it’s rare
to find another record exactly matching a new individual’s detailed profile (e.g., gender,
ethnicity, income, location, voting history, family structure, marital status).
This happens because:
 Each additional variable increases the number of unique combinations of predictor
values.
 As the number of predictors grows, the chance of finding an exact match drops sharply
— a problem known as the curse of dimensionality.
 For instance, adding just one new variable with five categories reduces the probability
of an exact match by a factor of 5.
Hence, exact Bayesian classification becomes impractical for high-dimensional data,
motivating the use of Naive Bayes, which assumes predictors are conditionally independent to
estimate probabilities efficiently.
5.1.2 The Naive Solution
 The Naive Bayes algorithm is a probabilistic classifier based on Bayes’ theorem, which
estimates the probability of a class given a set of predictor values.
 It is called “naive” because it assumes that all predictors are independent of each other
given the outcome — an assumption rarely true in real data, but one that often works well
in practice.
 Instead of looking only at records that exactly match a new case (as in exact Bayes
classification), Naive Bayes uses the entire dataset to estimate probabilities.

Dept. of CSE-DS, RNSIT Smitha B A 3


SML for DS [BAD702]

Steps in the Naive Bayes Algorithm

In Python (using scikit-learn):


from sklearn.naive_bayes import MultinomialNB
import pandas as pd
predictors = ['purpose_', 'home_', 'emp_len_']
outcome = 'outcome'
X = pd.get_dummies(loan_data[predictors], prefix='', prefix_sep='')
y = loan_data[outcome]
naive_model = MultinomialNB(alpha=0.01, fit_prior=True)
naive_model.fit(X, y)
 The Naive Bayes classifier provides not only a class prediction (e.g., default or paid off)
but also a posterior probability estimate — the predicted likelihood of an outcome such
as default (Y = 1).
 However, these probability estimates are often biased because of the naive independence
assumption among predictors.
 Despite this, Naive Bayes remains effective when the objective is ranking rather than
obtaining perfectly accurate probabilities.

5.1.3 Numeric Predictor Variables

Dept. of CSE-DS, RNSIT Smitha B A 4


SML for DS [BAD702]

The Naive Bayes classifier is designed primarily for categorical predictors, such as in spam
detection where features represent the presence or absence of specific words.
When dealing with numerical predictors, Naive Bayes cannot be directly applied. Two main
approaches are used:
1. Discretization (Binning):
o Convert continuous (numerical) variables into categorical bins or intervals.
o Then apply the standard Naive Bayes algorithm for categorical data.
o Example: Age can be divided into bins such as <20, 20–40, 40–60, 60+.
2. Assume a Probability Distribution:
o Model each numerical predictor using a known probability distribution
(commonly the normal distribution).
o Estimate the conditional probability 𝑃(𝑋𝑗 ∣ 𝑌 = 𝑖)using that distribution’s
parameters (mean and variance) computed from the training data.
o This approach leads to Gaussian Naive Bayes, often used when predictors are
continuous.
5.2 Discriminant Analysis
 Linear Discriminant Analysis (LDA) is the most common form of discriminant
analysis.
 Fisher’s original method (1936) differs slightly from modern LDA, but the underlying
mechanics are similar.
 Usage: Less common today due to more advanced methods like decision trees and
logistic regression.
 Relevance: LDA is still used in certain applications and has connections to other
techniques, such as Principal Components Analysis (PCA).

5.2.1 Covariance Matrix


 To understand discriminant analysis, it is important to know covariance, which
measures how two variables vary together.
 Covariance between variables x and z is calculated as:

Here,
 𝑥ˉ , 𝑧ˉ: means of x and z

Dept. of CSE-DS, RNSIT Smitha B A 5


SML for DS [BAD702]

 𝑛: number of records
 The division by (n – 1) adjusts for degrees of freedom.
Covariance and Covariance Matrix
 Covariance measures how two variables change together.
o Positive → variables increase/decrease together.
o Negative → one increases while the other decreases.
o Zero → No linear relationship.
 Relation to Correlation:
o Correlation is the scaled (standardized) form of covariance.
o Correlation ∈ [–1, +1]; covariance has no fixed range (depends on variable
scale).
 Covariance Matrix (Σ):
For variables 𝑥and 𝑧:

Diagonal elements: variances (𝑠𝑥2 , 𝑠𝑧2 )


Off-diagonal elements: covariances (𝑠𝑥,𝑧 = 𝑠𝑧,𝑥 )

5.2.2 Fisher’s Linear Discriminant


Goal is to Classify records into two groups (binary outcome y) using two continuous variables
(x, z).
Assumptions:
Predictor variables are continuous and normally distributed (though LDA is fairly robust to
mild violations).
Fisher’s Linear Discriminant seeks a linear combination of predictors:
𝑦 ′ = 𝑤𝑥 𝑥 + 𝑤𝑧 𝑧
that best separates the two groups.
And Maximize the ratio:
𝑆𝑆between
𝑆𝑆within
where,
𝑆𝑆between : variation between group means (distance between groups)
𝑆𝑆within : variation within groups (spread around means), adjusted by the covariance matrix
Intuition:
Maximize between-group separation

Dept. of CSE-DS, RNSIT Smitha B A 6


SML for DS [BAD702]

Minimize within-group overlap


→ gives the best linear boundary for distinguishing the two classes.
5.2.3 A Simple Example
Predicting Loan Default
Problem
We want to predict whether a loan applicant will default (y = 1) or pay off (y = 0) using two
numeric predictors:
 borrower_score – measure of creditworthiness (0–1 scale)
 payment_inc_ratio – ratio of monthly payment to income

borrower_score payment_inc_ratio outcome

0.65 5.2 paid off

0.32 10.8 default

0.72 4.1 paid off

0.28 12.3 default

0.81 3.9 paid off

0.40 9.7 default

import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import matplotlib.pyplot as plt

# Sample data
data = pd.DataFrame({
'borrower_score': [0.65, 0.32, 0.72, 0.28, 0.81, 0.40],
'payment_inc_ratio': [5.2, 10.8, 4.1, 12.3, 3.9, 9.7],
'outcome': ['paid off', 'default', 'paid off', 'default', 'paid off', 'default']
})

X = data[['borrower_score', 'payment_inc_ratio']]
y = data['outcome']

# Fit LDA model

Dept. of CSE-DS, RNSIT Smitha B A 7


SML for DS [BAD702]

lda = LinearDiscriminantAnalysis()
lda.fit(X, y)

# Show discriminant weights


weights = pd.DataFrame(lda.scalings_, index=X.columns, columns=['Weight'])
print(weights)

# Predict probabilities
pred = pd.DataFrame(lda.predict_proba(X), columns=lda.classes_)
print(pred)

LDA creates a straight line that separates “paid off” and “default” regions:
𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑐 = 0

 Points left of the line → predicted as default


 Points right of the line → predicted as paid off
Confidence increases as the point moves farther from the boundary.
In Python (matplotlib/seaborn), a scatter plot with color gradient shows:
 Predicted probabilities (e.g., default risk).
 Decision boundary (solid diagonal line).

Key Ideas:
o LDA works for continuous
or categorical predictors with
categorical outcomes.
o Uses the covariance matrix
to compute a linear discriminant
function.
o Produces scores/weights
that classify each record into a
likely group.
o Core objective: maximize
between-group variance / minimize within-group variance for optimal class
separation.

Dept. of CSE-DS, RNSIT Smitha B A 8


SML for DS [BAD702]

5.3 Logistic Regression


 Logistic regression is similar to multiple linear regression, but the outcome variable is
binary (e.g., 0/1, yes/no).
 It uses a transformation (logit function) to model the relationship between predictors
and a binary outcome.
 The model estimates the probability of an event occurring based on input variables.
 Like discriminant analysis, logistic regression is a structured model-based approach
(not purely data-driven like KNN or naive Bayes).
 It is computationally efficient and provides an interpretable model for predicting new
data quickly.
 Widely used for classification problems and risk prediction.

5.3.1 Logistic Response Function and Logit


In logistic regression, the main components are the logistic response function and the logit
transformation. These help map probabilities (which range from 0 to 1) onto an unbounded
scale suitable for linear modelling.
Rather than directly modelling the probability 𝒑 of the outcome being “1” as a linear combination
of predictors —
𝑝 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑞 𝑥𝑞
which could yield values outside the valid probability range, logistic regression uses the logistic
(inverse logit) function to constrain 𝒑 between 0 and 1:
1
𝑝= −(𝛽0 +𝛽1 𝑥1+𝛽2 𝑥2+⋯+𝛽𝑞 𝑥𝑞 )
1+𝑒
This transformation ensures valid probability estimates and forms the basis for interpreting the
relationship between predictors and the likelihood of an event.

Dept. of CSE-DS, RNSIT Smitha B A 9


SML for DS [BAD702]

To better interpret this relationship, we use odds,


defined as the ratio of the probability of success to
failure:
𝑝
Odds(𝑌 = 1) =
1−𝑝
From this, the probability can be expressed as:
Odds
𝑝=
1 + Odds
Combining this with the logistic model gives:
Odds(𝑌 = 1) = 𝑒 𝛽0 +𝛽1 𝑥1+𝛽2 𝑥2+⋯+𝛽𝑞 𝑥𝑞
Taking the natural logarithm of both sides gives the
logit function (log-odds form):
log (Odds Y = 1) = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑞 𝑥𝑞
𝑝
log( ) = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑞 𝑥𝑞
1−𝑝
Thus, logistic regression models a linear relationship between the predictors and the log of the
odds of the event.
This allows us to predict probabilities within (0,1) and classify observations using a chosen cutoff
(e.g., 𝑝 > 0.5⇒ class 1).

5.3.2 Logistic Regression and the GLM


In logistic regression, the response variable represents the log-odds of a binary outcome (e.g.,
success = 1, failure = 0). Since we only observe the binary outcome itself—not the log-odds—
special estimation methods such as maximum likelihood estimation (MLE) are used to fit the
model.
Logistic regression is a special case of the Generalized Linear Model (GLM), which extends
linear regression to handle different types of response variables by using a suitable link function
and error distribution.
 For logistic regression:
o Link function: Logit (log of the odds)
o Error distribution: Binomial
In this logistic regression example, the response variable is outcome, which equals 0 if the loan
is paid off and 1 if the loan defaults.
 The predictors include both numeric and categorical (factor) variables:

Dept. of CSE-DS, RNSIT Smitha B A 10


SML for DS [BAD702]

o purpose_ → purpose of the loan (factor variable)


o home_ → home ownership status (factor variable)
o payment_inc_ratio, emp_len_, and borrower_score → numeric predictors
In Python, the logistic regression model is built using the LogisticRegression class from
sklearn.linear_model:

predictors = ['payment_inc_ratio', 'purpose_', 'home_', 'emp_len_', 'borrower_score']


outcome = 'outcome'
X = pd.get_dummies(loan_data[predictors], prefix='', prefix_sep='', drop_first=True)
y = loan_data[outcome]
logit_reg = LogisticRegression(penalty='l2', C=1e42, solver='liblinear')
logit_reg.fit(X, y)

 pd.get_dummies() creates dummy variables for categorical predictors.


 Regularization (L1 or L2) is enabled by default to prevent overfitting; setting C=1e42
(very large value) effectively removes it.
 solver='liblinear' specifies the optimization algorithm.
 In scikit-learn, classes are ordered alphabetically, so coefficient signs may appear
reversed compared to R.
 predict() gives class labels, while predict_proba() returns predicted probabilities
corresponding to the order in logit_reg.classes_.

5.3.3 Generalized Linear Models


Generalized Linear Models (GLMs) extend traditional linear regression to handle different
types of response variables and relationships. They are defined by two main components:
1. Probability distribution (family): Determines the type of response variable.
o For logistic regression, the family is binomial (binary outcome).
o Other families include Poisson (for count data), negative binomial, and gamma
(for modeling time or duration).
2. Link function: A transformation that connects the linear predictor to the expected value
of the response variable.
o Logistic regression uses the logit link:
𝑝
logit(𝑝) = log( )
1−𝑝

Dept. of CSE-DS, RNSIT Smitha B A 11


SML for DS [BAD702]

o Sometimes, a log link is used instead of a logit, but results are often similar in
practical cases.
While logistic regression is the most common GLM and widely applicable, using other GLMs
(e.g., Poisson, Gamma) requires deeper statistical understanding, as these models involve more
complex assumptions and are sensitive to data characteristics.
5.3.4 Predicted Values from Logistic Regression
In logistic regression, the model predicts the log-odds of the outcome being 1:
𝑌 = log(Odds(𝑌 = 1))
To convert this linear prediction into a probability, the logistic response function is applied:
1
𝑝=
1 + 𝑒 −𝑌
This transformation ensures that the predicted probabilities lie between 0 and 1.
In Python (scikit-learn):
 Log-odds are obtained using predict_log_proba().
 Probabilities are obtained directly using predict_proba():
pred = pd.DataFrame(logit_reg.predict_proba(X),
columns=loan_data[outcome].cat.categories)
pred.describe()
The predicted probabilities indicate the likelihood of the outcome (e.g., loan default).
Typically, a cutoff of 0.5 is used to classify outcomes (≥0.5 → default, <0.5 → paid off).
However, when identifying rare events, a lower threshold may be chosen to improve detection
of the minority class.

5.3.5 Interpreting the Coefficients and Odds Ratios


One of the main advantages of logistic regression is that it provides a model that can be easily
applied (scored) to new data and is straightforward to interpret. The interpretation centers on
the concept of the odds ratio, which measures how changes in predictor variables affect the odds
of an outcome (e.g., default vs. paid off).
For a binary predictor variable 𝑋:
Odds(𝑌 = 1 ∣ 𝑋 = 1)
Odds Ratio =
Odds(𝑌 = 1 ∣ 𝑋 = 0)
If the odds ratio > 1, the event 𝑌 = 1is more likely when 𝑋 = 1.
 If the odds ratio < 1, the event is less likely when 𝑋 = 1.

Dept. of CSE-DS, RNSIT Smitha B A 12


SML for DS [BAD702]

The logistic regression coefficient 𝛽𝑗 represents the log of the odds ratio for the variable 𝑋𝑗 :
𝛽𝑗 = log(odds ratio)
Hence,
odds ratio = 𝑒 𝛽𝑗
Example interpretations:
 For a categorical variable purpose_small_business
with coefficient 1.21526,
𝑒 1.21526 ≈ 3.4
Loans to small businesses are about 3.4 times more likely to default than credit card loans (the
reference category).
 For a numeric variable like payment_inc_ratio with coefficient 0.08244,
𝑒 0.08244 ≈ 1.09
Each unit increase in payment-to-income ratio increases default odds by 9%.
 For borrower_score with coefficient –4.61264,
𝑒 −4.61264 ≈ 0.01
Borrowers with excellent creditworthiness have 100 times lower odds of defaulting compared
to those with poor credit.
Because coefficients are expressed on the log scale, a one-unit increase in the coefficient
corresponds to a multiplication of the odds by 𝑒 1 ≈ 2.72.

5.3.6 Linear and Logistic Regression: Similarities and Differences


Like linear regression, logistic regression uses a parametric linear form to relate predictors to
the response variable. Model exploration, feature selection, and transformations (such as splines)
can be applied in similar ways to improve model fit and flexibility.
However, logistic regression differs from linear regression in two key aspects:
1. Model Fitting:
 Least squares estimation (used in linear regression) is not applicable because the
response variable is binary, not continuous.
 Instead, logistic regression uses maximum likelihood estimation (MLE) to determine
the coefficients that best explain the observed outcomes.
2. Residual Analysis:
 The residuals (differences between observed and predicted values) have a different
distribution and interpretation compared to linear regression.

Dept. of CSE-DS, RNSIT Smitha B A 13


SML for DS [BAD702]

 Specialized residual diagnostics (like deviance or Pearson residuals) are used to assess
model fit and identify outliers or influential observations.
Thus, while logistic regression retains the linear relationship in form, its estimation and
diagnostic procedures are fundamentally different from those of ordinary least squares
regression.
Fitting the Model:
In linear regression, the model is fit using least squares, and the quality of fit is measured with
metrics like RMSE and R-squared.
In logistic regression, however:
 There is no closed-form solution for the coefficients because the response is binary.
 The model is fit using Maximum Likelihood Estimation (MLE), which finds the
parameter values that make the observed data most probable.
 The logistic regression response is modeled as the log-odds of the outcome being 1,
rather than 0 or 1 directly.
 MLE iteratively updates coefficients using algorithms like quasi-Newton optimization
or Fisher scoring, improving the fit at each step.
For most practitioners, the software handles the optimization, so it is sufficient to understand
that MLE finds the best-fitting logistic model under certain assumptions.

5.3.7 Assessing the Model


1. Model Evaluation:
Logistic regression is evaluated based on classification accuracy rather than RMSE or R-
squared.
Metrics commonly used include accuracy, precision, recall, F1-score, AUC-ROC, etc.
2. Coefficient Interpretation:
Summary outputs (e.g., in R using summary(logistic_model) or in Python using statsmodels
GLM) provide:
Estimate (β) – regression coefficients
Standard Error (SE) – uncertainty of coefficient estimates
z-value – ratio of estimate to SE
p-value – indicates relative importance (not strict significance)
Example: purpose_small_business coefficient of 1.21526 → odds of default 3.4 times higher
than reference category.
3. Extensions from Linear Regression:

Dept. of CSE-DS, RNSIT Smitha B A 14


SML for DS [BAD702]

Stepwise selection, interaction terms, spline terms, and generalized additive models
(GAMs) are applicable.
In R: gam() function with family='binomial'
In Python: statsmodels.formula.api.glm() supports spline terms using bs() (B-splines).
4. Residual Analysis:
 Residuals differ from linear regression due to the binary nature of the outcome.
 Partial residuals help visualize the effect of a predictor and detect nonlinear
behavior.
 In logistic regression, residuals lie in
two clouds corresponding to 0s and 1s
because the observed outcomes are
binary, while predictions are log-odds.
 Partial residual plots can still identify
influential observations and nonlinear
patterns.
 R supports partial residuals; Python requires custom implementation.
5. Note
 Dispersion parameter in R summary is not relevant for logistic regression.
 Residual devia nce and number of scoring iterations relate to maximum likelihood
fitting.
Key Takeaway: Logistic regression combines interpretability of coefficients, flexibility with
GLM extensions, and classification-based evaluation, but residual analysis and goodness-of-fit
metrics differ fundamentally from linear regression.

5.4 Evaluating Classification Models


In predictive modeling, it is standard practice to:
1. Train multiple models on a training dataset.
2. Validate each model on a holdout sample to assess performance.
3. If enough data are available, use a third holdout (test) sample to estimate performance
on completely new data.
o The terms validation and test are often used interchangeably across disciplines.
Measuring Accuracy:
 Accuracy is a simple and common metric for classification performance:

Dept. of CSE-DS, RNSIT Smitha B A 15


SML for DS [BAD702]

Number of Correct Predictions ΣTrue Positives + ΣTrue Negatives


Accuracy = =
Total Sample Size Sample Size
Decision Cutoff:
 Most classification algorithms predict a probability of being 1.
 The default cutoff is usually 0.5:
o Probability ≥ 0.5 → class 1
o Probability < 0.5 → class 0
 Alternative cutoffs can be used based on the prevalence of 1s in the data or specific
business goals, especially when dealing with imbalanced classes

5.4.1 Confusion Matrix


The confusion matrix is a fundamental tool for evaluating classification models. It summarizes
correct and incorrect predictions for each class in a table format.
 Binary classification convention:
o 𝑌 = 1→ event of interest (e.g., loan default)
o 𝑌 = 0→ negative or usual event (e.g., loan paid off)
The confusion matrix has four key components:

Predicted 1 Predicted 0

Actual 1 True Positive (TP) False Negative (FN)

Actual 0 False Positive (FP) True Negative (TN)


 True Positive (TP): Correctly predicted 1s
 True Negative (TN): Correctly predicted 0s
 False Positive (FP): Predicted 1 but actual 0
 False Negative (FN): Predicted 0 but actual 1
Usage:
 Compute by hand or using packages in R (table(), caret::confusionMatrix()) or Python
(sklearn.metrics.confusion_matrix).
 For example, applying it to the logistic_gam model trained on a balanced dataset of
defaulted vs. paid-off loans allows you to see how well the model classifies each
outcome.
From the confusion matrix, you can derive other performance metrics like accuracy, precision,
recall, and F1-score.
In Python:

Dept. of CSE-DS, RNSIT Smitha B A 16


SML for DS [BAD702]

pred = logit_reg.predict(X)
pred_y = logit_reg.predict(X) == 'default'
true_y = y == 'default'
true_pos = true_y & pred_y
true_neg = ~true_y & ~pred_y
false_pos = ~true_y & pred_y
false_neg = true_y & ~pred_y

conf_mat = pd.DataFrame([[np.sum(true_pos), np.sum(false_neg)],


[np.sum(false_pos), np.sum(true_neg)]],
index=['Y = default', 'Y = paid off'],
columns=['Yhat = default', 'Yhat = paid off'])
conf_mat

In a confusion matrix for binary classification:


 Rows correspond to the actual outcomes.
 Columns correspond to the predicted outcomes.
 Diagonal elements represent correct
predictions:
o Upper-left: True Negatives (TN)
o Lower-right: True Positives (TP)
 Off-diagonal elements represent
incorrect predictions:
o Upper-right: False Positives (FP)
o Lower-left: False Negatives (FN)
Example (Loan Defaults):
 14,295 defaulted loans correctly predicted as defaults (TP)
 8,376 defaulted loans incorrectly predicted as paid off (FN)
Key Metrics Derived from the Confusion Matrix:
 Accuracy: Overall proportion of correct predictions
 Precision: TP / (TP + FP) → proportion of predicted positives that are correct
 Recall (Sensitivity): TP / (TP + FN) → proportion of actual positives correctly identified
 Specificity: TN / (TN + FP) → proportion of actual negatives correctly identified
 False Positive Rate: FP / (FP + TN) → important when positive events are rare

Dept. of CSE-DS, RNSIT Smitha B A 17


SML for DS [BAD702]

Important Note:
When 1s (positives) are rare, the false positive rate can dominate, making a predicted positive
much more likely to be a negative in reality. This phenomenon occurs in medical screening tests,
such as mammograms, where most positive results are false positives due to the rarity of the
condition.

5.4.2 The Rare Class Problem


In many predictive modelling problems, the classes are imbalanced, meaning one class (e.g.,
legitimate claims, non-purchasers) is much more prevalent than the other (e.g., fraudulent claims,
purchasers).
 The rare class is often the class of interest and is usually labeled 1.
 Misclassifying the rare class (false negative) is often more costly than misclassifying the
common class (false positive).
o Example: Identifying a fraudulent insurance claim saves significant money, while
misclassifying a legitimate claim as fraudulent incurs smaller costs.
Problem with Imbalance:
 A naive model that predicts only the majority class (0) can achieve high accuracy but is
practically useless.
o Example: If only 0.1% of website visitors purchase, predicting all as non-
purchasers yields 99.9% accuracy but fails to detect actual purchasers.
Practical Approach:
 Focus on metrics that emphasize the rare class, such as:
o Recall / Sensitivity for the rare class
o Precision for the rare class
o F1-score (harmonic mean of precision and recall)
 Accept some loss in overall accuracy in order to better detect the important rare
events.

5.4.3 Precision, Recall, and Specificity


In addition to overall accuracy, several nuanced metrics are used to evaluate classification
models, especially when dealing with imbalanced classes. These metrics are widely used in
statistics and biostatistics for diagnostic tests.
1. Precision (Positive Predictive Value):
Measures the accuracy of predicted positives:

Dept. of CSE-DS, RNSIT Smitha B A 18


SML for DS [BAD702]

True Positives (TP)


Precision =
TP + False Positives (FP)
2. Recall (Sensitivity, True Positive Rate):
Measures the ability to correctly identify actual positives:
True Positives (TP)
Recall =
TP + False Negatives (FN)
Term sensitivity is common in biostatistics; recall is used in machine learning.
3. Specificity (True Negative Rate):
Measures the ability to correctly identify actual negatives:
True Negatives (TN)
Specificity =
TN + False Positives (FP)
In Python:
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
conf_mat = confusion_matrix(y, logit_reg.predict(X))
precision = conf_mat[0,0] / sum(conf_mat[:,0])
recall = conf_mat[0,0] / sum(conf_mat[0,:])
specificity = conf_mat[1,1] / sum(conf_mat[1,:])
# Alternatively, calculate all at once
precision_recall_fscore_support(y, logit_reg.predict(X), labels=['default','paid off'])

Note: These metrics are especially valuable in cases with rare positive events, where overall
accuracy can be misleading.

5.4.5 ROC Curve: Receiver Operating


Characteristic
In classification, there is often a trade-off between
recall (sensitivity) and specificity:
 Increasing recall (capturing more positives)
generally increases false positives, lowering
specificity.
 An ideal classifier maximizes recall without
sacrificing specificity.
The Receiver Operating Characteristic (ROC)
curve visualizes this trade-off.

Dept. of CSE-DS, RNSIT Smitha B A 19


SML for DS [BAD702]

Key Features of the ROC Curve:


 Y-axis: Recall / Sensitivity (True Positive Rate)
o X-axis: Either Specificity (1 on left, 0 on right)
o Or 1 – Specificity / False Positive Rate (0 on left, 1 on right)
 Both representations produce identical curves.
Steps to Compute an ROC Curve:
1. Sort records by predicted probability of being a positive (1), from highest to lowest.
2. Compute cumulative recall and specificity as you move down the sorted list,
generating the ROC points.
The ROC curve allows you to select an appropriate cutoff based on the trade-off between
detecting positives and avoiding false positives.

5.4.6 AUC: Area under the Curve


While the ROC curve visually shows the trade-off
between recall (sensitivity) and specificity, it does not
provide a single performance metric. The Area Under
the Curve (AUC) is a summary statistic derived from
the ROC curve:
 AUC = 1: Perfect classifier (all 1s correctly
classified, no 0s misclassified).
 AUC = 0.5: Classifier no better than random
guessing (diagonal line).
 Higher AUC → better classifier.
In python:
from sklearn.metrics import roc_auc_score
roc_auc_score([1 if yi == 'default' else 0 for yi in y], logit_reg.predict_proba(X)[:,0])
Interpretation:
 The loan model example has AUC ≈ 0.69, indicating a relatively weak classifier.
 AUC provides a single, threshold-independent measure of model discriminative
ability.

Dept. of CSE-DS, RNSIT Smitha B A 20


SML for DS [BAD702]

5.4.7 Lift
Using AUC improves model evaluation over simple accuracy because it considers the trade-off
between identifying positives (1s) and overall accuracy, but it does not fully solve rare-class
problems:
 When positives are rare, a cutoff < 0.5 may be necessary to avoid classifying all records
as 0.
o Example: Classifying records with probability ≥ 0.3 as 1 to catch more rare
events.
 Lowering the cutoff increases recall for the rare class but also increases false positives.
Lift (or Gains) Metric:
 Measures how much better the model performs in identifying 1s compared to random
selection.
 Example: Top 10% of records by predicted probability may yield 0.3% positive rate vs.
0.1% if selected randomly → lift = 3.
 Lift chart / Gains chart:
o X-axis: Cumulative records (or deciles)
o Y-axis: Cumulative recall (percentage of 1s captured)
o Lift curve: Ratio of cumulative gains to the diagonal (random selection)
 Useful for identifying optimal probability cutoff in practice, especially under resource
constraints.
Applications:
 Direct mail marketing: Target top prospects efficiently.
 Tax audits: Select returns most likely to be fraudulent given limited audit resources.
 Marketing / political campaigns: Determine uplift—improvement in outcome due to
treatment A vs. B for individual cases.
Note:
 Lift charts quantify model effectiveness for the rare class and help decide a practical
probability cutoff aligned with business or resource priorities.

Dept. of CSE-DS, RNSIT Smitha B A 21


SML for DS [BAD702]

5.5 Strategies for Imbalanced Data


When dealing with imbalanced datasets—where the outcome of interest (e.g., purchase, fraud)
is rare—standard evaluation metrics like accuracy can be misleading. Beyond using metrics such
as precision, recall, specificity, AUC, and lift, additional strategies can improve model
performance for the rare class.

5.5.1 Undersampling
When the dataset is large, one effective strategy to handle imbalanced classes is undersampling
the majority class (0s):
o The dominant class often contains redundant records.
o Removing some of these records creates a more balanced dataset, improving
model performance and simplifying data preparation.
 Benefits:
o Reduces computational burden.
o Makes it easier to explore and pilot models.
o Helps the model better learn patterns for the minority class (1s).
 How much data is enough?
o Depends on the application.

Dept. of CSE-DS, RNSIT Smitha B A 22


SML for DS [BAD702]

o Generally, having tens of thousands of records for the less dominant class is
sufficient.
o If the classes are easily distinguishable, less data may suffice.
 Example (Loan Data):
o Training set was balanced: 50% paid off, 50% defaulted.
o Predicted probabilities roughly split around 0.5.
o In the full dataset, only ~19% of loans were in default, illustrating the original
imbalance.
Undersampling is particularly useful when the majority class is overwhelming, allowing the
model to focus on learning the characteristics of the minority class effectively.
In Python:
predictors = ['payment_inc_ratio', 'purpose_', 'home_', 'emp_len_',
'dti', 'revol_bal', 'revol_util']
outcome = 'outcome'
X = pd.get_dummies(full_train_set[predictors], prefix='', prefix_sep='',
drop_first=True)
y = full_train_set[outcome]
full_model = LogisticRegression(penalty='l2', C=1e42, solver='liblinear')
full_model.fit(X, y)
print('percentage of loans predicted to default: ',
100 * np.mean(full_model.predict(X) == 'default'))

Training on imbalanced data can severely underpredict rare events.


 Example: Only 0.39% of loans predicted as default vs. 19% actual.
Reason: Majority class dominates; defaulting loans may resemble nondefaulting ones.
Balanced sampling (50% defaults, 50% paid off) improves predictions, giving ~50%
predicted defaults.

5.5.2 Oversampling and Up/Down Weighting


Issue with Undersampling: Removes data from the dominant class, risking loss of useful
information, especially in small datasets.
Alternatives:
1. Oversampling: Duplicate or bootstrap rare-class records to balance the dataset.

Dept. of CSE-DS, RNSIT Smitha B A 23


SML for DS [BAD702]

2. Weighting: Assign higher weights to rare-class records during model training.


Example (Loan Data):
 Without weighting: only 0.39% of loans predicted as default.
 With weighting (or upsampling): predictions for defaults increase to ~58%, balancing
influence of rare and common classes.

5.5.3 Data Generation


Create new synthetic records by perturbing existing minority-class records, giving the
algorithm more examples to learn robust classification rules.
SMOTE (Synthetic Minority Oversampling Technique):
 Selects a minority-class record and a similar neighbor.
 Generates a synthetic record as a randomly weighted average of the two.
 Number of synthetic records depends on the desired oversampling ratio.
Purpose: Improves model learning without simply duplicating records.
Implementation:
 R: unbalanced package or FNN for SMOTE.
 Python: imbalanced-learn package, compatible with scikit-learn, supports SMOTE,
oversampling, undersampling, and integration with ensemble methods like boosting and
bagging.

5.5.4 Cost-Based Classification


Limitation of Accuracy/AUC: They do not account for the different costs of misclassification.
Cost-Based Decision: Assign costs/returns to outcomes, e.g.,
 C: cost of loan default
 R: return from paid-off loan
 Expected return:
Expected return = 𝑃(𝑌 = 0) × 𝑅 + 𝑃(𝑌 = 1) × 𝐶
Application: Use expected return to decide whether to approve a loan, rather than just
classifying based on probability.
Benefit: Incorporates business value, allowing decisions that maximize profit rather than just
prediction accuracy.

Dept. of CSE-DS, RNSIT Smitha B A 24


SML for DS [BAD702]

5.5.5 Exploring the Predictions


A single metric, such as AUC, cannot evaluate all aspects of the suitability of a model for a
situation. Figure 5-8 displays the decision rules for four different models fit to the loan data
using just two predictor variables: borrower_score and pay ment_inc_ratio.

Example (Loan Data with 2 predictors: borrower_score and payment_inc_ratio):


 Linear Discriminant Analysis (LDA) and Logistic Linear Regression: Produce nearly
identical, smooth linear decision boundaries.
 Tree Model: Produces a less regular, stepwise decision boundary.
 Logistic GAM: Provides a compromise—more flexible than linear models, smoother
than tree models.
Note:
sDifferent modeling approaches can yield different decision rules, highlighting the need to
consider interpretability, smoothness, and flexibility alongside metrics like AUC.

Dept. of CSE-DS, RNSIT Smitha B A 25

You might also like