0% found this document useful (0 votes)
32 views7 pages

Exp 2

The document provides a comprehensive theoretical exploration of Logistic Regression, a key algorithm in supervised machine learning for binary classification tasks. It covers the model's mathematical formulation, the sigmoid activation function, the Log-Loss cost function, and the optimization process for training the model. Additionally, it explains how to interpret model coefficients and includes code snippets for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views7 pages

Exp 2

The document provides a comprehensive theoretical exploration of Logistic Regression, a key algorithm in supervised machine learning for binary classification tasks. It covers the model's mathematical formulation, the sigmoid activation function, the Log-Loss cost function, and the optimization process for training the model. Additionally, it explains how to interpret model coefficients and includes code snippets for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Military Institute of Science and Technology

Department of Electrical, Electronic & Communication Engineering

Course Code : 408


Course: Artificial Intelligence & Machine Learning Laboratory

Exp. No.-2

Name of the Exp.: A Theoretical Deep Dive into Logistic Regression.

Introduction
In the field of machine learning, supervised learning is a fundamental category where an algorithm
learns from a labeled dataset. This means each data point is tagged with a correct output or ”label.”
The goal is to learn a mapping function that can predict the output for new, unseen data. Supervised
learning problems are primarily divided into two types: regression, which predicts continuous numerical
values, and classification, which assigns data to discrete categories. This experiment provides a detailed
theoretical exploration of Logistic Regression, a cornerstone algorithm for solving binary classification
problems, where the goal is to determine which of two groups a data point belongs to.

Objective of the Experiment


1. To understand the fundamental concepts of supervised machine learning for classification tasks.
2. To learn the detailed theory behind the Logistic Regression model, including its mathematical
formulation.
3. To deeply understand the role and properties of the sigmoid activation function.

4. To perform a technical analysis of the Log-Loss (Binary Cross-Entropy) cost function.


5. To learn how to interpret the coefficients of different types of logistic regression models (binary,
continuous, and multivariate).
6. To understand the optimization process for finding the model’s parameters.

Theory
1. The Logistic Regression Model
Despite its name, Logistic Regression is a supervised learning algorithm used for classification. It works
by predicting the probability that an observation falls into a particular class based on its features. While
it can be extended for more than two categories, its primary use is for binary classification.
The model’s operation is a two-step process. First, it calculates a linear predictor, denoted as z,
which is a weighted sum of the input features. This is mathematically identical to the equation used in
linear regression:
z = βˆ0 + βˆ1 x1 + βˆ2 x2 + ... + βˆk xk
Second, this linear predictor z is transformed by the sigmoid function to produce an output between
0 and 1, which can be interpreted as a probability. The final equation gives the probability that the
outcome y belongs to class 1, given the input features X:
1
P (y = 1|X) = sigmoid(z) =
1 + e−z
This resulting probability is then compared to a classification threshold (typically 0.5) to assign the
observation to a class. If the probability is higher than the threshold, the model predicts class 1;
otherwise, it predicts class 0.

2. Activation Function: The Sigmoid


The sigmoid function is the characteristic activation function of logistic regression. It is responsible for
converting the unbounded output of the linear predictor, z, into a bounded probability. The formula for
the sigmoid function is:
1
f (x) =
1 + e−x
Key properties of the sigmoid function include:
• Output Range: It squashes any real-valued input into a range between 0 and 1, which is essential
for interpreting the output as a probability.
• Asymptotic Behavior: As the input x approaches positive infinity, the output f (x) approaches
1. Conversely, as x approaches negative infinity, f (x) approaches 0.
• Monotonicity: The function is monotonically increasing, meaning a higher input value z will
always result in a higher or equal probability.
x −x
While other activation functions like ReLU (f (x) = max(0, x)) and Tanh (f (x) = eex −e
+e−x ) are common
in more complex models like neural networks, the sigmoid function is fundamental to standard logistic
regression.

3. The Cost Function: Log-Loss (Binary Cross-Entropy)


To train the model, we need a way to measure how well it is performing. This is done using a cost
function (or loss function). For logistic regression, the appropriate cost function is called Log-Loss or
Binary Cross-Entropy. The goal is to find the model parameters (βi ) that minimize this function.
The combined Log-Loss function for a single prediction is:

Loss = −[y log(p̂) + (1 − y) log(1 − p̂)]

Where y is the true label (0 or 1) and p̂ is the predicted probability. Let’s analyze its two parts:

• Case 1: True Class is 1 (y = 1): The loss function simplifies to Loss = − log(p̂).
– If the model correctly predicts a probability p̂ close to 1, the loss, − log(p̂), approaches 0.
– If the model incorrectly predicts a probability p̂ close to 0, the loss, − log(p̂), approaches
infinity. This heavily penalizes confident wrong predictions.

• Case 2: True Class is 0 (y = 0): The loss function simplifies to Loss = − log(1 − p̂).
– If the model correctly predicts a probability p̂ close to 0, the term (1 − p̂) is close to 1, and
the loss, − log(1 − p̂), approaches 0.
– If the model incorrectly predicts a probability p̂ close to 1, the term (1 − p̂) is close to 0, and
the loss approaches infinity.

2
The total cost over all n samples in the dataset is the sum of the individual losses:
n
X
Log − Loss = −(yi log(pi ) + (1 − yi ) log(1 − pi ))
i=1

Minimizing this Log-Loss function is equivalent to maximizing the Log-Likelihood of the parameters, a
concept from statistical estimation.

4. Model Training and Optimization


The process of finding the optimal coefficients (βˆ0 , βˆ1 , ..., βˆk ) that minimize the Log-Loss function is called
training or optimization. The two primary methods are Gradient Descent and Maximum Likelihood
Estimation (MLE).
• Gradient Descent: This is a common iterative optimization algorithm used in machine learning.
The process involves:
1. Selecting initial values for the parameters.
2. Calculating the gradient of the Log-Loss cost function with respect to each parameter. The
gradient is a vector that points in the direction of the steepest increase of the function.
3. Updating the parameters by taking a small step in the direction opposite to the gradient.
This step size is controlled by a value called the learning rate.
4. Repeating this process until the parameters converge to values that minimize the cost function,
resulting in a sigmoid curve that best fits the data.
• Maximum Likelihood Estimation (MLE): This is a statistical method that aims to find the
parameter values that maximize the likelihood of observing the actual data. For logistic regres-
sion, minimizing the Log-Loss is equivalent to maximizing the Log-Likelihood, making these two
approaches two sides of the same coin.

5. Interpreting Model Coefficients


The coefficients (β) in a logistic regression model have a specific and powerful interpretation related to
the ”log-odds.” The odds are the ratio of the probability of an event occurring to the probability of it
not occurring (p/(1 − p)).

Model with One Continuous Feature


Consider a model predicting a sunny day based on temperature: P (Day = Sunny|T emperature) =
1
−(β̂0 +β̂1 ∗T emperature)
.
1+e

• Intercept (β̂0 ): This is the log-odds of a sunny day when the temperature is 0 degrees. The
eβ̂0
probability at 0 degrees can be calculated as p = .
1+eβ̂0

• Weight (β̂1 ): This is the change in the log-odds for a one-unit increase in the feature (e.g., a
1-degree increase in temperature). The exponentiated coefficient, eβ̂1 , is the odds ratio. It tells
us how the odds of the outcome are multiplied for every one-unit increase in the feature. For
example, if β̂1 = 0.7, then e0.7 ≈ 2.01. This means that for each additional degree of temperature,
the odds of it being a sunny day are multiplied by 2.01 (i.e., they approximately double).

Model with One Binary Feature


Consider a model predicting a sunny day based on whether it is foggy (where Foggy=1 if true, 0 if false):
1
P (Day = Sunny|F oggy) = −(β̂0 +β̂1 ∗F oggy)
.
1+e

• Intercept (β̂0 ): This is the log-odds of a sunny day when the feature is 0 (i.e., when it is not
foggy).
• Weight (β̂1 ): This is the change in the log-odds when the day is foggy relative to when it is not
foggy. The odds ratio, eβ̂1 , indicates how the odds of a sunny day change if it is foggy. For example,
if β̂1 = −0.7, then e−0.7 ≈ 0.50. This means the odds of it being a sunny day are halved if it is
foggy compared to if it is not.

3
Multivariate Logistic Regression Model
1
Typically, a model will include multiple features. Example: P (Day = Sunny|T emp, F oggy) = ˆ ˆ ˆ .
1+e−(β0 +β1 ∗T emp+β2 ∗F oggy)

• Intercept (β̂0 ): This represents the log-odds of the outcome when all predictor variables are zero
(e.g., a non-foggy day with a temperature of 0 degrees).

• Weights (β̂1 , β̂2 , ...): Each coefficient, say β̂j , is the change in the log-odds for a one-unit change
in its corresponding feature xj , holding all other features constant. For example, β̂1 is the change
in the log-odds of a sunny day for a one-unit change in temperature, assuming the foggy/not-foggy
status does not change.

Codes
# import dependencies
# d a t a c l e a n i n g and m a n i p u l a t i o n
import pandas a s pd
import numpy a s np

# data v i s u a l i z a t i o n
import m a t p l o t l i b . p y p l o t a s p l t
import s e a b o r n a s s n s

# machine l e a r n i n g
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r

import s k l e a r n . l i n e a r m o d e l a s s k l l m
from s k l e a r n import p r e p r o c e s s i n g
from s k l e a r n import n e i g h b o r s
from s k l e a r n . m e t r i c s import c o n f u s i o n m a t r i x , c l a s s i f i c a t i o n r e p o r t , p r e c i s i o n s c o r e
from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t

import s t a t s m o d e l s . a p i a s sm
import s t a t s m o d e l s . f o r m u l a . a p i a s smf

# i n i t i a l i z e some p a c k a g e s e t t i n g s
s n s . set ( s t y l e=” w h i t e g r i d ” , c o l o r c o d e s=True , f o n t s c a l e =1.3)

%m a t p l o t l i b i n l i n e

# r ead i n t h e d a t a and c h e c k t h e f i r s t 5 rows


d f = pd . r e a d c s v ( ’ . . / i n p u t / data . c s v ’ , i n d e x c o l =0)
d f . head ( )

# g e n e r a l summary o f t h e d a t a f r a m e
df . i n f o ()

# remove t h e ’ Unnamed : 32 ’ column


d f = d f . drop ( ’ Unnamed : 32 ’ , a x i s =1)

# c h e c k t h e d a t a t y p e o f each column
df . dtypes

# v i s u a l i z e d i s t r i b u t i o n of classes
p l t . f i g u r e ( f i g s i z e =(8 , 4 ) )
s n s . c o u n t p l o t ( d f [ ’ d i a g n o s i s ’ ] , p a l e t t e= ’RdBu ’ )

4
# co u nt number o f o b v s i n each c l a s s
benign , m a l i g n a n t = d f [ ’ d i a g n o s i s ’ ] . v a l u e c o u n t s ( )
print ( ’ Number o f c e l l s l a b e l e d Benign : ’ , b e n i g n )
print ( ’ Number o f c e l l s l a b e l e d Malignant : ’ , m a l i g n a n t )
print ( ’ ’ )
print ( ’% o f c e l l s l a b e l e d Benign ’ , round ( b e n i g n / len ( d f ) ∗ 1 0 0 , 2 ) , ’%’ )
print ( ’% o f c e l l s l a b e l e d Malignant ’ , round ( m a l i g n a n t / len ( d f ) ∗ 1 0 0 , 2 ) , ’%’ )

# g e n e r a t e a s c a t t e r p l o t m a t r i x w i t h t h e ”mean” columns
cols = [ ’ diagnosis ’ ,
’ radius mean ’ ,
’ texture mean ’ ,
’ perimeter mean ’ ,
’ area mean ’ ,
’ smoothness mean ’ ,
’ compactness mean ’ ,
’ concavity mean ’ ,
’ concave p o i n t s m e a n ’ ,
’ symmetry mean ’ ,
’ fractal dimension mean ’ ]

s n s . p a i r p l o t ( data=d f [ c o l s ] , hue= ’ d i a g n o s i s ’ , p a l e t t e= ’RdBu ’ )

# generate a c o r r e l a t i o n matrix
c o r r = d f . c o r r ( ) . round ( 2 )

# mask f o r t h e upper t r i a n g l e
mask = np . z e r o s l i k e ( c o r r , dtype=np . bool )
mask [ np . t r i u i n d i c e s f r o m ( mask ) ] = True

# s e t up t h e f i g u r e
f , ax = p l t . s u b p l o t s ( f i g s i z e =(20 , 2 0 ) )

# g e n e r a t e a custom d i v e r g i n g colormap
cmap = s n s . d i v e r g i n g p a l e t t e ( 2 2 0 , 1 0 , as cmap=True )

# draw t h e heatmap
s n s . heatmap ( c o r r , mask=mask , cmap=cmap , vmin=−1, vmax=1, c e n t e r =0,
s q u a r e=True , l i n e w i d t h s =.5 , c b a r k w s={” s h r i n k ” : . 5 } , annot=True )

plt . tight layout ()

# c r e a t e a new d a t a f r a m e w i t h t h e ”mean” columns


df mean = d f [ [ ’ d i a g n o s i s ’ , ’ r a d i u s m e a n ’ , ’ t e x t u r e m e a n ’ , ’ p e r i m e t e r m e a n ’ , ’ area mean ’ ,

# c r e a t e a new d a t a f r a m e w i t h t h e ” s e ” columns
d f s e = df [ [ ’ diagnosis ’ , ’ r a d i u s s e ’ , ’ t e x t u r e s e ’ , ’ perimeter se ’ , ’ area se ’ , ’ smoothn

# c r e a t e a new d a t a f r a m e w i t h t h e ” w o r s t ” columns
df worst = df [ [ ’ diagnosis ’ , ’ radius worst ’ , ’ texture worst ’ , ’ perimeter worst ’ , ’ area wo

# c r e a t e a c o r r e l a t i o n m a t r i x f o r t h e ”mean” columns
corr mean = df mean . c o r r ( ) . round ( 2 )

# mask f o r t h e upper t r i a n g l e
mask mean = np . z e r o s l i k e ( corr mean , dtype=np . bool )
mask mean [ np . t r i u i n d i c e s f r o m ( mask mean ) ] = True

5
# s e t up t h e f i g u r e
f , ax = p l t . s u b p l o t s ( f i g s i z e =(10 , 1 0 ) )

# g e n e r a t e a custom d i v e r g i n g colormap
cmap = s n s . d i v e r g i n g p a l e t t e ( 2 2 0 , 1 0 , as cmap=True )

# draw t h e heatmap
s n s . heatmap ( corr mean , mask=mask mean , cmap=cmap , vmin=−1, vmax=1, c e n t e r =0,
s q u a r e=True , l i n e w i d t h s =.5 , c b a r k w s={” s h r i n k ” : . 5 } , annot=True )

plt . tight layout ()

# c r e a t e a c o r r e l a t i o n m a t r i x f o r t h e ” s e ” columns
c o r r s e = d f s e . c o r r ( ) . round ( 2 )

# mask f o r t h e upper t r i a n g l e
mask se = np . z e r o s l i k e ( c o r r s e , dtype=np . bool )
mask se [ np . t r i u i n d i c e s f r o m ( mask se ) ] = True

# s e t up t h e f i g u r e
f , ax = p l t . s u b p l o t s ( f i g s i z e =(10 , 1 0 ) )

# g e n e r a t e a custom d i v e r g i n g colormap
cmap = s n s . d i v e r g i n g p a l e t t e ( 2 2 0 , 1 0 , as cmap=True )

# draw t h e heatmap
s n s . heatmap ( c o r r s e , mask=mask se , cmap=cmap , vmin=−1, vmax=1, c e n t e r =0,
s q u a r e=True , l i n e w i d t h s =.5 , c b a r k w s={” s h r i n k ” : . 5 } , annot=True )

plt . tight layout ()

# c r e a t e a c o r r e l a t i o n m a t r i x f o r t h e ” w o r s t ” columns
c o r r w o r s t = d f w o r s t . c o r r ( ) . round ( 2 )

# mask f o r t h e upper t r i a n g l e
mask worst = np . z e r o s l i k e ( c o r r w o r s t , dtype=np . bool )
mask worst [ np . t r i u i n d i c e s f r o m ( mask worst ) ] = True

# s e t up t h e f i g u r e
f , ax = p l t . s u b p l o t s ( f i g s i z e =(10 , 1 0 ) )

# g e n e r a t e a custom d i v e r g i n g colormap
cmap = s n s . d i v e r g i n g p a l e t t e ( 2 2 0 , 1 0 , as cmap=True )

# draw t h e heatmap
s n s . heatmap ( c o r r w o r s t , mask=mask worst , cmap=cmap , vmin=−1, vmax=1, c e n t e r =0,
s q u a r e=True , l i n e w i d t h s =.5 , c b a r k w s={” s h r i n k ” : . 5 } , annot=True )

plt . tight layout ()

# c r e a t e X and y
X = d f . drop ( ’ d i a g n o s i s ’ , a x i s =1)
y = df [ ’ diagnosis ’ ]

# c r e a t e a LabelEncoder o b j e c t
l e = p r e p r o c e s s i n g . LabelEncoder ( )

# f i t and t r a n s f o r m t h e y v a r i a b l e
y = le . fit transform (y)

6
# c r e a t e t r a i n i n g and t e s t i n g s e t s
X t r a i n , X t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t (X, y , t e s t s i z e =0.3 , r a n d o m s t a t e =42

# create a StandardScaler object


s c a l e r = StandardScaler ()

# f i t and t r a n s f o r m t h e t r a i n i n g d a t a
X train scaled = s c a l e r . fit transform ( X train )

# transform the t e s t i n g data


X t e s t s c a l e d = s c a l e r . transform ( X test )

# c r e a t e a l o g i s t i c r e g r e s s i o n model
log reg = skl lm . LogisticRegression ()

# f i t t h e model
log reg . f i t ( X train scaled , y train )

# make p r e d i c t i o n s
y pred = l o g r e g . p r e d i c t ( X t e s t s c a l e d )

# p r i n t the accuracy score


print ( ” Accuracy : ” , l o g r e g . s c o r e ( X t e s t s c a l e d , y t e s t ) )

# p r i n t the confusion matrix


confusion matrix ( y test , y pred )

# print the c l a s s i f i c a t i o n report


print ( c l a s s i f i c a t i o n r e p o r t ( y t e s t , y p r e d ) )

# p l o t t h e ROC c u r v e
from s k l e a r n . m e t r i c s import r o c c u r v e

# get the predicted p r o b a b i l i t i e s


y pred prob = l o g r e g . predict proba ( X test scaled ) [ : , 1 ]

# g e t t h e f p r , t p r , and t h r e s h o l d s
f p r , tpr , t h r e s h o l d s = r o c c u r v e ( y t e s t , y p r e d p r o b )

# p l o t t h e ROC c u r v e
p l t . plot ( fpr , tpr )
p l t . xlim ( [ 0 . 0 , 1 . 0 ] )
p l t . ylim ( [ 0 . 0 , 1 . 0 ] )
p l t . t i t l e ( ’ROC c u r v e f o r b r e a s t c a n c e r c l a s s i f i e r ’ )
p l t . x l a b e l ( ’ F a l s e P o s i t i v e Rate ( 1 − S p e c i f i c i t y ) ’ )
p l t . y l a b e l ( ’ True P o s i t i v e Rate ( S e n s i t i v i t y ) ’ )
p l t . g r i d ( True )

# c r e a t e a l o g i s t i c r e g r e s s i o n model
l o g r e g 2 = sm . L o g i t ( y t r a i n , X t r a i n s c a l e d )

# f i t t h e model
log reg2 res = log reg2 . f i t ()

# p r i n t t h e summary
print ( l o g r e g 2 r e s . summary ( ) )

You might also like