UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
Machine Learning
Logistic Regression
Dated:
29th Jan, 2024 to 2nd Feb, 2024
Semester:
2024
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
Objectives:-
The objectives of this session are: -
1. Test/Train
2. Save and Load trained Model
3. Pickle and sklearn joblib
4. Logistic Regression
What is Train/Test
Train/Test is a method to measure the accuracy of your model.
It is called Train/Test because you split the data set into two sets: a training set and a testing set.
80% for training, and 20% for testing.
You train the model using the training set.
You test the model using the testing set.
Start With a Data Set
Start with a data set you want to test.
Our data set illustrates 100 customers in a shop, and their shopping habits.
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
plt.scatter(x, y)
plt.show()
Split Into Train/Test
The training set should be a random selection of 80% of the original data.
The testing set should be the remaining 20%.
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
Display the Training Set
Display the same scatter plot with the training set:
plt.scatter(train_x, train_y)
plt.show()
Display the Testing Set
To make sure the testing set is not completely different, we will take a look at the testing set as
well.
plt.scatter(test_x, test_y)
plt.show()
Example
Draw a polynomial regression line through the data points:
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))
myline = numpy.linspace(0, 6, 100)
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
The result can back my suggestion of the data set fitting a polynomial regression, even though it
would give us some weird results if we try to predict values outside of the data set. Example: the
line indicates that a customer spending 6 minutes in the shop would make a purchase worth 200.
That is probably a sign of overfitting.
But what about the R-squared score? The R-squared score is a good indicator of how well my
data set is fitting the model.
R2
Remember R2, also known as R-squared?
It measures the relationship between the x axis and the y axis, and the value ranges from 0 to 1,
where 0 means no relationship, and 1 means totally related.
The sklearn module has a method called r2_score() that will help us find this relationship.
In this case we would like to measure the relationship between the minutes a customer stays in
the shop and how much money they spend.
Example
How well does my training data fit in a polynomial regression?
import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
Bring in the Testing Set
Now we have made a model that is OK, at least when it comes to training data.
Now we want to test the model with the testing data as well, to see if gives us the same result.
Example
Let us find the R2 score when using testing data:
import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
Predict Values
Now that we have established that our model is OK, we can start predicting new values.
Example
How much money will a buying customer spend, if she or he stays in the shop for 5 minutes?
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
print(mymodel(5))
Save and Load trained Model
Solving a problem in ML consist of two steps typically. The first step is to training a model
using your training dataset and the second step is to ask your questions to the trained model
which sort like a human brain and that will give you the answers often the size of the training
dataset is pretty huge because as the size increases your model becomes more accurate. It is
like if you are doing a football training and if you train yourself more and more you become
more and more better at your football game and when your training dataset is so huge often it
in like giga bytes the training steps become more time consuming if you save the train model
to a file you can latter on use that same model to make the actual prediction. So, you don’t
need to train it every time you want to ask these questions
Quick Task Load Saved Model Example
Open the Linear Regression Python File predicting Home Prices and do these changes.
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
In the above file take a new code block in google colab and write below code
import pickle
with open('model_pickle','wb') as file:
pickle.dump(model,file)
with open('model_pickle','rb') as file:
mp = pickle.load(file)
mp.coef_
mp.intercept_
mp.predict([[5000]])
Save Trained Model Using joblib (Second Way to save Model)
from sklearn.externals import joblib
joblib.dump(model, 'model_joblib')
Load Saved Model
mj = joblib.load('model_joblib')
mj.coef_
mj.intercept_
mj.predict([[5000]])
Question to Think?? What is difference between Joblib and Pickle?
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
Logistic Regression
Logistic regression aims to solve classification problems. It does this by predicting categorical
outcomes, unlike linear regression that predicts a continuous outcome.
In the simplest case there are two outcomes, which is called binomial, an example of which is
predicting if a tumor is malignant or benign. Other cases have more than two outcomes to
classify, in this case it is called multinomial. A common example for multinomial logistic
regression would be predicting the class of an iris flower between 3 different species.
Here we will be using basic logistic regression to predict a binomial variable. This means it has
only two possible outcomes.
How does it work?
In Python we have modules that will do the work for us. Start by importing the NumPy module.
import numpy
Store the independent variables in X.
Store the dependent variable in y.
Below is a sample dataset:
#X represents the size of a tumor in centimeters.
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52,
3.69, 5.88]).reshape(-1,1)
#Note: X has to be reshaped into a column from a row for the
LogisticRegression() function to work.
#y represents whether or not the tumor is cancerous (0 for "No", 1 for
"Yes").
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
We will use a method from the sklearn module, so we will have to import that module as well:
from sklearn import linear_model
From the sklearn module we will use the LogisticRegression() method to create a logistic
regression object.
This object has a method called fit() that takes the independent and dependent values as
parameters and fills the regression object with data that describes the relationship:
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
logr = linear_model.LogisticRegression()
logr.fit(X,y)
Now we have a logistic regression object that is ready to whether a tumor is cancerous based on
the tumor size:
#predict if tumor is cancerous where the size is 3.46mm:
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
Example
import numpy
from sklearn import linear_model
#Reshaped for Logistic function.
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
#predict if tumor is cancerous where the size is 3.46mm:
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
print(predicted)
Coefficient
In logistic regression the coefficient is the expected change in log-odds of having the outcome
per unit change in X.
This does not have the most intuitive understanding so let's use it to create something that makes
more sense, odds.
Example
See the whole example in action:
import numpy
from sklearn import linear_model
#Reshaped for Logistic function.
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
log_odds = logr.coef_
odds = numpy.exp(log_odds)
print(odds)
This tells us that as the size of a tumor increases by 1mm the odds of it being a cancerous tumor
increases by 4x.
Probability
The coefficient and intercept values can be used to find the probability that each tumor is
cancerous.
Create a function that uses the model's coefficient and intercept values to return a new value.
This new value represents probability that the given observation is a tumor:
def logit2prob(logr,x):
log_odds = logr.coef_ * x + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)
Function Explained
To find the log-odds for each observation, we must first create a formula that looks similar to the
one from linear regression, extracting the coefficient and the intercept.
log_odds = logr.coef_ * x + logr.intercept_
To then convert the log-odds to odds we must exponentiate the log-odds.
odds = numpy.exp(log_odds)
Now that we have the odds, we can convert it to probability by dividing it by 1 plus the odds.
probability = odds / (1 + odds)
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
Let us now use the function with what we have learned to find out the probability that each
tumor is cancerous
Example
See the whole example in action:
import numpy
from sklearn import linear_model
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
def logit2prob(logr, X):
log_odds = logr.coef_ * X + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)
print(logit2prob(logr, X))
Results Explained
3.78 0.61 The probability that a tumor with the size 3.78cm is cancerous is 61%.
2.44 0.19 The probability that a tumor with the size 2.44cm is cancerous is 19%.
2.09 0.13 The probability that a tumor with the size 2.09cm is cancerous is 13%.
Lab Instructor: Sheharyar Khan
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA
FACULTY OF TELECOMMUNICATION AND INFORMATION ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
Tasks 1
Find the 7_Logistic Regression Python file and upload it in Google collab with data file
Insurance.csv
Run all Codes in the cell and write your understanding with output.
Task2
Download employee retention dataset from here: https://www.kaggle.com/giripujar/hr-analytics.
1. Now do some exploratory data analysis to figure out which variables have direct and clear
impact on employee retention (i.e. whether they leave the company or continue to work)
2. Plot bar charts showing impact of employee salaries on retention
3. Plot bar charts showing corelation between department and employee retention
4. Now build logistic regression model using variables that were narrowed down in step 1
5. Measure the accuracy of the model
Task 3: Run All Examples in the Lab Manual
Lab Instructor: Sheharyar Khan