7/2/22, 11:32 PM Logistic - Jupyter Notebook
Logistic Regression
Used to predict classes for binary classification problems
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 1/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 2/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 3/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 4/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 5/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 6/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
Terminology
• Type of classification outputs:
– True positive (m11): Example of class 1 predicted as class 1.
– False positive (m01): Example of class 0 predicted as class 1. Type 1 error.
– True negative (m00): Example of class 0 predicted as class 0.
– False negative (m10): Example of class 1 predicted as class 0. Type II error.
• Total number of instances: m = m00 + m01 + m10 + m11
Error rate: (m01 + m10) / m
– If the classes are imbalanced (e.g. 10% from class 1, 90% from class 0), one can
achieve low error (e.g. 10%) by classifying everything as coming from class 0!
Confusion matrix
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 7/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
Common measures
• Accuracy = (TP+ TN) / (TP + FP + FN + TN)
• Precision = True positives / Total number of declared positives
= TP / (TP+ FP)
• Recall = True positives / Total number of actual positives
= TP / (TP + FN)
• Sensitivity is the same as recall.
• Specificity = True negatives / Total number of actual negatives
= TN / (FP + TN)
• False positive rate = FP / (FP + TN)
• F1 measure= 2(Precision*Recall/(Precision+Recall))
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 8/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
Receiver-operator characteristic (ROC) curve
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 9/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
K Fold Cross Validation
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 10/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 11/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [4]: 1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 import scipy.optimize as opt
5
6 path = 'data1.txt'
7 data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitte
8 print('data = ')
9 print(data.head(10) )
10 print()
11 print('data.describe = ')
12 print(data.describe())
13 positive = data[data['Admitted'].isin([1])]
14 negative = data[data['Admitted'].isin([0])]
15 fig, ax = plt.subplots(figsize=(5,5))
16 ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o
17 label='Admitted')
18 ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x
19 ax.legend()
20 ax.set_xlabel('Exam 1 Score')
21 ax.set_ylabel('Exam 2 Score')
22 data['Admitted'].value_counts()
data =
Exam 1 Exam 2 Admitted
0 34.623660 78.024693 0
1 30.286711 43.894998 0
2 35.847409 72.902198 0
3 60.182599 86.308552 1
4 79.032736 75.344376 1
5 45.083277 56.316372 0
6 61.106665 96.511426 1
7 75.024746 46.554014 1
8 76.098787 87.420570 1
9 84.432820 43.533393 1
data.describe =
Exam 1 Exam 2 Admitted
count 100.000000 100.000000 100.000000
mean 65.644274 66.221998 0.600000
std 19.458222 18.582783 0.492366
min 30.058822 30.603263 0.000000
25% 50.919511 48.179205 0.000000
50% 67.032988 67.682381 1.000000
75% 80.212529 79.360605 1.000000
max 99.827858 98.869436 1.000000
Out[4]: 1 60
0 40
Name: Admitted, dtype: int64
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 12/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [5]: 1 def sigmoid(z):
2 return 1 / (1 + np.exp(-z))
3
4 def cost(theta, X, y):
5 theta = np.matrix(theta)
6 X = np.matrix(X)
7 y = np.matrix(y)
8 first = np.multiply(-y, np.log(sigmoid(X * theta.T)))
9 second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
10 return np.sum(first - second) / (len(X))
11 def gradient(theta, X, y):
12 theta = np.matrix(theta)
13 X = np.matrix(X)
14 y = np.matrix(y)
15 parameters = int(theta.ravel().shape[1])
16 grad = np.zeros(parameters)
17 error = sigmoid(X * theta.T) - y
18 for i in range(parameters):
19 term = np.multiply(error, X[:,i])
20 grad[i] = np.sum(term) / len(X)
21 return grad
22 def predict(theta, X):
23 probability = sigmoid(X * theta.T)
24 return [1 if x >= 0.5 else 0 for x in probability]
25
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 13/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [6]: 1 nums = np.arange(-10, 10, step=1)
2 fig, ax = plt.subplots(figsize=(5,5))
3 ax.plot(nums, sigmoid(nums), 'r')
4 # add a ones column - this makes the matrix multiplication work out easie
5 data.insert(0, 'Ones', 1)
6 # set X (training data) and y (target variable)
7 cols = data.shape[1]
8 X = data.iloc[:,0:cols-1]
9 y = data.iloc[:,cols-1:cols]
10 # convert to numpy arrays and initalize the parameter array theta
11 X = np.array(X.values)
12 y = np.array(y.values)
13 theta = np.zeros(3)
14 print()
15 print('X.shape = ' , X.shape)
16 print('theta.shape = ' , theta.shape)
17 print('y.shape = ' , y.shape)
18 thiscost = cost(theta, X, y)
19 print()
20 print('cost = ' , thiscost)
21
22 result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y))
23 costafteroptimize = cost(result[0], X, y)
24 print()
25 print('cost after optimize = ' , costafteroptimize)
26 print()
27
28 theta_min = np.matrix(result[0])
29 predictions = predict(theta_min, X)
30 correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0)) else 0 for (
31 zip(predictions, y)]
32 accuracy = (sum(map(int, correct)) % len(correct))
33 print ('accuracy = {0}%'.format(accuracy))
X.shape = (100, 3)
theta.shape = (3,)
y.shape = (100, 1)
cost = 0.6931471805599453
cost after optimize = 0.20349770158947486
accuracy = 89%
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 14/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 15/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [7]: 1 from sklearn.model_selection import train_test_split
2 from sklearn.linear_model import LogisticRegression
3 from sklearn.metrics import confusion_matrix
4 import seaborn as sns
5 import matplotlib.pyplot as plt
6 from sklearn.metrics import accuracy_score
7 from sklearn.metrics import f1_score
8 from sklearn.metrics import recall_score
9 from sklearn.metrics import precision_score
10 from sklearn.metrics import precision_recall_fscore_support
11 from sklearn.metrics import precision_recall_curve
12 from sklearn.metrics import classification_report
13 from sklearn.metrics import roc_curve
14 from sklearn.metrics import auc
15 from sklearn.metrics import roc_auc_score
16 from sklearn.metrics import zero_one_loss
17 from sklearn.metrics import plot_roc_curve
18 path='data1.txt'
19 data = pd.read_csv(path, header=None, names=['Exam 1', 'Exam 2', 'Admitte
20 ositive = data[data['Admitted'].isin([1])]
21 negative = data[data['Admitted'].isin([0])]
22 fig, ax = plt.subplots(figsize=(5,5))
23 ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o
24 label='Admitted')
25 ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x
26 ax.legend()
27 ax.set_xlabel('Exam 1 Score')
28 ax.set_ylabel('Exam 2 Score')
29 cols = data.shape[1]
30 X = data.iloc[:,0:cols-1]
31 y = data.iloc[:,cols-1:cols]
32 X = np.array(X.values)
33 y = np.array(y.values)
34 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
35
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 16/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [8]: 1 LogisticRegressionModel = LogisticRegression(solver='liblinear')
2 LogisticRegressionModel.fit(X_train, y_train)
3 LogisticRegression()
4 #Calculating Details
5 print('LogisticRegressionModel Train Score is : ' , LogisticRegressionMod
6 print('LogisticRegressionModel Test Score is : ' , LogisticRegressionMode
7 print('LogisticRegressionModel Classes are : ' , LogisticRegressionModel
8 print('LogisticRegressionModel No. of iteratios is : ' , LogisticRegressi
9 print('----------------------------------------------------')
10
LogisticRegressionModel Train Score is : 0.8625
LogisticRegressionModel Test Score is : 0.9
LogisticRegressionModel Classes are : [0 1]
LogisticRegressionModel No. of iteratios is : [13]
----------------------------------------------------
C:\Users\ralhm\anaconda3\lib\site-packages\sklearn\utils\validation.py:63:
DataConversionWarning: A column-vector y was passed when a 1d array was exp
ected. Please change the shape of y to (n_samples, ), for example using rav
el().
return f(*args, **kwargs)
In [9]: 1 #Calculating Prediction
2 y_pred = LogisticRegressionModel.predict(X_test)
3 y_pred_prob = LogisticRegressionModel.predict_proba(X_test)
4 print('Predicted Value for LogisticRegressionModel is : ' , y_pred[:10])
5 print('Prediction Probabilities Value for LogisticRegressionModel is : '
6
7 #----------------------------------------------------
8
Predicted Value for LogisticRegressionModel is : [1 0 1 1 0 0 1 1 1 1]
Prediction Probabilities Value for LogisticRegressionModel is : [[0.163841
75 0.83615825]
[0.60151281 0.39848719]
[0.28609122 0.71390878]
[0.20357648 0.79642352]
[0.59404439 0.40595561]
[0.50512972 0.49487028]
[0.3937627 0.6062373 ]
[0.35861105 0.64138895]
[0.25021584 0.74978416]
[0.19457158 0.80542842]]
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 17/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [11]: 1 #Calculating Confusion Matrix
2 CM = confusion_matrix(y_test, y_pred)
3 print('Confusion Matrix is : \n', CM)
4
5 # drawing confusion matrix
6 sns.heatmap(CM, center = True)
7 plt.show()
8 plot_roc_curve(LogisticRegressionModel,X_test, y_test)
9 plt.show()
10 print(y_test)
11
12 #----------------------------------------------------
13
14
Confusion Matrix is :
[[ 4 2]
[ 0 14]]
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 18/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
[[1]
[0]
[0]
[1]
[0]
[0]
[1]
[1]
[1]
[1]
[0]
[1]
[1]
[0]
[1]
[1]
[1]
[1]
[1]
[1]]
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 19/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [47]: 1 #Calculating Accuracy Score : ((TP + TN) / float(TP + TN + FP + FN))
2 AccScore = accuracy_score(y_test, y_pred, normalize=True)
3 print('Accuracy Score is : ', AccScore)
4
5 #----------------------------------------------------
6 #Calculating F1 Score : 2 * (precision * recall) / (precision + recall)
7 # f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sa
8
9 F1Score = f1_score(y_test, y_pred, average='micro') #it can be : binary,m
10 print('F1 Score is : ', F1Score)
11
12 #----------------------------------------------------
13 #Calculating Recall Score : (Sensitivity) (TP / float(TP + FN)) 1 / 1+2
14 # recall_score(y_true, y_pred, labels=None, pos_label=1, average=’binary
15
16 RecallScore = recall_score(y_test, y_pred, average='micro') #it can be :
17 print('Recall Score is : ', RecallScore)
18
19 #----------------------------------------------------
20 #Calculating Precision Score : (Specificity) #(TP / float(TP + FP))
21 # precision_score(y_true, y_pred, labels=None, pos_label=1, average=’bina
22
23 PrecisionScore = precision_score(y_test, y_pred, average='micro') #it can
24 print('Precision Score is : ', PrecisionScore)
25
26 #----------------------------------------------------
27 #Calculating Precision recall Score :
28 #metrics.precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels
29 # None, warn_for = ('precision’,’r
30
31 PrecisionRecallScore = precision_recall_fscore_support(y_test, y_pred, av
32 print('Precision Recall Score is : ', PrecisionRecallScore)
33
34 #----------------------------------------------------
35 #Calculating Precision recall Curve :
36 # precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weig
37
38 PrecisionValue, RecallValue, ThresholdsValue = precision_recall_curve(y_t
39 print('Precision Value is : ', PrecisionValue)
40 print('Recall Value is : ', RecallValue)
41 print('Thresholds Value is : ', ThresholdsValue)
42
43 #----------------------------------------------------
44 #Calculating classification Report :
45 #classification_report(y_true, y_pred, labels=None, target_names=None,sam
46
47 ClassificationReport = classification_report(y_test,y_pred)
48 print('Classification Report is : ', ClassificationReport )
49
50 #----------------------------------------------------
51 #Calculating Area Under the Curve :
52
53 fprValue2, tprValue2, thresholdsValue2 = roc_curve(y_test,y_pred)
54 AUCValue = auc(fprValue2, tprValue2)
55 print('AUC Value : ', AUCValue)
56
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 20/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
57 #----------------------------------------------------
58 #Calculating Receiver Operating Characteristic :
59 #roc_curve(y_true, y_score, pos_label=None, sample_weight=None,drop_inter
60
61 fprValue, tprValue, thresholdsValue = roc_curve(y_test,y_pred)
62 print('fpr Value : ', fprValue)
63 print('tpr Value : ', tprValue)
64 print('thresholds Value : ', thresholdsValue)
65
66 #----------------------------------------------------
67 #Calculating ROC AUC Score:
68 #roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None,max_f
69
70 ROCAUCScore = roc_auc_score(y_test,y_pred, average='micro') #it can be :
71 print('ROCAUC Score : ', ROCAUCScore)
72
Accuracy Score is : 0.9
F1 Score is : 0.9
Recall Score is : 0.9
Precision Score is : 0.9
Precision Recall Score is : (0.9, 0.9, 0.9, None)
Precision Value is : [0.875 1. ]
Recall Value is : [1. 0.]
Thresholds Value is : [1]
Classification Report is : precision recall f1-score s
upport
0 1.00 0.67 0.80 6
1 0.88 1.00 0.93 14
accuracy 0.90 20
macro avg 0.94 0.83 0.87 20
weighted avg 0.91 0.90 0.89 20
AUC Value : 0.8333333333333334
fpr Value : [0. 0.33333333 1. ]
tpr Value : [0. 1. 1.]
thresholds Value : [2 1 0]
ROCAUC Score : 0.8333333333333334
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 21/22
7/2/22, 11:32 PM Logistic - Jupyter Notebook
In [67]: 1 #Multiple classes
2 from sklearn.datasets import load_iris
3 from sklearn.linear_model import LogisticRegression
4 X, y = load_iris(return_X_y=True)
5 clf = LogisticRegression(random_state=10, solver='lbfgs' , max_iter= 1000
6 #clf = LogisticRegression(random_state=10, solver='liblinear')
7 #clf = LogisticRegression(random_state=10, solver='saga')
8 print(X.shape)
9 print(X[:2, :])
10 clf.fit(X, y)
11 clf.predict(X[:2, :])
12 print(clf.predict_proba(X[:2, :]))
13
14 score = clf.score(X, y)
15
16 print('score = ' , score)
17 print('No of iterations = ' , clf.n_iter_)
18 print('Classes = ' , clf.classes_)
19 y_pred = clf.predict(X)
20 CM = confusion_matrix(y, y_pred)
21 print(CM)
(150, 4)
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]]
[[9.69810844e-01 3.01885609e-02 5.94808016e-07]
[9.55989484e-01 4.40095854e-02 9.30437140e-07]]
score = 0.9666666666666667
No of iterations = [85]
Classes = [0 1 2]
[[50 0 0]
[ 0 47 3]
[ 0 2 48]]
In [ ]: 1
localhost:8888/notebooks/ML/ECCE5001/Logistic.ipynb 22/22