0% found this document useful (0 votes)

273 views62 pages

Lecture - 4 - Logistic Regression

Logistic regression is an important machine learning tool for classification. It uses gradient descent and cross-entropy loss to learn weights during training, and then uses the learned weights to compute probabilities and make predictions on test data. Logistic regression is a discriminative classifier that directly models the conditional probability of the labels given the features.

Uploaded by

Adegboye Olaoluwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

273 views62 pages

Lecture - 4 - Logistic Regression

Uploaded by

Adegboye Olaoluwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Logistic Regression

Lin Li
CS@PVAMU
Logistic Regression
 Important analytic tool in natural and social sciences
 Baseline supervised machine learning tool for
classification
 Is also the foundation of neural networks
 Two phases of logistic regression:
 Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
 Test: Given a test example x we compute p(y|x) using
learned weights w and b, and return whichever label (y = 1
or y = 0) is higher probability

2
Generative vs. Discriminative Classifiers
Suppose we're distinguishing cat from dog images

ImageNet ImageNet

 Naive Bayes is a generative classifier

 K-Nearest Neighbor, Rocchio Classifier, Logistic regression,
and Linear Regression (while being used for classification, e.g.,
Perceptron) are discriminative classifiers
3
Generative Classifiers
 Build a model of what's in a cat image
 Knows about whiskers, ears, eyes
 Assigns a probability to any image
 how cat-y is this image?

 Also, build a model of what's in a dog image

 Given a new image, run both models

to see which one fits better

4
Discriminative Classifiers
 Just try to distinguish dogs from cats

Look, dogs have collars! Let's ignore everything else

5
Generative vs. Discriminative Classifiers
Generative models Discriminative models
 Specifying joint distribution  Specifying conditional
 Full probabilistic distribution
specification for all the  Only explain the target variable
random variables  Arbitrary features can be
 Dependence assumption incorporated for modeling
has to be specified for 𝑝 𝒕𝒘
𝑝 𝒘 𝒕 and 𝑝(𝒕)  Need labeled data, only
 Flexible, can be used in suitable for (semi-) supervised
unsupervised learning learning
 e.g., Naïve Bayes  e.g. Logistic Regression
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑝𝑟𝑖𝑜𝑟 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
𝑐 = argmax 𝑃(𝑑|𝑐) 𝑃(𝑐) 𝑐 = argmax 𝑃(𝑐|𝑑)
𝑐∈𝐶 𝑐∈𝐶
6
Probabilistic Learning Classifier
 Given m input/output pairs (x(i),y(i)), the
components include:
 A feature representation of the input. For each input
observation x(i), a vector of features [x1, x2, ... , xn].
Feature j for input x(i) is xj, more completely xj(i), or
sometimes fj(x).
 A classification function that computes 𝑦, the
estimated class, via p(y|x), like the sigmoid or softmax
functions.
 An objective function for learning, like mean squared
error or cross-entropy loss.
 An algorithm for optimizing the objective function:
stochastic gradient descent.

7
Error-based Learning

 Linear Regression
 Logistic Regression

8
Regression
 Given a dataset, how to use a parameterized prediction model to
represent the relationship between the descriptive features and the
target feature?

 Note: all chapter indexed pictures (e.g., Table 7.1) in this lecture are from the textbook of John
Kelleher, et al, “Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and
Case Studies”
9
Linear Regression

 Initiate a linear equation representing the relation between

descriptive features and the target feature
 E.g., between size (x) and price (y), y = w × x + b; which is a line with
w being the slope, and b being the y intercept.
 Key → find w and b (e.g., after exploration, price = 6.47 + 0.62 × size)
10
Measuring Prediction Error
 Linear regression formula for multiple descriptive features (e.g.,
k features)
t = w1 × d1 + w2 × d2 + … + wk × dk + w0
where w1 , w2 , …, wk are the weights of the respective features
 Measure the sum of the squared error (SSE, difference between
prediction and target value) or the sum of the squared residual1.
e.g., SSE is calculated as: 𝑛
1
𝐿2 𝑀𝑤 , 𝐷 = (𝑡𝑖 − 𝑀𝑤 (𝑑𝑖 ))2
2
𝑖=1

e.g., for the size (d1) and price (t) model, after expansion,
𝑛
1
𝐿2 𝑀𝑤 , 𝐷 = (𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2
2
𝑖=1

11
Error Surface

 Weight space and error surface: for example, t = w1 × d1 + w0

 Surface is convex and there is a global minimum
12
Gradient Descent
 Goal: adjust the weights to minimize the residual
 How to: find the optimal weights at the point where the partial
derivatives of the error surface with respect to w0 and w1 are
equal to 0, i is the index of the data with range [1, n]
𝑛
𝜕 1
(𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2 = 0
𝜕𝑤[0] 2
𝑖=1
𝑛
𝜕 1
(𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2 = 0
𝜕𝑤[1] 2
𝑖=1

 For multiple variable linear regression,

𝑚

𝑀𝑤 𝑑 = 𝑤 0 + 𝑤 1 × 𝑑 1 + ⋯ + 𝑤 𝑚 × 𝑑 𝑚 = 𝑤 0 + 𝑤[𝑗] × 𝑑 [𝑗]
𝑖=1

apply partial derivatives similarly

13
Gradient Descent (cont’d)

 Idea: starts by initiate the weights to “random” values, then adjust it along the
deepest direction (slope) of the current point toward the minimum.
14
Gradient Descent (cont’d)
 Gradient with respect to each weight: (partial derivative by
chain rule) 𝜕
𝐿 𝑀 ,𝐷 =
𝜕 1
(𝑡 − 𝑀 (𝑑))2
2 𝑤 𝑤
𝜕𝑤 𝑗 𝜕𝑤 𝑗 2
𝜕
= (𝑡 − 𝑀𝑤 (𝑑) × 𝑡 − 𝑀𝑤 (𝑑)
𝜕𝑤 𝑗
𝜕
= (𝑡 − 𝑀𝑤 (𝑑) × 𝑡 − (𝑤 ∙ 𝑑)
𝜕𝑤 𝑗
= (𝑡 − 𝑀𝑤 (𝑑) × −𝑑[𝑗]
 Update the weights one by one by tuning them toward the
opposite direction of the calculated gradient, using a learning
rate (α) to decide the change amount
𝑛

𝑤 𝑗 ←𝑤 𝑗 +𝑎× ((𝑡𝑖 − 𝑀𝑤 𝑑𝑖 ) × 𝑑𝑖 [𝑗])

𝑖=1
𝑒𝑟𝑟𝑜𝑟𝐷𝑒𝑙𝑡𝑎(𝐷,𝑤[𝑗])

 Then, re-measure the errors and keep updating the weights

as above until the error is less than a threshold (e.g., 0.1)
15
Error Change with Weights Update

16
Example
Rental Price = w0 + w1 × size + w2 × floor + w3 × Broadband Rate
Set α = 0.00000002, and initialize all weights between [-0.2, +0.2]

17
Example (cont’d)

After 100 iterations, w0 = -0.1513, w1 = 0.6270, w2 = -0.1781, w3 = 0.0714, and

the squared error of the final model is 2913.5
Rental Price = w0 + w1 × size + w2 × floor + w3 × Broadband Rate
18
Batch and Stochastic Gradient Descent
 Batch Gradient Descent: weight adjustment at each iteration
based on summation of the squared error made by the
candidate model (linear regression formula) for the whole
training set
 Stochastic Gradient Descent: weight adjustment based on the
error in the predication made by the candidate model (linear
regression formula) for each training instance individually
 Mini-batch (a compromise of both): train/adjust the weights
each time on a group of m examples that is less than the whole
dataset (if m = 1, same as stochastic gradient descent; if m = size
of the whole dataset, same as batch gradient descent)
 e.g., batch size is 512 or 1024

19
Interpreting Multivariable Linear Models
 Signs of Weights: indicating whether the feature has a positive or
negative impact on the prediction
 Magnitudes of Weights: how much the target value changes
according to the descriptive feature’s unit
 Mistake: the features with higher weights are more important
than those with lower weights
 Each descriptive feature has a different varying scale! Need to
performance statistical significance test before comparison.
 A test-statistic is computed (e.g., t-test) to decide whether the
feature has significant impact on the target
 P-value: the probability that the test-statistic is greater than or
equal to the one computed being the result of chance
 Null-hypothesis: the feature has no significant impact on the target
 P-value is compared with pre-defined significance threshold. If it is
≤ threshold (e.g., 5% or 1%), null hypothesis is rejected.
20
Interpreting Multivariable Linear Models (Example)

 First: calculate the standard error of the

model (n is sample size)

 Second: calculate the standard error on a

specific descriptive feature

 Third : calculate t-statistic for this feature

 Finally: check p-value associated with the
calculated t-test value in a look-up table,
and then Compare results

21
Interpreting Multivariable Linear Models (Example)

 Often shown in data analytics (e.g., R or Python)

22
Impact of Learning Rate (α)
 Large α:
converge
fast at
beginning,
but may
miss the
global
optimal
 Small α:
won’t miss
the global
optimal,
but may
converge
very slowly

23
Learning Rate (α) with Weight Decay

 Reason: reduce the residual (sum of squared error) fast at beginning, and
slow down gradually to converge to the global optimal
α0 is the initial learning rate (e.g., 0.5)
c is a constant which controls how quickly the rate decays (e.g., 10),
τ is the current iteration number of the gradient descent
24
Handling Categorical Descriptive Features

 Rental Price = w0 + w1 × size + w2 × floor + w3 × broadband rate +

w4 × energy rating A + w5 × energy rating B +
w6 × energy rating C
 Note: you can save one extra feature (e.g., use ratings A and B only)
Rental Price = w0⃰ + w1⃰ × size + w2⃰ × floor + w3 ⃰ × broadband rate +
w4 ⃰ × energy rating A + w5⃰ × energy rating B
where w0 + w6 = w0⃰ , w1 = w1⃰ , w2 = w2⃰ , w3 = w3⃰ , w4 - w6 = w4⃰ , w5 - w6 = w5⃰
25
Predicting Categorical Targets

 Train a linear model (line or hyper plane), named decision surface, to classify
the data.
f (RPM, vibration) = w0 + w1 × RPM + w2 × vibration
Status = f(RPM, vibration) > 0 ? Good : Faulty

26
Error-based Learning

 Linear Regression
 Logistic Regression

27
Regression for Classification
 Linear Regression
𝑦 ← 𝑤𝑇𝑋
 Relationship between a scalar dependent variable 𝑦 and one
or more explanatory variables Y is discrete in a
classification problem!

y
1 𝑤 𝑇 𝑋 > 0.5 1.00
𝑦=
0 𝑤 𝑇 𝑋 ≤ 0.5
0.75

What if we have 0.50 Optimal

an outlier? 0.25 regression
model
0.00 x

28
Regression for Classification
 Logistic Regression
1
𝑝 𝑦𝑥 =𝜎 𝑤𝑇𝑋 =
1 + exp(−𝑤 𝑇 𝑋)
 Directly modeling of class posterior Sigmoid function

P(y|x)

1.00

0.75
What if we have 0.50
an outlier?
0.25

0.00 x

29
Why Sigmoid function?
 Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1+
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)

Binomial P(y|x)

𝑃 𝑦=1 =𝛼 1.00

0.75 𝑃 𝑋 𝑦 = 1 = 𝑁(𝜇1 , 𝛿 2 )
𝑃 𝑋 𝑦 = 0 = 𝑁(𝜇0 , 𝛿 2 ) 0.50
0.25 Normal with identical variance
x
30
Why Sigmoid function?
 Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1 1
= =
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0) 𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1+ 1 + exp − ln
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1) 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
𝑉
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1) 𝑃(𝑦 = 1) 𝑃(𝑥𝑖 |𝑦 = 1)
ln = ln + ln
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0) 𝑃(𝑦 = 0) 𝑃(𝑥𝑖 |𝑦 = 0)
𝑖=1
𝑉 2 2
𝛼 𝜇1𝑖 − 𝜇0𝑖 𝜇1𝑖 − 𝜇0𝑖
= ln + 2 𝑥𝑖 −
1−𝛼 𝛿𝑖 2𝛿𝑖2
𝑖=1
𝑉
Origin of the name: 𝜇1𝑖 − 𝜇0𝑖 1 𝑥−𝜇 2
= 𝑤0 + 𝑥𝑖 𝑃 𝑥𝑦 = 𝑒
−
2𝛿 2
logit function 𝛿𝑖2 𝛿 2𝜋
𝑖=1
= 𝑤0 + 𝑤𝑇𝑋 = 𝑤𝑇𝑋
31
Why Sigmoid function?
 Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1+
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1
=
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1 + exp − ln
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
1 + exp(−𝑤 𝑇 𝑋)
Generalized Linear Model

Note: it is still a linear relation among the features!

32
Another Dataset

 logistic model is more functional

in derivative deduction
 Output range [0, 1], represent
either the normalized numerical
result or the probability of the
query falling into one target
category
1
𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑥 =
1 + 𝑒 −𝑥
33
Logistic Regression

1
 Sigmoid function: replace x in the logistic formula 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑥 = with
1 + 𝑒 −𝑥
the linear model described before, i.e.,
1
𝑀 𝑅𝑃𝑀, 𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛 =
1 + 𝑒 −1∗(𝑤0+𝑤1∗𝑅𝑃𝑀+𝑤2∗𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛)
 Then, use gradient descent method to train the weights of the logistic model
34
Turning Probability into a Classifier
1
𝑃 𝑦 =1 =𝜎 𝑤∙𝑥+𝑏 =
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )

1
𝑃 𝑦 =0 =1−𝜎 𝑤∙𝑥+𝑏 =1−
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )
exp(− 𝑤 ∙ 𝑥 + 𝑏 )
= = 𝜎 −(𝑤 ∙ 𝑥 + 𝑏)
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )

1 𝑖𝑓 𝑃 𝑦 = 1 𝑥 > 0.5 0.5 here is called the decision boundary

𝑦=
0 𝑂𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒

𝑁𝑜𝑡𝑒: 1 − 𝜎 𝑥 = 𝜎 −𝑥

35
Weight Adjustment
 Goal: adjust the weights to minimize the residual
 Apply gradient descent to the sum of squared error with
respect to wj equal to 0 (partial derivative by chain rule)
𝜕
𝐿 𝑀 , 𝐷 = (𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗]
𝜕𝑤 𝑗 2 𝑤

Mw(d) represent the calculated output (prediction), t is the

target value coming with the training data.
 The weight adjustment amount by each training data is:
𝑤 𝑗 ← 𝑤 𝑗 + 𝛼 × (𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗]
 Considering all training data, the weight will be adjusted by
𝑛

𝑤 𝑗 ←𝑤 𝑗 +𝛼× ((𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗])

𝑖=1
i is the index of the training data, range [1,n]
36
Weight Update through Batch Gradient Descent

37
Logistic Model Change with Weights Update

1
Final model: M𝑤 RPM, Vibration =
1 + 𝑒 −(−0.4077+4.1697∗𝑅𝑃𝑀+6.0460∗𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛)
38
Logistic Regression in Text Mining

 Sentiment Analysis: suppose we are doing binary

sentiment classification on movie review text, and
we would like to know whether to assign the
sentiment class 1 = positive or 0 = negative to the
following review

 e.g., what is the sentiment of the follow text?

It's hokey. There are virtually no surprises, and the writing is
second-rate. So why was it so enjoyable? For one thing, the
cast is great. Another nice touch is the music. I was overcome
with the urge to get off the couch and start dancing. It sucked
me in, and it'll do the same to you.

39
Feature Extraction

40
Classifying Sentiment for Input
 Let’s assume for the moment that we’ve already learned a
real-valued weight for each of the features, and that the 6
weights corresponding to the 6 features are as follows:
[2.5, -5.0, -1.2, 0.5, 2.0, 0.7]
and b = 0.1
 Thus, we have the following and classify the text as “positive”
𝑝 + 𝑥 =𝑝 𝑦 =1 𝑥 =𝜎 𝑤∙𝑥+𝑏
= 𝜎 2.5, −5.0, −1.2, 0.5, 2.0,0.7 ∙ [3,2,1,3,0,4.19] + 0.1
= 𝜎 .833 = 0.70

𝑝 − 𝑥 = 𝑝 𝑦 = 0 𝑥 = 1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 = 0.30

41
Build Features
 By examining the training set with an eye to linguistic
intuitions and the linguistic literature on the domain
 Created automatically via feature templates
 Example: Period Disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end

42
Weight Adjustment: Revisit
 If the loss function f is twice differentiable and the domain is
real, then we can characterize it as follows:
 f is convex if and only if f ”(x) ≥ 0 for all x. Hence if the double
derivative of the loss function is ≥ 0, then we can claim it to be
convex (i.e., guarantee to find global optima)
 For logistic regression, if the loss function is sum of squared
error (SSE), it is NOT convex! (i.e., may stuck at local optima)
2
1
𝑓 𝑥 = (𝑦 − 𝑦)2 = 𝑦 −
1 + exp − 𝑤 ∙ 𝑥 + 𝑏
𝜕2𝑓 𝑥 2
Not always ≥0
= −2 𝑦 − 2𝑦𝑦 − 2𝑦 + 3𝑦 ∗ 𝑥𝑗 ∗ 𝑥𝑗 ∗ 𝑦(1 − 𝑦)
𝜕𝑤𝑗 2
 Is there a convex loss function? —Yes, cross-entropy loss!
https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c

43
Training of Logistic Regression
 Goal: maximize probability of the correct label p(y|x)
 Since there are only 2 discrete outcomes (0 or 1), we can
express the probability p(y|x) from our classifier as:
𝑝(𝑦|𝑥) = 𝑦 𝑦 (1 − 𝑦)1−𝑦
If y =1, simplifies to 𝑦, if y =0, simplifies to 1- 𝑦.
 Take the log of both sides — mathematically handy maximize
log p(y|x) will maximize p(y|x)
log 𝑝 𝑦 𝑥 = log 𝑦 𝑦 (1 − 𝑦)1−𝑦 = [𝑦 ∗ log 𝑦 + (1 − 𝑦) ∗ log(1 − 𝑦)]
 For a training dataset of N instances, maximize the log
conditional likelihood:
𝑁 𝑁 𝑁

log 𝑝(𝑦𝑖 |𝑥𝑖 ) = log(𝑦𝑖 𝑦𝑖 (1 − 𝑦𝑖 )1−𝑦𝑖 ) = 𝑦𝑖 ∗ log 𝑦𝑖 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝑦𝑖 )

𝑖=1 𝑖=1 𝑖=1

44
Cross-Entropy Loss: 𝑳𝑪𝑬 𝒚, 𝒚
 Flip the sign of target function to turn it into a loss
function 𝐿𝐶𝐸 𝑦, 𝑦 : (to minimize)
𝐿𝐶𝐸 𝑦, 𝑦 = −log 𝑝 𝑦 𝑥 = −[𝑦 ∗ log 𝑦 + (1 − 𝑦) ∗ log(1 − 𝑦)]

Plugging in the definition of 𝑦:

𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
 For a training dataset of N instances, 𝐿𝐶𝐸 𝑦, 𝑦 is:
𝑁

𝐿𝐶𝐸 𝑦, 𝑦 = − [𝑦𝑖 ∗ log 𝑦𝑖 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝑦𝑖 )]

𝑖=1
𝑁 x(i): the ith instance
=− [𝑦𝑖 ∗ log 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 )]
𝑖=1

45
Cross-Entropy Loss: Example

 True value is y =1, given the weights values as [2.5, -5.0,

-1.2, 0.5, 2.0, 0.7] and b = 0.1, How well is our model?
𝑝 + 𝑥 = 𝑝 𝑦 = 1 𝑥 = 0.70 very good!

 How about the loss?

𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
= − log 𝜎 𝑤 ∙ 𝑥 + 𝑏 = − log 0.70 = 0.36
What about the loss if the true value is y = 0?
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
= − log 1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 = − log 0.30 = 1.2

46
Gradient for Cross-Entropy Loss Function
 What’s the partial derivative of the loss function?
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]

 Apply chain rule, the elegant derivative is:

𝜕𝐿𝐶𝐸 𝑦, 𝑦
= 𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 ∗ 𝑥𝑗
𝜕𝑤𝑗
Note: do the math, use textbook or the Appendix as reference
 For a training dataset of N instances, the gradient is:
𝑁 𝑁
𝜕 𝑖=1 𝐿𝐶𝐸 𝑦𝑖 , 𝑦𝑖
= 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 − 𝑦𝑖 ∗ 𝑥(𝑖)𝑗
𝜕𝑤𝑗
𝑖=1
x(i)j: the jth feature of the ith instance

47
Weight Adjustment for Logistic Regression
 Given a learning rate η (hyperparameter), train the model:

48
Stochastic Weight Adjustment: One-Step
 A mini-sentiment example, where y=1 (positive)
 Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
 Assume the three parameters (2 weights and 1 bias) of the
logistic regression model Θ are initialized to zeros:
w1 = w2 = b = 0, and η = 0.1
𝜕𝐿𝐶𝐸 𝑦, 𝑦
 Calculate gradient descent using: = 𝜎 𝑤∙𝑥+𝑏 −𝑦 ∗ 𝑥𝑗
𝜕𝑤𝑗
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝜕𝑤1
𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 𝑥1 𝜎 0 − 1 𝑥1 −0.5𝑥1 −1.5
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝛻𝑤,𝑏 = = 𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 𝑥2 = 𝜎 0 − 1 𝑥2 = −0.5𝑥2 = −1.0
𝜕𝑤2 𝜎 𝑤∙𝑥+𝑏 −𝑦 −0.5
𝜎 0 −1 −0.5
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝜕𝑏 𝑤1 0 −1.5 .15
 Update weights and bias in mode: 𝜃 = 𝑤2 = 0 − η −1.0 = .1
𝑏 0 −0.5 .05
49
Overfitting
 A model that perfectly match the training data may
overfit to the data, modeling noise
 A random word that perfectly predicts y (it happens to only occur in
one class) will get a very high weight.
 Failing to generalize to a test set without this word
 A good model should be able to generalize!
 e.g., 4-gram model on tiny data will just memorize the
data, but be surprised by novel 4-grams in testing data!

50
Regularization
 How to avoid overfitting
 Regularization in regression
 Dropout in neural networks (will talk about it later)
 Regularization ― Add a regularization term R(θ) to the
loss function (for now written as maximizing log prob
rather than minimizing loss)
𝑚

𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼𝑅(𝜃)

𝜃
𝑖=1

 Idea: choose an R(θ) that penalizes large weights

 Fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights

51
L2 Regularization (= Ridge Regression)
 The sum of the squares of the weights
 The name is because this is the (square of the) L2 norm
‖θ‖2, = Euclidean distance of θ to the origin.

 L2 regularized objective function:

𝑚 𝑛

𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼 𝜃𝑗 2

𝜃
𝑖=1 𝑗=1

52
L1 Regularization (= Lasso Regression)
 The sum of the (absolute value of the) weights
 Named after the L1 norm ‖W‖1, = sum of the absolute
values of the weights, = Manhattan distance

 L1 regularized objective function:

𝑚 𝑛

𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼 |𝜃𝑗 |

𝜃
𝑖=1 𝑗=1

53
Multinomial Logistic Regression
 Often we need more than 2 classes
 Positive/negative/neutral
 Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
 Classify emergency SMSs into different actionable classes
 If >2 classes, we use multinomial logistic regression,
which is also named as:
= Softmax regression
= Multinomial logistic regression
= (defunct names : Maximum entropy modeling or MaxEnt)
 So "logistic regression" just means binary classification

54
Multinomial Logistic Regression
 The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
 A generalization of the sigmoid called softmax, takes a
vector z = [z1, z2, ..., zk] of k arbitrary values, and outputs a
probability distribution
 each value in the range [0,1]
 all the values summing to 1
exp 𝑧𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑖 = 𝑘 1≤𝑖≤𝑘
𝑗=1 exp 𝑧𝑗

𝑘 𝑧𝑖
 Denominator 𝑖=1 𝑒 normalizes values into probabilities
exp 𝑧1 exp 𝑧2 exp 𝑧𝑘
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 = 𝑘
, 𝑘
,…, 𝑘
𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖

55
Example of Softmax
 Given a vector z = [z1, z2, ..., zk] of k arbitrary values:
𝑧 = [0.6, 1.1, −1.5, 1.2, 3.2, −1.1]
 Using softmax function:
exp 𝑧1 exp 𝑧2 exp 𝑧𝑘
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 = 𝑘 , 𝑘 ,…, 𝑘
𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖
 The probabilities of the class are:
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]

56
Features in Multinomial Logistic Regression
 Softmax in multinomial logistic regression
 Input is still the dot product between weight vector w and
input vector x
 But now we’ll need separate weight vectors for each of the
K classes
 Features in binary logistic regression: one weight each
feature
w5 = 3.0

 Features in multinomial logistic regression: k weights

each feature (i.e., one weight per each class)

57
Binary vs. Multinomial Logistic Regression

58
What you should know
 Linear Regression
 Loss Function and Gradient Descent Optimization
 Mean Square Error
 Cross-Entropy
 Logistic Regression
 Binary Classes
 Multinomial Classes
 Overfitting and Regularization

59
Reading
 SLP Chapter 5
 Chapter 7 of “Machine Learning for Predictive Data
Analytics: Algorithms, Worked Examples, and Case
Studies”, MIT Press, 2015

Note: SLP: “Speech and Language Processing”, check

slide 30 of lecture 1 for the acronyms of the textbooks

60
Acknowledgement
 The slides of this lecture were prepared based on the
course materials of the following scholars:
 Text contents of John Kelleher, et al, “Machine Learning for Predictive
Data Analytics: Algorithms, Worked Examples, and Case Studies”
 Textbook and slides of Dan Jurafsky, Stanford University, “Speech and
Language Processing”
 Course materials of many other higher educators (e.g., Hongning Wang
from University of Virginia, Greg Durrett from UT Austin, etc.)

61
Appendix A: Deriving the Gradient Equation

Linear Regression
No ratings yet
Linear Regression
38 pages
Error Based Learning
No ratings yet
Error Based Learning
48 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture3 - Linear Regression and Logistic Regression
No ratings yet
Lecture3 - Linear Regression and Logistic Regression
60 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
36 pages
Module 2 MLA
No ratings yet
Module 2 MLA
33 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
38 pages
Supervised Learning & Regression
No ratings yet
Supervised Learning & Regression
41 pages
ML Day2
No ratings yet
ML Day2
7 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Linear Regression Basics and Model Fitting
No ratings yet
Linear Regression Basics and Model Fitting
35 pages
Module I Complete Notes
No ratings yet
Module I Complete Notes
136 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Regression
No ratings yet
Regression
25 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Lecture Slides Week11
No ratings yet
Lecture Slides Week11
33 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
Lecture Slides-Week9
No ratings yet
Lecture Slides-Week9
46 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Lecture Slides-Week9,10
No ratings yet
Lecture Slides-Week9,10
66 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
17 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
Neural Networks
No ratings yet
Neural Networks
38 pages
Lecture Slides-Week11
No ratings yet
Lecture Slides-Week11
32 pages
ML Day3
No ratings yet
ML Day3
10 pages
GR 1 Report Week 7
No ratings yet
GR 1 Report Week 7
6 pages
Machine Learning Basics and Techniques
No ratings yet
Machine Learning Basics and Techniques
56 pages
Cs229 ML Notes
No ratings yet
Cs229 ML Notes
192 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Developing A Machining Learning Models From Start To Finish.
No ratings yet
Developing A Machining Learning Models From Start To Finish.
59 pages
Supervised Machine Learning Overview
100% (1)
Supervised Machine Learning Overview
111 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Experiment N1
No ratings yet
Experiment N1
7 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
30 pages
Linear Regression
No ratings yet
Linear Regression
130 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
Module I-Part 1
No ratings yet
Module I-Part 1
48 pages
Linear Regression Practice Lab Guide
No ratings yet
Linear Regression Practice Lab Guide
14 pages
Statistical Machine Learning Overview
No ratings yet
Statistical Machine Learning Overview
45 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Complete Chapter Revision Takeaways Supervised ML Regression
No ratings yet
Complete Chapter Revision Takeaways Supervised ML Regression
22 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Cambridge Lower Secondary Science Workbook 9
78% (32)
Cambridge Lower Secondary Science Workbook 9
200 pages
Lower Secondary Science 7 End-Of-Year Test
82% (28)
Lower Secondary Science 7 End-Of-Year Test
9 pages
Cambridge Lower Secondary Science Workbook 7
76% (62)
Cambridge Lower Secondary Science Workbook 7
25 pages
Cambridge Year 7 Math Learner Book Answers
87% (157)
Cambridge Year 7 Math Learner Book Answers
57 pages
States of Matter Worksheet
80% (44)
States of Matter Worksheet
2 pages
Year 7 Science End-of-Year Test
88% (16)
Year 7 Science End-of-Year Test
9 pages
Year 7 End-of-Year Maths Test
90% (30)
Year 7 End-of-Year Maths Test
7 pages
S7 End-Of-unit Test 1
87% (15)
S7 End-Of-unit Test 1
2 pages
Year 8 Science Exam Revision Notes
100% (10)
Year 8 Science Exam Revision Notes
7 pages
Stage7 Science Paper1 2024
75% (36)
Stage7 Science Paper1 2024
16 pages
Balance The Equations Below:: Chapter 7 Worksheet #1 Balancing Chemical Equations
91% (11)
Balance The Equations Below:: Chapter 7 Worksheet #1 Balancing Chemical Equations
4 pages
Lower Secondary Science 7 End-Of-Year Test
92% (13)
Lower Secondary Science 7 End-Of-Year Test
9 pages
SCIENCE SECONDARY - Cambridge Stage 7 Work Book 2021 (Collins)
56% (9)
SCIENCE SECONDARY - Cambridge Stage 7 Work Book 2021 (Collins)
172 pages
Science Unit Tests All
86% (7)
Science Unit Tests All
32 pages
Dichotomous Key Practice Worksheets
91% (11)
Dichotomous Key Practice Worksheets
5 pages
Mid-Year Examination, 2023 Science Year 7 1 Hour
88% (8)
Mid-Year Examination, 2023 Science Year 7 1 Hour
23 pages
Lower - Secondary - Science - 8 - End-Of-Year Test
95% (21)
Lower - Secondary - Science - 8 - End-Of-Year Test
9 pages
IGCSE Chemistry Notes
93% (61)
IGCSE Chemistry Notes
46 pages
Ls Maths8 2ed TR Workbook Answers-2
88% (182)
Ls Maths8 2ed TR Workbook Answers-2
60 pages
Separation Techniques Worksheet
100% (4)
Separation Techniques Worksheet
2 pages
S7 U1 Worksheets
100% (12)
S7 U1 Worksheets
19 pages
1416752cambridge Y7 WB MS
82% (82)
1416752cambridge Y7 WB MS
60 pages
States of Matter Practice Test: Multiple Choice
91% (11)
States of Matter Practice Test: Multiple Choice
4 pages
States of Matter Worksheet For STE
73% (11)
States of Matter Worksheet For STE
2 pages
Cambridge Checkpoint Lower Secondary Science Students Book 8
88% (59)
Cambridge Checkpoint Lower Secondary Science Students Book 8
517 pages
Year 8 Science Exam Practice 2016
79% (24)
Year 8 Science Exam Practice 2016
39 pages
Elements Compounds and Mixtures HW For Assessment
100% (1)
Elements Compounds and Mixtures HW For Assessment
9 pages
Chemical Reaction and Equation
88% (67)
Chemical Reaction and Equation
15 pages
IGCSE Chemistry Notes
93% (42)
IGCSE Chemistry Notes
70 pages
Cambridge Past Paper Stage 7 + Their Answer
92% (26)
Cambridge Past Paper Stage 7 + Their Answer
160 pages
Ultra-Wideband Matching for RVD Antennas
No ratings yet
Ultra-Wideband Matching for RVD Antennas
5 pages
Energy Efficient Design of Low-Profile Wideband Microstrip Patch Antennas Using Deep Learning
No ratings yet
Energy Efficient Design of Low-Profile Wideband Microstrip Patch Antennas Using Deep Learning
3 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Mobility Management
No ratings yet
Mobility Management
21 pages
Winning Software Project Pitch Guide
No ratings yet
Winning Software Project Pitch Guide
14 pages
RF24 Arduino Communication Code
No ratings yet
RF24 Arduino Communication Code
2 pages
Abuja Project
No ratings yet
Abuja Project
4 pages
Pay Electricity Bill Online Guide
No ratings yet
Pay Electricity Bill Online Guide
1 page
Patch Report
No ratings yet
Patch Report
48 pages
Sierpinski Carpet Fractal Antenna Design
No ratings yet
Sierpinski Carpet Fractal Antenna Design
16 pages