Logistic Regression
Lin Li
CS@PVAMU
Logistic Regression
Important analytic tool in natural and social sciences
Baseline supervised machine learning tool for
classification
Is also the foundation of neural networks
Two phases of logistic regression:
Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
Test: Given a test example x we compute p(y|x) using
learned weights w and b, and return whichever label (y = 1
or y = 0) is higher probability
2
Generative vs. Discriminative Classifiers
Suppose we're distinguishing cat from dog images
ImageNet ImageNet
Naive Bayes is a generative classifier
K-Nearest Neighbor, Rocchio Classifier, Logistic regression,
and Linear Regression (while being used for classification, e.g.,
Perceptron) are discriminative classifiers
3
Generative Classifiers
Build a model of what's in a cat image
Knows about whiskers, ears, eyes
Assigns a probability to any image
how cat-y is this image?
Also, build a model of what's in a dog image
Given a new image, run both models
to see which one fits better
4
Discriminative Classifiers
Just try to distinguish dogs from cats
Look, dogs have collars! Let's ignore everything else
5
Generative vs. Discriminative Classifiers
Generative models Discriminative models
Specifying joint distribution Specifying conditional
Full probabilistic distribution
specification for all the Only explain the target variable
random variables Arbitrary features can be
Dependence assumption incorporated for modeling
has to be specified for 𝑝 𝒕𝒘
𝑝 𝒘 𝒕 and 𝑝(𝒕) Need labeled data, only
Flexible, can be used in suitable for (semi-) supervised
unsupervised learning learning
e.g., Naïve Bayes e.g. Logistic Regression
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑝𝑟𝑖𝑜𝑟 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
𝑐 = argmax 𝑃(𝑑|𝑐) 𝑃(𝑐) 𝑐 = argmax 𝑃(𝑐|𝑑)
𝑐∈𝐶 𝑐∈𝐶
6
Probabilistic Learning Classifier
Given m input/output pairs (x(i),y(i)), the
components include:
A feature representation of the input. For each input
observation x(i), a vector of features [x1, x2, ... , xn].
Feature j for input x(i) is xj, more completely xj(i), or
sometimes fj(x).
A classification function that computes 𝑦, the
estimated class, via p(y|x), like the sigmoid or softmax
functions.
An objective function for learning, like mean squared
error or cross-entropy loss.
An algorithm for optimizing the objective function:
stochastic gradient descent.
7
Error-based Learning
Linear Regression
Logistic Regression
8
Regression
Given a dataset, how to use a parameterized prediction model to
represent the relationship between the descriptive features and the
target feature?
Note: all chapter indexed pictures (e.g., Table 7.1) in this lecture are from the textbook of John
Kelleher, et al, “Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and
Case Studies”
9
Linear Regression
Initiate a linear equation representing the relation between
descriptive features and the target feature
E.g., between size (x) and price (y), y = w × x + b; which is a line with
w being the slope, and b being the y intercept.
Key → find w and b (e.g., after exploration, price = 6.47 + 0.62 × size)
10
Measuring Prediction Error
Linear regression formula for multiple descriptive features (e.g.,
k features)
t = w1 × d1 + w2 × d2 + … + wk × dk + w0
where w1 , w2 , …, wk are the weights of the respective features
Measure the sum of the squared error (SSE, difference between
prediction and target value) or the sum of the squared residual1.
e.g., SSE is calculated as: 𝑛
1
𝐿2 𝑀𝑤 , 𝐷 = (𝑡𝑖 − 𝑀𝑤 (𝑑𝑖 ))2
2
𝑖=1
e.g., for the size (d1) and price (t) model, after expansion,
𝑛
1
𝐿2 𝑀𝑤 , 𝐷 = (𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2
2
𝑖=1
11
Error Surface
Weight space and error surface: for example, t = w1 × d1 + w0
Surface is convex and there is a global minimum
12
Gradient Descent
Goal: adjust the weights to minimize the residual
How to: find the optimal weights at the point where the partial
derivatives of the error surface with respect to w0 and w1 are
equal to 0, i is the index of the data with range [1, n]
𝑛
𝜕 1
(𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2 = 0
𝜕𝑤[0] 2
𝑖=1
𝑛
𝜕 1
(𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2 = 0
𝜕𝑤[1] 2
𝑖=1
For multiple variable linear regression,
𝑚
𝑀𝑤 𝑑 = 𝑤 0 + 𝑤 1 × 𝑑 1 + ⋯ + 𝑤 𝑚 × 𝑑 𝑚 = 𝑤 0 + 𝑤[𝑗] × 𝑑 [𝑗]
𝑖=1
apply partial derivatives similarly
13
Gradient Descent (cont’d)
Idea: starts by initiate the weights to “random” values, then adjust it along the
deepest direction (slope) of the current point toward the minimum.
14
Gradient Descent (cont’d)
Gradient with respect to each weight: (partial derivative by
chain rule) 𝜕
𝐿 𝑀 ,𝐷 =
𝜕 1
(𝑡 − 𝑀 (𝑑))2
2 𝑤 𝑤
𝜕𝑤 𝑗 𝜕𝑤 𝑗 2
𝜕
= (𝑡 − 𝑀𝑤 (𝑑) × 𝑡 − 𝑀𝑤 (𝑑)
𝜕𝑤 𝑗
𝜕
= (𝑡 − 𝑀𝑤 (𝑑) × 𝑡 − (𝑤 ∙ 𝑑)
𝜕𝑤 𝑗
= (𝑡 − 𝑀𝑤 (𝑑) × −𝑑[𝑗]
Update the weights one by one by tuning them toward the
opposite direction of the calculated gradient, using a learning
rate (α) to decide the change amount
𝑛
𝑤 𝑗 ←𝑤 𝑗 +𝑎× ((𝑡𝑖 − 𝑀𝑤 𝑑𝑖 ) × 𝑑𝑖 [𝑗])
𝑖=1
𝑒𝑟𝑟𝑜𝑟𝐷𝑒𝑙𝑡𝑎(𝐷,𝑤[𝑗])
Then, re-measure the errors and keep updating the weights
as above until the error is less than a threshold (e.g., 0.1)
15
Error Change with Weights Update
16
Example
Rental Price = w0 + w1 × size + w2 × floor + w3 × Broadband Rate
Set α = 0.00000002, and initialize all weights between [-0.2, +0.2]
17
Example (cont’d)
After 100 iterations, w0 = -0.1513, w1 = 0.6270, w2 = -0.1781, w3 = 0.0714, and
the squared error of the final model is 2913.5
Rental Price = w0 + w1 × size + w2 × floor + w3 × Broadband Rate
18
Batch and Stochastic Gradient Descent
Batch Gradient Descent: weight adjustment at each iteration
based on summation of the squared error made by the
candidate model (linear regression formula) for the whole
training set
Stochastic Gradient Descent: weight adjustment based on the
error in the predication made by the candidate model (linear
regression formula) for each training instance individually
Mini-batch (a compromise of both): train/adjust the weights
each time on a group of m examples that is less than the whole
dataset (if m = 1, same as stochastic gradient descent; if m = size
of the whole dataset, same as batch gradient descent)
e.g., batch size is 512 or 1024
19
Interpreting Multivariable Linear Models
Signs of Weights: indicating whether the feature has a positive or
negative impact on the prediction
Magnitudes of Weights: how much the target value changes
according to the descriptive feature’s unit
Mistake: the features with higher weights are more important
than those with lower weights
Each descriptive feature has a different varying scale! Need to
performance statistical significance test before comparison.
A test-statistic is computed (e.g., t-test) to decide whether the
feature has significant impact on the target
P-value: the probability that the test-statistic is greater than or
equal to the one computed being the result of chance
Null-hypothesis: the feature has no significant impact on the target
P-value is compared with pre-defined significance threshold. If it is
≤ threshold (e.g., 5% or 1%), null hypothesis is rejected.
20
Interpreting Multivariable Linear Models (Example)
First: calculate the standard error of the
model (n is sample size)
Second: calculate the standard error on a
specific descriptive feature
Third : calculate t-statistic for this feature
Finally: check p-value associated with the
calculated t-test value in a look-up table,
and then Compare results
21
Interpreting Multivariable Linear Models (Example)
Often shown in data analytics (e.g., R or Python)
22
Impact of Learning Rate (α)
Large α:
converge
fast at
beginning,
but may
miss the
global
optimal
Small α:
won’t miss
the global
optimal,
but may
converge
very slowly
23
Learning Rate (α) with Weight Decay
Reason: reduce the residual (sum of squared error) fast at beginning, and
slow down gradually to converge to the global optimal
α0 is the initial learning rate (e.g., 0.5)
c is a constant which controls how quickly the rate decays (e.g., 10),
τ is the current iteration number of the gradient descent
24
Handling Categorical Descriptive Features
Rental Price = w0 + w1 × size + w2 × floor + w3 × broadband rate +
w4 × energy rating A + w5 × energy rating B +
w6 × energy rating C
Note: you can save one extra feature (e.g., use ratings A and B only)
Rental Price = w0⃰ + w1⃰ × size + w2⃰ × floor + w3 ⃰ × broadband rate +
w4 ⃰ × energy rating A + w5⃰ × energy rating B
where w0 + w6 = w0⃰ , w1 = w1⃰ , w2 = w2⃰ , w3 = w3⃰ , w4 - w6 = w4⃰ , w5 - w6 = w5⃰
25
Predicting Categorical Targets
Train a linear model (line or hyper plane), named decision surface, to classify
the data.
f (RPM, vibration) = w0 + w1 × RPM + w2 × vibration
Status = f(RPM, vibration) > 0 ? Good : Faulty
26
Error-based Learning
Linear Regression
Logistic Regression
27
Regression for Classification
Linear Regression
𝑦 ← 𝑤𝑇𝑋
Relationship between a scalar dependent variable 𝑦 and one
or more explanatory variables Y is discrete in a
classification problem!
y
1 𝑤 𝑇 𝑋 > 0.5 1.00
𝑦=
0 𝑤 𝑇 𝑋 ≤ 0.5
0.75
What if we have 0.50 Optimal
an outlier? 0.25 regression
model
0.00 x
28
Regression for Classification
Logistic Regression
1
𝑝 𝑦𝑥 =𝜎 𝑤𝑇𝑋 =
1 + exp(−𝑤 𝑇 𝑋)
Directly modeling of class posterior Sigmoid function
P(y|x)
1.00
0.75
What if we have 0.50
an outlier?
0.25
0.00 x
29
Why Sigmoid function?
Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1+
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
Binomial P(y|x)
𝑃 𝑦=1 =𝛼 1.00
0.75 𝑃 𝑋 𝑦 = 1 = 𝑁(𝜇1 , 𝛿 2 )
𝑃 𝑋 𝑦 = 0 = 𝑁(𝜇0 , 𝛿 2 ) 0.50
0.25 Normal with identical variance
x
30
Why Sigmoid function?
Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1 1
= =
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0) 𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1+ 1 + exp − ln
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1) 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
𝑉
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1) 𝑃(𝑦 = 1) 𝑃(𝑥𝑖 |𝑦 = 1)
ln = ln + ln
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0) 𝑃(𝑦 = 0) 𝑃(𝑥𝑖 |𝑦 = 0)
𝑖=1
𝑉 2 2
𝛼 𝜇1𝑖 − 𝜇0𝑖 𝜇1𝑖 − 𝜇0𝑖
= ln + 2 𝑥𝑖 −
1−𝛼 𝛿𝑖 2𝛿𝑖2
𝑖=1
𝑉
Origin of the name: 𝜇1𝑖 − 𝜇0𝑖 1 𝑥−𝜇 2
= 𝑤0 + 𝑥𝑖 𝑃 𝑥𝑦 = 𝑒
−
2𝛿 2
logit function 𝛿𝑖2 𝛿 2𝜋
𝑖=1
= 𝑤0 + 𝑤𝑇𝑋 = 𝑤𝑇𝑋
31
Why Sigmoid function?
Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1+
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1
=
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1 + exp − ln
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
1 + exp(−𝑤 𝑇 𝑋)
Generalized Linear Model
Note: it is still a linear relation among the features!
32
Another Dataset
logistic model is more functional
in derivative deduction
Output range [0, 1], represent
either the normalized numerical
result or the probability of the
query falling into one target
category
1
𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑥 =
1 + 𝑒 −𝑥
33
Logistic Regression
1
Sigmoid function: replace x in the logistic formula 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑥 = with
1 + 𝑒 −𝑥
the linear model described before, i.e.,
1
𝑀 𝑅𝑃𝑀, 𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛 =
1 + 𝑒 −1∗(𝑤0+𝑤1∗𝑅𝑃𝑀+𝑤2∗𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛)
Then, use gradient descent method to train the weights of the logistic model
34
Turning Probability into a Classifier
1
𝑃 𝑦 =1 =𝜎 𝑤∙𝑥+𝑏 =
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )
1
𝑃 𝑦 =0 =1−𝜎 𝑤∙𝑥+𝑏 =1−
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )
exp(− 𝑤 ∙ 𝑥 + 𝑏 )
= = 𝜎 −(𝑤 ∙ 𝑥 + 𝑏)
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )
1 𝑖𝑓 𝑃 𝑦 = 1 𝑥 > 0.5 0.5 here is called the decision boundary
𝑦=
0 𝑂𝑡𝑒𝑟𝑤𝑖𝑠𝑒
𝑁𝑜𝑡𝑒: 1 − 𝜎 𝑥 = 𝜎 −𝑥
35
Weight Adjustment
Goal: adjust the weights to minimize the residual
Apply gradient descent to the sum of squared error with
respect to wj equal to 0 (partial derivative by chain rule)
𝜕
𝐿 𝑀 , 𝐷 = (𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗]
𝜕𝑤 𝑗 2 𝑤
Mw(d) represent the calculated output (prediction), t is the
target value coming with the training data.
The weight adjustment amount by each training data is:
𝑤 𝑗 ← 𝑤 𝑗 + 𝛼 × (𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗]
Considering all training data, the weight will be adjusted by
𝑛
𝑤 𝑗 ←𝑤 𝑗 +𝛼× ((𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗])
𝑖=1
i is the index of the training data, range [1,n]
36
Weight Update through Batch Gradient Descent
37
Logistic Model Change with Weights Update
1
Final model: M𝑤 RPM, Vibration =
1 + 𝑒 −(−0.4077+4.1697∗𝑅𝑃𝑀+6.0460∗𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛)
38
Logistic Regression in Text Mining
Sentiment Analysis: suppose we are doing binary
sentiment classification on movie review text, and
we would like to know whether to assign the
sentiment class 1 = positive or 0 = negative to the
following review
e.g., what is the sentiment of the follow text?
It's hokey. There are virtually no surprises, and the writing is
second-rate. So why was it so enjoyable? For one thing, the
cast is great. Another nice touch is the music. I was overcome
with the urge to get off the couch and start dancing. It sucked
me in, and it'll do the same to you.
39
Feature Extraction
40
Classifying Sentiment for Input
Let’s assume for the moment that we’ve already learned a
real-valued weight for each of the features, and that the 6
weights corresponding to the 6 features are as follows:
[2.5, -5.0, -1.2, 0.5, 2.0, 0.7]
and b = 0.1
Thus, we have the following and classify the text as “positive”
𝑝 + 𝑥 =𝑝 𝑦 =1 𝑥 =𝜎 𝑤∙𝑥+𝑏
= 𝜎 2.5, −5.0, −1.2, 0.5, 2.0,0.7 ∙ [3,2,1,3,0,4.19] + 0.1
= 𝜎 .833 = 0.70
𝑝 − 𝑥 = 𝑝 𝑦 = 0 𝑥 = 1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 = 0.30
41
Build Features
By examining the training set with an eye to linguistic
intuitions and the linguistic literature on the domain
Created automatically via feature templates
Example: Period Disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end
42
Weight Adjustment: Revisit
If the loss function f is twice differentiable and the domain is
real, then we can characterize it as follows:
f is convex if and only if f ”(x) ≥ 0 for all x. Hence if the double
derivative of the loss function is ≥ 0, then we can claim it to be
convex (i.e., guarantee to find global optima)
For logistic regression, if the loss function is sum of squared
error (SSE), it is NOT convex! (i.e., may stuck at local optima)
2
1
𝑓 𝑥 = (𝑦 − 𝑦)2 = 𝑦 −
1 + exp − 𝑤 ∙ 𝑥 + 𝑏
𝜕2𝑓 𝑥 2
Not always ≥0
= −2 𝑦 − 2𝑦𝑦 − 2𝑦 + 3𝑦 ∗ 𝑥𝑗 ∗ 𝑥𝑗 ∗ 𝑦(1 − 𝑦)
𝜕𝑤𝑗 2
Is there a convex loss function? —Yes, cross-entropy loss!
https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c
43
Training of Logistic Regression
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1), we can
express the probability p(y|x) from our classifier as:
𝑝(𝑦|𝑥) = 𝑦 𝑦 (1 − 𝑦)1−𝑦
If y =1, simplifies to 𝑦, if y =0, simplifies to 1- 𝑦.
Take the log of both sides — mathematically handy maximize
log p(y|x) will maximize p(y|x)
log 𝑝 𝑦 𝑥 = log 𝑦 𝑦 (1 − 𝑦)1−𝑦 = [𝑦 ∗ log 𝑦 + (1 − 𝑦) ∗ log(1 − 𝑦)]
For a training dataset of N instances, maximize the log
conditional likelihood:
𝑁 𝑁 𝑁
log 𝑝(𝑦𝑖 |𝑥𝑖 ) = log(𝑦𝑖 𝑦𝑖 (1 − 𝑦𝑖 )1−𝑦𝑖 ) = 𝑦𝑖 ∗ log 𝑦𝑖 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝑦𝑖 )
𝑖=1 𝑖=1 𝑖=1
44
Cross-Entropy Loss: 𝑳𝑪𝑬 𝒚, 𝒚
Flip the sign of target function to turn it into a loss
function 𝐿𝐶𝐸 𝑦, 𝑦 : (to minimize)
𝐿𝐶𝐸 𝑦, 𝑦 = −log 𝑝 𝑦 𝑥 = −[𝑦 ∗ log 𝑦 + (1 − 𝑦) ∗ log(1 − 𝑦)]
Plugging in the definition of 𝑦:
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
For a training dataset of N instances, 𝐿𝐶𝐸 𝑦, 𝑦 is:
𝑁
𝐿𝐶𝐸 𝑦, 𝑦 = − [𝑦𝑖 ∗ log 𝑦𝑖 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝑦𝑖 )]
𝑖=1
𝑁 x(i): the ith instance
=− [𝑦𝑖 ∗ log 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 )]
𝑖=1
45
Cross-Entropy Loss: Example
True value is y =1, given the weights values as [2.5, -5.0,
-1.2, 0.5, 2.0, 0.7] and b = 0.1, How well is our model?
𝑝 + 𝑥 = 𝑝 𝑦 = 1 𝑥 = 0.70 very good!
How about the loss?
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
= − log 𝜎 𝑤 ∙ 𝑥 + 𝑏 = − log 0.70 = 0.36
What about the loss if the true value is y = 0?
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
= − log 1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 = − log 0.30 = 1.2
46
Gradient for Cross-Entropy Loss Function
What’s the partial derivative of the loss function?
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
Apply chain rule, the elegant derivative is:
𝜕𝐿𝐶𝐸 𝑦, 𝑦
= 𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 ∗ 𝑥𝑗
𝜕𝑤𝑗
Note: do the math, use textbook or the Appendix as reference
For a training dataset of N instances, the gradient is:
𝑁 𝑁
𝜕 𝑖=1 𝐿𝐶𝐸 𝑦𝑖 , 𝑦𝑖
= 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 − 𝑦𝑖 ∗ 𝑥(𝑖)𝑗
𝜕𝑤𝑗
𝑖=1
x(i)j: the jth feature of the ith instance
47
Weight Adjustment for Logistic Regression
Given a learning rate η (hyperparameter), train the model:
48
Stochastic Weight Adjustment: One-Step
A mini-sentiment example, where y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume the three parameters (2 weights and 1 bias) of the
logistic regression model Θ are initialized to zeros:
w1 = w2 = b = 0, and η = 0.1
𝜕𝐿𝐶𝐸 𝑦, 𝑦
Calculate gradient descent using: = 𝜎 𝑤∙𝑥+𝑏 −𝑦 ∗ 𝑥𝑗
𝜕𝑤𝑗
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝜕𝑤1
𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 𝑥1 𝜎 0 − 1 𝑥1 −0.5𝑥1 −1.5
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝛻𝑤,𝑏 = = 𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 𝑥2 = 𝜎 0 − 1 𝑥2 = −0.5𝑥2 = −1.0
𝜕𝑤2 𝜎 𝑤∙𝑥+𝑏 −𝑦 −0.5
𝜎 0 −1 −0.5
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝜕𝑏 𝑤1 0 −1.5 .15
Update weights and bias in mode: 𝜃 = 𝑤2 = 0 − η −1.0 = .1
𝑏 0 −0.5 .05
49
Overfitting
A model that perfectly match the training data may
overfit to the data, modeling noise
A random word that perfectly predicts y (it happens to only occur in
one class) will get a very high weight.
Failing to generalize to a test set without this word
A good model should be able to generalize!
e.g., 4-gram model on tiny data will just memorize the
data, but be surprised by novel 4-grams in testing data!
50
Regularization
How to avoid overfitting
Regularization in regression
Dropout in neural networks (will talk about it later)
Regularization ― Add a regularization term R(θ) to the
loss function (for now written as maximizing log prob
rather than minimizing loss)
𝑚
𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼𝑅(𝜃)
𝜃
𝑖=1
Idea: choose an R(θ) that penalizes large weights
Fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights
51
L2 Regularization (= Ridge Regression)
The sum of the squares of the weights
The name is because this is the (square of the) L2 norm
‖θ‖2, = Euclidean distance of θ to the origin.
L2 regularized objective function:
𝑚 𝑛
𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼 𝜃𝑗 2
𝜃
𝑖=1 𝑗=1
52
L1 Regularization (= Lasso Regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ‖W‖1, = sum of the absolute
values of the weights, = Manhattan distance
L1 regularized objective function:
𝑚 𝑛
𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼 |𝜃𝑗 |
𝜃
𝑖=1 𝑗=1
53
Multinomial Logistic Regression
Often we need more than 2 classes
Positive/negative/neutral
Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
Classify emergency SMSs into different actionable classes
If >2 classes, we use multinomial logistic regression,
which is also named as:
= Softmax regression
= Multinomial logistic regression
= (defunct names : Maximum entropy modeling or MaxEnt)
So "logistic regression" just means binary classification
54
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
A generalization of the sigmoid called softmax, takes a
vector z = [z1, z2, ..., zk] of k arbitrary values, and outputs a
probability distribution
each value in the range [0,1]
all the values summing to 1
exp 𝑧𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑖 = 𝑘 1≤𝑖≤𝑘
𝑗=1 exp 𝑧𝑗
𝑘 𝑧𝑖
Denominator 𝑖=1 𝑒 normalizes values into probabilities
exp 𝑧1 exp 𝑧2 exp 𝑧𝑘
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 = 𝑘
, 𝑘
,…, 𝑘
𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖
55
Example of Softmax
Given a vector z = [z1, z2, ..., zk] of k arbitrary values:
𝑧 = [0.6, 1.1, −1.5, 1.2, 3.2, −1.1]
Using softmax function:
exp 𝑧1 exp 𝑧2 exp 𝑧𝑘
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 = 𝑘 , 𝑘 ,…, 𝑘
𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖
The probabilities of the class are:
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
56
Features in Multinomial Logistic Regression
Softmax in multinomial logistic regression
Input is still the dot product between weight vector w and
input vector x
But now we’ll need separate weight vectors for each of the
K classes
Features in binary logistic regression: one weight each
feature
w5 = 3.0
Features in multinomial logistic regression: k weights
each feature (i.e., one weight per each class)
57
Binary vs. Multinomial Logistic Regression
58
What you should know
Linear Regression
Loss Function and Gradient Descent Optimization
Mean Square Error
Cross-Entropy
Logistic Regression
Binary Classes
Multinomial Classes
Overfitting and Regularization
59
Reading
SLP Chapter 5
Chapter 7 of “Machine Learning for Predictive Data
Analytics: Algorithms, Worked Examples, and Case
Studies”, MIT Press, 2015
Note: SLP: “Speech and Language Processing”, check
slide 30 of lecture 1 for the acronyms of the textbooks
60
Acknowledgement
The slides of this lecture were prepared based on the
course materials of the following scholars:
Text contents of John Kelleher, et al, “Machine Learning for Predictive
Data Analytics: Algorithms, Worked Examples, and Case Studies”
Textbook and slides of Dan Jurafsky, Stanford University, “Speech and
Language Processing”
Course materials of many other higher educators (e.g., Hongning Wang
from University of Virginia, Greg Durrett from UT Austin, etc.)
61
Appendix A: Deriving the Gradient Equation
62