Data Mining
Principal Component Analysis
Logistic Regression and Neural
Networks
CS 584 :: Fall 2024
Ziwei Zhu
Department of Computer Science
George Mason University
Part of slides is from Dr. Theodora Chaspari.
1
• Quiz 2 today.
• Solutions of Quiz 1 and 2 will be explained next week.
• Midterm exam is on 10/08
• HW2 out, due in 3 weeks (10/14).
2
Linear Classification
RentPrice = w0 + w1 × Size Predict whether the apartment
+ w2 × DistanceFromGMU will be rented or not
+...
size
Distance to GMU
3
Outline
• Brief review of probability
• Binary Logistic Regression
• Multi-class Logistic Regression
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks
4
Bernoulli Distribution
Toss a biased coin.
A single experiment outputs head/tail.
5
Bernoulli Distribution
Toss a biased coin. A single experiment outputs head/tail.
Probability
p(Y=y)
Distribution
Function
(PDF)
y 6
Bernoulli Distribution: Likelihood
• If we toss this biased coin 𝑛 times.
• Assume the result of each time is 𝑦𝑖 (𝑖 = 1,2 … , 𝑛).
• What’s the probability of observing these results?
7
Bernoulli Distribution: Likelihood
If we toss this biased coin 𝑛 times. Assume the result of
each time is 𝑦𝑖 (𝑖 = 1,2 … , 𝑛). The probability of
observing these results is
likelihood 𝑛 𝑛
𝑝 𝑦1 , . . , 𝑦𝑛 𝜃 = ෑ 𝑝(𝑦𝑖 |𝜃) = ෑ 𝜃 𝑦𝑖 (1 − 𝜃)𝑦𝑖
𝑖=1 𝑖=1
8
Bernoulli Distribution: Likelihood
If we toss this biased coin 𝑛 times. Assume the result of
each time is 𝑦𝑖 (𝑖 = 1,2 … , 𝑛). The probability of
observing these results is
likelihood 𝑛 𝑛
𝑝 𝑦1 , . . , 𝑦𝑛 𝜃 = ෑ 𝑝(𝑦𝑖 |𝜃) = ෑ 𝜃 𝑦𝑖 (1 − 𝜃)𝑦𝑖
𝑖=1 𝑖=1
log-likelihood 𝑛
log 𝑝 𝑦1 , . . , 𝑦𝑛 𝜃 = 𝑦𝑖 log 𝜃 + 1 − 𝑦𝑖 log(1 − 𝜃)
𝑖=1
9
Maximum Likelihood Estimation (MLE)
If we don’t know the parameter 𝜃 of the biased
coin but we observe 𝑦1 , . . , 𝑦𝑛 from 𝑛 independent
experiments, how can we estimate 𝜃?
10
Maximum Likelihood Estimation (MLE)
If we don’t know the parameter 𝜃 of the biased coin
but we observe 𝑦1 , . . , 𝑦𝑛 from 𝑛 independent
experiments, then we can estimate 𝜃 by maximizing the
log likelihood:
𝑛
𝜃 ∗ = arg max 𝑦𝑖 log 𝜃 + 1 − 𝑦𝑖 log(1 − 𝜃)
𝜃
𝑖=1
𝑛
∗
σ𝑖=1 𝑦𝑖
𝜃 =
𝑛
11
Outline
• Brief review of probability
• Binary Logistic Regression
• Multi-class Logistic Regression
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks
12
Logistic Regression
Key idea: given one input sample, represented by its
features 𝒙 ∈ ℝ𝐷 , we use a linear model to predict
the probability that the label 𝑦 of this sample is
positive (y = 1).
13
Logistic Regression
Example:
Classification task: whether a student passes or not the course
Features: SAT scores
Logistic Regression: estimates “pass” probability, i.e.,
f(score)=p(pass). If p(pass)= f(score)>0.5, predicts “pass”,
otherwise “fail”.
14
How exactly do we do?
Input: 𝒙 ∈ ℝ𝐷+1
Predict Probability: 𝑝 𝑦 = 1 𝒙 = 𝒘𝑇 𝒙
𝑤0
𝑤1
𝒘 = ⋮ the same as 𝒘 in the linear regression model
𝑤𝐷
𝑝 𝑦 = 1 𝒙 = 𝒘𝑇 𝒙 ∈ ℝ
However, the probability should be 𝟎~𝟏
15
How exactly do we do?
Input: 𝒙 ∈ ℝ𝐷+1
Predict Probability: 𝑝 𝑦 = 1 𝒙 = 𝜎(𝒘𝑇 𝒙)
𝑤0
𝑤1
𝒘 = ⋮ the same as 𝒘 in the linear regression model
𝑤𝐷
1 𝑒𝑥
Sigmoid function: 𝜎 𝑥 = 1+𝑒 −𝑥
= 1+𝑒 𝑥
∈ (0,1)
16
The Sigmoid Function
𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 17
The Sigmoid Function
𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 18
The Sigmoid Function
𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 19
The Sigmoid Function
𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 20
How exactly do we do?
Input: 𝒙 ∈ ℝ𝐷
Predict Probability: 𝑝 𝑦 = 1 𝒙 = 𝜎 𝒘𝑇 𝒙
1, 𝑝 𝑦 = 1 𝒙 > 𝟎. 𝟓
Predict Class: 𝑦ො = ቊ
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
can change as we need
21
Logistic Regression
+1
∈ ℝ𝐷+1
22
Logistic Regression
Binary classification can be considered as predicting the
parameter in a Bernoulli distribution:
𝑦~
23
Likelihood as the Model Evaluation
24
Likelihood as the Model Evaluation
25
Optimization
26
Optimization
𝒘∗ = arg min 𝜀(𝒘)
𝒘
27
Optimization
𝒘∗ = arg min 𝜀(𝒘)
𝒘
28
Recap: Gradient Descent
29
Optimization: Gradient Descent
Optimization at k-th step:
30
Optimization: Gradient Descent
31
Recap: Non-Linear Regression
32
Can be Applied to Logistic Regression
Just replace 𝑿𝑛 with Φ(𝑿𝑛 )
33
Recap: Overfitting
34
Overfitting in Logistic Regression
35
Regularization in Logistic Regression
36
Outline
• Brief review of probability
• Binary Logistic Regression
• Multi-class Logistic Regression
37
Multi-class Logistic Regression
38
Multi-class Logistic Regression
39
Multi-class Logistic Regression
40
Multi-class Logistic Regression
41
Multi-class Logistic Regression
+1
42
Optimization
43
Optimization
44
What have we learnt so far
47
Outline
• Brief review of probability
• Binary Logistic Regression
• Multi-class Logistic Regression
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks
48
Another Way to View Logistic Regression
𝑦ො = 𝜎(𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐷 𝑥𝐷 )
49
Perceptron
• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 50
Perceptron
• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 51
Perceptron
• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 52
Perceptron
• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 53
Perceptron
• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 54
Perceptron: A Basic Layer
55
Perceptron: Training
56
Perceptron: Approximate Linear Functions
Example: Boolean AND
57
Perceptron: Approximate Linear Functions
Example: Boolean XOR
58
Outline
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks
59
Multi-layer Perceptron
60
Multi-layer Perceptron: Example
61
Multi-layer Perceptron
input layer output layer
hidden layer
62
Multi-layer Perceptron
63
Multi-layer Perceptron
Flexible to add hidden layers and nodes at layers to
increase the complexity of the model
Theoretically, can approximate any functions
65
Multi-layer Perceptron
Loss for Classification
Loss for Regression
How to train the model?
66
Optimization: Backpropagation
• Forward propagation to calculate training loss
• Back propagation to measure how much each node is
“responsible” for the training loss, and then we update
corresponding weights by Gradient Descent
67
Optimization: Backpropagation
• Not easy especially for neural networks of complex
structures.
• Fortunately, we have awesome tools!
68
Outline
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks
69
Determine number of layers and sizes
• The capacity of the network (i.e., the number and
complexity of representable functions
• How to avoid overfitting?
70
Determine number of layers and sizes
71
Determine Activation Function
Linear: 𝑠 𝑥 = 𝑥
• Cannot introduce non-linearity to the neural networks.
73
Determine Activation Function
74
Determine Activation Function
75
Determine Activation Function
76
Determine Activation Function
77
What have we learnt so far
• View Logistic Regression as Perceptron, which is the
basic processing unit of neural networks and can
represent linear functions.
• Multi-layer perceptron to approximate non-linear
functions.
• We need to determine number of layers, size of layers,
and activation function for neural networks.
78