0% found this document useful (0 votes)
7 views74 pages

5 LogRegNN

The document outlines a course on Data Mining focusing on Principal Component Analysis, Logistic Regression, and Neural Networks. It includes details about quizzes, midterm exams, and homework assignments, as well as a review of probability and various regression techniques. Key concepts discussed include binary and multi-class logistic regression, perceptrons, and multi-layer perceptrons, along with optimization techniques like maximum likelihood estimation and gradient descent.

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views74 pages

5 LogRegNN

The document outlines a course on Data Mining focusing on Principal Component Analysis, Logistic Regression, and Neural Networks. It includes details about quizzes, midterm exams, and homework assignments, as well as a review of probability and various regression techniques. Key concepts discussed include binary and multi-class logistic regression, perceptrons, and multi-layer perceptrons, along with optimization techniques like maximum likelihood estimation and gradient descent.

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining

Principal Component Analysis

Logistic Regression and Neural


Networks

CS 584 :: Fall 2024


Ziwei Zhu
Department of Computer Science
George Mason University
Part of slides is from Dr. Theodora Chaspari.

1
• Quiz 2 today.
• Solutions of Quiz 1 and 2 will be explained next week.
• Midterm exam is on 10/08
• HW2 out, due in 3 weeks (10/14).

2
Linear Classification
RentPrice = w0 + w1 × Size Predict whether the apartment
+ w2 × DistanceFromGMU will be rented or not
+...

size
Distance to GMU

3
Outline

• Brief review of probability


• Binary Logistic Regression
• Multi-class Logistic Regression
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks

4
Bernoulli Distribution

Toss a biased coin.


A single experiment outputs head/tail.

5
Bernoulli Distribution
Toss a biased coin. A single experiment outputs head/tail.

Probability
p(Y=y)

Distribution
Function
(PDF)
y 6
Bernoulli Distribution: Likelihood

• If we toss this biased coin 𝑛 times.


• Assume the result of each time is 𝑦𝑖 (𝑖 = 1,2 … , 𝑛).
• What’s the probability of observing these results?

7
Bernoulli Distribution: Likelihood
If we toss this biased coin 𝑛 times. Assume the result of
each time is 𝑦𝑖 (𝑖 = 1,2 … , 𝑛). The probability of
observing these results is

likelihood 𝑛 𝑛

𝑝 𝑦1 , . . , 𝑦𝑛 𝜃 = ෑ 𝑝(𝑦𝑖 |𝜃) = ෑ 𝜃 𝑦𝑖 (1 − 𝜃)𝑦𝑖


𝑖=1 𝑖=1

8
Bernoulli Distribution: Likelihood
If we toss this biased coin 𝑛 times. Assume the result of
each time is 𝑦𝑖 (𝑖 = 1,2 … , 𝑛). The probability of
observing these results is

likelihood 𝑛 𝑛

𝑝 𝑦1 , . . , 𝑦𝑛 𝜃 = ෑ 𝑝(𝑦𝑖 |𝜃) = ෑ 𝜃 𝑦𝑖 (1 − 𝜃)𝑦𝑖


𝑖=1 𝑖=1

log-likelihood 𝑛

log 𝑝 𝑦1 , . . , 𝑦𝑛 𝜃 = ෍ 𝑦𝑖 log 𝜃 + 1 − 𝑦𝑖 log(1 − 𝜃)


𝑖=1
9
Maximum Likelihood Estimation (MLE)

If we don’t know the parameter 𝜃 of the biased


coin but we observe 𝑦1 , . . , 𝑦𝑛 from 𝑛 independent
experiments, how can we estimate 𝜃?

10
Maximum Likelihood Estimation (MLE)
If we don’t know the parameter 𝜃 of the biased coin
but we observe 𝑦1 , . . , 𝑦𝑛 from 𝑛 independent
experiments, then we can estimate 𝜃 by maximizing the
log likelihood:
𝑛

𝜃 ∗ = arg max ෍ 𝑦𝑖 log 𝜃 + 1 − 𝑦𝑖 log(1 − 𝜃)


𝜃
𝑖=1

𝑛

σ𝑖=1 𝑦𝑖
𝜃 =
𝑛
11
Outline

• Brief review of probability


• Binary Logistic Regression
• Multi-class Logistic Regression
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks

12
Logistic Regression

Key idea: given one input sample, represented by its


features 𝒙 ∈ ℝ𝐷 , we use a linear model to predict
the probability that the label 𝑦 of this sample is
positive (y = 1).

13
Logistic Regression
Example:
Classification task: whether a student passes or not the course
Features: SAT scores
Logistic Regression: estimates “pass” probability, i.e.,
f(score)=p(pass). If p(pass)= f(score)>0.5, predicts “pass”,
otherwise “fail”.

14
How exactly do we do?
Input: 𝒙 ∈ ℝ𝐷+1
Predict Probability: 𝑝 𝑦 = 1 𝒙 = 𝒘𝑇 𝒙
𝑤0
𝑤1
𝒘 = ⋮ the same as 𝒘 in the linear regression model
𝑤𝐷

𝑝 𝑦 = 1 𝒙 = 𝒘𝑇 𝒙 ∈ ℝ

However, the probability should be 𝟎~𝟏

15
How exactly do we do?
Input: 𝒙 ∈ ℝ𝐷+1
Predict Probability: 𝑝 𝑦 = 1 𝒙 = 𝜎(𝒘𝑇 𝒙)

𝑤0
𝑤1
𝒘 = ⋮ the same as 𝒘 in the linear regression model
𝑤𝐷

1 𝑒𝑥
Sigmoid function: 𝜎 𝑥 = 1+𝑒 −𝑥
= 1+𝑒 𝑥
∈ (0,1)
16
The Sigmoid Function

𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 17
The Sigmoid Function

𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 18
The Sigmoid Function

𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 19
The Sigmoid Function

𝑑𝜎(𝑥)
= 𝜎(𝑥)(1 − 𝜎(𝑥))
𝑑𝑥 20
How exactly do we do?

Input: 𝒙 ∈ ℝ𝐷
Predict Probability: 𝑝 𝑦 = 1 𝒙 = 𝜎 𝒘𝑇 𝒙

1, 𝑝 𝑦 = 1 𝒙 > 𝟎. 𝟓
Predict Class: 𝑦ො = ቊ
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

can change as we need

21
Logistic Regression

+1

∈ ℝ𝐷+1
22
Logistic Regression

Binary classification can be considered as predicting the


parameter in a Bernoulli distribution:

𝑦~

23
Likelihood as the Model Evaluation

24
Likelihood as the Model Evaluation

25
Optimization

26
Optimization

𝒘∗ = arg min 𝜀(𝒘)


𝒘

27
Optimization

𝒘∗ = arg min 𝜀(𝒘)


𝒘

28
Recap: Gradient Descent

29
Optimization: Gradient Descent

Optimization at k-th step:

30
Optimization: Gradient Descent

31
Recap: Non-Linear Regression

32
Can be Applied to Logistic Regression

Just replace 𝑿𝑛 with Φ(𝑿𝑛 )

33
Recap: Overfitting

34
Overfitting in Logistic Regression

35
Regularization in Logistic Regression

36
Outline

• Brief review of probability


• Binary Logistic Regression
• Multi-class Logistic Regression

37
Multi-class Logistic Regression

38
Multi-class Logistic Regression

39
Multi-class Logistic Regression

40
Multi-class Logistic Regression

41
Multi-class Logistic Regression
+1

42
Optimization

43
Optimization

44
What have we learnt so far

47
Outline

• Brief review of probability


• Binary Logistic Regression
• Multi-class Logistic Regression
• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks

48
Another Way to View Logistic Regression

𝑦ො = 𝜎(𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐷 𝑥𝐷 )

49
Perceptron

• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 50
Perceptron

• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 51
Perceptron

• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 52
Perceptron

• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 53
Perceptron

• Feature inputs 𝑥𝑑 ∈ ℝ, 𝑑 = 1, … , 𝐷
• Each input is associated with a connection weight 𝑤𝑑 ∈
ℝ, 𝑑 = 1, … , 𝐷
• One additional bias term 𝑤0 or denoted as 𝑏
• Output is some function (called activation function) of the
linear combination of inputs: 𝑦ො = 𝑠(𝑤0 + 𝑤1𝑥1 + 𝑤2 𝑥2 +
⋯ + 𝑤𝐷 𝑥𝐷 ) = 𝑠(𝒘𝑇 𝒙), where 𝑠(∙) have many choices, e.g.,
sigmoid function 𝜎 ∙
• Perceptron can be used for classification and regression 54
Perceptron: A Basic Layer

55
Perceptron: Training

56
Perceptron: Approximate Linear Functions
Example: Boolean AND

57
Perceptron: Approximate Linear Functions
Example: Boolean XOR

58
Outline

• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks

59
Multi-layer Perceptron

60
Multi-layer Perceptron: Example

61
Multi-layer Perceptron
input layer output layer

hidden layer
62
Multi-layer Perceptron

63
Multi-layer Perceptron
Flexible to add hidden layers and nodes at layers to
increase the complexity of the model

Theoretically, can approximate any functions


65
Multi-layer Perceptron

Loss for Classification


Loss for Regression

How to train the model?

66
Optimization: Backpropagation

• Forward propagation to calculate training loss


• Back propagation to measure how much each node is
“responsible” for the training loss, and then we update
corresponding weights by Gradient Descent
67
Optimization: Backpropagation
• Not easy especially for neural networks of complex
structures.
• Fortunately, we have awesome tools!

68
Outline

• Perceptron
• Multi-layer Perceptron (MLP)
• Design Issues of Neural Networks

69
Determine number of layers and sizes
• The capacity of the network (i.e., the number and
complexity of representable functions
• How to avoid overfitting?

70
Determine number of layers and sizes

71
Determine Activation Function

Linear: 𝑠 𝑥 = 𝑥

• Cannot introduce non-linearity to the neural networks.

73
Determine Activation Function

74
Determine Activation Function

75
Determine Activation Function

76
Determine Activation Function

77
What have we learnt so far

• View Logistic Regression as Perceptron, which is the


basic processing unit of neural networks and can
represent linear functions.
• Multi-layer perceptron to approximate non-linear
functions.
• We need to determine number of layers, size of layers,
and activation function for neural networks.

78

You might also like