0% found this document useful (0 votes)
26 views70 pages

cz4041 7 ANN

Uploaded by

seeyin2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views70 pages

cz4041 7 ANN

Uploaded by

seeyin2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CZ4041/SC4000:

Machine Learning
Lesson 7: Artificial Neural Networks
LI Boyang, Albert
School of Computer Science and Engineering,
NTU, Singapore
Machine Learning
Most of SCSE courses discuss computer systems,
which are artificial in nature.
Programming Languages, Data Structures, Operating
Systems, Databases, Compilers, etc.
Most natural sciences study the objective reality
and various types of phenomena within.
Physics, Chemistry, Biology, etc.
Machine Learning is where computer systems
meet the objective world; a subject where we use
the former to model the latter.
6
ML Requires Math

7
ML Requires Math

8
Typical ML Approach
Specify (explicitly or implicitly) a family of
functions that contain some free parameters (e.g.,
, , and below). ↓ parameters
b ↓

= + +
output

Determine the values of these free parameters from


data.

9
Purposes of Machine Learning
Make Accurate Predictions
Which team will win a soccer match?
Which stock will see its price skyrocket?
Which patient is at higher risk?
Usually it is difficult to write down rules manually
Rather, we learn to make the predictions from
paired data ( , )
input output

10
Purposes of Machine Learning
Make Accurate Predictions
Artificial Neural Networks (Week 7)
Support Vector Machines (Week 8)
Regression (Week 8)
Ensemble Learning (Week 9)

11
Purposes of Machine Learning
Build Probabilistic Models of the World
Germany will probably beat Japan, but what are the
odds? 60/40, 70/30, or 80/20?
I’m willing to bet more money if the odds are in my
favor.
Often translates to: what is the shape of the gaussian
probability distribution?

12
Purposes of Machine Learning
Build Probabilistic Models of the World
Density Estimation (Week 11)

Data

13
Purposes of Machine Learning
Pattern Discovery
Imagine you are an alien from another planet. You
watch a soccer match. What do you observe?
Two groups of humans. One ball.
Behavior change when the ball goes into the net.

14
Purposes of Machine Learning
Pattern Discovery
We often have little prior experience, knowledge, or
insight into the causal mechanisms that generated the
data.
Still, with only statistical tools, we can identify
many important data characteristics
Obviously, domain knowledge can enrich and
complement statistical tools.

15
Purposes of Machine Learning
Pattern Discovery
Clustering (Week 10)
Dimensionality Reduction (Week 12)

16
Artificial Neural Networks:
Perceptron

17
Artificial Neural Networks (ANN)
The study of ANN was inspired by biological
neural systems
Cat Dog

18
“Biology”
Human brain is a densely interconnected network
of neurons, connected to others via dendrites and
axons.
Dendrites and axons transmit electrical signals
from one neuron to another
The human brain learns by changing the strength
of the synaptic connection between neurons
An ANN is composed of an interconnected
assembly of nodes and directed links.

19
“Biology”
↓d I ↓

A neuron sends out a spike from the axon after


receiving enough input from the dendrites.
20
Artificial Neural Networks (cont.)

X1 X2 X3 y Input Black box


1 0 0 -1
1 0 1 1 X1
1 1 0 1 Output
1 1 1 1
X2 y
0 0 1 -1
0 1 0 -1
0 1 1 1 X3
0 0 0 -1

Output y is 1 if at least two of the three inputs are equal to 1

21
ANN: Perceptron
Neurons
first
Input layer Output
X1 X2 … Xd nodes Black box node
1 0 … 0
1 0 … 1 X1 w1
1 1 … 0
1 1 … 1 X2 w2
0 0 … 1 ... y
0 1 … 0
... ...
0 1 … 1 Xd wd
0 0 … 0 ③ Weightedsum
① input Weight Activation function

Each input node is connected via a weighted link to the output node.
Weights can be positive, negative or zero (no connection)
22
ANN: Perceptron (cont.)
Input Output
X1 X2 … Xd nodes Black box node
1 0 … 0
1 0 … 1 X1 w1
1 1 … 0
1 1 … 1 X2 w2
0 0 … 1 ... y
0 1 … 0
... ...
0 1 … 1 Xd wd
0 0 … 0
Bias
factor
= , = + + +
activation funct
1, 0
23 = , where = sign =
1, otherwise
Input Output
nodes Black box node
X1 w1
X2 w2
... y
... ...
Xd wd

Model is an assembly of inter-connected nodes and


weighted links
Output node first sums up each of its input value
according to the weights of its links
Compare the weighted sum against some threshold
Produce an output based on the sign of the result
24
Input Output
nodes Black box node
X1 w1
Perceptron Model X2 w2
... y
... ...
Xd wd

= = = sign

= sign
25
ANN: Perceptron (cont.)
Mathematically, the output of a perceptron model
can be expressed in a more compact form

= sign
d
input
dimension
Inner product
= sign
where =( , , ,…, )
= , , ,…,
26 = , and =1
Inner Product: Review
Given two vectors and , which are both of d
dimensions, the inner product between and is
defined as

= , ,…, = , ,…,

27
ANN: Perceptron (cont.)
=( , , ,…, )
= sign
= , , ,…,

= = +

= , and =1

= sign = sign
28
ANN: Sign Function

1, 0
sign =
1, < 0

Note: the sign(z) function


has derivative = 0
everywhere, except at z = 0.

29
Perceptron: Making Prediction
Given a learned perceptron with = 0.5, =
1, and = 0
Black box
Test data: Predicted
1 X1 = 0.5 label
X1 X2
1
x 1 1
= 1
1 X2

= sign 1 × 0.5 + 1 × 1
= sign 0.5 = 1
30 ↑ 20
Perceptron: Learning
During training, the weight parameters w are
adjusted until the outputs of the perceptron become
consistent with the true outputs of training data
The weight parameters are updated iteratively or
in an online learning manner
x1
X1 X2 X3 y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1 (0)
1 1 0 1
xi 1 1 1 1
X2 (0)
0 0 1 -1
0 1 0 -1
(0)
31
0 1 1 1 X3
x8 0 0 0 -1 is updated if necessary
Perceptron: Learning (cont.)

x2
X1 X2 X3 Y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1 (1)
1 1 0 1
xi 1 1 1 1
X2 (1)
0 0 1 -1
0 1 0 -1
(1)
0 1 1 1 X3
x8 0 0 0 -1 is updated if necessary

32

Perceptron: Learning (cont.)

x8
X1 X2 X3 y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1
(7)
1 1 0 1
xi 1 1 1 1
X2 (7)
0 0 1 -1
0 1 0 -1
(7)
0 1 1 1 X3
x8 0 0 0 -1

33 End of the 1st Epoch


The 2nd Epoch starts

Perceptron: Learning (cont.)

x1
X1 X2 X3 y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1
(8)
1 1 0 1
xi 1 1 1 1
X2 (8)
0 0 1 -1
0 1 0 -1
(8)
0 1 1 1 X3
x8 0 0 0 -1

34

Perceptron: Learning (cont.)
dimensions
Algorithm:
1. Let = , = 1,2, … , } be the set of
training examples, = 0
2. Initialize with random values
3. Repeat
4. for each training example ( , ) do
5. Compute the predicted output with ground truth

6. Update by = +
7. = +1
8. end for
35
9. Until stopping condition is met
Perceptron: Learning (cont.)
Why use the following weight update rule?
= +

Induced based on a gradient descent method


Function to be
minimized, e.g.,
= the error loss
function

Learning rate (0,1] loss function



find w that minimise
Gradient Descent y x2=

= arg min ( ) du
2x
x
=

O more
take small steps &x ,

J away from
since
= origin
grad positive
- ( )
gradient
= +1 ② O , grad
b

!
negative ,
more away
from origin
( )
apply -

37
Perceptron: Learning (cont.)
Weight update rule
= + =

Consider the loss function for each training


example as = sign
1 1 1
= = = sign
2 2 2
ideally go"to0
b
predicted
Update the weight using a gradient descent method
(z) ( )
= =
z
1 Chain rule
= = sign( ) =
2
Chain Rule of Calculus (Review)
Suppose that = ( ) and = ( )=
Chain rule of calculus:
=
Generalized to the vector case: suppose , ,
= ( ) and = ( )

39
Perceptron: Learning (cont.)
The weight update formula for perceptron:
(z) ( )
= =
z
1
=
2
= sign
=
× 1 S
Note: the sign() function 1
has derivative = 0 except
at z = 0. Therefore, to =
= + compute meaningful
derivative, we remove
sign() from consideration
Approximating the derivative?
The equation used to compute from
= sign
( )
The equation used to compute
=
Why? This is approximating the step function with a linear
function.

While the derivative itself is incorrect,


its direction is correct.
&
That is, if you want to increase ,
·

you should increase , and vice


versa.
Perceptron Weights Update
= +
If the prediction is correct, = 0, then
weight remains unchanged = ideal

If = +1 and = 1, then ( )=2


The weights of all links with positive inputs need
to be updated by increasing their values
The weights of all links with negative inputs need
to be updated by decreasing their weights
… …
>0 =0 <0
42 … …
Perceptron Weights Update (cont.)
= +

If = 1 and = +1, then = 2


The weights of all links with positive inputs need
to be updated by decreasing their values
The weights of all links with negative inputs need
to be updated by increasing their weights
… k …
>0 =0 <0
… …
43
Convergence
The decision boundary of a perceptron is a linear
hyperplane
X1 w1

X2 w2

The perceptron learning algorithm is guaranteed to


converge to an optimal solution for linear
classification problems
44
Perceptron Limitation
If the problem is not linearly separable, the algorithm
fails to converge
Nonlinearly separable data given by the XOR function

45
Multi-layer Perceptron

46
Example: Not Linearly Separable

L1

L3

L1 L2 L3
L2

X1 X2
Example: Not Linearly Separable

1=0

=1 =1 = 4
+4 =0
0 1
1 1
0 1
1=0
Example: Not Linearly Separable

1=0

L1 L2 L3
+4 =0 o i"
011

1=0
Example: Not Linearly Separable
010

Region Code
L1 L2 L3

011
110
111

001 101 100


Example: Not Linearly Separable
010 +1 -
1

=3

L1 L2 L3 1 1
1
011 0 = 3

110
111

001 101 100


Example: Not Linearly Separable …

=0

=1
General Structure: Multilayer ANN

Output Layer . . . Neuron


( )

. . .
( )
Hidden Layer
() ()
. . . ( )

Integration Activation
Input Layer . . . function function
Integration Functions
Weighted sum:

Quadratic function

Spherical function

54
Activation Functions
Sign function (Threshold Unipolar sigmoid function:
function)
1 0 1
= sign = =
1 <0 1+

( ) 1.5

1 0.5

0
-4 -3 -2 -1 0 1 2 3 4

When = 1, it is called
sigmoid function
Update Weights for Multi-layer NNs
Initialize the weights in each layer ,…, ,…,
Adjust the weights such that the output of ANN is consistent with
class labels of training examples
Loss function for each training instance:
1
=
2
For each layer , update the weights, ( ), by gradient descent at
each iteration :
differentiated loss fu
= ( )

Computing the gradient w.r.t. weights in each layer is compu-


tationally expensive!
56
The Backpropagation Algorithm
A Multi-layer Feed-forward NN
Input Hidden Hidden Output
Layer Layer 1 Layer 2 Layer
( ) ( ) ( )

perceptron

() ()
57
Backpropagation: Basic Idea
Initialize the weights ( ( ) ,…, ( ) )
Forward pass: each training examples , is
used to compute outputs of each hidden layer and
generate the final output based on the ANN
Input Hidden Hidden Output
Layer ( ) Layer 1 ( )
Layer 2 ( ) Layer
Wis

W4
,

we wat
W, 5

W25

58
Backpropagation: Basic Idea (cont.)
Backpropagation: Starting with the output layer,
to propagate error back to the previous layer in
order to update the weights between the two layers,
until the earliest hidden layer is reached
Input Hidden Hidden Output
Layer ( ) Layer 1 ( )
Layer 2 ( ) Layer

Calculate
error
59
The Computational Graph
Input Hidden Hidden Output
Layer ( ) Layer 1 ( ) Layer 2 ( ) Layer

( ) ( ) ( )
Output of
hidden layer 1
( ) ( ) ( ) ( ) ( )
( ) summation activation ( )
Backpropagation (BP)
( )
( )
Gradient of w.r.t. : ( )
= ( ) ( )
( )
Gradient of w.r.t. :
( ) ( ) ( )

( )
= ( ) ( ) ( ) ( )

( )
Gradient of w.r.t. :
( ) ( ) ( ) ( ) ( )

( )
= ( ) ( ) ( ) ( ) ( ) ( )
Consider each layer
contains a single unit
( ) ( ) ( )

( ) ( ) ( ) ( ) ( )
An Example
Consider an ANN of 1 hidden layer as follows. Suppose the
sign function and the weighted sum function are used for
both hidden and output nodes
Input Hidden Output
Layer Layer Layer

62
= + = =
1 1
= =
2 2
=sgn 1× =
=1
= = +
Input Hidden Output
Layer Layer Layer

63
= +
= =
Obtained when
updating = +

=sgn
=1
= = +
Input Hidden Output
Layer Layer Layer

64
= +
= =
Obtained when = +
updating

=sgn
=1
= = +
Input Hidden Output
Layer Layer Layer

65
BP Algorithm: Example
Activation function: sign()
Integration function: weighted sum Input Hidden Output
= 0.4, = 0 Layer Layer Layer

X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1

Initialization:
= 1, = 1, = 1, = 1, = 1, =1
For the 1st example: = 0 and = 0
66
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer

X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1

Forward pass:
= sign 0 × 1 + 0 × 1 = 1 and = sign 0 × 1 + 0 × 1 = 1
Then = = sign 1 × 1 + 1 × 1 = 1
67
= + = +
BP Algorithm: Example (cont.)
= +
Input Hidden Output
Layer Layer Layer
= 2
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1

Backpropagation:
= 1 + 0.4 × 2 × 1 = 0.2 = 1 + 0.4 × 2 × 1 = 0.2
= 1 + 0.4 × 2 × 1 × 0 = 1 = 1 + 0.4 × 2 × 1 × 0 = 1
= 1 + 0.4 × ( 2) × 1 × 0 = 1 = 1 + 0.4 × ( 2) × 1 × 0 = 1
68
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
=0
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1

For the 2nd example: = 1 and = 0


= sign(1 × 1 + 0 × 1) = 1 and = sign(1 × 1 + 0 × 1) = 1
Then = = sign(1 × 0.2 + 1 × 0.2) = 1
69
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
=0
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1

For the 3rd example: = 0 and = 1


= sign(0 × 1 + 1 × 1) = 1 and = sign(0 × 1 + 1 × 1) = 1
Then = = sign(1 × 0.2 + 1 × 0.2) = 1
70
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
=0
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1

For the 4th example: = 1 and = 1


= sign(1 × 1 + 1 × 1) = 1 and = sign(1 × 1 + 1 × 1) = 1
Then = = sign(1 × 0.2 + 1 × 0.2) = 1
71 End of the 1st Epoch
The 2nd Epoch starts

BP Algorithm: Example (cont.)


Input Hidden Output
Layer Layer Layer
= 2
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1

For the 1st example again: = 0 and = 0


= sign(0 × 1 + 0 × 1) = 1 and = sign(0 × 1 + 0 × 1) = 1
Then = = sign(0.2 × 1 + 0.2 × 1) = 1
72
Weights need to be further updated via backpropagation
Design Issues for ANN
The number of nodes in the input layer
Assign an input node to each numerical or binary
input variable /must be same)
The number of nodes in the output layer
Binary class problem single node
-class problem output nodes One not encoding
How many nodes in the hidden layer(s)?
Too many parameters result in networks that are
too complex and overfit the data

73
Design Issues for ANN
How many nodes in the hidden layer(s)?
Too many parameters result in networks that are
too complex and overfit the data
If the network underfits
Try to increase the number of hidden units
If the network overfits
Try to decrease the number of hidden units

74

You might also like