CZ4041/SC4000:
Machine Learning
Lesson 7: Artificial Neural Networks
LI Boyang, Albert
School of Computer Science and Engineering,
NTU, Singapore
Machine Learning
Most of SCSE courses discuss computer systems,
which are artificial in nature.
Programming Languages, Data Structures, Operating
Systems, Databases, Compilers, etc.
Most natural sciences study the objective reality
and various types of phenomena within.
Physics, Chemistry, Biology, etc.
Machine Learning is where computer systems
meet the objective world; a subject where we use
the former to model the latter.
6
ML Requires Math
7
ML Requires Math
8
Typical ML Approach
Specify (explicitly or implicitly) a family of
functions that contain some free parameters (e.g.,
, , and below). ↓ parameters
b ↓
= + +
output
Determine the values of these free parameters from
data.
9
Purposes of Machine Learning
Make Accurate Predictions
Which team will win a soccer match?
Which stock will see its price skyrocket?
Which patient is at higher risk?
Usually it is difficult to write down rules manually
Rather, we learn to make the predictions from
paired data ( , )
input output
10
Purposes of Machine Learning
Make Accurate Predictions
Artificial Neural Networks (Week 7)
Support Vector Machines (Week 8)
Regression (Week 8)
Ensemble Learning (Week 9)
11
Purposes of Machine Learning
Build Probabilistic Models of the World
Germany will probably beat Japan, but what are the
odds? 60/40, 70/30, or 80/20?
I’m willing to bet more money if the odds are in my
favor.
Often translates to: what is the shape of the gaussian
probability distribution?
12
Purposes of Machine Learning
Build Probabilistic Models of the World
Density Estimation (Week 11)
Data
13
Purposes of Machine Learning
Pattern Discovery
Imagine you are an alien from another planet. You
watch a soccer match. What do you observe?
Two groups of humans. One ball.
Behavior change when the ball goes into the net.
14
Purposes of Machine Learning
Pattern Discovery
We often have little prior experience, knowledge, or
insight into the causal mechanisms that generated the
data.
Still, with only statistical tools, we can identify
many important data characteristics
Obviously, domain knowledge can enrich and
complement statistical tools.
15
Purposes of Machine Learning
Pattern Discovery
Clustering (Week 10)
Dimensionality Reduction (Week 12)
16
Artificial Neural Networks:
Perceptron
17
Artificial Neural Networks (ANN)
The study of ANN was inspired by biological
neural systems
Cat Dog
18
“Biology”
Human brain is a densely interconnected network
of neurons, connected to others via dendrites and
axons.
Dendrites and axons transmit electrical signals
from one neuron to another
The human brain learns by changing the strength
of the synaptic connection between neurons
An ANN is composed of an interconnected
assembly of nodes and directed links.
19
“Biology”
↓d I ↓
A neuron sends out a spike from the axon after
receiving enough input from the dendrites.
20
Artificial Neural Networks (cont.)
X1 X2 X3 y Input Black box
1 0 0 -1
1 0 1 1 X1
1 1 0 1 Output
1 1 1 1
X2 y
0 0 1 -1
0 1 0 -1
0 1 1 1 X3
0 0 0 -1
Output y is 1 if at least two of the three inputs are equal to 1
21
ANN: Perceptron
Neurons
first
Input layer Output
X1 X2 … Xd nodes Black box node
1 0 … 0
1 0 … 1 X1 w1
1 1 … 0
1 1 … 1 X2 w2
0 0 … 1 ... y
0 1 … 0
... ...
0 1 … 1 Xd wd
0 0 … 0 ③ Weightedsum
① input Weight Activation function
②
Each input node is connected via a weighted link to the output node.
Weights can be positive, negative or zero (no connection)
22
ANN: Perceptron (cont.)
Input Output
X1 X2 … Xd nodes Black box node
1 0 … 0
1 0 … 1 X1 w1
1 1 … 0
1 1 … 1 X2 w2
0 0 … 1 ... y
0 1 … 0
... ...
0 1 … 1 Xd wd
0 0 … 0
Bias
factor
= , = + + +
activation funct
1, 0
23 = , where = sign =
1, otherwise
Input Output
nodes Black box node
X1 w1
X2 w2
... y
... ...
Xd wd
Model is an assembly of inter-connected nodes and
weighted links
Output node first sums up each of its input value
according to the weights of its links
Compare the weighted sum against some threshold
Produce an output based on the sign of the result
24
Input Output
nodes Black box node
X1 w1
Perceptron Model X2 w2
... y
... ...
Xd wd
= = = sign
= sign
25
ANN: Perceptron (cont.)
Mathematically, the output of a perceptron model
can be expressed in a more compact form
= sign
d
input
dimension
Inner product
= sign
where =( , , ,…, )
= , , ,…,
26 = , and =1
Inner Product: Review
Given two vectors and , which are both of d
dimensions, the inner product between and is
defined as
= , ,…, = , ,…,
27
ANN: Perceptron (cont.)
=( , , ,…, )
= sign
= , , ,…,
= = +
= , and =1
= sign = sign
28
ANN: Sign Function
1, 0
sign =
1, < 0
Note: the sign(z) function
has derivative = 0
everywhere, except at z = 0.
29
Perceptron: Making Prediction
Given a learned perceptron with = 0.5, =
1, and = 0
Black box
Test data: Predicted
1 X1 = 0.5 label
X1 X2
1
x 1 1
= 1
1 X2
= sign 1 × 0.5 + 1 × 1
= sign 0.5 = 1
30 ↑ 20
Perceptron: Learning
During training, the weight parameters w are
adjusted until the outputs of the perceptron become
consistent with the true outputs of training data
The weight parameters are updated iteratively or
in an online learning manner
x1
X1 X2 X3 y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1 (0)
1 1 0 1
xi 1 1 1 1
X2 (0)
0 0 1 -1
0 1 0 -1
(0)
31
0 1 1 1 X3
x8 0 0 0 -1 is updated if necessary
Perceptron: Learning (cont.)
x2
X1 X2 X3 Y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1 (1)
1 1 0 1
xi 1 1 1 1
X2 (1)
0 0 1 -1
0 1 0 -1
(1)
0 1 1 1 X3
x8 0 0 0 -1 is updated if necessary
32
…
Perceptron: Learning (cont.)
x8
X1 X2 X3 y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1
(7)
1 1 0 1
xi 1 1 1 1
X2 (7)
0 0 1 -1
0 1 0 -1
(7)
0 1 1 1 X3
x8 0 0 0 -1
33 End of the 1st Epoch
The 2nd Epoch starts
Perceptron: Learning (cont.)
x1
X1 X2 X3 y Black box
x1 1 0 0 -1
x2 1 0 1 1 X1
(8)
1 1 0 1
xi 1 1 1 1
X2 (8)
0 0 1 -1
0 1 0 -1
(8)
0 1 1 1 X3
x8 0 0 0 -1
34
…
Perceptron: Learning (cont.)
dimensions
Algorithm:
1. Let = , = 1,2, … , } be the set of
training examples, = 0
2. Initialize with random values
3. Repeat
4. for each training example ( , ) do
5. Compute the predicted output with ground truth
6. Update by = +
7. = +1
8. end for
35
9. Until stopping condition is met
Perceptron: Learning (cont.)
Why use the following weight update rule?
= +
Induced based on a gradient descent method
Function to be
minimized, e.g.,
= the error loss
function
Learning rate (0,1] loss function
↓
find w that minimise
Gradient Descent y x2=
= arg min ( ) du
2x
x
=
O more
take small steps &x ,
J away from
since
= origin
grad positive
- ( )
gradient
= +1 ② O , grad
b
!
negative ,
more away
from origin
( )
apply -
37
Perceptron: Learning (cont.)
Weight update rule
= + =
Consider the loss function for each training
example as = sign
1 1 1
= = = sign
2 2 2
ideally go"to0
b
predicted
Update the weight using a gradient descent method
(z) ( )
= =
z
1 Chain rule
= = sign( ) =
2
Chain Rule of Calculus (Review)
Suppose that = ( ) and = ( )=
Chain rule of calculus:
=
Generalized to the vector case: suppose , ,
= ( ) and = ( )
39
Perceptron: Learning (cont.)
The weight update formula for perceptron:
(z) ( )
= =
z
1
=
2
= sign
=
× 1 S
Note: the sign() function 1
has derivative = 0 except
at z = 0. Therefore, to =
= + compute meaningful
derivative, we remove
sign() from consideration
Approximating the derivative?
The equation used to compute from
= sign
( )
The equation used to compute
=
Why? This is approximating the step function with a linear
function.
While the derivative itself is incorrect,
its direction is correct.
&
That is, if you want to increase ,
·
you should increase , and vice
versa.
Perceptron Weights Update
= +
If the prediction is correct, = 0, then
weight remains unchanged = ideal
If = +1 and = 1, then ( )=2
The weights of all links with positive inputs need
to be updated by increasing their values
The weights of all links with negative inputs need
to be updated by decreasing their weights
… …
>0 =0 <0
42 … …
Perceptron Weights Update (cont.)
= +
If = 1 and = +1, then = 2
The weights of all links with positive inputs need
to be updated by decreasing their values
The weights of all links with negative inputs need
to be updated by increasing their weights
… k …
>0 =0 <0
… …
43
Convergence
The decision boundary of a perceptron is a linear
hyperplane
X1 w1
X2 w2
The perceptron learning algorithm is guaranteed to
converge to an optimal solution for linear
classification problems
44
Perceptron Limitation
If the problem is not linearly separable, the algorithm
fails to converge
Nonlinearly separable data given by the XOR function
45
Multi-layer Perceptron
46
Example: Not Linearly Separable
L1
L3
L1 L2 L3
L2
X1 X2
Example: Not Linearly Separable
1=0
=1 =1 = 4
+4 =0
0 1
1 1
0 1
1=0
Example: Not Linearly Separable
1=0
L1 L2 L3
+4 =0 o i"
011
1=0
Example: Not Linearly Separable
010
Region Code
L1 L2 L3
011
110
111
001 101 100
Example: Not Linearly Separable
010 +1 -
1
=3
L1 L2 L3 1 1
1
011 0 = 3
110
111
001 101 100
Example: Not Linearly Separable …
=0
=1
General Structure: Multilayer ANN
Output Layer . . . Neuron
( )
. . .
( )
Hidden Layer
() ()
. . . ( )
Integration Activation
Input Layer . . . function function
Integration Functions
Weighted sum:
Quadratic function
Spherical function
54
Activation Functions
Sign function (Threshold Unipolar sigmoid function:
function)
1 0 1
= sign = =
1 <0 1+
( ) 1.5
1 0.5
0
-4 -3 -2 -1 0 1 2 3 4
When = 1, it is called
sigmoid function
Update Weights for Multi-layer NNs
Initialize the weights in each layer ,…, ,…,
Adjust the weights such that the output of ANN is consistent with
class labels of training examples
Loss function for each training instance:
1
=
2
For each layer , update the weights, ( ), by gradient descent at
each iteration :
differentiated loss fu
= ( )
Computing the gradient w.r.t. weights in each layer is compu-
tationally expensive!
56
The Backpropagation Algorithm
A Multi-layer Feed-forward NN
Input Hidden Hidden Output
Layer Layer 1 Layer 2 Layer
( ) ( ) ( )
perceptron
() ()
57
Backpropagation: Basic Idea
Initialize the weights ( ( ) ,…, ( ) )
Forward pass: each training examples , is
used to compute outputs of each hidden layer and
generate the final output based on the ANN
Input Hidden Hidden Output
Layer ( ) Layer 1 ( )
Layer 2 ( ) Layer
Wis
W4
,
we wat
W, 5
W25
58
Backpropagation: Basic Idea (cont.)
Backpropagation: Starting with the output layer,
to propagate error back to the previous layer in
order to update the weights between the two layers,
until the earliest hidden layer is reached
Input Hidden Hidden Output
Layer ( ) Layer 1 ( )
Layer 2 ( ) Layer
Calculate
error
59
The Computational Graph
Input Hidden Hidden Output
Layer ( ) Layer 1 ( ) Layer 2 ( ) Layer
( ) ( ) ( )
Output of
hidden layer 1
( ) ( ) ( ) ( ) ( )
( ) summation activation ( )
Backpropagation (BP)
( )
( )
Gradient of w.r.t. : ( )
= ( ) ( )
( )
Gradient of w.r.t. :
( ) ( ) ( )
( )
= ( ) ( ) ( ) ( )
( )
Gradient of w.r.t. :
( ) ( ) ( ) ( ) ( )
( )
= ( ) ( ) ( ) ( ) ( ) ( )
Consider each layer
contains a single unit
( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
An Example
Consider an ANN of 1 hidden layer as follows. Suppose the
sign function and the weighted sum function are used for
both hidden and output nodes
Input Hidden Output
Layer Layer Layer
62
= + = =
1 1
= =
2 2
=sgn 1× =
=1
= = +
Input Hidden Output
Layer Layer Layer
63
= +
= =
Obtained when
updating = +
=sgn
=1
= = +
Input Hidden Output
Layer Layer Layer
64
= +
= =
Obtained when = +
updating
=sgn
=1
= = +
Input Hidden Output
Layer Layer Layer
65
BP Algorithm: Example
Activation function: sign()
Integration function: weighted sum Input Hidden Output
= 0.4, = 0 Layer Layer Layer
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1
Initialization:
= 1, = 1, = 1, = 1, = 1, =1
For the 1st example: = 0 and = 0
66
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1
Forward pass:
= sign 0 × 1 + 0 × 1 = 1 and = sign 0 × 1 + 0 × 1 = 1
Then = = sign 1 × 1 + 1 × 1 = 1
67
= + = +
BP Algorithm: Example (cont.)
= +
Input Hidden Output
Layer Layer Layer
= 2
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1
Backpropagation:
= 1 + 0.4 × 2 × 1 = 0.2 = 1 + 0.4 × 2 × 1 = 0.2
= 1 + 0.4 × 2 × 1 × 0 = 1 = 1 + 0.4 × 2 × 1 × 0 = 1
= 1 + 0.4 × ( 2) × 1 × 0 = 1 = 1 + 0.4 × ( 2) × 1 × 0 = 1
68
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
=0
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1
For the 2nd example: = 1 and = 0
= sign(1 × 1 + 0 × 1) = 1 and = sign(1 × 1 + 0 × 1) = 1
Then = = sign(1 × 0.2 + 1 × 0.2) = 1
69
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
=0
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1
For the 3rd example: = 0 and = 1
= sign(0 × 1 + 1 × 1) = 1 and = sign(0 × 1 + 1 × 1) = 1
Then = = sign(1 × 0.2 + 1 × 0.2) = 1
70
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
=0
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1
For the 4th example: = 1 and = 1
= sign(1 × 1 + 1 × 1) = 1 and = sign(1 × 1 + 1 × 1) = 1
Then = = sign(1 × 0.2 + 1 × 0.2) = 1
71 End of the 1st Epoch
The 2nd Epoch starts
BP Algorithm: Example (cont.)
Input Hidden Output
Layer Layer Layer
= 2
X1 X2 y
0 0 -1
1 0 1
0 1 1
1 1 1
For the 1st example again: = 0 and = 0
= sign(0 × 1 + 0 × 1) = 1 and = sign(0 × 1 + 0 × 1) = 1
Then = = sign(0.2 × 1 + 0.2 × 1) = 1
72
Weights need to be further updated via backpropagation
Design Issues for ANN
The number of nodes in the input layer
Assign an input node to each numerical or binary
input variable /must be same)
The number of nodes in the output layer
Binary class problem single node
-class problem output nodes One not encoding
How many nodes in the hidden layer(s)?
Too many parameters result in networks that are
too complex and overfit the data
73
Design Issues for ANN
How many nodes in the hidden layer(s)?
Too many parameters result in networks that are
too complex and overfit the data
If the network underfits
Try to increase the number of hidden units
If the network overfits
Try to decrease the number of hidden units
74