CHAPTER 7 | Neural Networks and
Neural Language Models
“[M]achines of this character can behave in a very
complicated manner when the number of units is large.”
Alan Turing (1948) “Intelligent Machines”, page 6
1
Introduction
They are called neural because:
Origins lie in the neuron
Simplified model of the human
neuron like computing element
Described in terms of
propositional logic
Introduction
Neural network is a
Network of small computing units
Takes a vector of input values
Produces a single output value
Called a feedforward network
• Because the computation proceeds iteratively from one layer
of units to the next
Unit
takes a weighted sum of its inputs
one additional bias term
Using vector notation:
z = w· x+b
[w: weight vector, b: scalar bias b, x: input vector x]
Activation
a non-linear function f to z
output of this function : the activation value, a
y = a = f(z)
Different activation functions:
• Sigmoid
• Tanh
• rectified linear unit or ReLU
Sigmoid
it maps the output into the range (0,1)
the output of a neural unit:
Example:
weight vector: w = [0.2,0.3,0.9] , bias: b = 0.5 and x = [0.5,0.6,0.1]
Sigmoid
Used in the output layer of binary classification
Disadvantage:
• Non-zero centered output
Tanh
Advantages: mean of the activation closer to zero
Relu (Rectified linear unit)
y = ReLU(z) = max(z,0)
Advantages:
• Avoids vanishing gradient problem
• Doesn't become saturated
Vanishing Gradient:
Networks is trained by
Propagating an error signal backwards
Now,
gradients that are almost 0 cause the error signal to get smaller to be used
for training
a problem called the vanishing gradient problem
The XOR problem
Can neural units compute simple functions of input?
AND OR
XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0
Perceptrons
A very simple neural unit
• Binary output (0 or 1)
• No non-linear activation function
Easy to build AND or OR with perceptrons
AND :
0
0 0 0
-1
Easy to build AND or OR with perceptrons
AND :
0
1 1 0
-1
Easy to build AND or OR with perceptrons
AND :
1
0 0 0
-1
Easy to build AND or OR with perceptrons
AND :
1
1 1 1
-1
Easy to build AND or OR with perceptrons
OR :
0
0
0 0 0
0
Easy to build AND or OR with perceptrons
OR :
0
0
1 1 1
0
Easy to build AND or OR with perceptrons
OR :
1
1
0 0 1
0
Easy to build AND or OR with perceptrons
OR :
1
1
1 1 1
0
Not possible to capture XOR with perceptrons !!
Perceptron equation given x1 and x2, is the equation of a line
w1x1 + w2x2 + b = 0
in standard linear format: x2 = (−w1/w2)x1 + (−b/w2)
This line acts as a decision boundary
• 0 if input is on one side of the line
• 1 if on the other side of the line
Decision boundaries
x x x
12 12 12
?
0 x 0 x 0
0 1 0 1 0 1
1 1
a) x1 b) x1 c) x1
AND x2 OR x2 XOR x2
Filled circles represent perceptron outputs of 1,
white circles perceptron outputs of 0
no way to draw a line that correctly separates the two categories for
XOR
The solution: neural networks
Can be calculated by a layered network of perceptron units
Can compute using two layers of ReLU-based units
• The middle layer (called h) has two units
• the output layer (called y) has one unit
The solution: neural networks
0 0
0
0
0 input x = [0, 0]
0 In hidden layer, [0, -1]
0 0 -1 =>[0] After Relu h layer as
0 [0, 0]
-1 Final output 0
The solution: neural networks
0 0
1
1
1 input x = [0, 1]
0 In hidden layer, [1, 0]
1 0 0 After Relu h layer as
1 [1, 0]
-1 Final output 1
The hidden representation h
hidden representations x
= [0, 1] and x = [1, 0] are
merged into h = [1, 0]
The merger makes it easy
to linearly separate the
positive and negative
cases of XOR
Feedforward Neural Networks
Simplest kind of neural network
Multilayer network
Units are connected with no cycles
Outputs from units in each layer are passed to units in the
next higher layer
No outputs are passed back to lower layers
Sometimes called multi-layer perceptrons
Feedforward Neural
Networks
feedforward networks have three
kinds of nodes
• input units, hidden units, and
output units
fully-connected
• Takes as input the outputs from
previous layer
• link between every pair of units
from two adjacent layers
Hidden layer computation
3 steps : Multiplying the weight matrix W by the input vector x , adding the
bias vector b , applying the activation function g
h = σ(Wx+b)
X = Where,
g[z1,z2,z3] =
[g(z1),g(z2),g(z3)]
+ =
Output layer computation
weight matrix U , U ∈ n2×n1
input vector (h)
intermediate output z, z ∈ R n2
Where, z = Uh
z : a vector of real-valued number
can’t be the output of the classifier
Softmax
For normalizing a vector of real values
To a vector that encodes a probability
distribution between 0 and 1
sum to 1
Used for muliclass classification
Final Equation
h = σ(Wx+b)
z = Uh
y = softmax(z)
Where x ∈ n0 , h ∈ n1 , b ∈ n1 , W ∈ n1×n0 , U ∈ n2×n1 , and the
output vector y ∈ n2