Lecture1 About Deep Learning
Lecture1 About Deep Learning
Introduction to Deeplearning 1 / 46
State-of-the-art of AI with Deeplearning
Introduction to Deeplearning 2 / 46
Geoffrey Hinton is known by
many to be the godfather of
deep learning. Aside from his
seminal 1986 paper on
backpropagation, Hinton has
invented several
foundational deep learning
techniques throughout his
decades-long career. Hinton
currently splits his time
between the University of
Toronto and Google Brain.
3 / 46
AI: Emulates Human Intelligence
ML: Emulates Human Learning
DL: Emulates Network of Neuron in Human Brain
4 / 46
Popular deep learning language and framework
5 / 46
1943, Warren McCulloch and Walter Pitts: Neuron
Bias
b
x1 w1
Activate
function Output
Inputs x2 w2
Σ f y
x3 w3
Weights
A perceptron (specific type of artificial neuron) is a single layer linear model + an activa tion function f . We can find weights
(w ), and bia s (b) that minimizes a loss function using gradient descent.
Introduction to Deeplearning 7 / 46
8 / 46
9 / 46
10 / 46
11 / 46
12 / 46
13 / 46
14 / 46
15 / 46
1980, David Everett Rumelhart, Backpropagation
Input
layer Hidden Output
layers layer
x1
h1(1)
x2 y1
h2(1)
..
x3
..
..
yk
h(1)
m
xn
A feedforward backpropagation uses a layered architecture where information flows in one direction, from input to output, and
then errors are backpropagated to adjust the network’s weights.
Introduction to Deeplearning 16 / 46
Single Neuron
inputs = [ 1 , 2 , 3 ]
weights = [ 0.2 , 0.8 , - 0.5 ]
bias = 2 (modelling a single neuron, have one bias (one bias value per neuron).
The bias is an additional tuneable value but is not associated with any input in contrast to the
weights.
17 / 46
A Layer of Neurons
Neural networks typically have layers that consist of more than one neuron. Layers
are nothing more than groups of neurons. Each neuron in a layer takes exactly the
same input — the input given to the layer (which can be either the training data or
the output from the previous layer), but contains its own set of weights and its own
bias, producing its own unique output. The layer’s output is a set of each of these
outputs — one per each neuron. Let’s say we have a scenario with 3 neurons in a
layer and 4 inputs:
18 / 46
inputs = [ 1 , 2 , 3 , 2.5 ]
weights1 = [ 0.2 , 0.8 , - 0.5 , 1 ]
weights2 = [ 0.5 , - 0.91 , 0.26 , - 0.5 ]
weights3 = [ - 0.26 , - 0.27 , 0.17 , 0.87 ]
bias1 = 2 bias2 = 3 bias3 = 0.5
outputs = [
# Neuron 1:
inputs[ 0 ] * weights1[ 0 ] +
inputs[ 1 ] * weights1[ 1 ] +
inputs[ 2 ] * weights1[ 2 ] +
inputs[ 3 ] * weights1[ 3 ] + bias1,
# Neuron 2:
inputs[ 0 ] * weights2[ 0 ] +
inputs[ 1 ] * weights2[ 1 ] +
inputs[ 2 ] * weights2[ 2 ] +
inputs[ 3 ] * weights2[ 3 ] + bias2,
# Neuron 3:
inputs[ 0 ] * weights3[ 0 ] +
inputs[ 1 ] * weights3[ 1 ] +
inputs[ 2 ] * weights3[ 2 ] +
inputs[ 3 ] * weights3[ 3 ] + bias3]
19 / 46
https://nnfs.io/mxo
20 / 46
NEURAL NETWORK activation
a(0)
1 = σ w 1,1 a1(0) + w 1,2 a2(0) + . . . + w 1,n an(0) + b1(0)
w1,1
!
a1(1) Σn
(0) + b(0)
w1,2 =σ w 1,i a
i 1
a2(0) i =1
w1,3
a2(1)
w1,4
a3(0)
a1
( 1 )
w1,1 w1,2 ... w1,n
(0)
a1
b(0)
1
(1) w
w1,n (1) a = σ 2,1 w2,2 ... w 2,n a(0)
b2(0)
a3 2 2 +
..
.. .. .. . .. .. ..
a4
(0) . (1)
am wm,1 wm,2 ... wm,n an(0) (0)
bm
.
a (1) = σ W (0)a (0) + b(0)
..
(1)
am
an(0)
# Inputs = n, # Neurons = m; # Weights = n × m.
Introduction to Deeplearning 21 / 46
Multilayer Perceptron
22 / 46
Activation Functions
Tanh
Sigmoid
ex —e−x
1 tanh(x ) =
σ(x ) = ex + e−x
1 + e−x
1 tanh(x )
1 σ(x )
x
0.5 —5 5
x —1
—5 5
Softmax (3-class example)
ReLU
exi
σ(xi ) = Σ , j = 1, 2, 3
f (x ) = max(0, x ) exj
j
6 1
f (x ) softmax (xi )
3 0.5
x
xi
—4 4
—5 5
Introduction to Deeplearning 23 / 46
Shape Recognition: Concept
Introduction to Deeplearning 24 / 46
Shape Recognition: Example
Introduction to Deeplearning 25 / 46
Digit Recognition: Concept
Introduction to Deeplearning 26 / 46
Layers break problem in pieces
Introduction to Deeplearning 27 / 46
Backpropagation: chain rule
3 Conceptually,
Introduction to Deeplearning 28 / 46
Backpropagation: chain rule
1 a(L—1) is influenced by its own weight and bias, which means our tree
actually extends up higher...
Introduction to Deeplearning 29 / 46
Computing the first derivative
1 How sensitive the cost C0 is to small changes in the weight w(L), i..e.,
∂C0
∂w (L)
Introduction to Deeplearning 30 / 46
The Constituent Derivatives
1 To compute each derivative, we’ll use some relevant formula from the
way we’ve defined our neural network.
∂z (L) = a(L—1)
z(L) + w (L)a(L—1) + b(L) =⇒
∂w (L)
a(L) = σ(z (L) ) =⇒ ∂a = σ r(z (L) )
(L)
∂z (L)
∂C0
C0 = (a(L) − y )2 =⇒ (L) = 2(a(L) − y )
∂a
2
∂C0
∂w (L)
= ∂z (L) ∂a(L) ∂C0
∂w (L) ∂z (L) ∂a(L)
3 Putting this together with our constituent derivatives
∂C0 = a(L—1)σr(z (L))2(a(L) − y )
∂w (L)
Introduction to Deeplearning 31 / 46
Ex1: Chain rule of differentiation
Introduction to Deeplearning 32 / 46
Ex1: Chain rule of differentiation
Ex1: Compute the gradients of cost with respect to the initial weight ∂c
∂w4
Introduction to Deeplearning 33 / 46
Ex2: Weights Initialization
1 Zero initialization
2 Constant initialization
3 Random initialization with very small values
4 Random initialization with very large values.
Introduction to Deeplearning 34 / 46
Functional Gradients
Introduction to Deeplearning 35 / 46
Taking the Gradient – Review
f (x ) = (−x + 3)2
f = q2 q = r +3 r = −x
∂f ∂q ∂r
= 2q =1 = −1
∂q ∂r ∂x
[colback=blue!5!white, colframe=blue!75!black, title=Chain rule]
∂f ∂f ∂q ∂r
= = 2q · 1 · (−1)
∂x ∂q ∂r ∂x
= −2(−x + 3) = 2x − 6
Introduction to Deeplearning 36 / 46
Let’s Do This Another Way
n f(n)
g ∂f
∂n
f g
Introduction to Deeplearning 37 / 46
Let’s Do This Another Way: Functional Diagrams
f (x ) = (−x + 3)2
—x
(−x + 3)2
—x + 3
x −n n+3 n2
1
d
n2 = 2n = 2(−x + 3)
dn
= −2x + 6
d
· 1 = (−2x + 6) · 1
dn
Introduction to Deeplearning 38 / 46
Let’s Do This Another Way
f (x ) = (−x + 3)2
—x + 3
—x
(−x + 3)2
x −n n+3 n2
—2x + 6
1
d
=1
dn
1 ∗ (−2x + 6)
Introduction to Deeplearning 23 / 46
39
Let’s Do This Another Way
f (x ) = (−x + 3)2
—x —x + 3
(−x + 3)2
x −n n+3 n2
—2x + 6 —2x + 6
1
d
= −1
dn
−1 ∗ (−2x + 6)
Introduction to Deeplearning 40 / 46
Let’s Do This Another Way
f (x ) = (−x + 3)2
—x —x + 3
x (−x + 3)2
−n n+3 n2
2x − 6 1
—2x + 6 —2x + 6
Introduction to Deeplearning 41 / 46
Functional Gradients: Gates
Introduction to Deeplearning 42 / 46
Once more, with numbers!
Introduction to Deeplearning 43 / 46
f (x, y, z) = (x + y )z
1
n+m 5
4
10 n*m 50
Introduction to Deeplearning 44 / 46
f (x, y, z ) = (x + y )z
n+m
5
4 10
50
10 n∗m
5
1
∂
∂n
nm = m → 10 ∗ 1 ∂
∂m
nm = n → 5∗ 1
Introduction to Deeplearning 45 / 46
f (x, y, z ) = (x + y )z
10
n+m
5
4
10
10
50
10 n∗m
5
1
∂
∂n
n+m = 1 → 1 ∗ 10 ∗ 1 ∂
∂m
n+m = 1 → 1 ∗ 10 ∗ 1
Introduction to Deeplearning 46 / 46
Something More Complex
f (w, x) = 1
1+e−(w0x0+w1x1+w2)
w0
n∗m
x0
n+m n+m n∗−1 en n+1 1/n
w1
n∗m
x1
w2
Introduction to Deeplearning 47 / 46
f (w, x) = 1
1 + e—(w0x0+w1x1+w2)
2
w0
-2
*
x0 -1 4 1 -1 0.37 1.37 0.73
-3 + + * -1 en +1 1/n
w1
* 6
x1 -2
-3
w2
Introduction to Deeplearning 48 / 46
f (w, x) = 1
1+ e—(w0x0+w1x1+w2)
w02
-2
*
x0 -1 4 1 -1 0.37 1.37 0.73
-3 + + * -1 en +1 n—1
w1
* 6
x1 -2 -3
w2
Introduction to Deeplearning 49 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
∂ (n−1) = —n−2
w0 (d) ∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 0.73
4 1 -1 0.37 1.37
+ + * -1 en +1 n−1
0.73
w1-3 6
—(1.37)−2 ·1 = —0.53
*
x1 Where does 1.37 come from?
-2 -3
w2
Introduction to Deeplearning 50 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 1.37 0.73
4 1 -1 0.37
+ + * -1 en +1 n−1
-o.53 0.73
w1-3 6
1 ∗—0.53 = —0.53
*
x1
-2 -3
w2
Introduction to Deeplearning 51 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 0.37 1.37 0.73
4 1 -1
+ + * -1 en +1 n−1
-0.53 -0.53 0.73
w1-3 6
e−1 ∗—0.53 = —0.2
*
x1
-2 -3
w2
Introduction to Deeplearning 52 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 -1 0.37 1.37 0.73
4 1 * -1
+ + en +1 n−1
-0.2 -0.53 -0.53 0.73
w1-3 6
—1 ∗0.2 = —0.2
*
x1
-2 -3
w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 53 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 1 -1 0.37 1.37 0.73
4 * -1
+ + en +1 n−1
-0.2 -0.2 -0.53 -0.53 0.73
w1-3 6
1 ∗0.2 = 0.2
*
x1
-2 -3
w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 54 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 4 1 -1 0.37 1.37 0.73
+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73
w1-3 6
1 ∗0.2 = 0.2
*
x1 -3
-2
0.2
w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 55 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e)
-2 ∂n
∂ (c + n) = 1
(f)
x0 ∂n
-1 0.2 4 1 -1 0.37 1.37 0.73
+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73
w1-3 0.2
—1 ∗0.2 = —0.2
* 6
x1 -3 2 ∗0.2 = 0.4
-2
0.2
w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 56 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e)
-2 ∂n
∂ (c + n) = 1
(f)
x0 ∂n
-1 0.2 4 1 -1 0.37 1.37 0.73
+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73
w1-3 0.2
—1 ∗0.2 = —0.2
* 6
x1 -3 2 ∗0.2 = 0.4
-2
0.2
w2
Introduction to Deeplearning 57 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e)
-2 ∂n
∂ (c + n) = 1
(f)
x0 ∂n
-1 0.2 4 1 -1 0.37 1.37 0.73
+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73
w1-3 0.2
—1 ∗0.2 = —0.2
* 6
x1 -3 2 ∗0.2 = 0.4
-2
-0.6
w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 58 / 46
Does It Have To Be So Painful?
f (w, x) = 1
1+e−(w0x0+w1x1+w2)
w0
n∗m
x0
n+m n+m n∗ − 1 en n+1 1/x
w1
n∗m
1
x1 σ(n) = 1+e −n
w2
Introduction to Deeplearning 59 / 46
Does It Have To Be So Painful?
σ(n) = 1
1 + e—n
e—n —n
∂ σ(n) = = 1+e − 1 1
∂n (1 + e—n)2 1 + e—n 1 + e—n
1 + e—n 1
= —n − = 1 − σ(n)
1+ e 1 + e—n
= (1 − σ(n))σ(n)
Introduction to Deeplearning 44 / 46
60
Ex3: Compute the upstream and downstream gradients of
the following functional graph
Given w0 = 2, x0 = 1, w1 = −3, x1 = 4, w2 = −5
f (w, x) = 1
1 + e—(w0x0+w1x1+w2)
w0 n*m w2
x0
n+m n+m σ(n)
w1 n*m
x1
σ(n) = 1 ∂σ(n)
= (1 − σ(n))σ(n)
1 + e—n ∂n
Introduction to Deeplearning 61 / 46
Any Questions
Introduction to Deeplearning 62 / 46