0% found this document useful (0 votes)
6 views62 pages

Lecture1 About Deep Learning

The document provides an introduction to deep learning, highlighting key figures like Geoffrey Hinton and foundational concepts such as neural networks and backpropagation. It discusses the architecture of neural networks, including perceptrons, activation functions, and the role of weights and biases in learning. Additionally, it covers practical aspects of implementing deep learning techniques, including gradient computation and weight initialization strategies.

Uploaded by

Mayank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views62 pages

Lecture1 About Deep Learning

The document provides an introduction to deep learning, highlighting key figures like Geoffrey Hinton and foundational concepts such as neural networks and backpropagation. It discusses the architecture of neural networks, including perceptrons, activation functions, and the role of weights and biases in learning. Additionally, it covers practical aspects of implementing deep learning techniques, including gradient computation and weight initialization strategies.

Uploaded by

Mayank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Introduction to Deeplearning

Introduction to Deeplearning 1 / 46
State-of-the-art of AI with Deeplearning

Introduction to Deeplearning 2 / 46
Geoffrey Hinton is known by
many to be the godfather of
deep learning. Aside from his
seminal 1986 paper on
backpropagation, Hinton has
invented several
foundational deep learning
techniques throughout his
decades-long career. Hinton
currently splits his time
between the University of
Toronto and Google Brain.

3 / 46
AI: Emulates Human Intelligence
ML: Emulates Human Learning
DL: Emulates Network of Neuron in Human Brain
4 / 46
Popular deep learning language and framework

5 / 46
1943, Warren McCulloch and Walter Pitts: Neuron

(a) Biological neuron (b) Artificial neuron

Neural nets/perceptrons are loosely inspired by biology. But they certainly


are not a model of how the brain works, or even how neurons work1.

1Diagram credit: Karpathy and Fei-Fei


Introduction to Deeplearning 6 / 46
1957, Frank Rosenblatt: Perceptron

Bias
b
x1 w1
Activate
function Output
Inputs x2 w2
Σ f y

x3 w3
Weights
A perceptron (specific type of artificial neuron) is a single layer linear model + an activa tion function f . We can find weights
(w ), and bia s (b) that minimizes a loss function using gradient descent.

Introduction to Deeplearning 7 / 46
8 / 46
9 / 46
10 / 46
11 / 46
12 / 46
13 / 46
14 / 46
15 / 46
1980, David Everett Rumelhart, Backpropagation
Input
layer Hidden Output
layers layer

x1
h1(1)
x2 y1
h2(1)
..
x3
..
..
yk
h(1)
m
xn
A feedforward backpropagation uses a layered architecture where information flows in one direction, from input to output, and
then errors are backpropagated to adjust the network’s weights.

Introduction to Deeplearning 16 / 46
Single Neuron

inputs = [ 1 , 2 , 3 ]
weights = [ 0.2 , 0.8 , - 0.5 ]
bias = 2 (modelling a single neuron, have one bias (one bias value per neuron).
The bias is an additional tuneable value but is not associated with any input in contrast to the
weights.

output = (inputs[ 0 ] * weights[ 0 ] +


inputs[ 1 ] * weights[ 1 ] +
inputs[ 2 ] * weights[ 2 ] + bias)
https://nnfs.io/bkr

17 / 46
A Layer of Neurons
Neural networks typically have layers that consist of more than one neuron. Layers
are nothing more than groups of neurons. Each neuron in a layer takes exactly the
same input — the input given to the layer (which can be either the training data or
the output from the previous layer), but contains its own set of weights and its own
bias, producing its own unique output. The layer’s output is a set of each of these
outputs — one per each neuron. Let’s say we have a scenario with 3 neurons in a
layer and 4 inputs:

18 / 46
inputs = [ 1 , 2 , 3 , 2.5 ]
weights1 = [ 0.2 , 0.8 , - 0.5 , 1 ]
weights2 = [ 0.5 , - 0.91 , 0.26 , - 0.5 ]
weights3 = [ - 0.26 , - 0.27 , 0.17 , 0.87 ]
bias1 = 2 bias2 = 3 bias3 = 0.5
outputs = [
# Neuron 1:
inputs[ 0 ] * weights1[ 0 ] +
inputs[ 1 ] * weights1[ 1 ] +
inputs[ 2 ] * weights1[ 2 ] +
inputs[ 3 ] * weights1[ 3 ] + bias1,
# Neuron 2:
inputs[ 0 ] * weights2[ 0 ] +
inputs[ 1 ] * weights2[ 1 ] +
inputs[ 2 ] * weights2[ 2 ] +
inputs[ 3 ] * weights2[ 3 ] + bias2,
# Neuron 3:
inputs[ 0 ] * weights3[ 0 ] +
inputs[ 1 ] * weights3[ 1 ] +
inputs[ 2 ] * weights3[ 2 ] +
inputs[ 3 ] * weights3[ 3 ] + bias3]
19 / 46
https://nnfs.io/mxo
20 / 46
NEURAL NETWORK activation
a(0)
1 = σ w 1,1 a1(0) + w 1,2 a2(0) + . . . + w 1,n an(0) + b1(0)
w1,1
!
a1(1) Σn
(0) + b(0)
w1,2 =σ w 1,i a
i 1
a2(0) i =1

w1,3
a2(1)
w1,4
a3(0) 
a1
( 1 ) 
w1,1 w1,2 ... w1,n
  (0) 
a1

b(0)

1
(1) w
w1,n (1)  a  = σ   2,1 w2,2 ... w 2,n   a(0)   
b2(0) 
a3  2      2  +
   ..    
 ..    .. .. . ..   ..   .. 
a4
(0) . (1)
am wm,1 wm,2 ... wm,n an(0) (0)
bm
.
a (1) = σ W (0)a (0) + b(0)
..
(1)
am

an(0)
# Inputs = n, # Neurons = m; # Weights = n × m.
Introduction to Deeplearning 21 / 46
Multilayer Perceptron

22 / 46
Activation Functions
Tanh
Sigmoid
ex —e−x
1 tanh(x ) =
σ(x ) = ex + e−x
1 + e−x
1 tanh(x )
1 σ(x )
x
0.5 —5 5
x —1
—5 5
Softmax (3-class example)
ReLU
exi
σ(xi ) = Σ , j = 1, 2, 3
f (x ) = max(0, x ) exj
j

6 1
f (x ) softmax (xi )
3 0.5
x
xi
—4 4
—5 5

Introduction to Deeplearning 23 / 46
Shape Recognition: Concept

Introduction to Deeplearning 24 / 46
Shape Recognition: Example

Introduction to Deeplearning 25 / 46
Digit Recognition: Concept

Introduction to Deeplearning 26 / 46
Layers break problem in pieces

Introduction to Deeplearning 27 / 46
Backpropagation: chain rule

1 The last activation a(L) = σ(w (L)a(L—1) + b(L)) is determined by a


weight, a bias, and the previous neuron’s activation (nonlinear
function like sigmoid/ ReLU).
2 Weighted sum, z(L) = w (L)a(L—1) + b(L) =⇒ a(L) = σ(z (L))

3 Conceptually,

Introduction to Deeplearning 28 / 46
Backpropagation: chain rule

1 a(L—1) is influenced by its own weight and bias, which means our tree
actually extends up higher...

Introduction to Deeplearning 29 / 46
Computing the first derivative

1 How sensitive the cost C0 is to small changes in the weight w(L), i..e.,
∂C0
∂w (L)

Introduction to Deeplearning 30 / 46
The Constituent Derivatives

1 To compute each derivative, we’ll use some relevant formula from the
way we’ve defined our neural network.

∂z (L) = a(L—1)
z(L) + w (L)a(L—1) + b(L) =⇒
∂w (L)
a(L) = σ(z (L) ) =⇒ ∂a = σ r(z (L) )
(L)

∂z (L)
∂C0
C0 = (a(L) − y )2 =⇒ (L) = 2(a(L) − y )
∂a

2
∂C0
∂w (L)
= ∂z (L) ∂a(L) ∂C0
∂w (L) ∂z (L) ∂a(L)
3 Putting this together with our constituent derivatives
∂C0 = a(L—1)σr(z (L))2(a(L) − y )
∂w (L)

Introduction to Deeplearning 31 / 46
Ex1: Chain rule of differentiation

1 The rate of change of a function of a function is the multiple of the


derivatives of those functions.
∂ f (g (x )) = ∂g ∂f
2
∂x ∂x ∂g
3
∂ f (g (h(i (j(k(x )))))) = ∂k ∂j ∂i ∂h ∂g ∂f
∂x ∂x ∂k ∂j ∂i ∂h ∂g
Compute the gradients of cost with respect to the initial weight ∂c
∂w4

Introduction to Deeplearning 32 / 46
Ex1: Chain rule of differentiation

Ex1: Compute the gradients of cost with respect to the initial weight ∂c
∂w4

Introduction to Deeplearning 33 / 46
Ex2: Weights Initialization

Experiment weights initialization of the NN for - Relu, Tanh, and Sigmoid


activation functions and comment.

1 Zero initialization
2 Constant initialization
3 Random initialization with very small values
4 Random initialization with very large values.

Introduction to Deeplearning 34 / 46
Functional Gradients

Introduction to Deeplearning 35 / 46
Taking the Gradient – Review

f (x ) = (−x + 3)2

f = q2 q = r +3 r = −x

∂f ∂q ∂r
= 2q =1 = −1
∂q ∂r ∂x
[colback=blue!5!white, colframe=blue!75!black, title=Chain rule]
∂f ∂f ∂q ∂r
= = 2q · 1 · (−1)
∂x ∂q ∂r ∂x
= −2(−x + 3) = 2x − 6

Introduction to Deeplearning 36 / 46
Let’s Do This Another Way

Suppose we have a box representing a function f .

This box does two things:


Forward: Given forward input n, compute f(n)
Backwards: Given backwards input g, return g · ∂f /∂n

n f(n)
g ∂f
∂n
f g

Introduction to Deeplearning 37 / 46
Let’s Do This Another Way: Functional Diagrams

f (x ) = (−x + 3)2

—x
(−x + 3)2
—x + 3
x −n n+3 n2
1

d
n2 = 2n = 2(−x + 3)
dn
= −2x + 6
d
· 1 = (−2x + 6) · 1
dn

Introduction to Deeplearning 38 / 46
Let’s Do This Another Way

f (x ) = (−x + 3)2

—x + 3

—x
(−x + 3)2
x −n n+3 n2

—2x + 6
1

d
=1
dn
1 ∗ (−2x + 6)

Introduction to Deeplearning 23 / 46
39
Let’s Do This Another Way

f (x ) = (−x + 3)2

—x —x + 3
(−x + 3)2
x −n n+3 n2

—2x + 6 —2x + 6
1

d
= −1
dn
−1 ∗ (−2x + 6)

Introduction to Deeplearning 40 / 46
Let’s Do This Another Way

f (x ) = (−x + 3)2

—x —x + 3
x (−x + 3)2
−n n+3 n2
2x − 6 1
—2x + 6 —2x + 6

Introduction to Deeplearning 41 / 46
Functional Gradients: Gates

Introduction to Deeplearning 42 / 46
Once more, with numbers!

Introduction to Deeplearning 43 / 46
f (x, y, z) = (x + y )z

1
n+m 5
4

10 n*m 50

Introduction to Deeplearning 44 / 46
f (x, y, z ) = (x + y )z

n+m
5

4 10

50
10 n∗m

5
1

∂n
nm = m → 10 ∗ 1 ∂
∂m
nm = n → 5∗ 1

Introduction to Deeplearning 45 / 46
f (x, y, z ) = (x + y )z

10
n+m
5
4
10
10

50
10 n∗m

5
1

∂n
n+m = 1 → 1 ∗ 10 ∗ 1 ∂
∂m
n+m = 1 → 1 ∗ 10 ∗ 1

Introduction to Deeplearning 46 / 46
Something More Complex

f (w, x) = 1
1+e−(w0x0+w1x1+w2)

w0
n∗m
x0
n+m n+m n∗−1 en n+1 1/n
w1
n∗m
x1
w2

Introduction to Deeplearning 47 / 46
f (w, x) = 1
1 + e—(w0x0+w1x1+w2)

2
w0
-2
*
x0 -1 4 1 -1 0.37 1.37 0.73
-3 + + * -1 en +1 1/n
w1
* 6
x1 -2
-3
w2

Introduction to Deeplearning 48 / 46
f (w, x) = 1
1+ e—(w0x0+w1x1+w2)

w02
-2
*
x0 -1 4 1 -1 0.37 1.37 0.73
-3 + + * -1 en +1 n—1
w1
* 6
x1 -2 -3
w2

Example Credit: Karpathy and Fei-Fei

Introduction to Deeplearning 49 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
∂ (n−1) = —n−2
w0 (d) ∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 0.73
4 1 -1 0.37 1.37
+ + * -1 en +1 n−1
0.73

w1-3 6
—(1.37)−2 ·1 = —0.53
*
x1 Where does 1.37 come from?
-2 -3

w2
Introduction to Deeplearning 50 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 1.37 0.73
4 1 -1 0.37
+ + * -1 en +1 n−1
-o.53 0.73

w1-3 6
1 ∗—0.53 = —0.53
*
x1
-2 -3

w2
Introduction to Deeplearning 51 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 0.37 1.37 0.73
4 1 -1
+ + * -1 en +1 n−1
-0.53 -0.53 0.73

w1-3 6
e−1 ∗—0.53 = —0.2
*
x1
-2 -3

w2
Introduction to Deeplearning 52 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 -1 0.37 1.37 0.73
4 1 * -1
+ + en +1 n−1
-0.2 -0.53 -0.53 0.73

w1-3 6
—1 ∗0.2 = —0.2
*
x1
-2 -3

w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 53 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 1 -1 0.37 1.37 0.73
4 * -1
+ + en +1 n−1
-0.2 -0.2 -0.53 -0.53 0.73

w1-3 6
1 ∗0.2 = 0.2
*
x1
-2 -3

w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 54 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e) ∂n
∂ (c + n) = 1
(f)
x0 -2 ∂n
-1 4 1 -1 0.37 1.37 0.73

+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73

w1-3 6
1 ∗0.2 = 0.2
*
x1 -3
-2
0.2

w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 55 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e)
-2 ∂n
∂ (c + n) = 1
(f)
x0 ∂n
-1 0.2 4 1 -1 0.37 1.37 0.73

+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73

w1-3 0.2
—1 ∗0.2 = —0.2
* 6
x1 -3 2 ∗0.2 = 0.4
-2
0.2

w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 56 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e)
-2 ∂n
∂ (c + n) = 1
(f)
x0 ∂n
-1 0.2 4 1 -1 0.37 1.37 0.73

+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73

w1-3 0.2
—1 ∗0.2 = —0.2
* 6
x1 -3 2 ∗0.2 = 0.4
-2
0.2

w2
Introduction to Deeplearning 57 / 46
f (w, x ) = 1
1+ e−(w0 x0 +w1 x1 +w2 )
(a) ∂ (m + n) = 1
∂n
∂ (mn) = m
(b) ∂n
∂ (en ) = en
(c)
2 ∂n
w0 (d) ∂ (n−1) = —n−2
∂n
∂ (an) = a
* (e)
-2 ∂n
∂ (c + n) = 1
(f)
x0 ∂n
-1 0.2 4 1 -1 0.37 1.37 0.73

+ + * -1 en +1 n−1
4 -0.2 -0.2 -0.53 -0.53 0.73

w1-3 0.2
—1 ∗0.2 = —0.2
* 6
x1 -3 2 ∗0.2 = 0.4
-2
-0.6

w2
Neeta Nain, Dept CSE, MNIT Jaipur Introduction to Deeplearning 58 / 46
Does It Have To Be So Painful?

f (w, x) = 1
1+e−(w0x0+w1x1+w2)

w0
n∗m
x0
n+m n+m n∗ − 1 en n+1 1/x
w1
n∗m
1
x1 σ(n) = 1+e −n

w2

Introduction to Deeplearning 59 / 46
Does It Have To Be So Painful?

σ(n) = 1
1 + e—n
e—n —n
∂ σ(n) = = 1+e − 1 1
∂n (1 + e—n)2 1 + e—n 1 + e—n
1 + e—n 1
= —n − = 1 − σ(n)
1+ e 1 + e—n
= (1 − σ(n))σ(n)

For the curious


Line 1 to 2:
∂ −1
σ(n) = ∗ 1 ∗ e−n ∗ −1
∂n (1 + e −n )2
Chain rule: d/dx (1/x) * d/dx (1+x) *
d/dx (eˆx) * d/dx (-x)

Introduction to Deeplearning 44 / 46
60
Ex3: Compute the upstream and downstream gradients of
the following functional graph
Given w0 = 2, x0 = 1, w1 = −3, x1 = 4, w2 = −5

f (w, x) = 1
1 + e—(w0x0+w1x1+w2)

w0 n*m w2

x0
n+m n+m σ(n)

w1 n*m

x1

σ(n) = 1 ∂σ(n)
= (1 − σ(n))σ(n)
1 + e—n ∂n
Introduction to Deeplearning 61 / 46
Any Questions

Introduction to Deeplearning 62 / 46

You might also like