0% found this document useful (0 votes)
41 views9 pages

Chapter 7 Part 2

This document summarizes backpropagation for calculating gradients with respect to the output units of a neural network. The key points are: 1) The loss function is cross entropy loss between the predicted (Q) and actual (P) probability distributions over classes. 2) The gradient of the loss with respect to the output units y is calculated as -1/yl for the true class l, and 0 otherwise. 3) This gradient can be written compactly using a one-hot vector, resulting in the gradient being -1/yl times a one-hot vector with a 1 in the position of the true class. 4) This gradient is then used to calculate the gradient of the loss

Uploaded by

ADY Beats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views9 pages

Chapter 7 Part 2

This document summarizes backpropagation for calculating gradients with respect to the output units of a neural network. The key points are: 1) The loss function is cross entropy loss between the predicted (Q) and actual (P) probability distributions over classes. 2) The gradient of the loss with respect to the output units y is calculated as -1/yl for the true class l, and 0 otherwise. 3) This gradient can be written compactly using a one-hot vector, resulting in the gradient being -1/yl times a one-hot vector with a 1 in the position of the true class. 4) This gradient is then used to calculate the gradient of the loss

Uploaded by

ADY Beats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Backpropagation – Gradients w.r.

t to output Units
Consider a ANN structure

Some observations from this network


1) It is a classification ANN
2) There are 3 layers = 2 hiddenlayers+ 1 output Layer
3) There are L=3 total layers in this network.
4) There are L-1=2 Hidden Layers in this network.

[]
x1
x
5) x= 2 ∈ R is the input to the network.

xn

[]
y1
y2
6) y= .. is the output of this network (Probabilities of belonging to i th class

yk
7) There are n inputs
8) There are n neurons in each hidden layer.
9) There are k nodes in output layer.
10) Activation function used at each hidden layer neuron is g(x )
11) Weights and Biases are the learnable parameters
12) θ is the vector containing Weight matrix and Bias vector θ= [ WB ]
13)
L th th th th
wLnm=w nm=Weight of thelink connecting n node of L−1 Layer with m node of L Layer
14) bLm=b Lm=Bias added ¿ the m th node of Lth Layer
15) aLi=aiL= pre−activation of thei th node of Lth layer
L th th
16) hLi=h i =activation of i node of L Layer

[]
y1
y
17) y= 2 is the vector of probabilities of each class predicted by the ANN (Q distribution)

yk

[]
^y 1
^y
18) ^y = 2 is the vector of actual probabilities of each class (P distribution)

^y k
3

ea i

y i= k
→ softmax (a 3i )
19)
∑ ea
3
j

j=1
20) If there were L layers and L-1 hidden layers then
L

ea i

y i= k
→ softmax(aiL )
∑ a Lj
j=1
21) Loss function used = Cross entropy Loss

( Q1 )
N
L ( θ )=∑ P j . log
j=1 j
Here P is the Actual probability distribution of all k classes for the i th input of the dataset
Q is the predicted probability distribution of all k classes for the i th input of the dataset

Some pre-calculations

1) Since the actual Probability distribution for k classes for a given input vector x i would have
P(X=l )=1 for the correct class l and all other classes will have P(X≠ l)=0
2) This means
L ( θ )=P1 log
( )
1
Q1
+ P2 log
( )
1
Q2
… .+ Pl log
1
Ql
+… .. P N log
1
( )
QN ( )
¿ 1. log ( Q1 ) l

L ( θ )=log
( Q1 ) l

Hence L ( θ )=log
1
Ql ( )
3) But Q l is the predicted probability of the l th class by the ANN
L

ea l

Ql= yl = k
∑ ea
L
j

j=1
Hence L ( θ )=log ( ) 1
yl
=−log ⁡( y l ) (l=true class label ¿

4) And previously we saw - To backpropagate the Loss and update Weights and Biases we need
to find the gradient of L(θ) wrt to θ
But ∇ θ L ( θ ) would contain gradients with respect to intermediate members of the network

For example,

[ ]
∂ L (θ)
∇θ L ( θ ) = ∂ W
∂ L (θ)
∂B
∂ L ( θ ) ∂ L (θ ) ∂ y
= .
∂W ∂ y ∂W

∂ L(θ )
Ok, not going ahead, we first need to find
∂y

Gradient With Respect to Output Units

∂ ∂
∂y
L ( θ )=
∂y
(−log ( y l ) )

If y i is the i th element of output y vector

{ }
−1
∂ ,if i=l
L ( θ )= yl
∂ yi
0 otherwise

More compactly this can be written as


∂ −1i=l
L ( θ )=
∂ yi yl
Thisuses Kronecker delta function 1i=l which is equal ¿ 1 when∧0 otherwise

So,
∂ ∂

[]
∂y
(−log ( y l ) )= (−log ( yl ) )
y1
y2
∂ y3


yk

[ ]

∂ y1 (
−log ( y l ) )


(−log ( y l ) )
¿ ∂ y2



∂ yk
( −log ( y l ) )

[]
−1 l=1
yl
−1 l=2
yl

∂y
(−log ( y l ) )= −1l =3
yl


−1l =k
yl

This means for a given input vector x there can only be 1 element in
∂y
(−log ( y l ) ) that will be 1 and
all others will be 0
This kind of vector is called “One hot vector ”

Can be denoted as

∂ −1
∂y
(−log ( y l ) )=
yl
e(l)

Hence

∂ −1
L ( θ )= e (l)
∂y yl

Where e is a k dimensional one hot vector such that it’s l th vector is 1, where l is the true class label
∂ −1
Till now we have backpropagation till the dark green part… L ( θ )= e (l)
∂y yl

Now let’s move to the light green part and calculate



L
L ( θ )=∇(a ) L ( θ )
L

∂a
L th
a is the vector containing pre-activation at L layer

∂ ∂
L
L ( θ )= L (−log ( y l ) )
∂ ai ∂ ai
∂ ( −log ( y l ) ) ∂ y l
¿ . L
∂ yl ∂ ai
1 ∂ y l
¿− . L
y l ∂ ai

L
But is y l dependent on a i ?

Yes, as
[]
L
a1
e
k

∑ ea
L
j

[] [ ]
j=1
L

ea 2

softmax ( a1 )
L
y1 k

∑e
L
aj
y2 softmax ( a2L )
j=1
y= y 3 = = softmax ( a3 )
L L
ea 2

… k

∑ ea
L
… …
j

yk j=1 L
… softmax( ak )

L
a
e k

∑ ea
L
j

j=1

This means
L

ea i

y i=softmax ( a ) = L
i L L L

ea +e a +…+ e a
1 2 k

Each element of y involves each element of a L


And
L

ea l

y l=softmax ( a ) = L
l k

∑ ea
L
j

j=1
So,
∂ −1 ∂ y l
L
L ( θ )= .
∂ ai y l ∂ aiL

( )
alL
e
∂ k

∑ ea
L
j

∂ yl j=1
L
= L
∂a i ∂ ai

This is of the form

g(x) ∂ g (x ) ∂h(x) ∂ g (x ) ∂h(x)


∂ h(x) −g ( x ) h(x) g(x)
h (x )
∂x
=
∂x
(h ( x )) 2
∂x
=
∂x
( h (x )) 2

∂x
(h ( x ) )2
=
∂g(x) 1
.
∂ x h (x )

g(x) ∂ h ( x )
.
( h ( x ) )2 ∂ x ( )( )
How,
L L
al L al L
 As e is a function of ai , e =g(a i ) when i=l
k

∑ ea =ea +e a +…+ e a
L L L L
L L
 j 1 2 k
is a function of a i ,h(ai ) when i= j
j=1
L
 a i =x

Hence,

( )
aLl
e
∂ k
g(x)
∑ e (a )
L
j

j=1

∂ ai
L
=
h(x)
∂x
=
∂ g(x) 1
.
∂x h(x)
− (
g (x) ∂ h ( x )
.
(h ( x ) ) ∂ x
2 )( )
( )
aLl
e

) (( )

( )
k k
∑e (a Lj )
∂ ∑ e(
a Lj )

(
L L
a a
j=1 ∂e l
1 e l
j=1
= . − .

)
L L k k 2 L
∂ ai ∂ ai ∂ ai
∑e ( aLj )
∑e ( aLj )
j=1 j=1

)( )
( )
k

∑ e(a )
L

∂ j

(
L L
a a
∂e l
1 e l
1 j=1
¿ L
. k
− k
. k
. L
∂ ai ( aLj ) ( aLj ) ( aLj ) ∂ ai
∑e ∑e ∑e
j=1 j=1 j=1

( )( )
aLl
∂e 1 1
. e(
ai )
L

¿ . − softmax ( aLl ) .
∂ a Li k k

∑ e( a ) ∑ aLj
L
j

j=1 j=1

( )( )
al
L
( aLi )
∂e 1 e
− softmax ( a ) .
L
¿ L
. k l k
∂ ai ( aLj )
∑e ∑ aLj
j=1 j=1

{ ( ) }
L

ea l

−( softmax ( al ) . softmax ( ai ) ) ,if l=i

( )
a L L L
∂e l
1
−( softmax ( a ) . softmax ( a )) ¿
L L k
¿ .
∑ e( )
L
l i a
∂ a Li k j

( aLj )
∑e j=1
j=1
−( softmax ( aLl ) . softmax ( aiL ) ) , otherwise

( )
aLl
e
∂ k

∑ e (a )
L

{ }
j

j=1 softmax ( alL ) −( softmax ( alL ) . softmax ( aiL ) ) , if l=i


=
∂ aiL −( softmax ( alL ) . softmax ( aiL ) ) , otherwise

Which can be written more compactly as


( )
aLl
e
∂ k

∑ e (a )
L
j

j=1
=1(l=i) softmax ( al ) −softmax ( a l ) . softmax ( ai )
L L L
L
∂a i

L
But y l=softmax (al )
And y i=softmax ( ai )
L

Therefore

( )
aLl
e
∂ k

∑ e (a )
L
j

∂ yl j=1
L
= L
=1(l=i) y l− y l . y i
∂a i ∂ ai
And
∂ −1
L ( θ )= . [ 1(l =i ) y l − y l . y i ]
L
∂ ai yl
Finally,

L
L ( θ )=−(1 ¿ ¿ ( l=i )− yi )¿
∂ ai

We can now write Gradient with respect to the vector a L

[]

L (θ )
∂ aL1

L (θ )
∂ aL2
∇ a L ( θ )= ∂ L (θ ) =¿
L

∂ aL2



L (θ )
∂ aLk

Hence,
∇ (a ¿¿L) L (θ )=−(e (l )− y ) ¿

Where
e is a k dimensional one hot vector such that it’s l th vector is 1, where l is the true class label
y is the vector containing predicted probability of the input belonging to each of the k class

You might also like