Backpropagation – Gradients w.r.
t to output Units
Consider a ANN structure
Some observations from this network
1) It is a classification ANN
2) There are 3 layers = 2 hiddenlayers+ 1 output Layer
3) There are L=3 total layers in this network.
4) There are L-1=2 Hidden Layers in this network.
[]
x1
x
5) x= 2 ∈ R is the input to the network.
…
xn
[]
y1
y2
6) y= .. is the output of this network (Probabilities of belonging to i th class
…
yk
7) There are n inputs
8) There are n neurons in each hidden layer.
9) There are k nodes in output layer.
10) Activation function used at each hidden layer neuron is g(x )
11) Weights and Biases are the learnable parameters
12) θ is the vector containing Weight matrix and Bias vector θ= [ WB ]
13)
L th th th th
wLnm=w nm=Weight of thelink connecting n node of L−1 Layer with m node of L Layer
14) bLm=b Lm=Bias added ¿ the m th node of Lth Layer
15) aLi=aiL= pre−activation of thei th node of Lth layer
L th th
16) hLi=h i =activation of i node of L Layer
[]
y1
y
17) y= 2 is the vector of probabilities of each class predicted by the ANN (Q distribution)
…
yk
[]
^y 1
^y
18) ^y = 2 is the vector of actual probabilities of each class (P distribution)
…
^y k
3
ea i
y i= k
→ softmax (a 3i )
19)
∑ ea
3
j
j=1
20) If there were L layers and L-1 hidden layers then
L
ea i
y i= k
→ softmax(aiL )
∑ a Lj
j=1
21) Loss function used = Cross entropy Loss
( Q1 )
N
L ( θ )=∑ P j . log
j=1 j
Here P is the Actual probability distribution of all k classes for the i th input of the dataset
Q is the predicted probability distribution of all k classes for the i th input of the dataset
Some pre-calculations
1) Since the actual Probability distribution for k classes for a given input vector x i would have
P(X=l )=1 for the correct class l and all other classes will have P(X≠ l)=0
2) This means
L ( θ )=P1 log
( )
1
Q1
+ P2 log
( )
1
Q2
… .+ Pl log
1
Ql
+… .. P N log
1
( )
QN ( )
¿ 1. log ( Q1 ) l
L ( θ )=log
( Q1 ) l
Hence L ( θ )=log
1
Ql ( )
3) But Q l is the predicted probability of the l th class by the ANN
L
ea l
Ql= yl = k
∑ ea
L
j
j=1
Hence L ( θ )=log ( ) 1
yl
=−log ( y l ) (l=true class label ¿
4) And previously we saw - To backpropagate the Loss and update Weights and Biases we need
to find the gradient of L(θ) wrt to θ
But ∇ θ L ( θ ) would contain gradients with respect to intermediate members of the network
For example,
[ ]
∂ L (θ)
∇θ L ( θ ) = ∂ W
∂ L (θ)
∂B
∂ L ( θ ) ∂ L (θ ) ∂ y
= .
∂W ∂ y ∂W
∂ L(θ )
Ok, not going ahead, we first need to find
∂y
Gradient With Respect to Output Units
∂ ∂
∂y
L ( θ )=
∂y
(−log ( y l ) )
If y i is the i th element of output y vector
{ }
−1
∂ ,if i=l
L ( θ )= yl
∂ yi
0 otherwise
More compactly this can be written as
∂ −1i=l
L ( θ )=
∂ yi yl
Thisuses Kronecker delta function 1i=l which is equal ¿ 1 when∧0 otherwise
So,
∂ ∂
[]
∂y
(−log ( y l ) )= (−log ( yl ) )
y1
y2
∂ y3
…
…
yk
[ ]
∂
∂ y1 (
−log ( y l ) )
∂
(−log ( y l ) )
¿ ∂ y2
…
…
∂
∂ yk
( −log ( y l ) )
[]
−1 l=1
yl
−1 l=2
yl
∂
∂y
(−log ( y l ) )= −1l =3
yl
…
…
−1l =k
yl
∂
This means for a given input vector x there can only be 1 element in
∂y
(−log ( y l ) ) that will be 1 and
all others will be 0
This kind of vector is called “One hot vector ”
Can be denoted as
∂ −1
∂y
(−log ( y l ) )=
yl
e(l)
Hence
∂ −1
L ( θ )= e (l)
∂y yl
Where e is a k dimensional one hot vector such that it’s l th vector is 1, where l is the true class label
∂ −1
Till now we have backpropagation till the dark green part… L ( θ )= e (l)
∂y yl
Now let’s move to the light green part and calculate
∂
L
L ( θ )=∇(a ) L ( θ )
L
∂a
L th
a is the vector containing pre-activation at L layer
∂ ∂
L
L ( θ )= L (−log ( y l ) )
∂ ai ∂ ai
∂ ( −log ( y l ) ) ∂ y l
¿ . L
∂ yl ∂ ai
1 ∂ y l
¿− . L
y l ∂ ai
L
But is y l dependent on a i ?
Yes, as
[]
L
a1
e
k
∑ ea
L
j
[] [ ]
j=1
L
ea 2
softmax ( a1 )
L
y1 k
∑e
L
aj
y2 softmax ( a2L )
j=1
y= y 3 = = softmax ( a3 )
L L
ea 2
… k
…
∑ ea
L
… …
j
yk j=1 L
… softmax( ak )
…
L
a
e k
∑ ea
L
j
j=1
This means
L
ea i
y i=softmax ( a ) = L
i L L L
ea +e a +…+ e a
1 2 k
Each element of y involves each element of a L
And
L
ea l
y l=softmax ( a ) = L
l k
∑ ea
L
j
j=1
So,
∂ −1 ∂ y l
L
L ( θ )= .
∂ ai y l ∂ aiL
( )
alL
e
∂ k
∑ ea
L
j
∂ yl j=1
L
= L
∂a i ∂ ai
This is of the form
g(x) ∂ g (x ) ∂h(x) ∂ g (x ) ∂h(x)
∂ h(x) −g ( x ) h(x) g(x)
h (x )
∂x
=
∂x
(h ( x )) 2
∂x
=
∂x
( h (x )) 2
−
∂x
(h ( x ) )2
=
∂g(x) 1
.
∂ x h (x )
−
g(x) ∂ h ( x )
.
( h ( x ) )2 ∂ x ( )( )
How,
L L
al L al L
As e is a function of ai , e =g(a i ) when i=l
k
∑ ea =ea +e a +…+ e a
L L L L
L L
j 1 2 k
is a function of a i ,h(ai ) when i= j
j=1
L
a i =x
Hence,
( )
aLl
e
∂ k
g(x)
∑ e (a )
L
j
∂
j=1
∂ ai
L
=
h(x)
∂x
=
∂ g(x) 1
.
∂x h(x)
− (
g (x) ∂ h ( x )
.
(h ( x ) ) ∂ x
2 )( )
( )
aLl
e
) (( )
∂
( )
k k
∑e (a Lj )
∂ ∑ e(
a Lj )
(
L L
a a
j=1 ∂e l
1 e l
j=1
= . − .
)
L L k k 2 L
∂ ai ∂ ai ∂ ai
∑e ( aLj )
∑e ( aLj )
j=1 j=1
)( )
( )
k
∑ e(a )
L
∂ j
(
L L
a a
∂e l
1 e l
1 j=1
¿ L
. k
− k
. k
. L
∂ ai ( aLj ) ( aLj ) ( aLj ) ∂ ai
∑e ∑e ∑e
j=1 j=1 j=1
( )( )
aLl
∂e 1 1
. e(
ai )
L
¿ . − softmax ( aLl ) .
∂ a Li k k
∑ e( a ) ∑ aLj
L
j
j=1 j=1
( )( )
al
L
( aLi )
∂e 1 e
− softmax ( a ) .
L
¿ L
. k l k
∂ ai ( aLj )
∑e ∑ aLj
j=1 j=1
{ ( ) }
L
ea l
−( softmax ( al ) . softmax ( ai ) ) ,if l=i
( )
a L L L
∂e l
1
−( softmax ( a ) . softmax ( a )) ¿
L L k
¿ .
∑ e( )
L
l i a
∂ a Li k j
( aLj )
∑e j=1
j=1
−( softmax ( aLl ) . softmax ( aiL ) ) , otherwise
( )
aLl
e
∂ k
∑ e (a )
L
{ }
j
j=1 softmax ( alL ) −( softmax ( alL ) . softmax ( aiL ) ) , if l=i
=
∂ aiL −( softmax ( alL ) . softmax ( aiL ) ) , otherwise
Which can be written more compactly as
( )
aLl
e
∂ k
∑ e (a )
L
j
j=1
=1(l=i) softmax ( al ) −softmax ( a l ) . softmax ( ai )
L L L
L
∂a i
L
But y l=softmax (al )
And y i=softmax ( ai )
L
Therefore
( )
aLl
e
∂ k
∑ e (a )
L
j
∂ yl j=1
L
= L
=1(l=i) y l− y l . y i
∂a i ∂ ai
And
∂ −1
L ( θ )= . [ 1(l =i ) y l − y l . y i ]
L
∂ ai yl
Finally,
∂
L
L ( θ )=−(1 ¿ ¿ ( l=i )− yi )¿
∂ ai
We can now write Gradient with respect to the vector a L
[]
∂
L (θ )
∂ aL1
∂
L (θ )
∂ aL2
∇ a L ( θ )= ∂ L (θ ) =¿
L
∂ aL2
…
…
∂
L (θ )
∂ aLk
Hence,
∇ (a ¿¿L) L (θ )=−(e (l )− y ) ¿
Where
e is a k dimensional one hot vector such that it’s l th vector is 1, where l is the true class label
y is the vector containing predicted probability of the input belonging to each of the k class