Backpropagation
Designing, Visualizing and Understanding Deep Neural Networks
CS W182/282A
Instructor: Sergey Levine
UC Berkeley
Neural networks
Drawing computation graphs
what expression does this compute?
equivalently, what program does this correspond to?
this is a MSE loss with a linear regression model
neural networks are computation graphs
if we design generic tools for computation graphs, we
can train many kinds of neural networks
Drawing computation graphs
what expression does this compute?
a simpler way to draw the same thing: equivalently, what program does this correspond to?
dot product
this is a MSE loss with a linear regression model
neural networks are computation graphs
if we design generic tools for computation graphs, we
can train many kinds of neural networks
Logistic regression
remember this is a vector! let’s draw the computation graph for logistic regression
with the negative log-likelihood loss
what does this produce?
1 0
“one-hot” vector or
0 1
Logistic regression
a simpler way to draw the same thing:
matrix
Drawing it even more concisely
Notice that we have two types of variables:
the parameters usually affect one specific operation
(though there is often parameter sharing, e.g., conv nets – more on this later)
also called fully connected
layer
Neural network diagrams
(simplified) computation graph diagram neural network diagram
often we don’t draw this b/c
every layer has parameters
cross-ent
softmax loss
often we don’t draw this
linear b/c cross-entropy
layer always follows softmax
2x1 2x1
simplified softmax
drawing: linear
layer
2x1 2x1
Logistic regression with features
which layer
Learning the features
which feature
Problem: how do we represent the learned features? = rows of weight matrix
Idea: what if each feature is a (binary) logistic regression output?
per-element sigmoid
not the same as softmax
each feature is independent
Let’s draw this!
2x1
3x2 3x1
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
Simpler drawing
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
simpler way to draw the same thing: even simpler:
softmax linear
softmax
sigmoid
layer layer
sigmoid linear
layer layer
2x1 3x1 2x1 2x1 3x1 2x1
Doing it multiple times
2x1
3x2 3x1 3x3 3x1 3x3 3x1
linear
softmax
sigmoid sigmoid sigmoid
layer layer layer layer
2x1 3x1 3x1 3x1 2x1
Activation functions
we don’t have to use a sigmoid!
a wide range of non-linear functions will work we’ll discuss specific choices later
these are called activation functions why non-linear?
multiple linear layers = one linear layer
enough layers = we can represent anything (so long as they’re nonlinear)
softmax
sigmoid sigmoid sigmoid linear
layer layer layer layer
Demo time!
Source: [Link]
Aside: what’s so neural about it?
dendrites receive signals from other neurons artificial “neuron” sums up signals
from upstream neurons
(also referred to as “units”)
neuron “decides”
whether to fire based upstream activations
on incoming signals
neuron “decides” how
much to fire based on
axon transmits signal to
incoming signals
downstream neurons
activations transmitted
to downstream units activation function
Training neural networks
What do we need?
1. Define your model class sigmoid
layer
sigmoid
layer
sigmoid
layer
linear
layer
softmax
2x1 3x1 3x1 3x1 2x1
2. Define your loss function negative log-likelihood, just like before
stochastic gradient descent
3. Pick your optimizer what do we need?
4. Run it on a big GPU
Aside: chain rule High-dimensional chain rule
Row or column?
In this lecture: In some textbooks:
Just two different conventions!
Chain rule for neural networks
A neural network is just a composition of functions
So we can use chain rule to compute gradients!
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
Does it work?
We can calculate each of these Jacobians!
Example:
Why might this be a bad idea?
Doing it more efficiently
Idea: start on the right
this is always true because
the loss is scalar-valued!
The backpropagation algorithm
“Classic” version softmax
sigmoid sigmoid sigmoid linear
layer layer layer layer
2x1
Let’s walk through it…
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
Practical implementation
Neural network architecture details
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
Some things we should figure out:
How many layers?
How big are the layers?
What type of activation function?
Bias terms
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
additional parameters in each linear layer
What else do we need for backprop?
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
Backpropagation recipes: linear layer
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
(just to simplify notation!)
Backpropagation recipes: linear layer
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
(just to simplify notation!)
Backpropagation recipes: linear layer
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
(just to simplify notation!)
Backpropagation recipes: linear layer
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
(just to simplify notation!)
Backpropagation recipes: sigmoid
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
Backpropagation recipes: ReLU
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1
Summary
cross-ent
sigmoid softmax loss
linear linear
layer layer
2x1 3x1 3x1