Neural Networks Homework Guide
Neural Networks Homework Guide
Start Here
• Collaboration policy:
– You are expected to comply with the University Policy on Academic Integrity and Plagiarism.
– You are allowed to help your friends debug
– You are allowed to look at your friends code
– You are allowed to copy math equations from any source that are not in code form
– You are not allowed to type code for your friend
– You are not allowed to look at your friends code while typing your solution
– You are not allowed to copy and paste solutions off the internet
– You are not allowed to import pre-built or pre-trained models
– You can share ideas but not code, you must submit your own code. All code submitted will be
compared with all code submitted this semester and in previous semesters using MOSS.
We encourage you to meet regularly with your study group to discuss and work on your homework.
You will not only learn more, you will also be more efficient that way. However, as noted above, the
actual code used to obtain the final submission must be entirely your own.
• Directions:
– You are required to do this assignment in the Python (version 3) programming language. Do
not use any auto-differentiation toolboxes (PyTorch, TensorFlow, Keras, etc.) - you are only
permitted and recommended to vectorize your computation using the Numpy library.
– We recommend that you look at all the problems before trying to solve the first one. However,
we recommend that you complete the problems in order, as the difficulty increases, and questions
often rely on the completion of previous questions.
1
Homework Objectives
In this homework, you will learn how to implement and train an entire MLP from scratch, on your own.
You will learn
• to write code for all the components that comprise a simple MLP;
• to chain these components up to actually compose a complete MLP of any depth;
• to implement losses to train the network parameters;
• how to backpropagate the derivatives of those losses through the network, to compute loss derivatives
with respect to all network parameters;
• how to incorporate those derivatives into stochastic gradient descent (SGD) to update network param-
eters;
• how to implement at least one common regularization method, namely batch normalization, to improve
training.
This homework comes with an optional, separately posted bonus part, in which you will also learn to
implement other optimizers including ADAM, and another key regularization technique, dropout.
2
Checklist
Here is a checklist page that you can use to keep track of your progress as you go through the write-up
and implement the corresponding sections in your starter notebook. As you complete each function in the
notebook, you can check the corresponding boxes aligned with each section.
1. Getting Started
Download code handout and extract the file
Install required python libraries
Read the whole assignment write-up for an overview
Watch the 10-min video Backpropagation by 3B1B to understand the backward methods 1 .
2. Complete the Components of a Multilayer Perceptron Model
Revisit lecture 2 about linear classifier, activation function, and perceptron
Complete the linear layer class
Complete the 4 activation functions
3. Complete 3 Multilayer Perceptron Models using Components Built
Revisit lecture 2 about MLP
Write a MLP model with 0 hidden layers
Write a MLP model with 1 hidden layers
Write a MLP model with 4 layers
4. Implement the Criterion Functions to evaluate a machine learning model
Revisit lecture 3 about Loss
Implement Mean Squared Error (MSE) Loss for regression models
Implement Cross-Entropy Loss for classification models
5. Implement an Optimizer to train a machine learning model
Revisit lecture 6 about momentum, and lecture 7 about SGD
Implement SGD optimizer
6. Implement a Regularization method: Batch Normalization
Revisit lecture 8 about Batch normalization
Translate the element-wise equations to matrix equations
Write the code based on the matrix equations you wrote
7. Hand-in
Set all flags to True in hw1p1 autograder flags.py
Make sure you pass all textcases in the local autograder
Make the handin.tar file and submit to autolab
3
Contents
1 Introduction to MyTorch series 6
3 Notation 9
4
10 Regularization [20 points] 31
10.1 Batch Normalization [mytorch.nn.BatchNorm1d] . . . . . . . . . . . . . . . . . . . . . . . . 31
10.1.1 Batch Normalization Forward Training Equations (When eval = False) . . . . . . . 33
10.1.2 Batch Normalization Forward Inference Equations (When eval = True) . . . . . . . 35
10.1.3 Batch Normalization Backward Equations . . . . . . . . . . . . . . . . . . . . . . . . 35
11 Appendix 37
11.1 Anaconda Installation and Setup Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5
1 Introduction to MyTorch series
In this series of homework assignments, you will implement your own deep-learning library from scratch.
Inspired by PyTorch, your library – MyTorch – will be used to create everything from multilayer percep-
trons (MLP), convolutional neural networks (CNN), to recurrent neural networks with gated recurrent units
(GRU), and long-short term memory (LSTM) structures. This is an ambitious undertaking, and we are here
to help you through the entire process. At the end of this work, you will understand forward propagation,
loss calculation, backward propagation, and gradient descent.
The culmination of all of the Homework Part 1’s will be your own custom deep learning library MyTorch © ,
along with detailed examples. It is structured similarly to popular deep library learning libraries like PyTorch
and TensorFlow, and you can easily import and reuse code modules for subsequent homework.
In this assignment, we will start by creating the core components of multilayer perceptrons: linear layers,
activation functions, and batch normalization. Then, you will implement loss functions and stochastic
gradient decent optimizer in MyTorch. The auto-grader tests will compare the outputs of your MyTorch
methods and class attributes with a reference PyTorch solution. We have made the necessary components
of these classes and class functions as explicit as possible. Your job is to understand how all the components
are related and implement the mathematics into code.
In looking at the mathematics, you will be coding the equations needed to build a simple Neural Network
Layer. This includes forward and backward propagation for the activations, loss functions, linear layers, and
batch normalization. If you have challenges going from math to code, consider the shapes involved and do
what you can to make the operations possible.
Welcome, and we are grateful to be with you on this journey!
This will create a directory called HW1P1 with the following file structure.
HW1P1
mytorch
nn
linear.py
activation.py
loss.py
batchnorm.py
optim
sgd.py
models
mlp.py
hw1p1 autograder flags.py
hw1p1 autograder.py
requirements.txt
Apart from the above major files, there are files named __init__.py that don’t have to be edited. You may
also see other files like .DS_STORE or __pycache__ folders that you should not be concerned about.
2 The handout might have an extension like handout.tar.112. In such a case, you will first have to rename the downloaded
file as handout.tar by removing the .112 extension and then untar the file.
6
• Install Anaconda and Setup Environment
Please refer to the Appendix section for detailed instructions on setting up Anaconda environment on different
operating systems (Windows, macOS, and Linux).
• Follow the writeup and edit code files
The sections of the writeup are ordered to help you build the MyTorch library incrementally. Each section
has a corresponding Python file that contains code for classes implementing the theory in that section.
For instance the section on activation functions corresponds to activation.py, the neural network models
section to mlp.py, etc. You need to edit these files according to the writeup.
Another thing to note is the file structure. The mytorch folder contains code for individual components like
linear layers, optimizer, losses, etc. These components are independent of each other. The models folder
on the other hand has code for an entire neural network that uses some of these independent componenets.
You can follow a similar structure if you try to write the entire code from scratch.
Lastly, testing the code is not performed by these files. There is a separate hw1p1 autograder.py to help
you do that. It runs some local tests as a preliminary check of code correctness. Instructions to use it are
given below.
• Autograde your code by
– Step 1: Open your preferred IDE or code editor, locate and open the desired .py file, make the
necessary edits, and save the changes.
– Step 2: (IMPORTANT): Setting the flags in hw1p1 autograder flags.py to True to test any
individual component on your local autograder. For example, if you only implement the sigmoid
activation functions, set DEBUG AND GRADE SIGMOID flag = True and everything else to False.
– Step 3: Running local autograder by: Confirm that you are in the top level director and execute
the following in anaconda prompt or terminal:
python hw1p1_autograder.py
Please keep in mind that the local autograder only has a few tests as a preliminary check. The entire suite
of tests is run on Autolab after you hand-in your code as described below.
• Hand-in your code by running the following command from the top level directory, then SUBMIT the
created handin.tar file to autolab3 :
tar -cvf handin.tar models mytorch
• DO
– Make sure you understand the concept of each function, we don’t want you to ”translate” math
equations to codes without understanding them.
– Go through the examples we provide to have a better visualization of the matrix calculations. If
you ask TAs for help, we will ask you to explain the example to us before giving you more hints.
• DO NOT
– Import other external libraries other than numpy in your submission, as extra packages that do
not exist in autolab will cause submission failures.4 Libraries like PyTorch, TensorFlow, Keras
are not allowed.
– Add, move, or remove any files or change any filenames.
3 If you are a Windows user, navigate to HW1P1 directory and run ”create tarball.sh” in the terminal.
4 We are not intending to make the numpy restriction arbitrarily prohibitive. You can use os, sys, matplotlib, and other
functions needed to get familiar with your environment and what is going on. However, AutoLab expects only numpy, math
and scipy. Remove other libraries when making the submission.
7
• Scoring:
The homework comprises several sections. You get points for each section. Within any individual section,
however, you are expected to pass all tests within the section to get the score for it. Sections do not have
partial credit.
The local autograder provided to you has is very detailed. You will be able to isolate and verify individual
components of the sections on it. This can help you identify any issues or bugs in your code that need to
be addressed. Make sure you get full points on the local autograder for any section, before submitting it to
autolab.
8
3 Notation
**Numpy Tips:
• Use A * B for element-wise multiplication A ⊙ B.
• Use A @ B for matrix multiplication A · B.
• Use A / B for element-wise division A ⊘ B.
9
4 The Big Picture
We can think of a neural network (NN) as a mathematical function which takes an input data x and computes
an output y:
y = fNN (x)
For example, a model trained to identify spam emails takes in an email as input data x, and output 0 or 1
indicating whether the email is spam.
The function fNN has a particular form: it’s a nested function. In lecture, we learnt the concepts of network
layers. So, for a 3-layer neural network that returns a scaler, fNN looks like this:
In the above equation, f1 and f2 are vector functions of the following form:
fl (z) = gl (Wl · z + bl )
where l is called the layer index. The function gl is called an activation function (e.g. ReLU, Sigmoid).
The parameters Wl (weight matrix) and bl (bias vector) for each layer are learnt using gradient descent
by optimizing a particular loss function5 depending on the task.
In this assignment, we will create one architecture of neural networks called multilayer perceptron
(MLP). Refer to Figure A.
10
5 Neural Network Layers [15 Points]
5.1 Linear Layer [mytorch.nn.Linear]
Linear layers, also known as fully-connected layers, connect every input neuron to every output neuron
and are commonly used in neural networks. Refer to Figure A for the visual representation of a linear layer.
In this section, your task is to implement the Linear class in file linear.py:
• Class attributes:
– Learnable model parameters weight W, bias b.
– Variables stored during forward-propagation to compute derivatives during back-propagation:
layer input A, batch size N 6 .
– Variables stored during backward-propogation to train model parameters dLdW, dLdb.
• Class methods:
– init : Two parameters define a linear layer: in feature (Cin ) and out feature (Cout ). Zero
initialize weight W and bias b based on the inputs. Refer to Table 5.1 to see how the shapes of W
and b are related to the inputs.
– forward: forward method takes in a batch of data A of shape N × Cin (representing N samples
where each sample has Cin features), and computes output Z of shape N × Cout – each data
sample is now represented by Cout features.
– backward: backward method takes in input dLdZ, how changes in its output Z affect loss L. It
calculates and stores dLdW, dLdb – how changes in the layer weights and bias affect
loss, which are used to improve the model. It returns dLdA, how changes in the layer inputs affect
loss to enable downstream computation.
Please consider the following class structure:
class Linear:
return Z
return dLdA
6 Important: We will introduce the concept of ”batch” in lecture 7, for now, think of batch size as number of input samples
11
Table 1: Linear Layer Components
Code Name Math Type Shape Meaning
N N scalar - batch size
in features Cin scalar - number of input features
out features Cout scalar - number of output features
A A matrix N × Cin batch of N inputs each represented by Cin features
Z Z matrix N × Cout batch of N outputs each represented by Cout features
W W matrix Cout × Cin weight parameters
b b matrix Cout × 1 bias parameters
dLdZ ∂L/∂Z matrix N × Cout how changes in outputs affect loss
dLdA ∂L/∂A matrix N × Cin how changes in inputs affect loss
dLdW ∂L/∂W matrix Cout × Cin how changes in weights affect loss
dLdb ∂L/∂b matrix Cout × 1 how changes in bias affect loss
Z = A · W T + ι N · bT ∈ RN ×Cout (1)
future homework.
12
In the above equations, dZdA, dZdW, and dZdb represent how the input, weights matrix, and bias respectively
affect the output of the linear layer.
Now, Z, A, and W are all two-dimensional matrices (see Table 1 above). dZdA would have derivative terms
corresponding to each term of Z with respect to each term of A, and hence would be a 4-dimensional tensor.
Similarly, dZdW would be 4-dimensional and dZdb would be 3-dimensional (since b is 1-dimensional). These
high-dimensional matrices would be sparse (many terms would be 0) as only some pairs of terms have a
dependence. So, to make things simpler and avoid dealing with high-dimensional intermediate tensors, the
derivative equations given above are simplified to the below form:
∂L ∂L
= ·W ∈ RN ×Cin (5)
∂A ∂Z
T
∂L ∂L
= ·A ∈ RCout ×Cin (6)
∂W ∂Z
T
∂L ∂L
= · ιN ∈ RCout ×1 (7)
∂b ∂Z
13
6 Activation Functions [10 points]
Congratulations for finishing the first section! Here, we will introduce to you a few popular activation
functions and how to implement them!
As a machine learning engineer, you can theoretically choose any differentiable function as the activation
function. The primary purpose of having nonlinear components in the neural network (fNN ) is to allow it to
approximate nonlinear functions. Without activation functions, fNN will always be linear, no matter
how deep it is. The reason is that A · W + b is a linear function, and a linear function of a linear function is
also linear.
Activation functions can either take scalar or vector arguments. Scalar activations apply a function to
a single number. Thus, when they are applied to a vector, they operate element-wise. This one-to-one
dependence between the input and output makes calculating derivatives easier. Popular choices of scalar
activation functions are Sigmoid, ReLU, Tanh, and GELU, as shown in Table 2. More details about
these functions are provided in their respective subsections.
In the case of vector activations, however, each output element depends on each of the input elements. This
makes calculating derivatives tricky. A popular vector activation is the Softmax which you will be imple-
menting in addition to the scalar activations mentioned above.
In this section, your task is to implement the Activation class in file activation.py:
• Class attributes:
– Activation functions have no trainable parameters.
– Variables stored during forward-propagation to compute derivatives during back-propagation:
layer output A.
• Class methods:
– forward: forward method takes in a batch of data Z of shape N × C (representing N samples
where each sample has C features), and applies the activation function to Z to compute output
A of shape N × C.
– backward: backward method takes in dLdA, a measure of how the post-activations (output) affect
the loss. Using this and the derivative of the activation function itself, the method calculates and
returns dLdZ, how changes in pre-activation features (input) Z affect the loss L. In the case of
scalar activations, dLdZ is computed as:
∂A
dLdZ = dLdA ⊙ (8)
∂Z
Here, ∂A
∂Z is the element wise derivative of A with respect to the corresponding element of Z. In
other words, for one input of size 1 × C, it represents the diagonal of the Jacobian matrix in a
14
vector of size 1 × C (recall from the lecture that the Jacobian of a scalar activation function is a
diagonal matrix). For a batch of size N , the size of ∂A ∂A
∂Z is N × C. ∂Z is calculated differently for
different scalar activation functions as you’ll see in the respective subsections.
The Jacobian of a vector activation function is not a diagonal matrix. For each input vector
Z (i) (1 × C) and corresponding output vector A(i) (also 1 × C) in the batch, you would calculate
the Jacobian matrix J(i) separately. This would be of size C × C. Then, dLdZ(i) is given by:
After calculating each of the 1 × C dLdZ(i) vectors, you can stack them vertically to get the final
N × C dLdZ matrix to return.
Please consider the following class structure for the scalar activations:
class Activation:
return self.A
return dLdZ
The activation function topology is visualized in Figure C, revisit Figure A to see where it is in the bigger
picture.
Note: By convention in this class, Z is the output of a linear layer, and A is the input of a linear layer. Here,
Z is the output from the previous linear layer and A is the input to the next linear layer, i.e. let fl be the
activation function of layer l, Al+1 = fl (Zl ).
15
6.1 Sigmoid [ mytorch.nn.Sigmoid ]
6.1.1 Sigmoid Forward Equation
During forward propagation, pre-activation features Z are passed to the activation function Sigmoid to
calculate their post-activation values A.
A = sigmoid.forward(Z) (10)
= ς(Z) (11)
1
= (12)
1 + e−Z
A = Tanh.forward(Z) (17)
= tanh(Z) (18)
eZ − e−Z
= (19)
eZ + e−Z
16
Figure E: Tanh Activation Forward Example
A = relu.forward(Z) (22)
= ? (23)
17
6.4 GELU [mytorch.nn.GELU]
6.4.1 GELU Forward Equation
The GELU (Gaussian Error Linear Unit) activation function is defined in terms of the cumulative distribution
function of the standard Gaussian distribution Φ(Z) = P(X ≤ Z) where X ∼ N (0, 1):
A = gelu.forward(Z) (26)
= ZΦ(Z) (27)
Z Z 2
1 x
=Z √ exp − dx (28)
−∞ 2π 2
1 Z
= Z 1 + erf √ (29)
2 2
Here, erf refers to the error function which is frequently seen in probability and statistics. It can also take
complex arguments but will take real ones here. Hint: Search the docs of the math and scipy libraries for
help with implementation.
exp(zm )
am = PC (37)
k=1 exp(zk )
Here Z was a single vector. Similar calculations can be done for batch of N vectors.
18
6.5.2 Softmax Backward Equation
As discussed in the description of the backward method for vector activations earlier in the section, the first
step in backpropagating the derivatives is to calculate the Jacobian for each vector in the batch. Let’s take
the example of an input vector Z (a row of the input data matrix) and corresponding output vector A (a
row of the output matrix calculated by softmax.forward). The Jacobian J is a C × C matrix. Its element
at the m-th row and n-th column is given by:
am (1 − am ) if m = n
Jmn = (38)
−am an if m ̸= n
Similar derivative calculation can be done for all the N vectors in the batch and the resulting vectors can
be stacked up vertically to give the final N × C derivatives matrix.
Some code hints for Softmax are given in the handout to help you with implementation.
19
7 Neural Network Models [35 points]
In this section, you will bring together the different components you have made so far – linear layers and
activation functions – and create your own Model Class in file models/mlp.py!
• Class attributes:
– layers: a list storing all linear and activation layers in the correct order.
• Class methods:
– forward: forward method takes input data A0 and applies transformations corresponding to the
layers (linear and activation) sequentially as self.layers[i].forward for i = 0, ..., l − 18 where
l is the total number of layers, to compute output Al .
– backward: backward method takes in dLdAl , how changes in loss L affect model output Al , and
performs back-propagation from the last layer to the first layer by calling self.layers[i].backward
for i = l − 1, ..., 0. It does not return anything. Note that activation and linear layers don’t need
to be treated differently as both take in the derivative of the loss with respect to the layer’s output
and give back the derivative of the loss with respect to the layer’s input.
Please consider the following class structure:
class Model:
def __init__(self):
self.layers = # TODO
return A
return dLdA
Note that the A mentioned in the for loop in the forward pseudo code above is written so to maintain the
same name of the variable containing the current output. In case of linear layers, it is the same as the output
that was written as Z in the linear layer section. The case with dLdA mentioned in the backward pseudo
code is similar. In the case of activation functions, it will be the same as what was mentioned as dLdZ in
the activation functions section after the current dLdA is passed through the activation layer’s backward
function.
We will start by building a shallow network with 0 hidden layer in subsection 7.1, and then a slightly deeper
network with 1 hidden layer in subsection 7.2. Finally, we will build a deep neural network with 4 hidden
layers in subsection 7.3. Note: all models have one additional layer for the output mapping, i.e. the total
number of layers l for a model with 1 hidden layer is actually 2.
We do not provide a reference table here. Using what you have learned so far, we encourage you to make a
reference table yourself. Though it takes time, it will aid the debugging process and help make clear your
8 python lists are 0-indexed
20
understanding of the relevant components. If you ask for help, we will likely ask to see the reference table
you have created before attempting to diagnose your issue.
21
7.2 MLP (Hidden Layers = 1) [mytorch.models.MLP1] [10 points]
In this section, your task is to implement the forward and backward attribute functions of the MLP1
class.
The MLP1 topology is visualized in Figure H. You must use the diagram to deduce what the model specifi-
cation is for the linear layers. To facilitate understanding, you should try labelling the graph to show which
parts correspond to which linear layers and activation functions.
22
• A1 is passed to the next linear layer, and we apply self.layers[2].forward to obtain Z1 .
• Finally, we apply activation function self.layers[3].forward on Z1 to compute model output A2 .
23
7.3 MLP (Hidden Layers = 4) [mytorch.models.MLP4] [15 points]
In this section, your task is to initialize the MLP4 class and implement the forward and backward attribute
functions.
The MLP4 topology is visualized in Figure I. You must use the diagram to deduce what the model speci-
fication is for the linear layers. To facilitate understanding, you can try labelling the graph to show which
parts correspond to which linear layers and activation functions.
24
7.3.1 MLP Forward Equations (Hidden Layers = 4)
Given the math equations, can you figure out which class methods of Linear class and Activation class
perform the calculation of which equation?
25
8 Criterion - Loss Functions [10 points]
Much as you did for activation functions you will now program some simple loss functions. Different loss
functions may become useful depending on the type of neural network and type of data you are using. Here
we will program Mean Squared Error Loss MSE and Cross Entropy Loss. It is important to know how
these are calculated, and how they will be used to update your network. As before we will provide the
formulas, and know that each of these functions can be done in less than 10 lines of code, so if your code
begins to get more complex than that you may be overthinking the problem.
In this section, your task is to implement the forward and backward attribute functions of the Loss class
in file loss.py:
• Class attributes:
– Stores model prediction A to compute back-propagation.
– Stores desired output Y stored to compute back-propagation.
• Class methods:
– forward: forward method takes in model prediction A and desired output Y of the same shape
to calculate and return a loss value L. The loss value is a scalar quantity used to quantify the
mismatch between the network output and the desired output.
– backward: backward method calculates and returns dLdA, how changes in model outputs A affect
loss L. It is used to enable downstream computation, as seen in previous sections.
Please consider the following class structure:
class Loss:
return L
def backward(self):
dLdA = # TODO
return dLdA
The loss function topology is visualized in Figure J, whose reference persists throughout this document.
26
Figure J: Loss Function Topology
SE(A, Y ) = (A − Y ) ⊙ (A − Y ) (50)
Then we calculate the sum of the squared error SSE, where ιN , ιC are column vectors of size N and C which
contain all 1s:
Here, we are calculating the sum of all elements of the N × C matrix SE(A, Y ). The first pre multiplication
with ιTN sums across rows. Then, the post multiplication of this product with ιC sums the row sums across
columns to give the final sum as a single number.
Lastly, we calculate the per-component Mean Squared Error MSE loss:
SSE(A, Y )
M SELoss(A, Y ) = (52)
N ·C
27
8.2 Cross-Entropy Loss [mytorch.nn.CrossEntropyLoss]
Cross-entropy loss if one of the most commonly used loss function for probability-based classification prob-
lems. In this course, most of the part 2 homework problems involve classification problems, hence you will
use this loss function very often.
Now, each row of A represents the model’s prediction of the probability distribution while each row of Y
represents target distribution of an input in the batch.
Then, we calculate the cross-entropy H(A, Y) of the distribution Ai relative to the target distribution Yi
for i = 1, ..., N :
Remember that the output of a loss function is a scalar, but now we have a column matrix of size N. To
transform it into a scalar, we can either use the sum or mean of all cross-entropy.
Here, we choose to use the mean cross-entropy as the cross-entropy loss as that is the default for PyTorch
as well:
10 The matrix division in Equation 55 is element-wise (the formal symbol for the element-wise division operator of two matrices
28
9 Optimizers [10 points]
In deep learning, optimizers are used to adjust the parameters for a model. The purpose of an optimizer is
to adjust model weights to maximize a loss function.
To recap, we built our own MLP models in Section 7 using linear class we built in Section 5 and
activation classes we built in Section 6 and have seen how to do forward propagation, and backward
propagation for the core components used in neural networks. Forward propagation is used for estimation,
and backward propagation informs us on how changes in parameters affect loss. And in Section 8, we coded
some loss functions, which are criterion we use to evaluate the quality of our model’s estimates. The last step
is to improve our model using the information we learned on how changes in parameters affect loss.
Your task is to implement the step attribute function of the SGD class in file sgd.py:
• Class attributes:
– l: list of model layers
– L: number of model layers
– lr: learning rate, tunable hyperparameter scaling the size of an update.
– mu: momentum rate µ, tunable hyperparameter controlling how much the previous updates affect
the direction of current update. µ = 0 means no momentum.
– v W: list of weight velocity for each layer
– v b: list of bias velocity for each layer
• Class methods:
– step: Updates W and b of each of the model layers:
∗ Because parameter gradients tell us which direction makes the model worse, we move opposite
the direction of the gradient to update parameters.
∗ When momentum is non-zero, update velocities v W and v b, which are changes in the gradient
to get to the global minima. The velocity of the previous update is scaled by hyperparameter
µ, refer to lecture slides for more details.
Please consider the following class structure:
class SGD:
29
def step(self):
for i in range(self.L):
if self.mu == 0:
self.l[i].W = # TODO
self.l[i].b = # TODO
else:
self.v_W[i] = # TODO
self.v_b[i] = # TODO
self.l[i].W = # TODO
self.l[i].b = # TODO
∂L
W := W − λ (62)
∂W
∂L
b := b − λ (63)
∂b
∂L
vW := µvW + (64)
∂W
∂L
vb := µvb + (65)
∂b
W := W − λvW (66)
b := b − λvb (67)
30
10 Regularization [20 points]
Regularization is a set of techniques that can prevent overfitting in neural networks and thus improve the
accuracy of a Deep Learning model when facing completely new data from the problem domain.
Batch normalization is a method used to make training of artificial neural networks faster and more stable
through normalization of the layers’ inputs by re-centering and re-scaling. It comes from the paper Batch
Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, we encourage
you to read the paper for a better understanding. You can find pseudocode and explanation in the paper if
you are stuck!
In this section, your task is to implement the forward and backward attribute functions of the BatchNorm1d
class in file batchnorm.py.
• Class attributes:
– alpha: a hyperparameter used for the running mean and running var computation.
– eps: a value added to the denominator for numerical stability.
– BW: learnable parameter of a BN (batch norm) layer to scale features.
– Bb: learnable parameter of a BN (batch norm) layer to shift features.
– dLdBW: how changes in γ affect loss
– dLdBb: how changes in β affect loss
– running M: learnable parameter, the estimated mean of the training data
– running V: learnable parameter, the estimated variance of the training data
• Class methods:
– forward: It takes in a batch of data Z computes the batch normalized data Ẑ, and returns the
scaled and shifted data Z̃. In addition:
∗ During training, forward calculates the mean and standard-deviation of each feature over the
mini-batches and uses them to update the running M E[Z] and running V V ar[Z], which are
learnable parameter vectors trained during forward propagation. By default, the elements of
E[Z] are set to 0 and the elements of V ar[Z] are set to 1.
∗ During inference, the learnt mean running M E[Z] and variance running V V ar[Z] over the
entire training dataset are used to normalize Z.
– backward: takes input dLdBZ, how changes in BN layer output affects loss, computes and stores
the necessary gradients dLdBW , dLdBb to train learnable parameters BW and Bb. Returns
dLdZ, how the changes in BN layer input Z affect loss L for downstream computation.
Please consider the following class structure:
31
class BatchNorm1d:
self.alpha = alpha
self.eps = 1e-8
return self.BZ
return dLdZ
32
Table 6: Batch Normalization Components
The batchnorm topology is visualized in Figure L, whose reference persists throughout this document. In
the image, V0,M0 correspond to V,M during training, and correspond to running V and running M during
inference.
Note: In the following sections, we are providing you with element-wise equations instead of matrix equa-
tions. As a deep learning ninja, please don’t use for loops to implement them – that will be extremely
slow!
Your task is first to come up with a matrix equation for each element-wise equation we provide, then im-
plement them as code. If you ask TAs for help in this section, we will ask you to provide your matrix
equations.
33
Hint: check the documentation for np.sum and apply it along the right axis.
N
1 X
µj = Zij j = 1, ..., C (69)
N i=1
N
1 X
σj2 = (Zij − µj )2 j = 1, ..., C (70)
N i=1
Using the mean and variance, we normalize the input Z to get the normalized data Ẑ. Note: we add ϵ in
denominator for numerical stability and to prevent division by 0 error .
Zi − µ
Ẑi = √ i = 1, ..., N (71)
σ2 + ϵ
Here, we give you an example for the above equation to facilitate understanding:
Hint: In your matrix equation, first broadcast γ and β to make them have the same shape N × C as Ẑ.
34
During training (and only during training), your forward method should be maintaining a running average
of the mini-batch mean and variance. These running averages should be used during inference. Hyperpa-
rameter α is used to compute weighted running averages.
∂L ∂L ∂ Z̃ ∂L
= = ⊙γ (79)
∂ Ẑ ∂ Z̃ ∂ Ẑ ∂ Z̃
N
!
∂L X ∂L ∂ Ẑ
= j = 1, ..., C (80)
∂σ 2 j i=1∂ Ẑ ∂σ 2 ij
N
1 X ∂L 3
=− ⊙ (Z − µ) ⊙ (σ 2 + ϵ)− 2 (81)
2 i=1 ∂ Ẑ ij
∂ Ẑi ∂ h 1
i
= (Zi − µ)(σ 2 + ϵ)− 2 i = 1, ..., N (82)
∂µ ∂µ
N
!
1 1 − 3 2 X
= −(σ 2 + ϵ)− 2 − (Zi − µ) ⊙ σ 2 + ϵ 2 − (Zi − µ) (83)
2 N i=1
N
∂L X ∂L ∂ Ẑi
= (84)
∂µ i=1 ∂ Ẑi
∂µ
∂L ∂L
Now for the grand finale, let’s compute ∂Z . For clarity, we present the derivation for ∂Zi for one data sample
Zi .
∂L ∂L ∂ Ẑ ∂L h 2 1
i ∂L 2 1 ∂L
= = (σ + ϵ)− 2 + 2 N
(Z i − µ) + (85)
∂Zi ∂ Ẑi ∂Zi ∂ Ẑi ∂σ N ∂µ
35
In figure O, we present you with the illustration of batchnorm in a 0-hidden layer MLP model. Since the
variables are color coded, it should be very clear each variable is used in which equations, which will help
you apply chain rule and understand where the backward equations come from.
36
11 Appendix
11.1 Anaconda Installation and Setup Instructions
11
1. Download the Anaconda installer specific to your operating system :
• Windows Installer
• macOS Installer
• Linux Installer
2. Once the installation is complete, open the Anaconda Prompt or the terminal in Linux and macOS.
• Windows: Click Start, Search for Anaconda Prompt, and click to open.
• macOS: Open the Terminal application. You can find it by going to “Applications” < “Utili-
ties” < “Terminal”.
• Linux: Open the Dash by clicking the Ubuntu icon, then type “Terminal”.
3. In the Anaconda Prompt, use the cd command to navigate to the directory where you have the
“HW1P1” directory. For example:
cd /path/to/HW1P1
4. Create a new Anaconda environment named “idlf23” with Python version 3.8 by running the following
command:
conda create -n idlf23 python=3.8
You may be prompted to answer “y” for a couple of prompts. Respond accordingly.
5. Activate the newly created “idlf23” environment using the following command:
conda activate idlf23
11 If you are using a non-linux system and have issue installing the exact same version of the packages, it is fine to install
a slightly newer/older version that is compatible with your OS. In case your codes give you full mark on your local machine
and raises an issue on autolab, read autolab’s feedback to figure out which functions are not supported by autolab and replace
them.
37