0% found this document useful (0 votes)
850 views41 pages

DL Unit1 Final

material

Uploaded by

Vvn Bhaskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
850 views41 pages

DL Unit1 Final

material

Uploaded by

Vvn Bhaskar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Deep Learning Lecture Notes Unit I

DEEP LEARNING
LECTURE NOTES
(3-1 CSE(AI), R20, JNTUA)

Prepared By

Dr.P.GANGADHARA REDDY

Associate Professor

Aditya College of Engineering, Madanapalle,

Annamayya(Dist.), Andhra Pradesh

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 1


Deep Learning Lecture Notes Unit I
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR
B.Tech CSE (AI)– III-I Sem LTPC
3003
(20A05703c) DEEP LEARNING
Course Objectives:

arameters and hyper parameters in a neural network's architecture

Course Outcomes:
After completion of the course, students will be able to

iate architecture of deep neural network

UNIT I Lecture 8Hrs


Linear Algebra: Scalars, Vectors, Matrices and Tensors, Matrix operations, types of matrices, Norms, Eigen
decomposition, Singular Value Decomposition, Principal Components Analysis.
Probability and Information Theory: Random Variables, Probability Distributions, Marginal Probability,
Conditional Probability, Expectation, Variance and Covariance, Bayes‟ Rule, Information Theory. Numerical
Computation: Overflow and Underflow, Gradient-Based Optimization, Constrained Optimization, Linear
Least Squares.
UNIT II Lecture 9Hrs
Machine Learning: Basics and Under fitting, Hyper parameters and Validation Sets, Estimators, Bias and
Variance, Maximum Likelihood, Bayesian Statistics, Supervised and Unsupervised Learning, Stochastic
Gradient Descent, Challenges Motivating Deep Learning. Deep Feed forward Networks: Learning XOR,
Gradient-Based Learning, Hidden Units, Architecture Design, Back-Propagation and other Differentiation
Algorithms.
UNIT III Lecture 8Hrs
Regularization for Deep Learning: Parameter Norm Penalties, Norm Penalties as Constrained Optimization,
Regularization and Under-Constrained Problems, Dataset Augmentation, Noise Robustness, Semi-Supervised
Learning, Multi-Task Learning, Early Stopping, Parameter Tying and Parameter Sharing, Sparse
Representations, Bagging and Other Ensemble Methods, Dropout, Adversarial Training, Tangent Distance,
Tangent Prop and Manifold Tangent Classifier. Optimization for Training Deep Models: Pure Optimization,
Challenges in Neural Network Optimization, Basic Algorithms, Parameter Initialization Strategies, Algorithms
with Adaptive Learning Rates, Approximate Second-Order Methods, Optimization Strategies and Meta-
Algorithms.
UNIT IV Lecture 9Hrs
Convolutional Networks: The Convolution Operation, Pooling, Convolution, Basic Convolution Functions,
Structured Outputs, Data Types, Efficient Convolution Algorithms, Random or Unsupervised Features, Basis
for Convolutional Networks.
UNIT V Lecture 8Hrs
Sequence Modeling: Recurrent and Recursive Nets: Unfolding Computational Graphs, Recurrent Neural
Networks, Bidirectional RNNs, Encoder-Decoder Sequence-to-Sequence Architectures, Deep Recurrent
Networks, Recursive Neural Networks, Echo State Networks, LSTM, Gated RNNs, Optimization for Long-
Term Dependencies, Auto encoders, Deep Generative Models.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 2


Deep Learning Lecture Notes Unit I
Textbooks:
1. Ian Goodfellow, YoshuaBengio, Aaron Courville, “Deep Learning”, MIT Press,2016.
2. Josh Patterson and Adam Gibson, “Deep learning: A practitioner's approach”, O'Reilly Media, First
Edition,2017.

Reference Books:
1. Fundamentals of Deep Learning, Designing next-generation machine intelligence algorithms, Nikhil
Buduma, O‟Reilly, Shroff Publishers,2019.
2. Deep learning Cook Book, Practical recipes to get started Quickly, DouweOsinga, O‟Reilly, Shroff
Publishers,2019.

Online Learning Resources:


o/datasets/

https://nptel.ac.in/courses/106105215

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 3


Deep Learning Lecture Notes Unit I

Preface

Deep Learning algorithms learn multi-level representations of data, with each level explaining
the data in a hierarchical manner. Such algorithms have been effective at uncovering underlying
structure in data, e.g., features to discriminate between classes. They have been successful in
many artificial intelligence problems including image classification, speech recognition and
natural language processing. The course, which will be taught through lectures and PPTs, will
cover the underlying theory, the range of applications to which it has been applied, and learning
from very large data sets. The course will cover connectionist architectures commonly associated
with deep learning, e.g., basic neural networks, convolutional neural networks and recurrent
neural networks. Methods to train and optimize the architectures and methods to perform
effective inference with them, will be the main focus. Students will be encouraged to use open
source software libraries such as Tensorflow.

Pre-requisite: Introductory Machine Learning (ML),Linear Algebra, Probabilistic Models&


Statistics.

Course material loosely follows the organization of the text: I. Goodfellow, Y. Bengio and A.
Courville, Deep Learning, MIT Press, 2016. There are ten chapters organized into two major
subdivisions (after the Overview): I. Applied Math and Machine Learning basics, II. Deep
Networks: Modern Practice.

Lecture Slides

Lecture slides loosly follows Prof.Sargur Srihari, Department of Computer Science and
Engineering , University of Baffalo. Following are links to pdf slide files for topics. Updated in
Spring 2020

Dr.P.GANGADHARA REDDY
Associate Professor
Aditya College of Engineering, Madanapalle,
Annamayya(Dist.), Andhra Pradesh

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 4


Deep Learning Lecture Notes Unit I

UNIT I
Linear Algebra & Probability and Information Theory
Syllabus:

Linear Algebra: Scalars, Vectors, Matrices and Tensors, Matrix operations, types of matrices, Norms, Eigen decomposition,
Singular Value Decomposition, Principal Components Analysis.
Probability and Information Theory: Random Variables, Probability Distributions, Marginal Probability, Conditional Probability,
Expectation, Variance and Covariance, Bayes’ Rule, Information Theory. Numerical Computation: Overflow and Underflow,
Gradient-Based Optimization, Constrained Optimization, Linear Least Squares.

Linear Algebra:
The concepts of linear algebra are crucial for understanding the theory behind machine learning,
especially for deep learning.
Linear algebra is the branch of mathematics that focuses on linear equations. It is often applied to the
science and engineering fields, specifically machine learning. Linear algebra is also central to almost
all areas of mathematics like geometry and functional analysis.

Mathematical Objects in Linear Algebra

Scalar
A scalar is simply a single number.
Vector
A vector is an ordered array of numbers and can be in a row or a column. A vector has just a single
index, which can point to a specific value within the vector. For example, V2 refers to the second
value within the vector.

Matrix
A matrix is an ordered 2D array of numbers and it has two indices. The first one points to the row and

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 5


Deep Learning Lecture Notes Unit I
the second one to the column. A matrix can have multiple numbers of rows and columns. Note that a
vector is also a matrix, but with only one row or one column.
The matrix in the example is also a 2- by 3-dimensional matrix (rows x columns). Below another
example of a matrix along with its notation:

Tensor
A Tensor as an array of numbers, arranged on a regular grid, with a variable number of axes. A tensor
has three indices, where the first one points to the row, the second to the column and the third one to
the axis. For example, T232 points to the second row, the third column, and the second axis. This
refers to the value 0 in the right tensor in the graphic below:

Tensor is the most general term for all of these concepts above because a tensor is a multidimensional
array and it can be a Vector and a Matrix, depending on the number of indices it has. For example, a
first-order tensor would be a vector (1 index). A second-order tensor is a matrix (2 indices) and third-
order tensors (3 indices) and higher are called higher-order tensors (3 or more indices).
Computational Rules of Linear Algebra
1. Matrix-Scalar OperationsThe operations multiply, divide, subtract, or add a scalar to a matrix, it
can do with every element of the matrix. The image below illustrates this perfectly for multiplication:

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 6


Deep Learning Lecture Notes Unit I
2. Matrix-Vector Multiplication
Multiplying a matrix by a vector can be thought of as multiplying each row of the matrix by the
column of the vector. The output will be a vector that has the same number of rows as the matrix. The
image below shows how this works:

To better understand the concept, the calculation of the second image.

Here is another example:

And here is a kind of cheat sheet:

3. Matrix-Matrix Addition and Subtraction


Matrix-matrix addition and subtraction is fairly easy and straightforward. The requirement is that the
matrices have the same dimensions and the result is a matrix that has also the same dimensions. here
just add or subtract each value of the first matrix with its corresponding value in the second matrix.
See below:

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 7


Deep Learning Lecture Notes Unit I
4. Matrix-Matrix Multiplication
Multiplying two matrices together isn‟t that hard either if you know how to multiply a matrix by a
vector. Note that you can only multiply matrices together if the number of the first matrix‟s columns
matches the number of the second matrix‟s rows. The result will be a matrix with the same number of
rows as the first matrix and the same number of columns as the second matrix. It works as follows:
simply split the second matrix into column-vectors and multiply the first matrix separately by each of
these vectors. Then put the results in a new matrix (without adding them up!). The image below
explains this step by step:

And here is again some kind of cheat sheet:

Matrix Multiplication Properties


Matrix multiplication has several properties that allow us to bundle a lot of computation into one
matrix multiplication. We will start by explaining these concepts with Scalars and then with matrices
because this will give a better understanding of the process.

1. Not Commutative
Scalar multiplication is commutative but matrix multiplication is not. This means that when we are
multiplying scalars, 7*3 is the same as 3*7. But when we multiply matrices by each other, A*B isn‟t
the same as B*A.
2. Associative
Scalar and matrix multiplication are both associative. This means that the scalar multiplication 3(5*3)
is the same as (3*5)3 and that the matrix multiplication A(B*C) is the same as (A*B)C.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 8


Deep Learning Lecture Notes Unit I
3. Distributive
Scalar and matrix multiplication are also both distributive. This means that
3(5 + 3) is the same as 3*5 + 3*3 and that A(B+C) is the same as A*B + A*C.
4. Identity Matrix
The identity matrix is a special kind of matrix. The number 1 is an identity because everything
multiply with 1 is equal to itself. Therefore every matrix that is multiplied by an identity matrix is
equal to itself. For example, matrix A times its identity-matrix is equal to A.
An identity matrix that it has ones along its diagonals and that every other value is zero. It is also a
“squared matrix,” meaning that its number of rows matches its number of columns.

We previously discussed that matrix multiplication is not commutative but there is one exception,
namely if we multiply a matrix by an identity matrix. Therefore, the following equation is true: A*I =
I*A = A
Inverse and Transpose
The matrix inverse and the matrix transpose are two special kinds of matrix properties. Again, we
will start by discussing how these properties relate to real numbers and then how they relate to
matrices.
1. Inverse
A number that is multiplied by its inverse is equal to 1. Note that every number except 0 has an
inverse. If multiply a matrix by its inverse, the result is its identity matrix. The example below shows
what the inverse of scalars looks like:

But not every matrix has an inverse. compute the inverse of a Matrix if it is a “squared matrix” and if
it has an inverse.
Need of an inverse Because we can‟t divide matrices. There is no concept of dividing by a matrix but
we can multiply a matrix by an inverse, which results essentially in the same thing.
The image below shows a matrix multiplied by its inverse, which results in a 2-by-2 identity matrix.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 9


Deep Learning Lecture Notes Unit I
2. Transpose
The matrix transpose property is basically the mirror image of a matrix, along a 45-degree axis. It is
fairly simple to get the transpose of a matrix. Its first column is the first row of the matrix transpose
and the second column is the second row of the matrix transpose. An m*n matrix is transformed into
an n*m matrix. Also, the Aij element of A is equal to the Aji(transpose) element. The image below
illustrates that:

3.Determinants
The determinant of a matrix is a single numerical value which is used when calculating the inverse
or when solving systems of linear equations. The determinant of a matrix A is denoted |A| , or
sometimes det(A).
Properties:
The determinant is the same in any row or column. The value of the determinant is 0 if all of the
elements of a row (or column) are zeros. Identity matrix determinant (In) is 1. If the rows and
columns are swapped, the determinant's value remains the same (value does not change).
Norm:
A norm is a way to measure the size of a vector, a matrix, or a tensor. In other words, norms are a
class of functions that enable us to quantify the magnitude of a vector. For instance, the norm of a
vector X drawn below is a measure of its length from origin.
Vector Norms are defined as a set of functions that take a vector as an input and output a positive
value against it. This is called the magnitude of a vector.
The norm of any vector X is denoted by a double bar around it and is written as follows:
Norm of vector x: ||X||
A diagram showing the family of vector norm functions and their output.

Properties of Norm:
Consider two vectors X and Y, having the same size.A function is considered a norm if and only if it
satisfies the following properties:
 Non-negativity: It should always be non-negative.
 Definiteness: It is zero if and only if the vector is zero, i.e., zero vector.
 Triangle inequality: The norm of a sum of two vectors is no more than the sum of their norms.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 10


Deep Learning Lecture Notes Unit I
 Homogeneity: Multiplying a vector by a scalar multiplies the norm of the vector by the
absolute value of the scalar.
Let‟s see these qualities represented mathematically.
Non-negativity ||x||≥ 0
Definiteness ||x||=0 only if x=0
Triange Inequality

.
Homogeneity ||λx||=|λ| ||x||,where λϵR
Any real value function of a vector that satisfies the above four properties is called a norm.
Standard Norms:
Lᵖ Norm We can now generalize to the idea of the p-norm. In a way, we can derive all other norms
from the p-norm by varying the values of p. if you substitute the value of p with one, two, and ∞
respectively in the formula below, you‟ll obtain L¹, L², and L∞ norms.

Mathematical Notation
The Lᵖ norm can be mathematically written as:

All norm functions originate from a standard equation of Norm, known as the p-norm. For different
values of the parameter p (p should be a real number greater than or equal to 1), we obtain a different
norm function. The generalized equation is shown below:

This takes an n-dimensional vector x and raises each element to its p-th power. Then, we sum all the
obtained elements and take the p-th root to get the p-norm of the vector, also known as its magnitude.
Now, with different values of the parameter p, we will obtain a different norm function. Let‟s discuss
them one by one below.
L0 Norm:
Although p=0 lies outside the domain of the p-norm function, substituting p=0 in the above equation
gives us the individual vector elements raised to the power 0, which is 1 (provided the number is not
zero). Furthermore, we also have a p-th root in the equation, which is not defined for p=0. To handle
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 11
Deep Learning Lecture Notes Unit I
this, the standard way of defining the L0 norm is to count the number of non-zero elements in the
given vector. The image below shows the output of the L-0 norm function for the given vector:

L1 Norm/Manhattan Norm:
Substituting p=1 in the standard equation of p-norm, we get the following:

The equation for L1 Norm.


 When used to compute the loss, the L1 norm is also referred to as the Mean Absolute Error.
 L1 norm varies linearly for all locations, whether far or near the origin.
The image below shows the output of the L1 norm function for the given vector:

The L¹ norm is defined as the sum of the absolute values of the components of a given vector. Since
we have a vector X with only two components, the L¹ norm of x can be written as:
The L¹ norm can be mathematically written as: ||X1||=∑
Properties of the L1/Manhattan Norm
 The L¹ norm is used in situations when it is helpful to distinguish between zero and non-zero
values.
 The L¹ norm increases linearly around the origin.
L² Norm / Euclidean Norm/Squared L2 Norm:
As the name indicates, the squared L2 Norm is the same as the L2 Norm but squared.

The equation for Squared L2 Norm.


 The above equation is often referred to as the mean squared error when used to compute the
error in machine learning.
The squared L2 Norm is relatively computationally inexpensive to use compared to the L2 Norm.
This is because:
1. It is missing the square root.
2. Within Machine Learning applications, the derivative of the Squared L2 Norm is easier to
compute and store. The derivate of an element in the Squared L2 Norm requires the element
itself. However, in the case of the L2 Norm, the entire vector is needed.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 12


Deep Learning Lecture Notes Unit I

L² is the most commonly used norm and the one most encountered in real life. The L² norm measures
the shortest distance from the origin. It is defined as the root of the sum of the squares of the
components of the vector. So, for our given vector X, the L² norm would be:

The Euclidean norm essentially means we are referring to the Euclidean distance.
Properties of the L2 Norm:
 The L² norm is the most commonly used one in machine learning
 Since it entails squaring of each component of the vector, it is not robust to outliers.
 The L² norm increases slowly near the origin, e.g., 0.¹² = 0.01
 It is used in ridge regression, which involves adding the coefficient of the L² norm as a
penalty term to the loss function.

L∞/Max Norm?
The L∞ norm is defined as the absolute value of the largest component of the vector. Therefore, it
is also called the max norm. So, continuing with our example of a 2D vector X having two
components, i.e., x₁ and x₂, where x₂ > x₁, the ∞ norm would simply be the absolute value of x₂.
As infinity is an abstract concept in Mathematics, we can‟t just substitute p=∞ in the standard p-norm
equation. However, we can study the function's behavior as p approaches infinity using limits. A
simple derivation for the equation of Max-norm can be found here.

Max norm returns the absolute value of the largest magnitude element. The image below shows
the output of the Max norm function for the given vector:

Properties of the Max Norm:


 The L∞ norm simplifies to the absolute value of the largest element in the vector.
Concluding notes:
1. Vector norm is a function that takes a vector as an input and outputs a positive value.
2. All norm functions can be derived from a single equation. The family of norm functions is
known as p-norm.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 13


Deep Learning Lecture Notes Unit I
3. The L1 norm is also referred to as the Mean Absolute Error.
4. The L2 Norm is also referred to as the Root Mean Squared Error.
5. The Squared L2 Norm is also referred to as the Mean Squared Error.

Eigenvalues and Eigenvectors:


Eigenvalues and Eigenvectors are fundamental concepts in linear algebra that play a significant role
in machine learning algorithms and applications.
Eigenvalues and Eigenvectors are the scalar and vector quantities associated with matrices used for
linear transformations. The vector that only changes by a scalar factor after applying a
transformation is called an eigenvector, and the scalar value attached to the eigenvector is
called the eigenvalue.

Eigen values
Eigenvalues are the scalar values associated with the eigenvectors in linear transformation. The
word „Eigen‟ is of German Origin which means „characteristic‟. Hence, these characteristic values
indicate the factor by which eigenvectors are stretched in their direction.
The equation for eigenvalue is given by
Av = λv
Where,
 A is the matrix,
 v is associated eigenvector, and
 λ is scalar eigenvalue.
Eigen vectors:
Eigenvectors are also called characteristic vectors and can only be found for square
matrices. Eigenvectors for square matrices are defined as non-zero vector values which when
multiplied by the square matrices give the scalar multiple of the vector, i.e. we define an
eigenvector for matrix A to be “v” if it specifies the condition, Av = λv.
The scalar multiple λ in the above case is called the eigenvalue of the square matrix. We always
have to find the eigenvalues of the square matrix first before finding the eigenvectors of the matrix.
Eigenvector Equation
The Eigenvector equation is the equation that is used to find the eigenvector of any square matrix.
The eigenvector equation is,
Av = λv
Where,
 A is the given square matrix,
 v is the eigenvector of matrix A, and
 λ is any scaler multiple.
How to get Eigenvalues and Eigenvectors?
If A is a square matrix of order n × n then we can easily find the eigenvector of the square matrix
by following the method discussed below,
We know that the eigenvector is given using the equation Av = λv, for the identity matrix of order
same as the order of A i.e. n × n we use the following equation,
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 14
Deep Learning Lecture Notes Unit I
(A-λI)v = 0
Solving the above equation we get various values of λ as λ1, λ2, …, λn these values are called the
eigenvalues and we get individual eigenvectors related to each eigenvalue.
Simplifying the above equation we get v which is a column matrix of order n × 1 and v is written
as,
𝑣=[𝑣1𝑣2𝑣3..𝑣𝑛]
How to Find an Eigenvector?
The eigenvector of the following square matrix can be easily calculated using the steps below,
Step 1: Find the eigenvalues of the matrix A, using the equation det |(A – λI| =0, where “I” is the
identity matrix of order similar to matrix A
Step 2: The value obtained in Step 2 are named as, λ1, λ2, λ3….
Step 3: Find the eigenvector (X) associated with the eigenvalue λ1 using the equation,
(A – λ1I) X = 0
Step 4: Repeat step 3 to find the eigenvector associated with other remaining eigenvalues
λ2, λ3….
Following these steps gives the eigenvector related to the given square matrix.
Types of Eigenvector
The eigenvectors calculated for the square matrix are of two types which are,
 Right Eigenvector
 Left Eigenvector
Right Eigenvector
The eigenvector which is multiplied by the given square matrix from the right-hand side is called
the right eigenvector.
For any square matrix, A of order n × n the eigenvector is the column matrix of order n × 1. If we
find the eigenvector of the matrix A by, Av = λv, “v” in this is called the right eigenvector of the
matrix A and is always multiplied to the right-hand side as matrix multiplication is not
commutative in nature. In general, when we find the eigenvector it is always the right eigenvector.
It is calculated by using the following equation,
AVR = λVR
Where,
 A is given square matrix of order n×n,
 λ is one of the eigenvalues, and
 VR is the column vector matrix
The value of VR is,
𝑉𝑅=[𝑣1𝑣2𝑣3..𝑣𝑛]
Left Eigenvector
The eigenvector which is multiplied by the given square matrix from the left-hand side is called the
left eigenvector.
We can find the left eigenvector of the square matrix A by using the relation, vA = vλ
Here, v is the left eigenvector and is always multiplied to the left-hand side. If matrix A is of order
n × n then v is a column matrix of order 1 × n.
It is calculated by using the following equation,
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 15
Deep Learning Lecture Notes Unit I
VLA = VLλ
Where,
 A is given square matrix of order n×n,
 λ is one of the eigenvalues, and
 VL is the row vector matrix.
The value of VL is,
VL = [v1, v2, v3,…, vn]
Eigenvectors of a Square Matrix
We can easily find the eigenvector of square matrices of order n × n. Now, let‟s find the following
square matrices:
 Eigenvectors of a 2 × 2 matrix
 Eigenvectors of a 3 × 3 matrix.
Eigenvector of a 2 × 2 matrix
The Eigenvector of the 2 × 2 matrix can be calculated using the above mention steps. An example
of the same is,
Example: Find the eigenvalues and the eigenvector for the matrix A =[ ]
Solution:
If eigenvalues are represented using λ and the eigenvector is represented as v = [𝑎 𝑏]
Then the eigenvector is calculated by using the equation,
|A- λI| = 0
[ ] [ ] [ ]

[ ] =0
(1-λ)(4-λ) – 2.5 = 0
⇒ 4 – λ – 4λ + λ2 – 10 = 0
⇒ λ2 -5λ -6 = 0
⇒ λ2 -6λ + λ – 6 = 0
⇒ λ(λ-6) + 1(λ-6) = 0
⇒ (λ-6)(λ+1) = 0
λ = 6 and λ = -1
Thus, the eigenvalues are 6, and -1. Then the respective eigenvectors are,
For λ = 6
(A-λI)v = 0
𝑎
[ ] [ ]=0
𝑏
⇒ -5a + 2b = 0
⇒ 5a – 2b = 0
Simplifying the above equation we get,
5a=2b
The required eigenvector is,
[𝑎 𝑏] t=[2 5] t

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 16


Deep Learning Lecture Notes Unit I
For λ = -1
(A-λI)v = 0
𝑎
[ ] [ ]=0
𝑏
⇒ 2a + 2b = 0
⇒ 5a + 5b = 0
simplifying the above equation we get,
a = -b
The required eigenvector is,
[𝑎 𝑏] t=[1 −1] t
Then the eigenvectors of the given 2 × 2 matrix are [𝑎 𝑏] t=[2 5] t,[𝑎 𝑏] t=[1 −1] t
Eigen Vector=
Eigendecomposition
In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, here the
matrix is represented in terms of its eigenvalues and eigenvectors. When the matrix being factorized
is a normal or real symmetric matrix, the decomposition is called "spectral decomposition", derived
from the spectral theorem.
Suppose that matrix A has n linearly independent eigenvectors {v(1),..,v(n)} with eigenvalues
{λ1,..,λn}.Concatenate eigenvectors to form matrix V per column V={v(1),..,v(n)},likewise concatenate
eigenvalues to form vector λ=[λ1,..,λn].Eigendecomposition of A is given by
A=Vdiag(λ)V-1
Let A be a square n × n matrix with n linearly independent eigenvectors qi (where i = 1, ..., n).
Then A can be factorized as
A=Qλ
where Q is the square n × n matrix whose ith column is the eigenvector qi of A, and Λ is the diagonal
matrix whose diagonal elements are the corresponding eigenvalues, Λii = λi. Note that
only diagonalizable matrices can be factorized in this way.
The eigendecomposition of a matrix tells us many useful facts about the matrix. The matrix is
singular if and only if any of the eigen values are zero. The eigen decomposition of a real symmetric
matrix can also be used to optimize Quadratic expressions of the form f(x) =xTAx subject to ||x||2=
1.Whenever x is equal to an eigen vector of A,f takes on the value of the corresponding eigenvalue.
The maximum value of f with in the constraint region is the maximum eigenvalue and its minimum
value with in the constraint region is the minimum eigenvalue.
A matrix whose eigenvalues are all positive is called positive definite.
A Matrix whose eigen values are all positive or zerovalued is called positive semidefinite.
Likewise,if all eigenvalues are negative,the matrix is negative definite,
and if all eigen values are negative or zerovalued,it is negative semidefinite.

Singular Value Decomposition


Eigendecomposition has form: A=Vdiag(λ)V-1 . If A is not square, eigendecomposition is undefined

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 17


Deep Learning Lecture Notes Unit I
 SVD is more general than eigendecomposition, Used with any matrix rather than symmetric
ones
 Every real matrix has a SVD factorized in to singular vectors and singular values
 SVD is a decomposition of the form A=UDVT
The Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix into three
matrices. It has some interesting algebraic properties and conveys important geometrical and
theoretical insights about linear transformations. It also has some important applications in data
science. In this article, I will try to explain the mathematical intuition behind SVD and its
geometrical meaning.
Mathematics behind SVD:
The SVD of mxn matrix A is given by the formula A=UDVT
where:
 U: mxm matrix of the orthonormal eigenvectors of V.
T
 V : transpose of a nxn matrix containing the orthonormal eigenvectors of V.
 D: mxn diagonal matrix with r elements equal to the root of the positive eigenvalues of AAᵀ or
Aᵀ A (both matrics have the same positive eigenvalues anyway).
Each of these matrices is defined to have a special structure.The matrices U and V are both defined to
be orthogonal matrices. The matrix D is defined to be a diagonal matrix. Note that D is not
necessarily square. The elements along the diagonal of D are known as the singular values of the
matrix A.
The columns of Uare known as the left-singular vectors.
The columns of V are known as the right-singular vectors.
We can actually interpret the singular value decomposition of A in terms of the eigen decomposition
of functions of A.
The left-singular vectors of A are the eigen vectors of AAT.
The right-singular vectors of A are the eigen vectors of ATA.
The non zero singular values of A are the square roots of the eigen values of ATA.
The same is true for AAT.

Principal Component Analysis(PCA):


Principal component analysis (PCA) is a dimensionality reduction and machine learning method used
to simplify a large data set into a smaller set while still maintaining significant patterns and trends.
Principal component analysis can be broken down into five steps.
1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal
components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes
The idea of PCA is simple: reduce the number of variables of a data set, while preserving as
much information as possible.
Principal Components:
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 18
Deep Learning Lecture Notes Unit I
Principal components are new variables that are constructed as linear combinations or mixtures of
the initial variables. These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is squeezed or
compressed into the first components. So, the idea is 10-dimensional data gives 10 principal
components, but PCA tries to put maximum possible information in the first component, then
maximum remaining information in the second and so on, until having something like shown in the
scree plot below.

Percentage of Variance (Information) for each by PC.


Organizing information in principal components this way will allow to reduce dimensionality without
losing much information, and this by discarding the components with low information and
considering the remaining components as new variables.
How PCA Constructs the Principal Components
As there are as many principal components as there are variables in the data, principal components
are constructed in such a manner that the first principal component accounts for the largest possible
variance in the data set.

The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for the next
highest variance.
This continues until a total of p principal components have been calculated, equal to the original
number of variables.
Step-by-Step Explanation of PCA
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each one of
them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of the initial variables. That is, if there are large
differences between the ranges of initial variables, those variables with larger ranges will dominate
over those with small ranges, which will lead to biased results. So, transforming the data to
comparable scales can prevent this problem.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 19


Deep Learning Lecture Notes Unit I
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for
each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.
Step 2: Covariance Matrix Computation
The aim of this step is to understand how the variables of the input data set are varying from the mean
with respect to each other, or in other words, to see if there is any relationship between them.
Variables are highly correlated in such a way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as
entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-
dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3 data matrix of this
from:

Covariance Matrix for 3-Dimensional Data.


Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal
(Top left to bottom right) we actually have the variances of each initial variable. And since the
covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric
with respect to the main diagonal, which means that the upper and the lower triangular portions are
equal.
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data.
What you first need to know about eigenvectors and eigenvalues is that they always come in pairs, so
that every eigenvector has an eigenvalue. Also, their number is equal to the number of dimensions of
the data. For example, for a 3-dimensional data set, there are 3 variables, therefore there are 3
eigenvectors with 3 corresponding eigenvalues.
It is eigenvectors and eigenvalues who are behind all the magic of principal components because the
eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most
variance (most information) and that we call Principal Components. And eigenvalues are simply the
coefficients attached to eigenvectors, which give the amount of variance carried in each Principal
Component.By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the
principal components in order of significance.
Step 4: Create a Feature Vector
As we saw in the previous step, computing the eigenvectors and ordering them by their eigenvalues in
descending order, allow us to find the principal components in order of significance. In this step, what

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 20


Deep Learning Lecture Notes Unit I
we do is, to choose whether to keep all these components or discard those of lesser significance (of
low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that
we decide to keep. This makes it the first step towards dimensionality reduction, because if we
choose to keep only p eigenvectors (components) out of n, the final data set will have
only p dimensions.
Step 5: Recast the Data Along the Principal Components Axes
In the previous steps, apart from standardization, you do not make any changes on the data, you just
select the principal components and form the feature vector, but the input data set remains always in
terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the eigenvectors of
the covariance matrix, to reorient the data from the original axes to the ones represented by the
principal components (hence the name Principal Components Analysis). This can be done by
multiplying the transpose of the original data set by the transpose of the feature vector.

Problems

Find the singular value decomposition of a matrix

𝐴=[−4 −7 : 1 4]
.
Solution:Given, 𝐴=[−4 −7: 1 4]

So, 𝐴𝑇=[−4 1 :−7 4]

Now,𝐴𝐴𝑇=[−4−7 14][−41 −74]=[65 −32 : −32 17]

Finding the eigenvector for AAT.

The eigenvalues of the matrix A⋅A′ are given by λ = 1, 81.

Now,Eigenvectors for λ = 81 are:

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 21


Deep Learning Lecture Notes Unit I
𝑣1=[−2 1]
Eigenvectors for λ = 1 are:

𝑣2=[0.5 1]

Similarly, we can find the eigenvectors for A’A as:

Eigenvectors for λ = 81 are:

𝑣1=[0.5 1]
Eigenvectors for λ = 1 are:

𝑣2=[−2 1]

Using these values, we can write the solution.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 22


Deep Learning Lecture Notes Unit I
Or

Probability
Probability is a concept which numerically measures the degree of certainty or uncertainty of
occurrence or non-occurrence of events.(or) Probability is measure of degree of happening or not
happening of an event.
Basic definations:
Experiment: An experiment is any physical action or process that is observed and the result is noted.
Examples:Tossing a coin, Throwing or rolling a die or dice and drawing a card from a deck of 52-
cards
Trail: A single performance of an experiment.
Outcome/output: It is the result of an experiment
Sample Space: The set of all possible outcomes in any Experiment is called the sample space. And it
is represented by the letter “S”. The sample space is a universal set for the experiment. The elements
of S are called sample points.
Ex:1. For the coin-toss experiment would be the results ―Head and ―Tail, which we may represent by
S={H T}.
Ex. 2. If we toss a die,
one sample space or the set of all possible outcomes is S = { 1, 2, 3, 4, 5, 6}
Types of Sample Space:
1. Discrete Sample Space:Consider the experiment of tossing a coin twice. The sample space can
be S = {HH, HT, TH , TT} the above sample space has a finite number of sample points. It is
called a finite sample space.
2. Continuous Sample Space: Obtaining a number on a spinning pointer is an example for
continuous finite sample space.
Now, the sample space becomes countably finite i.e.
S = {0.01,--------------, 1.0}
The above sample space is called a countable finite sample space.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 23


Deep Learning Lecture Notes Unit I
Event: An event is defined as a subset of the sample space. The events can be represented with
capital letters like A, B, C etc… As with sample space events may be of either discrete or
continuous..
Some examples of events:
Example 1: tossing a fair coin
The possible outcomes are H (head) and T (tail). The associated sample space is S={H,T}.It is a
finite sample space. The events associated with the sample space S are: {H},{T}and{ }/ɸ
Example 2: Throwing a fair die:
The possible 6 outcomes are:1,2,3,4,5,6.The associated finite sample space is S={1,2,3,4,5,6}.Some
events are A=Getting odd number={1,3,5}
B=getting 6={6}
The classical definition of Probability:
Let the sample space (denoted by S) be the set of all possible distinct outcomes to an experiment.
The classical way the probability is defined as the ratio of number of favorable outcomes to the total
number of possible outcomes from an experiment. i.e. Mathematically,
P(A) = F/T.
Where: P(A) is the probability of event A.
F is the number of favorable outcomes and
T is the Total number of possible outcomes.
The classical definition fails when the total number of outcomes becomes infinity.
Example1:A fair die is rolled once. What is the probability of getting a „6„ ?
Here S={1,2,3,4,5,6}and A={6}
n(S)=6 and n(A)=1
P(A)=
Definition of probability from Set theory and Axioms: In the axiomatic definition, the probability
P(A) of an event is always a non negative real number which satisfies the following three Axioms.
Axiom 1: P(A) ≥ 0.Which means that the probability of event is always a non negative number
Axiom 2: P(S) =1.Which means that the probability of a sample space consisting of all possible
outcomes of experiment is always unity or one.
Axiom 3: P (A1UA2U . . .U AN) = P (A1) + P (A2) + . . . + P (AN)
This means that the probability of Union of N number of events is same as the Sum of the individual
probabilities of those N Events.
Probability as a relative frequency: The use of common sense and engineering and scientific
observations leads to a definition of probability as a relative frequency of occurrence of some event.
Suppose that a random experiment repeated n times and if the event A occurs n(A) times, then the
probability of event a is defined as the relative frequency of event a when the number of trials n tends
to infinity. Mathematically P(A) = n(A)/n
Where n(A)/n is called the relative frequency of event A.
Joint Probability: If a sample space consists of two events A and B which are not mutually
exclusive, and then the probability of these events occurring jointly or simultaneously is called the
Joint Probability. In other words the joint probability of events A and B is equal to the relative

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 24


Deep Learning Lecture Notes Unit I
frequency of the joint occurrence. If the experiment repeats n number of times and the joint
occurrence of events A and B is n(AB) times, then the joint probability of events A and B is

i.e the probability of union of two events is always less than or equal to the sum of the event
probabilities.
Conditional Probability: If an experiment repeats n times and a sample space contains only two
events A and B and event A occurs n(A) times, event B occurs n(B) times and the joint event of A
and B occurs n(AB) times then the conditional probability of event A given event B is equal to the
relative frequency of the joint occurrence n(AB) with respect to n(B) as n tends to infinity.
Mathematically,

That is the conditional probability P(A/B) is the probability of event A occurring on the condition that
the probability of event B is already known. Similarly the conditional probability of occurrence of B
when the probability of event A is given can be expressed as

From the conditional probabilities, the joint probabilities of the events A and B can be expressed as

Total Probability Theorem: Consider a sample space, S that has n mutually exclusive events Bn,
n=1, 2, 3,…,N. such that Bm∩Bn=ɸ for m ≠n=1, 2, 3, ….,N. The probability of any event A, defined
on this sample space can be expressed in terms of the Conditional probabilities of events Bn. This
probability is known as the total probability of event A. Mathematically,

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 25


Deep Learning Lecture Notes Unit I
𝐴
𝐴 ∑ ( )

Proof: The sample space s of N mutually exclusive events, Bn, n=1, 2, 3, …N is shown in the figure.

These events have properties Bm∩Bn=ɸ for m ≠n=1, 2, 3, ….,N and ⋃

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 26


Deep Learning Lecture Notes Unit I

Random Variable Definition


A random variable is a function that maps outcomes of a random experiment to real numbers. (or)
A random variable associates the points in the sample space with real numbers.and is represented as
X(s) or X.
A (real-valued) random variable, often denoted by X (or some other capital letter), is a function
mapping a probability space (S; P) into the real line R. This is shown in below figure. Associated
with each point s in the domain S the function X assigns one and only one value X(s) in the range R.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 27


Deep Learning Lecture Notes Unit I

Typical random variables are the number of hits in a shooting game, the number of heads when
tossing coins, temperature/pressure variations of a physical system etc…For example, an experiment
consists of tossing two coins.
Let the random variable X chosen as the number of heads shown. So X maps the real numbers of the
event “showing no head” as zero, the event “any one head” as One and “both heads” as Two.
Therefore the random variable is X = {0,1,2}. The elements of the random variable X are x1=0, x2=1
& x3=2.
Classification of Random Variables: Random variables are classified into continuous, discrete and
mixed random variables.
The values of continuous random variable are continuous in a given continuous sample space. A
continuous sample space has infinite range of values. It cannot be produced from the discrete sample
space because of discrete values.
Types of continuous random variables: Normal/Gaussian R.V, Exponential R.V, Gamma R.V,
Rayleigh R.V, Uniform R.V
The values of a discrete random variable are only the discrete values in a given sample space. The
sample space for a discrete random variable can be continuous, discrete or even both continuous and
discrete points .They may be also finite or infinite.
Types of Discrete random variables: Bernoulli R.V, Binomial R.V, Poisson R.V, Geometric R.V
The values of mixed random variable are both continuous and discrete in a given sample space. The
mixed random variable has least practical significance or importance.

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 28


Deep Learning Lecture Notes Unit I

Probability Mass Function:


The probability distribution of a discrete random variable is a list of probabilities associated with
each of its possible values. It is also sometimes called the probability function or the probability
mass function.
More formally, the probability distribution of a discrete random variable X is a function which
gives the probability p(xi) that the random variable equals xi, for each value xi:
FX(xi) = P(X=xi)
It satisfies the following conditions:
a. 0≤ FX ( ≤ 1
b. ∑ =1

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 29


Deep Learning Lecture Notes Unit I

Expectation &Variance:
Mean of a random variable: If X is the random variable and P is the respective probabilities, the
mean of a random variable is defined by:

Mean (μ) = ∑ XP

where variable X consists of all possible values and P consist of respective probabilities.

Variance of Random Variable: The variance tells how much the spread of random variable X is
around the mean value. The formula for the variance of a random variable is given;

Var(X) = σ2 = E(X2) – [E(X)]2

where E(X2) = ∑X2P and E(X) = ∑ XP Standard Deviation(X) = 𝜎

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 30


Deep Learning Lecture Notes Unit I

2. If two random variables X and Y are independent, then the covariance is zero. i.e. CXY = 0. But
the converse is not true.
Proof: Consider two random variables X and Y. If X and Y are independent, We know that
E[XY]=E[X]E[Y] and the covariance of X and Y is

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 31


Deep Learning Lecture Notes Unit I

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 32


Deep Learning Lecture Notes Unit I

INFORMATION THEORY
Information theory is based on probability theory and statistics and often concerns itself with
measures of information of the distributions associated with random variables. Important quantities of
information are entropy, a measure of information in a single random variable, and mutual
information, a measure of information in common between two random variables.
Information theory revolves around quantifying how much information is present in a signal. It
was originally invented to study sending messages from discrete alphabets over a noisy channel,
such as communication via radio transmission. In the case of deep learning, the most common use
case for information theory is to characterize probability distributions and to quantify the
similarity between two probability distributions.
Basic Concepts
Note: All of these concepts are used in Prof. Tishby‟s work they are a part of Information Theory and
are often used in Deep Learning.
1. Entropy
The Shannon entropy H, in units of bits (per symbol) is given by
H=−∑
where is the probability of occurrence of the ith possible value of the source symbol. Intuitively, the
entropy HX of a random variable X gives us a measure of the amount of uncertainty associated
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 33
Deep Learning Lecture Notes Unit I
with the value of X when only its distribution is known. Below is an example of the entropy of a
Bernoulli trial as a function of the success probability (called the binary entropy function):

Fig.1 Entropy H(X) (i.e. the expected surprisal) of a coin flip, measured in bits, graphed versus the
bias of the coin Pr(X=1), where X=1 represents a result of heads.
In general for any random variable X, the entropy H is defined as H(X)=−∑x∈X p(x)logp(x)
where X is the set of all messages {x1,...,xn} that X could be and I(x) is the self-information, which is
the entropy contribution of an individual message.
2. Joint Entropy
This concept of entropy applies when there are two random variables (say X and Y). The joint
probability of (X,Y) is defined as:
H(X,Y)=−∑x,y p(x,y)log2p(x,y)
This entropy is equal to the sum of the individual entropies of X and Y when they are independent.
3. Conditional Entropy
The conditional entropy (also called conditional uncertainty) of a random variable X given a random
variable Y is the average conditional entropy over Y:
H(X∣Y)=−∑y∈Yp(y)∑x∈Xp(x∣y)logp(x∣y)=−∑x,yp(x,y)logp(x∣y)
One basic property that the conditional entropy satisfies is given by:
H(X∣Y)=H(X,Y)−H(Y)
4. Mutual Information
Using mutual information, we can quantify the amount of information that can be obtained for one
random variable using the information from the other. The mutual information of X relative to Y is
given by:
I(X;Y)=∑x,yp(x,y)log(p(x,y)/p(x)p(y))
where SI(x,y) is the Specific Mutual Information (also called pointwise mutual information) and is
defined as:
pmi(x;y)≡log(p(x,y)/p(x)p(y))=log(p(x∣y)/p(x))=log (p(y∣x)/p(y))
A basic property of the mutual information is that
I(X;Y)=H(X)−H(X∣Y)
Intuitively, this means that knowing Y, we can save an average of I(X;Y) bits in
encoding X compared to not knowing Y.
Mutual information is symmetric: I(X;Y)=I(Y;X)=H(X)+H(Y)−H(X,Y).

Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 34


lOMoARcPSD|29808885

Deep learning book Ch. 4- Numerical computation notes


James Chuang

February 3, 2017

Contents
DL 4.1 overflow and underflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
DL 4.2 poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
DL 4.3 gradient-based optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
DL 4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
DL 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

My notes on Deep Learning Book Chapter 4 on Numerical Computation.

ML algorithms usually require a high amount of numerical computation, which typically refers to algorithms that solve mathematical
problems by methods that update estimates of a solution iteratively, rather than solving analytically. Evaluating a function involving
real numbers on a computer can be difficult, because the real numbers must be represented with a finite amount of memory.

DL 4.1 overflow and underflow

The fundamental difficulty in performing continuous math on a digital computer is the need to represent infinitely many real numbers
with a finite number of bit patterns. This leads to approximation error, which can just be rounding error. Rounding error is especially
problematic when it compounds across many operations, and can cause algorithms to fail.

Forms of rounding error:

• underflow : when numbers near zero are rounded to zero. This is especially a problem when it leads to dividing by zero or
taking the log of zero.
• overflow : when numbers with large magnitude are approximated as ∞ or −∞.

The softmax function as a function that must be stabilized against underflow and overflow.

exp(xi )
softmax(X)i = ∑n
j=1 exp(xj )

1
Analytically, when all of the xi are equal to a constant c, all of the outputs should be n . Numerically, this may not occur when c has
large magnitude. For very negative c, exp(c) will underflow, causing the denominator to become 0, and making the output undefined.
For very positive c, exp(c) will overflow, again making the output undefined. Both of these difficulties can be resolved by instead
evaluating softmax(Z), where Z = X − maxi xi . The value of the softmax function is not changed analytically by adding or
subtracting a scalar from the input vector:

exp(xi − maxi xi )
softmax(Z)i = ∑n
j=1 exp(xj − maxi xi )
exp(xi )
exp(maxi xi )
= ∑n exp(xj )
j=1 exp(maxi xi )
exp(xi )
exp(maxi xi )
= ∑n
exp(xj )
j=1
exp(maxi xi )
exp(xi )
= ∑n
j=1 exp(xj )
softmax(Z)i = softmax(X)i

Downloaded by Gangadharareddy Peddamallu ([email protected])


lOMoARcPSD|29808885

Subtracting maxi xi makes the largest argument to exp be 0, which protects against overflow. Likewise, at least one term in the
denominator has a value of 1, protecting against underflow in the denominator and subsequent division by zero. However, underflow
in the numerator can still cause the expression to erroneously evaluate to zero. Then, taking log softmax(x) would erroneously return
−∞. This can be stabilized using the same trick used to stabilize the softmax function above.

DL 4.2 poor conditioning

Conditioning : how rapidly a function changes with respect to small changes in its inputs.

Functions that change rapidly in response to small input perturbations can be problematic because rounding error in the inputs can
result in large changes in the output.

Consider the function f (x) = A−1 x. When A ∈ Rn×n has an eigenvalue decomposition, its condition number is

λi
max
i,j λj
, the ratio of the largest and smallest eigenvalue. When the condition number is large, matrix inversion is particularly sensitive to error
in the input.

DL 4.3 gradient-based optimization

Optimization: minimizing a function f (x) by altering x. (Maximization is simply minimizing −f (x)).

• objective function or criterion: the function to be minimized or maximized


• When the function is to be minimized, it can also be called the cost function, loss function, or error function.
The value that minimizes or maximizes a function can be denoted with a superscript ∗, i.e. x∗ = arg min f (x).
The derivative of a function is useful for function minimization because it tells us how to change x in order to make a small improvement
in y . For small enough ϵ, f (x − ϵ sign(f ′ (x))) is less than f (x). This technique is called gradient descent .

• critical points or stationary points : points where f ′ (x)


=0
• can be a local minimum, local maximum, or saddle point
• global minimum: The point with the absolute lowest value of f (x) (there can be more than one).
• In deep learning, the functions may have many suboptimal local minima, and many saddle points surrounded by flat
regions. This makes optimization difficult, especially for high-dimensional functions. We therefore usually settle for finding
a “very low” value of f , which is not necessarily a global minimum.

The above can be generalized to functions with multiple inputs f : Rn → R. (In order to minimize the function, the output must still
be a single scalar output). In this case, the gradient ∇x f (x) is analogous to the derivative. The directional derivative in a direction

u (a unit vector) is the slope of the function f in direction u, i.e. ∂α f (x + αu) = uT ∇x f (x). The function f can be minimized by
moving in the direction of the negative gradient (This is known as the method of steepest descent or gradient descent ):

x′ = x − ϵ∇x f (x)

, where ϵ is the learning rate , a positive scalar determining the step size.

How to set the learning rate ϵ?

• popular approach is to set ϵ to a small constant


• or, solve for step size that makes directional derivative zero
• or, line search: evaluate f (x − ϵ∇x f (x)) for several values of ϵ and choose the one that results in the smallest objective
function value

Steepest descent converges when ∇x f (x) =0


• in cases where there is an analytical solution, this point can be jumped to directly

Downloaded by Gangadharareddy Peddamallu ([email protected])


lOMoARcPSD|29808885

Gradient descent is limited to optimization in continuous spaces

• the concept of making small moves towards better configurations can be generalized to discrete spaces (this is called hill
climbing )

DL 4.3.1 Jacobian and Hessian Matrices

Sometimes need the partial derivatives of a function f : Rm → R n .


Jacobian matrix: J (f (x)) the matrix of all partial derivatives


J ∈ Rn×m , Ji,j = f (x)i
∂xj

second derivatives indicate curvature

• important because it tells us whether a gradient step will cause as much of an improvement as we would expect based on the
gradient alone
• f ′′ (x) = 0: no curvature, value can be predicted using only the gradient
• f ′′ (x) < 0: function curves downward, so cost function decreases by more than ϵ
• f ′′ (x) > 0: function curves upward, so cost function decreases by less than ϵ
Hessian matrix: H (f (x)) the matrix of all second derivatives

∂2
H(f (x))i,j = f (x)
∂xi ∂xj

2
∂2
Anywhere that the second partial derivatives are continuous, ∂x∂∂x f (x) = ∂xj ∂xi f (x), implying that Hi,j = Hj,i , meaning
i j
that the Hessian is symmetric at such points.

• Most of the functions we will encounter have symmetric Hessian almost everywhere
• because the Hessian matrix is real and symmetric, it can be decomposed into a set of real eigenvalues and an orthogonal basis
of eigenvectors
• the second derivative in the direction of a unit vector d is given by dT Hd
• when d is an eigenvector of H, the second derivative in the direction of d is given by the corresponding eigenvalue
• for other directions, the second derivative is a weighted average of all of the eigenvalues, with weights between 0
and 1
• max eigenvalue → max second derivative
• min eigenvalue → min second derivative
The directional second derivative measures how well a gradient descent step is expected to perform. The second order Taylor series
approximation of the function f (x) about the current point x(0) :

( ) ( )T 1( )T ( ) ( ( )) ( ( ))
f (x) ≈ f x(0) + x − x(0) g+ x − x(0) H x − x(0) g = ∇x f x(0) , H = H f x(0)
2
Using gradient descent, the new point x = x(0) − ϵ∇x f (x(0) ) . Substituting this in:
( )

( ) ( )T 1( )T ( )
f (x) ≈ f x(0) + x − x(0) g + x − x(0) H x − x(0)
2
( ) ( ) ( )T 1 ( (0) )T ( )
f x(0) − ϵg ≈ f x(0) + x(0) − ϵg − x(0) g + x − ϵg − x(0) H x(0) − ϵg − x(0)
2
( ) ( ) 1
f x(0) − ϵg ≈ f x(0) − ϵgT g + ϵ2 gT Hg
2
new value of f ≈ original value − expected improvement due to slope + correction for curvature

Downloaded by Gangadharareddy Peddamallu ([email protected])


lOMoARcPSD|29808885

• gT Hg very positive → gradient descent step can actually move uphill


• gT Hg zero or negative → Taylor series approx. predicts that ϵ → ∞ will decrease f infinitely
• in practice, must then resort to heuristic choices of ϵ
• gT Hg positive, then solving for the optimal step size:

gT g
ϵ∗ =
gT Hg
• worst case: g aligns with the eigenvector of H corresponding to the maximal eigenvalue λmax
• then, optimal step size = λ1max
• thus, to the extent that the function we minimize can be approximated well by a quadratic function, the eigenvalues
of the Hessian determine the scale of the learning rate
• second derivative test : determine whether a critical point is a local maximum, local minimum, or saddle point
• at critical points, f ′ (x) = 0
• if f ′′ (x) > 0, f ′ (x) increases to the right, and f ′ (x) decreases to the left
• i.e., f ′ (x − ϵ) < 0 and f ′ (x + ϵ) > 0 for small ϵ
• therefore, x is a local minimum
• similarly, if f ′ = 0 and f ′′ (x) < 0, x is a local maximum
• if f ′′ = 0, test is inconclusive
• x may be a saddle point, or part of a flat region
• using the eigendecomposition of the Hessian matrix, the second derivative test can be generalized to multiple dimensions
• at critical points, ∇x f (x) = 0
• if Hessian is positive definite → all eigenvalues positive → directional second derivative in any direction is positive
→ x is a local minimum
• if Hessian is negative definite → all eigenvalues negative → directional second derivative in any direction is
negative → x is a local maximum
• if Hessian has at least one positive eigenvalue and one negative eigenvalue, x is a local max in at least one
dimension and a local min in at least one dimension, and therefore is a saddle point
• if all non-zero eigenvalues have the same sign, but at least one eigenvalue is zero, the test is inconclu-
sive because the univariate second derivative test is inconclusive in the cross section corresponding to the zero
eigenvalue
• in multiple dimensions, there is a different second derivative for each direction
• the condition number of the Hessian measures how much the second derivatives vary
• poor condition number means gradient descent performs poorly, because gradient descent is unaware that it should
preferentially explore only in the steep directions
• this also makes it difficult to choose a step size:
• step size must be small to avoid overshooting the minimum and going uphill in directions with strong positive
curvature
• this small step size is usually too small to make significant progress in directions with little curvature
• this can be resolved by using the Hessian to guide the search, e.g. with the Newton-Raphson method
The Newton-Raphson method is based on using the second-order Taylor series expansion to approximate f (x) near some point x(0)
( ) ( )T ( ) 1( )T ( )( )
f (x) ≈ f x(0) + x − x(0) ∇x f x(0) + x − x(0) Hf x0 x − x(0)
2
Solving for the critical point:
( )−1 ( )
x∗ = x(0) − Hf x(0) ∇x f x(0)
• For a positive definite quadratic function, Newton’s method arrives at the minimum in one step
• if f is not truly quadratic but can be locally approximated as positive definite quadratic, Newton’s method is iteratively applied
• this can arrive at the minimum much faster than gradient descent
• this is useful near a local minimum, but can be harmful near a saddle point
• Newton’s method is only appropriate when the nearby critical point is a minimum
• in contrast, gradient descent is not attracted to saddle points unless the gradient points toward them
• gradient descent is a first order optimization algorithm

Downloaded by Gangadharareddy Peddamallu ([email protected])


lOMoARcPSD|29808885

• methods such as the Newton-Raphson method that take the Hessian into account are called second order optimization
algorithms

Optimization algorithms come with almost no guarantees, because the family of functions encountered is very complicated

• in deep learning, can sometimes gain some guarantees by restricting to functions that are Lipschitz continuous or have
Lipschitz continuous derivatives
• a Lipschitz continuous function is a function f whose rate of change is bounded by a Lipschitz constant L:

∀x, y, |f (x) − f (y)| ≤ L∥x − y∥2

• this allows us to quantify the assumption that a small change in the input made by an algorithm such as gradient descent
will have a small change in the output
• Lipschitz continuity is also a fairly weak constraint
• many optimization problems in deep learning can be made Lipschitz continuous with relatively minor modifications
The most successful field of specialized optimization is convex optimization

• convex optimization algorithms are able to provide more guarantees by making stronger restrictions
• applicable only to convex functions, i.e. functions for which the Hessian is positive semidefinite everywhere
• these functions are well-behaved:
• they lack saddle points
• all local minima are necessarily global minima
• most problems in deep learning are difficult to express in terms of convex optimization
• used only as a subroutine of some DL algorithms

DL 4.4 Constrained Optimization

• constrained optimization: optimizing f (x) for values of x in some set S


• feasible points: points x in the set S
• often wish to find a solution that is “small” in some sense
• common solution is a norm constraint, e.g. ∥x∥ ≤ 1
• the Karush-Kuhn-Tucker (KKT) approach: a very general solution to constrained optimization (also see CS229 notes 3)
• design a different, unconstrained optimization problem whose solution can be converted into a solution to the original,
constrained optimization problem
• do this by introducing a new function- the generalized Lagrangian or generalized Lagrange function
• first, describe S in terms of m functions g (i) and n functions h(j) such that
{ }
∀i, g (i) (x) = 0 equality constraints
S= x|
∀j, h(j) (x) ≤ 0 inequality constraints
• introduce KKT multipliers λi and αj for each constraint, then define the generalized Lagrangian:
∑ ∑
L(x, λ, α) = f (x) + λi g (i) (x) + αj h(j) (x)
i j

• consider the following quantity:


max L(x, λ, α).
λ,α,α≥0

Any time a constraint is violated (g (i) ̸= 0 or h(i) > 0),

max L(x, λ, α) = ∞
λ,α,α≥0

while when the constraints are all satisfied,

max L(x, λ, α) = f (x).


λ,α,α≥0

Downloaded by Gangadharareddy Peddamallu ([email protected])


lOMoARcPSD|29808885

This means that


min max L(x, λ, α) = min f (x),
x λ,α,α≥0 x∈S

i.e., the solution to the new, unconstrained problem is the same as the original constrained problem.
• regarding the inequality constraints:
• a constraint h(i) (x) ≤ 0 is active if it holds with equality, i.e. h(i) (x) = 0
• if a constraint is not active, then the solution to the problem using that constraint would remain at least a local
solution if that constraint were removed
• because an inactive h(i) has a negative value, the solution to

min max L(x, λ, α)


x λ,α,α≥0

will have αi = 0
• thus, αh(x) = 0
• i.e., at least one of the constraints αi ≥ 0 or h(i) ≤ 0 must be active at the solution
• the solution is either on the boundary imposed by the inequality and its KKT multiplier influences the
solution, or the inequality has no influence on the solution and its KKT multiplier is zero
• the above properties are the Karush-Kuhn-Tucker (KKT) condtions , i.e.:
• the gradient of the generalized Lagrangian is zero
• all constraints on both x and the KKT multipliers are satisfied
• α ⊙ h(x) = 0

DL 4.5 Example: Linear Least Squares

Minimize the squared error loss function


1
f (x) = ∥Ax − b∥22
2
This can be analytically solved by taking the gradient and setting it to zero, since it is a positive definite quadratic function. However,
it can also be used as a simple test case for gradient-based optimization.

Finding the gradient:


1
f (x) = ∥Ax − b∥22
2
1 T
= (Ax − b) (Ax − b)
2
1[ T T
]
= (Ax) (Ax) − (Ax) b − bT (Ax) + bT b
2
1[ T T
x A Ax − 2xT AT b + bT b
]
=
2
1[ T
2A Ax − 2AT b
]
∇x f (x) =
2
= AT Ax − AT b

Pseudocode for minimization by gradient descent:

• initialize step size ϵ and convergence tolerance δ to small, positive numbers


• while ∥AT Ax − (AT b∥2 > δ do )
• x ← x − ϵ AT Ax − AT b
If solving by Newton’s method, the quadratic approximation is exact (since f is a positive definite quadratic function), so the minimum
is reached in one step.

Suppose we want to minimize the same function, but subject to the constraint xT x ≤ 1. This constraint can be rewritten as xT x −
1 ≤ 0. Therefore, the Lagrangian is
L(x, λ) = f (x) + λ xT x − 1 ,
( )

Downloaded by Gangadharareddy Peddamallu ([email protected])


lOMoARcPSD|29808885

and the solution to the original (primal) problem can be found by solving

min max L(x, λ)


x λ,λ≥0

The smallest-norm solution to this unconstrained least squares problem may be found using the Moore-Penrose pseudoinverse x =
A+ b. If this point is feasible, then it is the solution to the constrained problem. Otherwise, we must find a solution where the constraint
is active. Start by differentiating the Lagrangian w.r.t. x and setting it equal to the zero vector:

L(x, λ) = f (x) + λ xT x − 1
( )

1[ T T
x A Ax − 2xT AT b + bT b + λxT x − λ
]
=
2
1[ T
2A Ax − 2AT b + λx = 0
]
∇x L(x, λ) =
2
AT Ax − AT b + 2λx = 0
( T
A x + 2λI x = AT b
)
)−1 T
x = AT x + 2λI
(
A b

The solution must take the above form, and the magnitude of λ must be chosen such that the result obeys the constraint (i.e., xT x ≤ 1).
Value can be found by performing gradient ascent on λ:

1[ T T
x A Ax − 2xT AT b + bT b + λxT x − λ
]
L(x, λ) =
2

L (x, λ) = xT x − 1
∂λ

∂L
if xT x > 1 → >0 → gradient ascent increases λ
∂λ

An increased λ means the weight of the xT x penalty in the Lagrangian has increased. Therefore, solving for x will then yield a
solution with a smaller norm. The process of solving the linear equation and adjusting λ continues until x has the correct norm and
the derivative on λ is 0.

Downloaded by Gangadharareddy Peddamallu ([email protected])

You might also like