DL Unit1 Final
DL Unit1 Final
DEEP LEARNING
LECTURE NOTES
(3-1 CSE(AI), R20, JNTUA)
Prepared By
Dr.P.GANGADHARA REDDY
Associate Professor
Course Outcomes:
After completion of the course, students will be able to
Reference Books:
1. Fundamentals of Deep Learning, Designing next-generation machine intelligence algorithms, Nikhil
Buduma, O‟Reilly, Shroff Publishers,2019.
2. Deep learning Cook Book, Practical recipes to get started Quickly, DouweOsinga, O‟Reilly, Shroff
Publishers,2019.
https://nptel.ac.in/courses/106105215
Preface
Deep Learning algorithms learn multi-level representations of data, with each level explaining
the data in a hierarchical manner. Such algorithms have been effective at uncovering underlying
structure in data, e.g., features to discriminate between classes. They have been successful in
many artificial intelligence problems including image classification, speech recognition and
natural language processing. The course, which will be taught through lectures and PPTs, will
cover the underlying theory, the range of applications to which it has been applied, and learning
from very large data sets. The course will cover connectionist architectures commonly associated
with deep learning, e.g., basic neural networks, convolutional neural networks and recurrent
neural networks. Methods to train and optimize the architectures and methods to perform
effective inference with them, will be the main focus. Students will be encouraged to use open
source software libraries such as Tensorflow.
Course material loosely follows the organization of the text: I. Goodfellow, Y. Bengio and A.
Courville, Deep Learning, MIT Press, 2016. There are ten chapters organized into two major
subdivisions (after the Overview): I. Applied Math and Machine Learning basics, II. Deep
Networks: Modern Practice.
Lecture Slides
Lecture slides loosly follows Prof.Sargur Srihari, Department of Computer Science and
Engineering , University of Baffalo. Following are links to pdf slide files for topics. Updated in
Spring 2020
Dr.P.GANGADHARA REDDY
Associate Professor
Aditya College of Engineering, Madanapalle,
Annamayya(Dist.), Andhra Pradesh
UNIT I
Linear Algebra & Probability and Information Theory
Syllabus:
Linear Algebra: Scalars, Vectors, Matrices and Tensors, Matrix operations, types of matrices, Norms, Eigen decomposition,
Singular Value Decomposition, Principal Components Analysis.
Probability and Information Theory: Random Variables, Probability Distributions, Marginal Probability, Conditional Probability,
Expectation, Variance and Covariance, Bayes’ Rule, Information Theory. Numerical Computation: Overflow and Underflow,
Gradient-Based Optimization, Constrained Optimization, Linear Least Squares.
Linear Algebra:
The concepts of linear algebra are crucial for understanding the theory behind machine learning,
especially for deep learning.
Linear algebra is the branch of mathematics that focuses on linear equations. It is often applied to the
science and engineering fields, specifically machine learning. Linear algebra is also central to almost
all areas of mathematics like geometry and functional analysis.
Scalar
A scalar is simply a single number.
Vector
A vector is an ordered array of numbers and can be in a row or a column. A vector has just a single
index, which can point to a specific value within the vector. For example, V2 refers to the second
value within the vector.
Matrix
A matrix is an ordered 2D array of numbers and it has two indices. The first one points to the row and
Tensor
A Tensor as an array of numbers, arranged on a regular grid, with a variable number of axes. A tensor
has three indices, where the first one points to the row, the second to the column and the third one to
the axis. For example, T232 points to the second row, the third column, and the second axis. This
refers to the value 0 in the right tensor in the graphic below:
Tensor is the most general term for all of these concepts above because a tensor is a multidimensional
array and it can be a Vector and a Matrix, depending on the number of indices it has. For example, a
first-order tensor would be a vector (1 index). A second-order tensor is a matrix (2 indices) and third-
order tensors (3 indices) and higher are called higher-order tensors (3 or more indices).
Computational Rules of Linear Algebra
1. Matrix-Scalar OperationsThe operations multiply, divide, subtract, or add a scalar to a matrix, it
can do with every element of the matrix. The image below illustrates this perfectly for multiplication:
1. Not Commutative
Scalar multiplication is commutative but matrix multiplication is not. This means that when we are
multiplying scalars, 7*3 is the same as 3*7. But when we multiply matrices by each other, A*B isn‟t
the same as B*A.
2. Associative
Scalar and matrix multiplication are both associative. This means that the scalar multiplication 3(5*3)
is the same as (3*5)3 and that the matrix multiplication A(B*C) is the same as (A*B)C.
We previously discussed that matrix multiplication is not commutative but there is one exception,
namely if we multiply a matrix by an identity matrix. Therefore, the following equation is true: A*I =
I*A = A
Inverse and Transpose
The matrix inverse and the matrix transpose are two special kinds of matrix properties. Again, we
will start by discussing how these properties relate to real numbers and then how they relate to
matrices.
1. Inverse
A number that is multiplied by its inverse is equal to 1. Note that every number except 0 has an
inverse. If multiply a matrix by its inverse, the result is its identity matrix. The example below shows
what the inverse of scalars looks like:
But not every matrix has an inverse. compute the inverse of a Matrix if it is a “squared matrix” and if
it has an inverse.
Need of an inverse Because we can‟t divide matrices. There is no concept of dividing by a matrix but
we can multiply a matrix by an inverse, which results essentially in the same thing.
The image below shows a matrix multiplied by its inverse, which results in a 2-by-2 identity matrix.
3.Determinants
The determinant of a matrix is a single numerical value which is used when calculating the inverse
or when solving systems of linear equations. The determinant of a matrix A is denoted |A| , or
sometimes det(A).
Properties:
The determinant is the same in any row or column. The value of the determinant is 0 if all of the
elements of a row (or column) are zeros. Identity matrix determinant (In) is 1. If the rows and
columns are swapped, the determinant's value remains the same (value does not change).
Norm:
A norm is a way to measure the size of a vector, a matrix, or a tensor. In other words, norms are a
class of functions that enable us to quantify the magnitude of a vector. For instance, the norm of a
vector X drawn below is a measure of its length from origin.
Vector Norms are defined as a set of functions that take a vector as an input and output a positive
value against it. This is called the magnitude of a vector.
The norm of any vector X is denoted by a double bar around it and is written as follows:
Norm of vector x: ||X||
A diagram showing the family of vector norm functions and their output.
Properties of Norm:
Consider two vectors X and Y, having the same size.A function is considered a norm if and only if it
satisfies the following properties:
Non-negativity: It should always be non-negative.
Definiteness: It is zero if and only if the vector is zero, i.e., zero vector.
Triangle inequality: The norm of a sum of two vectors is no more than the sum of their norms.
.
Homogeneity ||λx||=|λ| ||x||,where λϵR
Any real value function of a vector that satisfies the above four properties is called a norm.
Standard Norms:
Lᵖ Norm We can now generalize to the idea of the p-norm. In a way, we can derive all other norms
from the p-norm by varying the values of p. if you substitute the value of p with one, two, and ∞
respectively in the formula below, you‟ll obtain L¹, L², and L∞ norms.
Mathematical Notation
The Lᵖ norm can be mathematically written as:
All norm functions originate from a standard equation of Norm, known as the p-norm. For different
values of the parameter p (p should be a real number greater than or equal to 1), we obtain a different
norm function. The generalized equation is shown below:
This takes an n-dimensional vector x and raises each element to its p-th power. Then, we sum all the
obtained elements and take the p-th root to get the p-norm of the vector, also known as its magnitude.
Now, with different values of the parameter p, we will obtain a different norm function. Let‟s discuss
them one by one below.
L0 Norm:
Although p=0 lies outside the domain of the p-norm function, substituting p=0 in the above equation
gives us the individual vector elements raised to the power 0, which is 1 (provided the number is not
zero). Furthermore, we also have a p-th root in the equation, which is not defined for p=0. To handle
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 11
Deep Learning Lecture Notes Unit I
this, the standard way of defining the L0 norm is to count the number of non-zero elements in the
given vector. The image below shows the output of the L-0 norm function for the given vector:
L1 Norm/Manhattan Norm:
Substituting p=1 in the standard equation of p-norm, we get the following:
The L¹ norm is defined as the sum of the absolute values of the components of a given vector. Since
we have a vector X with only two components, the L¹ norm of x can be written as:
The L¹ norm can be mathematically written as: ||X1||=∑
Properties of the L1/Manhattan Norm
The L¹ norm is used in situations when it is helpful to distinguish between zero and non-zero
values.
The L¹ norm increases linearly around the origin.
L² Norm / Euclidean Norm/Squared L2 Norm:
As the name indicates, the squared L2 Norm is the same as the L2 Norm but squared.
L² is the most commonly used norm and the one most encountered in real life. The L² norm measures
the shortest distance from the origin. It is defined as the root of the sum of the squares of the
components of the vector. So, for our given vector X, the L² norm would be:
The Euclidean norm essentially means we are referring to the Euclidean distance.
Properties of the L2 Norm:
The L² norm is the most commonly used one in machine learning
Since it entails squaring of each component of the vector, it is not robust to outliers.
The L² norm increases slowly near the origin, e.g., 0.¹² = 0.01
It is used in ridge regression, which involves adding the coefficient of the L² norm as a
penalty term to the loss function.
L∞/Max Norm?
The L∞ norm is defined as the absolute value of the largest component of the vector. Therefore, it
is also called the max norm. So, continuing with our example of a 2D vector X having two
components, i.e., x₁ and x₂, where x₂ > x₁, the ∞ norm would simply be the absolute value of x₂.
As infinity is an abstract concept in Mathematics, we can‟t just substitute p=∞ in the standard p-norm
equation. However, we can study the function's behavior as p approaches infinity using limits. A
simple derivation for the equation of Max-norm can be found here.
Max norm returns the absolute value of the largest magnitude element. The image below shows
the output of the Max norm function for the given vector:
Eigen values
Eigenvalues are the scalar values associated with the eigenvectors in linear transformation. The
word „Eigen‟ is of German Origin which means „characteristic‟. Hence, these characteristic values
indicate the factor by which eigenvectors are stretched in their direction.
The equation for eigenvalue is given by
Av = λv
Where,
A is the matrix,
v is associated eigenvector, and
λ is scalar eigenvalue.
Eigen vectors:
Eigenvectors are also called characteristic vectors and can only be found for square
matrices. Eigenvectors for square matrices are defined as non-zero vector values which when
multiplied by the square matrices give the scalar multiple of the vector, i.e. we define an
eigenvector for matrix A to be “v” if it specifies the condition, Av = λv.
The scalar multiple λ in the above case is called the eigenvalue of the square matrix. We always
have to find the eigenvalues of the square matrix first before finding the eigenvectors of the matrix.
Eigenvector Equation
The Eigenvector equation is the equation that is used to find the eigenvector of any square matrix.
The eigenvector equation is,
Av = λv
Where,
A is the given square matrix,
v is the eigenvector of matrix A, and
λ is any scaler multiple.
How to get Eigenvalues and Eigenvectors?
If A is a square matrix of order n × n then we can easily find the eigenvector of the square matrix
by following the method discussed below,
We know that the eigenvector is given using the equation Av = λv, for the identity matrix of order
same as the order of A i.e. n × n we use the following equation,
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 14
Deep Learning Lecture Notes Unit I
(A-λI)v = 0
Solving the above equation we get various values of λ as λ1, λ2, …, λn these values are called the
eigenvalues and we get individual eigenvectors related to each eigenvalue.
Simplifying the above equation we get v which is a column matrix of order n × 1 and v is written
as,
𝑣=[𝑣1𝑣2𝑣3..𝑣𝑛]
How to Find an Eigenvector?
The eigenvector of the following square matrix can be easily calculated using the steps below,
Step 1: Find the eigenvalues of the matrix A, using the equation det |(A – λI| =0, where “I” is the
identity matrix of order similar to matrix A
Step 2: The value obtained in Step 2 are named as, λ1, λ2, λ3….
Step 3: Find the eigenvector (X) associated with the eigenvalue λ1 using the equation,
(A – λ1I) X = 0
Step 4: Repeat step 3 to find the eigenvector associated with other remaining eigenvalues
λ2, λ3….
Following these steps gives the eigenvector related to the given square matrix.
Types of Eigenvector
The eigenvectors calculated for the square matrix are of two types which are,
Right Eigenvector
Left Eigenvector
Right Eigenvector
The eigenvector which is multiplied by the given square matrix from the right-hand side is called
the right eigenvector.
For any square matrix, A of order n × n the eigenvector is the column matrix of order n × 1. If we
find the eigenvector of the matrix A by, Av = λv, “v” in this is called the right eigenvector of the
matrix A and is always multiplied to the right-hand side as matrix multiplication is not
commutative in nature. In general, when we find the eigenvector it is always the right eigenvector.
It is calculated by using the following equation,
AVR = λVR
Where,
A is given square matrix of order n×n,
λ is one of the eigenvalues, and
VR is the column vector matrix
The value of VR is,
𝑉𝑅=[𝑣1𝑣2𝑣3..𝑣𝑛]
Left Eigenvector
The eigenvector which is multiplied by the given square matrix from the left-hand side is called the
left eigenvector.
We can find the left eigenvector of the square matrix A by using the relation, vA = vλ
Here, v is the left eigenvector and is always multiplied to the left-hand side. If matrix A is of order
n × n then v is a column matrix of order 1 × n.
It is calculated by using the following equation,
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 15
Deep Learning Lecture Notes Unit I
VLA = VLλ
Where,
A is given square matrix of order n×n,
λ is one of the eigenvalues, and
VL is the row vector matrix.
The value of VL is,
VL = [v1, v2, v3,…, vn]
Eigenvectors of a Square Matrix
We can easily find the eigenvector of square matrices of order n × n. Now, let‟s find the following
square matrices:
Eigenvectors of a 2 × 2 matrix
Eigenvectors of a 3 × 3 matrix.
Eigenvector of a 2 × 2 matrix
The Eigenvector of the 2 × 2 matrix can be calculated using the above mention steps. An example
of the same is,
Example: Find the eigenvalues and the eigenvector for the matrix A =[ ]
Solution:
If eigenvalues are represented using λ and the eigenvector is represented as v = [𝑎 𝑏]
Then the eigenvector is calculated by using the equation,
|A- λI| = 0
[ ] [ ] [ ]
[ ] =0
(1-λ)(4-λ) – 2.5 = 0
⇒ 4 – λ – 4λ + λ2 – 10 = 0
⇒ λ2 -5λ -6 = 0
⇒ λ2 -6λ + λ – 6 = 0
⇒ λ(λ-6) + 1(λ-6) = 0
⇒ (λ-6)(λ+1) = 0
λ = 6 and λ = -1
Thus, the eigenvalues are 6, and -1. Then the respective eigenvectors are,
For λ = 6
(A-λI)v = 0
𝑎
[ ] [ ]=0
𝑏
⇒ -5a + 2b = 0
⇒ 5a – 2b = 0
Simplifying the above equation we get,
5a=2b
The required eigenvector is,
[𝑎 𝑏] t=[2 5] t
The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for the next
highest variance.
This continues until a total of p principal components have been calculated, equal to the original
number of variables.
Step-by-Step Explanation of PCA
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each one of
them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of the initial variables. That is, if there are large
differences between the ranges of initial variables, those variables with larger ranges will dominate
over those with small ranges, which will lead to biased results. So, transforming the data to
comparable scales can prevent this problem.
Once the standardization is done, all the variables will be transformed to the same scale.
Step 2: Covariance Matrix Computation
The aim of this step is to understand how the variables of the input data set are varying from the mean
with respect to each other, or in other words, to see if there is any relationship between them.
Variables are highly correlated in such a way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as
entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-
dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3 data matrix of this
from:
Problems
𝐴=[−4 −7 : 1 4]
.
Solution:Given, 𝐴=[−4 −7: 1 4]
𝑣2=[0.5 1]
𝑣1=[0.5 1]
Eigenvectors for λ = 1 are:
𝑣2=[−2 1]
Probability
Probability is a concept which numerically measures the degree of certainty or uncertainty of
occurrence or non-occurrence of events.(or) Probability is measure of degree of happening or not
happening of an event.
Basic definations:
Experiment: An experiment is any physical action or process that is observed and the result is noted.
Examples:Tossing a coin, Throwing or rolling a die or dice and drawing a card from a deck of 52-
cards
Trail: A single performance of an experiment.
Outcome/output: It is the result of an experiment
Sample Space: The set of all possible outcomes in any Experiment is called the sample space. And it
is represented by the letter “S”. The sample space is a universal set for the experiment. The elements
of S are called sample points.
Ex:1. For the coin-toss experiment would be the results ―Head and ―Tail, which we may represent by
S={H T}.
Ex. 2. If we toss a die,
one sample space or the set of all possible outcomes is S = { 1, 2, 3, 4, 5, 6}
Types of Sample Space:
1. Discrete Sample Space:Consider the experiment of tossing a coin twice. The sample space can
be S = {HH, HT, TH , TT} the above sample space has a finite number of sample points. It is
called a finite sample space.
2. Continuous Sample Space: Obtaining a number on a spinning pointer is an example for
continuous finite sample space.
Now, the sample space becomes countably finite i.e.
S = {0.01,--------------, 1.0}
The above sample space is called a countable finite sample space.
i.e the probability of union of two events is always less than or equal to the sum of the event
probabilities.
Conditional Probability: If an experiment repeats n times and a sample space contains only two
events A and B and event A occurs n(A) times, event B occurs n(B) times and the joint event of A
and B occurs n(AB) times then the conditional probability of event A given event B is equal to the
relative frequency of the joint occurrence n(AB) with respect to n(B) as n tends to infinity.
Mathematically,
That is the conditional probability P(A/B) is the probability of event A occurring on the condition that
the probability of event B is already known. Similarly the conditional probability of occurrence of B
when the probability of event A is given can be expressed as
From the conditional probabilities, the joint probabilities of the events A and B can be expressed as
Total Probability Theorem: Consider a sample space, S that has n mutually exclusive events Bn,
n=1, 2, 3,…,N. such that Bm∩Bn=ɸ for m ≠n=1, 2, 3, ….,N. The probability of any event A, defined
on this sample space can be expressed in terms of the Conditional probabilities of events Bn. This
probability is known as the total probability of event A. Mathematically,
Proof: The sample space s of N mutually exclusive events, Bn, n=1, 2, 3, …N is shown in the figure.
Typical random variables are the number of hits in a shooting game, the number of heads when
tossing coins, temperature/pressure variations of a physical system etc…For example, an experiment
consists of tossing two coins.
Let the random variable X chosen as the number of heads shown. So X maps the real numbers of the
event “showing no head” as zero, the event “any one head” as One and “both heads” as Two.
Therefore the random variable is X = {0,1,2}. The elements of the random variable X are x1=0, x2=1
& x3=2.
Classification of Random Variables: Random variables are classified into continuous, discrete and
mixed random variables.
The values of continuous random variable are continuous in a given continuous sample space. A
continuous sample space has infinite range of values. It cannot be produced from the discrete sample
space because of discrete values.
Types of continuous random variables: Normal/Gaussian R.V, Exponential R.V, Gamma R.V,
Rayleigh R.V, Uniform R.V
The values of a discrete random variable are only the discrete values in a given sample space. The
sample space for a discrete random variable can be continuous, discrete or even both continuous and
discrete points .They may be also finite or infinite.
Types of Discrete random variables: Bernoulli R.V, Binomial R.V, Poisson R.V, Geometric R.V
The values of mixed random variable are both continuous and discrete in a given sample space. The
mixed random variable has least practical significance or importance.
Expectation &Variance:
Mean of a random variable: If X is the random variable and P is the respective probabilities, the
mean of a random variable is defined by:
Mean (μ) = ∑ XP
where variable X consists of all possible values and P consist of respective probabilities.
Variance of Random Variable: The variance tells how much the spread of random variable X is
around the mean value. The formula for the variance of a random variable is given;
2. If two random variables X and Y are independent, then the covariance is zero. i.e. CXY = 0. But
the converse is not true.
Proof: Consider two random variables X and Y. If X and Y are independent, We know that
E[XY]=E[X]E[Y] and the covariance of X and Y is
INFORMATION THEORY
Information theory is based on probability theory and statistics and often concerns itself with
measures of information of the distributions associated with random variables. Important quantities of
information are entropy, a measure of information in a single random variable, and mutual
information, a measure of information in common between two random variables.
Information theory revolves around quantifying how much information is present in a signal. It
was originally invented to study sending messages from discrete alphabets over a noisy channel,
such as communication via radio transmission. In the case of deep learning, the most common use
case for information theory is to characterize probability distributions and to quantify the
similarity between two probability distributions.
Basic Concepts
Note: All of these concepts are used in Prof. Tishby‟s work they are a part of Information Theory and
are often used in Deep Learning.
1. Entropy
The Shannon entropy H, in units of bits (per symbol) is given by
H=−∑
where is the probability of occurrence of the ith possible value of the source symbol. Intuitively, the
entropy HX of a random variable X gives us a measure of the amount of uncertainty associated
Dr.P.GANGADHARA REDDY M.Tech.,Ph.D Page 33
Deep Learning Lecture Notes Unit I
with the value of X when only its distribution is known. Below is an example of the entropy of a
Bernoulli trial as a function of the success probability (called the binary entropy function):
Fig.1 Entropy H(X) (i.e. the expected surprisal) of a coin flip, measured in bits, graphed versus the
bias of the coin Pr(X=1), where X=1 represents a result of heads.
In general for any random variable X, the entropy H is defined as H(X)=−∑x∈X p(x)logp(x)
where X is the set of all messages {x1,...,xn} that X could be and I(x) is the self-information, which is
the entropy contribution of an individual message.
2. Joint Entropy
This concept of entropy applies when there are two random variables (say X and Y). The joint
probability of (X,Y) is defined as:
H(X,Y)=−∑x,y p(x,y)log2p(x,y)
This entropy is equal to the sum of the individual entropies of X and Y when they are independent.
3. Conditional Entropy
The conditional entropy (also called conditional uncertainty) of a random variable X given a random
variable Y is the average conditional entropy over Y:
H(X∣Y)=−∑y∈Yp(y)∑x∈Xp(x∣y)logp(x∣y)=−∑x,yp(x,y)logp(x∣y)
One basic property that the conditional entropy satisfies is given by:
H(X∣Y)=H(X,Y)−H(Y)
4. Mutual Information
Using mutual information, we can quantify the amount of information that can be obtained for one
random variable using the information from the other. The mutual information of X relative to Y is
given by:
I(X;Y)=∑x,yp(x,y)log(p(x,y)/p(x)p(y))
where SI(x,y) is the Specific Mutual Information (also called pointwise mutual information) and is
defined as:
pmi(x;y)≡log(p(x,y)/p(x)p(y))=log(p(x∣y)/p(x))=log (p(y∣x)/p(y))
A basic property of the mutual information is that
I(X;Y)=H(X)−H(X∣Y)
Intuitively, this means that knowing Y, we can save an average of I(X;Y) bits in
encoding X compared to not knowing Y.
Mutual information is symmetric: I(X;Y)=I(Y;X)=H(X)+H(Y)−H(X,Y).
February 3, 2017
Contents
DL 4.1 overflow and underflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
DL 4.2 poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
DL 4.3 gradient-based optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
DL 4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
DL 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
ML algorithms usually require a high amount of numerical computation, which typically refers to algorithms that solve mathematical
problems by methods that update estimates of a solution iteratively, rather than solving analytically. Evaluating a function involving
real numbers on a computer can be difficult, because the real numbers must be represented with a finite amount of memory.
The fundamental difficulty in performing continuous math on a digital computer is the need to represent infinitely many real numbers
with a finite number of bit patterns. This leads to approximation error, which can just be rounding error. Rounding error is especially
problematic when it compounds across many operations, and can cause algorithms to fail.
• underflow : when numbers near zero are rounded to zero. This is especially a problem when it leads to dividing by zero or
taking the log of zero.
• overflow : when numbers with large magnitude are approximated as ∞ or −∞.
The softmax function as a function that must be stabilized against underflow and overflow.
exp(xi )
softmax(X)i = ∑n
j=1 exp(xj )
1
Analytically, when all of the xi are equal to a constant c, all of the outputs should be n . Numerically, this may not occur when c has
large magnitude. For very negative c, exp(c) will underflow, causing the denominator to become 0, and making the output undefined.
For very positive c, exp(c) will overflow, again making the output undefined. Both of these difficulties can be resolved by instead
evaluating softmax(Z), where Z = X − maxi xi . The value of the softmax function is not changed analytically by adding or
subtracting a scalar from the input vector:
exp(xi − maxi xi )
softmax(Z)i = ∑n
j=1 exp(xj − maxi xi )
exp(xi )
exp(maxi xi )
= ∑n exp(xj )
j=1 exp(maxi xi )
exp(xi )
exp(maxi xi )
= ∑n
exp(xj )
j=1
exp(maxi xi )
exp(xi )
= ∑n
j=1 exp(xj )
softmax(Z)i = softmax(X)i
Subtracting maxi xi makes the largest argument to exp be 0, which protects against overflow. Likewise, at least one term in the
denominator has a value of 1, protecting against underflow in the denominator and subsequent division by zero. However, underflow
in the numerator can still cause the expression to erroneously evaluate to zero. Then, taking log softmax(x) would erroneously return
−∞. This can be stabilized using the same trick used to stabilize the softmax function above.
Conditioning : how rapidly a function changes with respect to small changes in its inputs.
Functions that change rapidly in response to small input perturbations can be problematic because rounding error in the inputs can
result in large changes in the output.
Consider the function f (x) = A−1 x. When A ∈ Rn×n has an eigenvalue decomposition, its condition number is
λi
max
i,j λj
, the ratio of the largest and smallest eigenvalue. When the condition number is large, matrix inversion is particularly sensitive to error
in the input.
The above can be generalized to functions with multiple inputs f : Rn → R. (In order to minimize the function, the output must still
be a single scalar output). In this case, the gradient ∇x f (x) is analogous to the derivative. The directional derivative in a direction
∂
u (a unit vector) is the slope of the function f in direction u, i.e. ∂α f (x + αu) = uT ∇x f (x). The function f can be minimized by
moving in the direction of the negative gradient (This is known as the method of steepest descent or gradient descent ):
x′ = x − ϵ∇x f (x)
, where ϵ is the learning rate , a positive scalar determining the step size.
• the concept of making small moves towards better configurations can be generalized to discrete spaces (this is called hill
climbing )
∂
J ∈ Rn×m , Ji,j = f (x)i
∂xj
• important because it tells us whether a gradient step will cause as much of an improvement as we would expect based on the
gradient alone
• f ′′ (x) = 0: no curvature, value can be predicted using only the gradient
• f ′′ (x) < 0: function curves downward, so cost function decreases by more than ϵ
• f ′′ (x) > 0: function curves upward, so cost function decreases by less than ϵ
Hessian matrix: H (f (x)) the matrix of all second derivatives
∂2
H(f (x))i,j = f (x)
∂xi ∂xj
2
∂2
Anywhere that the second partial derivatives are continuous, ∂x∂∂x f (x) = ∂xj ∂xi f (x), implying that Hi,j = Hj,i , meaning
i j
that the Hessian is symmetric at such points.
• Most of the functions we will encounter have symmetric Hessian almost everywhere
• because the Hessian matrix is real and symmetric, it can be decomposed into a set of real eigenvalues and an orthogonal basis
of eigenvectors
• the second derivative in the direction of a unit vector d is given by dT Hd
• when d is an eigenvector of H, the second derivative in the direction of d is given by the corresponding eigenvalue
• for other directions, the second derivative is a weighted average of all of the eigenvalues, with weights between 0
and 1
• max eigenvalue → max second derivative
• min eigenvalue → min second derivative
The directional second derivative measures how well a gradient descent step is expected to perform. The second order Taylor series
approximation of the function f (x) about the current point x(0) :
( ) ( )T 1( )T ( ) ( ( )) ( ( ))
f (x) ≈ f x(0) + x − x(0) g+ x − x(0) H x − x(0) g = ∇x f x(0) , H = H f x(0)
2
Using gradient descent, the new point x = x(0) − ϵ∇x f (x(0) ) . Substituting this in:
( )
( ) ( )T 1( )T ( )
f (x) ≈ f x(0) + x − x(0) g + x − x(0) H x − x(0)
2
( ) ( ) ( )T 1 ( (0) )T ( )
f x(0) − ϵg ≈ f x(0) + x(0) − ϵg − x(0) g + x − ϵg − x(0) H x(0) − ϵg − x(0)
2
( ) ( ) 1
f x(0) − ϵg ≈ f x(0) − ϵgT g + ϵ2 gT Hg
2
new value of f ≈ original value − expected improvement due to slope + correction for curvature
gT g
ϵ∗ =
gT Hg
• worst case: g aligns with the eigenvector of H corresponding to the maximal eigenvalue λmax
• then, optimal step size = λ1max
• thus, to the extent that the function we minimize can be approximated well by a quadratic function, the eigenvalues
of the Hessian determine the scale of the learning rate
• second derivative test : determine whether a critical point is a local maximum, local minimum, or saddle point
• at critical points, f ′ (x) = 0
• if f ′′ (x) > 0, f ′ (x) increases to the right, and f ′ (x) decreases to the left
• i.e., f ′ (x − ϵ) < 0 and f ′ (x + ϵ) > 0 for small ϵ
• therefore, x is a local minimum
• similarly, if f ′ = 0 and f ′′ (x) < 0, x is a local maximum
• if f ′′ = 0, test is inconclusive
• x may be a saddle point, or part of a flat region
• using the eigendecomposition of the Hessian matrix, the second derivative test can be generalized to multiple dimensions
• at critical points, ∇x f (x) = 0
• if Hessian is positive definite → all eigenvalues positive → directional second derivative in any direction is positive
→ x is a local minimum
• if Hessian is negative definite → all eigenvalues negative → directional second derivative in any direction is
negative → x is a local maximum
• if Hessian has at least one positive eigenvalue and one negative eigenvalue, x is a local max in at least one
dimension and a local min in at least one dimension, and therefore is a saddle point
• if all non-zero eigenvalues have the same sign, but at least one eigenvalue is zero, the test is inconclu-
sive because the univariate second derivative test is inconclusive in the cross section corresponding to the zero
eigenvalue
• in multiple dimensions, there is a different second derivative for each direction
• the condition number of the Hessian measures how much the second derivatives vary
• poor condition number means gradient descent performs poorly, because gradient descent is unaware that it should
preferentially explore only in the steep directions
• this also makes it difficult to choose a step size:
• step size must be small to avoid overshooting the minimum and going uphill in directions with strong positive
curvature
• this small step size is usually too small to make significant progress in directions with little curvature
• this can be resolved by using the Hessian to guide the search, e.g. with the Newton-Raphson method
The Newton-Raphson method is based on using the second-order Taylor series expansion to approximate f (x) near some point x(0)
( ) ( )T ( ) 1( )T ( )( )
f (x) ≈ f x(0) + x − x(0) ∇x f x(0) + x − x(0) Hf x0 x − x(0)
2
Solving for the critical point:
( )−1 ( )
x∗ = x(0) − Hf x(0) ∇x f x(0)
• For a positive definite quadratic function, Newton’s method arrives at the minimum in one step
• if f is not truly quadratic but can be locally approximated as positive definite quadratic, Newton’s method is iteratively applied
• this can arrive at the minimum much faster than gradient descent
• this is useful near a local minimum, but can be harmful near a saddle point
• Newton’s method is only appropriate when the nearby critical point is a minimum
• in contrast, gradient descent is not attracted to saddle points unless the gradient points toward them
• gradient descent is a first order optimization algorithm
• methods such as the Newton-Raphson method that take the Hessian into account are called second order optimization
algorithms
Optimization algorithms come with almost no guarantees, because the family of functions encountered is very complicated
• in deep learning, can sometimes gain some guarantees by restricting to functions that are Lipschitz continuous or have
Lipschitz continuous derivatives
• a Lipschitz continuous function is a function f whose rate of change is bounded by a Lipschitz constant L:
• this allows us to quantify the assumption that a small change in the input made by an algorithm such as gradient descent
will have a small change in the output
• Lipschitz continuity is also a fairly weak constraint
• many optimization problems in deep learning can be made Lipschitz continuous with relatively minor modifications
The most successful field of specialized optimization is convex optimization
• convex optimization algorithms are able to provide more guarantees by making stronger restrictions
• applicable only to convex functions, i.e. functions for which the Hessian is positive semidefinite everywhere
• these functions are well-behaved:
• they lack saddle points
• all local minima are necessarily global minima
• most problems in deep learning are difficult to express in terms of convex optimization
• used only as a subroutine of some DL algorithms
max L(x, λ, α) = ∞
λ,α,α≥0
i.e., the solution to the new, unconstrained problem is the same as the original constrained problem.
• regarding the inequality constraints:
• a constraint h(i) (x) ≤ 0 is active if it holds with equality, i.e. h(i) (x) = 0
• if a constraint is not active, then the solution to the problem using that constraint would remain at least a local
solution if that constraint were removed
• because an inactive h(i) has a negative value, the solution to
will have αi = 0
• thus, αh(x) = 0
• i.e., at least one of the constraints αi ≥ 0 or h(i) ≤ 0 must be active at the solution
• the solution is either on the boundary imposed by the inequality and its KKT multiplier influences the
solution, or the inequality has no influence on the solution and its KKT multiplier is zero
• the above properties are the Karush-Kuhn-Tucker (KKT) condtions , i.e.:
• the gradient of the generalized Lagrangian is zero
• all constraints on both x and the KKT multipliers are satisfied
• α ⊙ h(x) = 0
Suppose we want to minimize the same function, but subject to the constraint xT x ≤ 1. This constraint can be rewritten as xT x −
1 ≤ 0. Therefore, the Lagrangian is
L(x, λ) = f (x) + λ xT x − 1 ,
( )
and the solution to the original (primal) problem can be found by solving
The smallest-norm solution to this unconstrained least squares problem may be found using the Moore-Penrose pseudoinverse x =
A+ b. If this point is feasible, then it is the solution to the constrained problem. Otherwise, we must find a solution where the constraint
is active. Start by differentiating the Lagrangian w.r.t. x and setting it equal to the zero vector:
L(x, λ) = f (x) + λ xT x − 1
( )
1[ T T
x A Ax − 2xT AT b + bT b + λxT x − λ
]
=
2
1[ T
2A Ax − 2AT b + λx = 0
]
∇x L(x, λ) =
2
AT Ax − AT b + 2λx = 0
( T
A x + 2λI x = AT b
)
)−1 T
x = AT x + 2λI
(
A b
The solution must take the above form, and the magnitude of λ must be chosen such that the result obeys the constraint (i.e., xT x ≤ 1).
Value can be found by performing gradient ascent on λ:
1[ T T
x A Ax − 2xT AT b + bT b + λxT x − λ
]
L(x, λ) =
2
∂
L (x, λ) = xT x − 1
∂λ
∂L
if xT x > 1 → >0 → gradient ascent increases λ
∂λ
An increased λ means the weight of the xT x penalty in the Lagrangian has increased. Therefore, solving for x will then yield a
solution with a smaller norm. The process of solving the linear equation and adjusting λ continues until x has the correct norm and
the derivative on λ is 0.