Lecture Notes
Lecture Notes
The learning algorithm attempts to find the best parameters (w1, w2…wd and b) that yields the correct
value of f(x) for any given x vector. In other words, it chooses the model that gives the least loss among
many potential models. Choosing the potential models is typically a manual process.
Evaluation shouldn’t be done on the training data, and instead on test data. Similarly, the selection of
potential models that are input to the learning model, should be done using validation data.
Unsupervised learning
Typically, unsupervised learning is a pre-processing step, and the interpretation of the output is
performed manually.
In dimensionality reduction technique, the goal is to compress and simplify data. Instead of using all of
it, we choose part of it.
In the density estimation technique, it assigns a probability value to each valid input, such that the sum
of all of it is 1.
The training set is used to fit the model, the validation set is used for model selection and the test set is
used for computing the generalization error
Week-2
• Definition of an open ball B.
Note that D(x, y) indicates the distance between the points x and y, and is represented
mathematically as follows.
.
The above two statements are equivalent, and the bottom one is a definition using the concept of
an open ball centered at x* and having radius ξ
• Vector space is a set of vectors wherein linear combinations of any two of its vectors (u and v) is
another vector in the same space. , where α and β are two real numbers.
• Dot product of vectors is sum of product of its parts.
Univariate calculus
• Function is said to be continuous at x* if and only if
Also written as
we computed the linear approximation of e^3x and 1/sqrt(1+x) separately, multiply them and ignore
the quadratic term.
• The points where the derivative of a function is zero are called critical points.
Multivariate calculus
•
•
•
• As an example, consider this:
Here, w is the vector all of whose components x1, x2 and x3 are 1. Applying the previous
mathematical definition of hyperplane normal to the vector w, we get x1 + x2 + x3 = 1
• Partial derivative of function f evaluated at the point v is defined as
• Gradient of a function evaluated at v is defined as a column vector containing Its partial derivatives
based on each component of x.
• The points where the gradient of a function is a zero vector are called critical points.
• For a multi-variable function f (dimension d), linear approximation is given as
where x and v are vectors in Rd and are approximately equal to each other.
• Higher order approximation is given by
• The gradient of a function f (denoted as ) is a collection of all its partial derivatives (on each
dimension) into a vector.
• A worked-out example of linear approximation of a 2-dimension function
• Gradient is perpendicular to the function’s contour plot at a specific point.
• The most important thing to remember about the gradient: The gradient of f, if evaluated at an
input (x0, y0) points in the direction of the steepest ascent. When the function f accepts more than
two inputs, the interpretation of a gradient is similar.
• For a function f that varies with x, y and z, the directional derivative along the unit vector v is
NOTE: If the vector v is not a unit vector, normalize it before applying the above method, or divide
the above expression by magnitude of vector v.
• In order to find the maximal directional derivative, find the unit vector along the gradient. Then
apply the same formula
.
For example, to find the maximal derivative of f(x,y) = x2y at (3,2), refer to the following working.
• Maximum derivative can also be computed by finding the magnitude of the gradient vector.
• Cauchy-Shwarz inequality states that given 2 d-dimensional vectors a and b,
Note that the equality holds when vector a is a scalar multiple of vector b.
Week-3
Linear Algebra
• C(A) = Vector space spanned by columns of given matrix. This is also called rank of the matrix.
• R(A) = C(AT) = Vector space spanned by rows of given matrix (or by columns of the transpose of
the matrix)
• N(A) = basis for the solution set of a homogeneous linear system derived from the given matrix.
• Left null space = basis for the solution set of a homogeneous linear system derived from the
transpose of the given matrix.
• Null space of a matrix is a vector space, which implies that all linear combinations of its basis
also belong to the null space.
• If the matrix is invertible, the column vectors are linearly independent. In this case, the null
space only has ‘zero’ vector, and column space is the whole space.
•
•
•
• Rank + Nullity = number of columns
• Rank + Left nullity = number of rows
• Column rank = Row rank = Rank of the matrix.
• A set consisting of mutually orthogonal vectors is a linearly independent set.
• Two subspaces are orthogonal to each other, when each element in first subspace is orthogonal
to each element in the second subspace. For example, row subspace of a given matrix is
orthogonal to its null subspace. Similarly, column subspace of a given matrix is orthogonal to
the left null subspace of its transpose.
• If {v1, v2…vk} are mutually orthogonal (non-trivial) set of vectors, it’s a linearly independent set.
• Inverse of the transpose of matrix A = Transpose of the inverse of matrix A
(https://www.youtube.com/watch?v=MsIvs_6vC38)
Projections
• When vector b is not in column space of A, we typically project b into the column space of A.
• Projecting a vector b onto vector a.
For example,
Third observation is that column space of the projection matrix is a line passing through the
vector a. Null space of the projection matrix is a plane orthogonal to vector a.
Further, the projection matrix will not change if vector a is a scalar multiple of itself. Thus,
The following diagram shows vector b projected onto the column space of A. Projection is
denoted by (p = AxHAT). E denotes the error and can be represented as vector (b–p)
Calculations leads to the following equation, from which we can compute the projection.
• To find the least square error of (Ax – b)2 is to find the projection of vector b into column
space of matrix A.
Thus,
Using the above equation for the line on each dimension of the input vector ([-1,1,2]), we get
Given b = [1,1,3]
Rough sketch of these points on the X-Y plane looks like this. From this sketch, it’s clear that e is
orthogonal to both input vectors.
NOTE: Least square error is well covered in these Khan Academy videos:
Least squares examples | Alternate coordinate systems (bases) | Linear Algebra | Khan
Academy - YouTube
Week4
• For Aθ = Y, loss function is represented by the following equation.
In the case of polynomial regression, solution can be obtained by performing linear regression on
, where A is a matrix with transformed features and
•
Eigen values/Eigen-vectors
• For a matrix A, eigen-value λ can be obtained by equating determinant of (A–λI) to 0, and the
corresponding eigenvector lies in the null space of (A–λI).
• If an eigenvalue of a matrix A is zero, then its corresponding eigen-vector belongs to the null
space of A.
• For a real symmetric matrix, all its eigen-values are real, but there could be imaginary eigen-
vectors.
• If matrix A results in distinct eigen values, it’ll have linearly independent eigen vectors.
• If A = aI + B where A and B are matrices and I is the identity matrix, both A and B will have same
eigen-vectors.
• Eigen-value corresponding to every non-zero vector in the column-space of projection matrix is
1.
• Eigen-value corresponding to every non-zero vector orthogonal to the column-space of
projection matrix is 0.
• A permutation matrix has eigen-value(s) 1 or -1.
• If A has r non-zero eigenvalues, then rank of A is at least r.
• If x is an eigenvector of A corresponding to eigenvalue λ1 and x is also an eigenvector
of B corresponding to eigenvalue λ2, then x is also an eigenvector of (A+B).
• Some important properties of eigen-values and eigen-vectors:
o Sum of the eigen-values of a matrix is equal to the sum of its diagonal elements (trace).
o The product of the eigenvalues of a matrix equals the determinant of the matrix.
o Matrix is singular (zero determinant) if and only if it has 0 eigen-value.
o The eigenvalues of an upper (or lower) triangular matrix are the elements on the main
diagonal.
o If λ is an eigenvalue of A, then λ is an eigenvalue of AT.
o If λ is an eigenvalue of A, then λk is an eigenvalue of Ak, for any positive integer k.
o If λ is an eigenvalue of A and if A is invertible, then 1/λ is an eigenvalue of A−1 (follows
from above statement)
o If λ is an eigenvalue of A, then αλ is an eigenvalue of αA, where α is any arbitrary scalar.
o If x is an eigenvector of A corresponding to the eigenvalue λ, then x is an eigenvector of
αA corresponding to eigenvalue αλ.
o If x is an eigenvector of A corresponding to the eigenvalue λ, then x is an eigenvector of
Ak corresponding to the eigenvalue λk, for any positive integer k
• For more detailed discussion refer to
https://www.adelaide.edu.au/mathslearning/system/files/media/documents/2020-03/evalue-
magic-tricks-handout.pdf.
• For details on eigen-values and eigen-vectors refer to
https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-
algebra-introduction-to-eigenvalues-and-eigenvectors
or
https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-
lectures/lecture-21-eigenvalues-and-eigenvectors/
Matrix Diagonalizability
o
• S is not unique, since S could contain scaled up eigen vectors as its columns.
• Not all matrices are diagonalizable.
• As per linear algebra, a good approximation of the kth Fibonacci number is
Week5
• If U and V are two symmetric matrices, UV is not symmetric, but U + V is symmetric.
• Complex conjugate of a + ib is a – ib. Magnitude of the former is given by re^iθ and that of the
latter is given by re^-iθ
• In the case of complex vector space, the vector is defined using its complex conjugate. Thus,
length of the vector (1, i) is calculated from its complex conjugate (1, -i) and equals 2.
• Given x and y are two vectors in the complex vector space, . Thus, x.y is not equal
to y.x. Recollect, if the vectors belong to the real space, dot product is commutative.
• Following equations are true when x and y belong to complex vector space.
Also,
Also,
• Unitary matrices are square matrices with orthonormal columns. For such matrices,
or
• If U is unitary matrix, then U* is also unitary matrix.
• All Hermitian matrices are unitarily diagonalizable but, all unitarily diagonalizable matrices are
not Hermitian. Watch https://www.youtube.com/watch?v=VYS9EYZ3gCo for an example
diagonalization of a Hermitian matrix.
• Properties of Unitary matrices
o ||Ux|| = ||x|| (Length unchanged)
o Eigen-values of unitary matrices have an absolute value equal to 1, although not
necessarily real.
o Eigen-vectors corresponding to the eigen-values of a unitary matrix are orthogonal.
• Given a non-symmetric n * n matrix A, it cannot be diagonalized, but it can be upper-
triangularized. There exists a unitary matrix U and an upper triangular matrix T such that A =
https://www.khanacademy.org/math/linear-algebra/alternate-bases/orthonormal-
basis/v/linear-algebra-the-gram-schmidt-process
•
•
• In the case of diagonalizable matrices, we already know how to decompose them using spectral
theorem.
• If the matrix is not diagonalizable, we’ll still be able to decompose it, as follows.
NOTE: Note that this works for any real m * n matrix. In this case, Q1 is m * m and Q2 is n * n.
Here Q1 is made up of normalized eigen vectors of AAT. Q2 is made up of normalized eigen vectors
of ATA. Note that both are real symmetric square matrices. These matrices are also orthogonal.
denotes the square root of the eigen values of ATA. The diagonal of the matrix ∑ are normally in a
sorted (descending) order.
• Q2T rotates the input, ∑ stretches it ( gives the semi-major and semi-minor axis length of a
stretched unit circle) and Q1T further rotates it. Effectively, this will produce the same
transformation as matrix A will cause.
Week6
• Every quadratic function of the form ax2+bxy+cy2 has a stationary point at (0, 0). First derivative
of the function becomes zero at this point.
• A function f that vanishes at (0, 0) and is strictly positive at all other points is called positive
definite, and is denoted as f > 0
• ax2+bxy+cy2 is positive definite if and only if a > 0 and ac > b2
• if ac = b2, ax2+bxy+cy2 is positive semi-definite if a > 0
• if ac = b2, ax2+bxy+cy2 is negative semi-definite if a < 0
• If ac < b2, ax2+bxy+cy2 has a saddle point at (0, 0)
• Following is the matrix-based representation of a function in two variables using matrices.
• In general, function f (in n variables) is positive, if all pivots are positive in its reduced row
echelon form.
• Definiteness of a bi-variate function f represented as a 2 x 2 matrix A can be determined using
the following table, from its elements:
• Definiteness of a function f represented as a matrix A can be determined using the following
table, from its eigen values:
where fxx is the second order derivative of f and D is the determinant of the second order
derivatives at the given point (given below)
• Example:
o
o Step3: Subtract mean vector from the given data points.
•
• Rank of the matrix is less than or equal to n.
•
• Appending a 1 to the end of every data point does not change the results of performing
PCA, except that the useful principal component vectors have an extra 0 at the end and
there is one useless component with eigen value 0.
• If you perform a 90-degree clockwise rotation of the data points before performing PCA, the
largest eigenvalue does not change.
• If you perform a 90-degree clockwise rotation of the data points before performing PCA, the
variance along each eigen vector does not change.
Week7
• Pillars of machine learning are Linear algebra, Probability, Optimization
• The structure and relationship between data points is dealt within Linear Algebra.
• Modelling noise/uncertainty in data is done using Probability.
• Optimization is the mathematical tool that helps in converting data to decisions.
• The “best” of something often means the least loss or maximum reward, and found using
calculus.
• A typical optimization problem looks like this:
How close can a cow tied to point (20, 30) using a rope that measures 10 units get to the grass at
(40, 40), given that a fence is placed along the perpendicular at x = 25.
Given the problem, following equation mathematically represent the objective function that must
be minimized.
And, following are the two equations that mathematically represent the constraints.
• In general, constraints could include inequality and equality constraints and can be represented
pictorially as follows:
• One method to find the minimum/maximum parameters of an equation is to find the first
derivative and equate to 0. As an example, to minimize y = (x - 5)2
NOTE: However, not all models can be minimized like this, since this might require us to solve
equations with degrees higher than 2.
• One of the algorithms that could solve the above equation is as follows:
o Start with x = x0
o Find the derivative of the function at the current x.
o Add negative of the derivative to the current value of x, to get the new x.
NOTE: This implies that the function value at any x_hat, can be calculated if the local
information (at x) is known.
For small-enough η, ignore the higher-order terms (from the 3rd term onwards) and f at the new
x can be approximated as follows:
Since the LHS is negative quantity (due to decreasing function value), it’s required that
When dealing with higher dimensions of the input data, this can be rewritten as .
Pictorially, this constitutes the region to the left of the perpendicular to the gradient vector of
the function.
• In the case of the cow-grass example above, in order to move towards the grass from (x1, x2), say (5,
2), following calculations should be done.
Now, this vector gives the direction that takes the cow (presently at (5, 2)) towards the grass at (40,
40). In order to compute the vector, choose an appropriate η to scale this direction vector before
adding to (5, 2).
As mentioned before, this is not the only direction to move so that f reduces. It can take any of the
infinite directions to the left side of the perpendicular to the gradient vector of the function.
• Newton method is an alternative (to gradient descent) algorithm to calculate the minimum value of
a function in iterations. Following is the update rule in this case.
This method might seem to have more precision than the gradient descent method, but in the case
of higher dimensions, it requires computing the second order derivative of the function (Hessian
matrices) that can get tough to compute.
• Taylor’s series can be used to derive the linear approximation formula from week-2 as follows.
Assuming x = a + Δ, Taylor’s series can be written as
f(x) = f(a) + Δ.f`(a) + Δ2.f``(a) + …
Week8
• Given a problem that requires to minimize objective function f(x) such that g(x) satisfies the
inequality g(x) <= 0, point x* will solve it optimally, if and only if there is no further descent direction
that’s also feasible. In the above context, descent direction is dependent on f(x) and is given by
direction d such dT∇f(x*) < 0, and feasible direction is given by the same direction d such that
dT∇g(x*) <= 0. In other words, the problem is solved only if there is no direction d available such
that both dT∇f(x*) < 0 and dT∇g(x*) <= 0.
• The feasible descent directions are depicted below.
From the above picture, it’s clear that the intersection of green and yellow regions comprises of all
feasible descent directions.
• Now, x* is considered an optimal solution only when dT∇f(x*) and dT∇g(x*) are anti-parallel as
depicted below.
This is mathematically represented in an equation as , where λ is a positive
scalar value.
• If the inequality in the above problem is changed to an equality condition (g(x) = 0), x* is optimal
only when dT∇g(x*) = 0, where d denotes the direction (vector) of movement. This is pictorially
shown as follows.
Thus, all feasible directions lie on a line (with the blue arrow)
• Extending this argument, the best possible (optimal) solutions for the constrained (by equality)
optimization problem defined above occur when f and g move parallel or anti-parallel as shown in
the picture below.
Now, substituting this value of x2 into the equation representing constraint g, we have
Now, the solution to the problem is worked out from these points, as follows:
Thus, the point (0, 1) is called the maximizer and the points are called minimizers.
• In general, it may not be always possible to solve such constrained optimization problems using
Lagrange method, in which case it can be solved through the Projected Lagrange method.
• As far the objective function is a convex function, the gradient descent can be projected onto the
feasible set to solve the problem. This is pictorially represented as
• The corresponding algorithm is called Projected Gradient Descent and rewritten as follows:
Convexity
• A set of points containing x1 and x2 is called a convex set when λx1 + (1-λ) x2 is also in the set, where
λ is in [0, 1]
• A pictorial representation of this concept is as follows:
Note that the line joining the two points x1 and x2 will contain all points in the convex set.
• A hyperplane is convex, since a line joining any two points in the hyperplane produces a convex set.
• If two sets S1 and S2 are convex, S1∩S2 is also convex.
• Set defined as {x ∈ Rd: Ax = b} given A ∈ Rm*d and b ∈ Rd*1 is convex.
• Convex combinations is defined as follows:
• Set consisting of all such convex combinations is called a convex hull and lies inside the area
bounded by points x1, x2…xn. This is mathematically represented as follows:
• Alternatively, convex hull is defined to be the intersection of all sets containing the points x1, x2…xn.
• A convex hull is also a convex set.
• epigraph epi(f) is defined as [x, z], where z > f(x). Note that epigraph is Rd+1. An example in R1 is
shown below.
• Function is a convex function, when its domain and its epigraph is a convex set. If the function’s
domain is not a convex set, the function is not convex.
• Here’s another definition of convex function.
• In yet another definition, the function is convex if the value at any point on its domain is greater
than the linear approximation at that point. It’s stated mathematically as
NOTE: For this definition, it’s assumed that the function is differentiable.
• A differentiable function f: Rd->R is convex, if
o The Hessian matrix H is positive definite or positive semi-definite matrix, det(H) > 0
o Eigenvalues of the Hessian matrix H are non-negative, Eigenvalues(H) >= 0.
• If f is a convex function, all its local minima are also global minima. This implies that optimization
logic that minimizes to local minima also minimizes to global minima, in the case of convex
functions.
• f is a convex function, if the determinant of the Hessian matrix formed by fxx, fxy and fyy is positive. If
the determinant is negative, the function is not convex.
• In order to find the interval over which the function is convex, calculate its double-derivative f`` and
solve f`` > 0. If f`` < 0, the function is concave.
• Function f: Rd -> R, f(x) = xTAx is convex, if matrix A is positive definite or positive semi-definite.
• To find the maximum/minimum of an objective function f given one or more constraints (g and h),
start with the Lagrange multiplier method . Solve for variables,
satisfying the above equation and all the given constraints. Example follows.
linear equation h(x) = wTx (w∈ Rd), sum of squares error is represented as where n is
the number of items in the data set.
• Specific goal of a linear regression is to minimize the sum of squares error above. This can be
mathematically represented as
• Computation of the inverse is O(d3), where d is the number of dimensions and is highly expensive.
Hence, it’s advisable to use the iterative algorithm which computes gradient descent on each step.
• Note that the gradient calculation doesn’t involve an inverse computation and hence
much more efficient computationally, and can be further simplified by approximating it using a
technique called stochastic gradient.
• The stochastic gradient technique essentially samples a smaller subset of datapoints (uniformly, at
random) and computes gradient (over several iterations) on this subset, instead of using all
datapoints. When averaged over all iterations, the resultant w should be equal to w*
• Finding the optimum value of f that’s constrained by say, h(x) <= 0 is same as minimizing the
maximum of a Lagrangian L(x, λ). This is called primal problem and is represented as follows:
NOTE:
NOTE2: λ is called the lagrangian multiplier.
• The min-max (primal) problem can be converted to a max-min (dual) problem as follows:
.
Note that the min (inner function) problem is a concave function.
• Solution for the min-max problem is as follows.
• When the function value at the dual optimum is less than or equal to function value at the primal
optimum, it’s called weak duality.
• If f and h are convex functions, then we get strong duality. At this point, f(x*) = h(x*). In this case,
we can solve either primal problem or the dual problem to arrive at the optimal solution x* for f.
• Thus, for the objective function f and inequality constraint h, the (local) optimal solution (x*, λ*) is
given by the Karush-Kuhn-Tucker (KKT) conditions, which are enlisted as follows:
NOTE: If the functions f and h are convex, these conditions ensure optimal solution, that’re not just
local.
• If the list of constraints includes equality constraints (represented as l) in addition to the inequality
conditions (represented as h), the KKT conditions can be re-written as
NOTE: Vectors u and v represent the lagrangian multipliers for the inequality and the equality
constraints respectively.
NOTE2: If the (primal) inequality constraints use > or >=, the dual feasibility should use <=
• Dual of dual is primal
• If either the primal or dual problem has an infeasible solution, then the value of the objective
function of the other is unbounded.
• If either the primal or dual problem has a solution then the other also has a solution and their
optimum values are equal.
• If one of the variables in the primal has unrestricted sign, the corresponding constraint in the dual is
satisfied with equality.
Week11
• Experiment in a sample space is represented as (Ω, F, P), where Ω is the sample space, F is the set of
experiments and P is the probability
• Axioms of probability with continuous random variables
o P(A) >= 0
o P(Ω) = 1
o P (A1 U A2 …U An) = ∑P(Ai) for I = 1 to n
• In the case of continuous variables, domain and range of sample space Ω is uncountably finite (set
all real numbers)
• When the continuous random variable X takes an exact value x, the probability is 0 by definition.
• PDF and CDF of continuous random variable is defined as follows:
• When PDF is integrated, you get the CDF. Likewise, when CDF is differentiated, you get PDF.
• Expectation of a continuous random variable is given by
• Variance is given by
• Properties of variance are
• Standard deviation is the square root of the variance
• In the case of uniform distribution in the interval [a, b], expectation is (b - a)2 / 12
• Total expectation law is as follows:
• Joint distribution
• Cumulative distribution
• If X and Y are independent random variables, covariance is 0, implying that the correlation
coefficient is 0 and hence they are uncorrelated. However, the reverse is not true. Uncorrelated
random variables need not be independent.
• To get the joint distribution of derived random variables from original random variables, use
Week12
• Covariance matrix is always a square matrix. For a system with two variables, it’s a 2 x 2 matrix.
•
• Variances are in the position of diagonal elements in a covariance matrix.
• If Z = CY, where Y is a normal distribution and C is the coefficient matrix, then
• Cov[X1, X2] = Cov[X2, X1]
• Maximum log likelihood is given by
• In the case of multi-variate normal distribution, this method of estimation computes the mean
and variance as follows: