0% found this document useful (0 votes)

15 views38 pages

Lecture Notes

The document covers various concepts in supervised and unsupervised learning, including regression and classification techniques, as well as evaluation methods using training, validation, and test data. It also delves into mathematical foundations such as vector spaces, linear algebra, eigenvalues, and eigenvectors, along with their properties and applications in matrix diagonalization and projections. Additionally, it discusses the importance of continuity and differentiability in calculus, providing a comprehensive overview of these topics across multiple weeks.

Uploaded by

Sudhaarsun Baburaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views38 pages

Lecture Notes

Uploaded by

Sudhaarsun Baburaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Week1

Supervised learning: Regression

The learning algorithm attempts to find the best parameters (w1, w2…wd and b) that yields the correct
value of f(x) for any given x vector. In other words, it chooses the model that gives the least loss among
many potential models. Choosing the potential models is typically a manual process.

The output of regression model is continuous and with any range.

Supervised Learning: Classification

Evaluation shouldn’t be done on the training data, and instead on test data. Similarly, the selection of
potential models that are input to the learning model, should be done using validation data.
Unsupervised learning

Typically, unsupervised learning is a pre-processing step, and the interpretation of the output is
performed manually.

In dimensionality reduction technique, the goal is to compress and simplify data. Instead of using all of
it, we choose part of it.

In the density estimation technique, it assigns a probability value to each valid input, such that the sum
of all of it is 1.

The training set is used to fit the model, the validation set is used for model selection and the test set is
used for computing the generalization error
Week-2
• Definition of an open ball B.

Note that D(x, y) indicates the distance between the points x and y, and is represented
mathematically as follows.

• De-Morgan’s laws of sets

• A sequence is said to converge, if

.
The above two statements are equivalent, and the bottom one is a definition using the concept of
an open ball centered at x* and having radius ξ
• Vector space is a set of vectors wherein linear combinations of any two of its vectors (u and v) is

another vector in the same space. , where α and β are two real numbers.
• Dot product of vectors is sum of product of its parts.

• Norm of a vector x is defined as

• Two vectors are said to be orthogonal(perpendicular), if the dot product is 0

• Graph of a function defined on a d-dimensional vector xd is a (d+1)-dimensional vector and is
mathematically represented as
• In order to plot two dimensional functions, use a range of values that the function can assume and
draw contours for each value. Here’s a contour plot for the function f(x) = x1 + x2

Univariate calculus
• Function is said to be continuous at x* if and only if

Can also be written as

• Function is said to be continuous if it’s continuous at all points in the domain.

• When is a function differentiable?

Also written as

• A discontinuous function is not differentiable.

• Slope of a function is given by its derivative at a given point.
• Linear approximation of a function f at x* is given by the equation

This is represented mathematically as

• Typically, linear approximations are taken around x*=0, so f(x) = f(0) + f`(0) (x - 0) = f(0) + f`(0).x
• Quadratic approximation is given by the equation

• Thus, to solve the following problem,

we computed the linear approximation of e^3x and 1/sqrt(1+x) separately, multiply them and ignore
the quadratic term.
• The points where the derivative of a function is zero are called critical points.

Multivariate calculus

•
•

•
• As an example, consider this:

Here, w is the vector all of whose components x1, x2 and x3 are 1. Applying the previous
mathematical definition of hyperplane normal to the vector w, we get x1 + x2 + x3 = 1
• Partial derivative of function f evaluated at the point v is defined as

• Gradient of a function evaluated at v is defined as a column vector containing Its partial derivatives
based on each component of x.

• The points where the gradient of a function is a zero vector are called critical points.
• For a multi-variable function f (dimension d), linear approximation is given as

where x and v are vectors in Rd and are approximately equal to each other.
• Higher order approximation is given by

• The gradient of a function f (denoted as ) is a collection of all its partial derivatives (on each
dimension) into a vector.
• A worked-out example of linear approximation of a 2-dimension function
• Gradient is perpendicular to the function’s contour plot at a specific point.
• The most important thing to remember about the gradient: The gradient of f, if evaluated at an
input (x0, y0) points in the direction of the steepest ascent. When the function f accepts more than
two inputs, the interpretation of a gradient is similar.

• For a function f that varies with x, y and z, the directional derivative along the unit vector v is

More details on this follows:

If v is denoted by the vector,

NOTE: If the vector v is not a unit vector, normalize it before applying the above method, or divide
the above expression by magnitude of vector v.
• In order to find the maximal directional derivative, find the unit vector along the gradient. Then
apply the same formula
.
For example, to find the maximal derivative of f(x,y) = x2y at (3,2), refer to the following working.

• Maximum derivative can also be computed by finding the magnitude of the gradient vector.
• Cauchy-Shwarz inequality states that given 2 d-dimensional vectors a and b,

Note that the equality holds when vector a is a scalar multiple of vector b.

Week-3
Linear Algebra
• C(A) = Vector space spanned by columns of given matrix. This is also called rank of the matrix.
• R(A) = C(AT) = Vector space spanned by rows of given matrix (or by columns of the transpose of
the matrix)
• N(A) = basis for the solution set of a homogeneous linear system derived from the given matrix.
• Left null space = basis for the solution set of a homogeneous linear system derived from the
transpose of the given matrix.
• Null space of a matrix is a vector space, which implies that all linear combinations of its basis
also belong to the null space.
• If the matrix is invertible, the column vectors are linearly independent. In this case, the null
space only has ‘zero’ vector, and column space is the whole space.

•
•
•
• Rank + Nullity = number of columns
• Rank + Left nullity = number of rows
• Column rank = Row rank = Rank of the matrix.
• A set consisting of mutually orthogonal vectors is a linearly independent set.
• Two subspaces are orthogonal to each other, when each element in first subspace is orthogonal
to each element in the second subspace. For example, row subspace of a given matrix is
orthogonal to its null subspace. Similarly, column subspace of a given matrix is orthogonal to
the left null subspace of its transpose.
• If {v1, v2…vk} are mutually orthogonal (non-trivial) set of vectors, it’s a linearly independent set.
• Inverse of the transpose of matrix A = Transpose of the inverse of matrix A
(https://www.youtube.com/watch?v=MsIvs_6vC38)

Projections
• When vector b is not in column space of A, we typically project b into the column space of A.
• Projecting a vector b onto vector a.

For example,

Following are a few observations, while projecting.

The second observation follows from the fact that if you project twice, it’s not going to alter the
projection matrix, and hence multiplies by itself.

Third observation is that column space of the projection matrix is a line passing through the
vector a. Null space of the projection matrix is a plane orthogonal to vector a.

Further, the projection matrix will not change if vector a is a scalar multiple of itself. Thus,

Least square error

• How to find the least square error?

The following diagram shows vector b projected onto the column space of A. Projection is
denoted by (p = AxHAT). E denotes the error and can be represented as vector (b–p)

Calculations leads to the following equation, from which we can compute the projection.

• To find the least square error of (Ax – b)2 is to find the projection of vector b into column
space of matrix A.

Case 1: When columns of A are linearly independent,

Case 2: If vector b belongs to the column space of A, projection matrix is identity matrix and
the projection is same as vector b itself.

Case 3: If vector b is in the null space of transpose of A, projection is 0.

Case 4: If matrix A is invertible, projection is vector b itself.

Case 5: If matrix is of rank 1, projection is the same as with that on a line.

• Projection matrix P is symmetric , and satisfies . Converse is also true.

• Example of projection and least squares method applied.
This system of linear equations is not solvable, since it’s inconsistent (can be proved by row-
reducing).

Let’s apply the equation above

Solving the above set of linear equations, we get

Thus,

Using the above equation for the line on each dimension of the input vector ([-1,1,2]), we get

Given b = [1,1,3]
Rough sketch of these points on the X-Y plane looks like this. From this sketch, it’s clear that e is
orthogonal to both input vectors.

NOTE: Least square error is well covered in these Khan Academy videos:

Least squares approximation | Linear Algebra | Khan Academy - YouTube

Least squares examples | Alternate coordinate systems (bases) | Linear Algebra | Khan
Academy - YouTube

Week4
• For Aθ = Y, loss function is represented by the following equation.

and the least squares is given by

We can solve , if A is a full rank matrix.

Minimizing the loss is equivalent to maximizing the likelihood function.

In the case of polynomial regression, solution can be obtained by performing linear regression on
, where A is a matrix with transformed features and

Solution also can be represented in a regularized form as:

NOTE: This method is called ridge regression.

•
Eigen values/Eigen-vectors
• For a matrix A, eigen-value λ can be obtained by equating determinant of (A–λI) to 0, and the
corresponding eigenvector lies in the null space of (A–λI).
• If an eigenvalue of a matrix A is zero, then its corresponding eigen-vector belongs to the null
space of A.
• For a real symmetric matrix, all its eigen-values are real, but there could be imaginary eigen-
vectors.
• If matrix A results in distinct eigen values, it’ll have linearly independent eigen vectors.
• If A = aI + B where A and B are matrices and I is the identity matrix, both A and B will have same
eigen-vectors.
• Eigen-value corresponding to every non-zero vector in the column-space of projection matrix is
1.
• Eigen-value corresponding to every non-zero vector orthogonal to the column-space of
projection matrix is 0.
• A permutation matrix has eigen-value(s) 1 or -1.
• If A has r non-zero eigenvalues, then rank of A is at least r.
• If x is an eigenvector of A corresponding to eigenvalue λ1 and x is also an eigenvector
of B corresponding to eigenvalue λ2, then x is also an eigenvector of (A+B).
• Some important properties of eigen-values and eigen-vectors:
o Sum of the eigen-values of a matrix is equal to the sum of its diagonal elements (trace).
o The product of the eigenvalues of a matrix equals the determinant of the matrix.
o Matrix is singular (zero determinant) if and only if it has 0 eigen-value.
o The eigenvalues of an upper (or lower) triangular matrix are the elements on the main
diagonal.
o If λ is an eigenvalue of A, then λ is an eigenvalue of AT.
o If λ is an eigenvalue of A, then λk is an eigenvalue of Ak, for any positive integer k.
o If λ is an eigenvalue of A and if A is invertible, then 1/λ is an eigenvalue of A−1 (follows
from above statement)
o If λ is an eigenvalue of A, then αλ is an eigenvalue of αA, where α is any arbitrary scalar.
o If x is an eigenvector of A corresponding to the eigenvalue λ, then x is an eigenvector of
αA corresponding to eigenvalue αλ.
o If x is an eigenvector of A corresponding to the eigenvalue λ, then x is an eigenvector of
Ak corresponding to the eigenvalue λk, for any positive integer k
• For more detailed discussion refer to
https://www.adelaide.edu.au/mathslearning/system/files/media/documents/2020-03/evalue-
magic-tricks-handout.pdf.
• For details on eigen-values and eigen-vectors refer to
https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-
algebra-introduction-to-eigenvalues-and-eigenvectors
or
https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-
lectures/lecture-21-eigenvalues-and-eigenvectors/

Matrix Diagonalizability

• A matrix is diagonalizable if there exists matrix S such that , where λ

represents a diagonal matrix with eigen-values along its diagonal. Each column of matrix S is an
eigen vector of the matrix A.
• Matrix A is diagonalizable only when there are enough eigen vectors. If the matrix has repeated
eigen-values, then it’s not diagonalizable.
• If the matrix is not diagonalizable, it can be added with a scalar multiple of identify matrix (λI) to
make it diagonalizable.
• S can be used to diagonalize square of the matrix A. Proof below.

• As an extension to the above argument, following also holds.

o
• S is not unique, since S could contain scaled up eigen vectors as its columns.
• Not all matrices are diagonalizable.
• As per linear algebra, a good approximation of the kth Fibonacci number is

• Any real symmetric matrix A satisfies the following properties:

• Orthogonal matrices satisfy the property . Since QQ-1 = I for all matrices, in the case of
orthogonal matrices, QQT = I
• Orthogonal diagonalizability can be written as:

This is called spectral theorem.

• In order to find Q, find the eigen-vectors and divide them by their corresponding lengths.
Following is an example that demonstrates this on a 2*2 matrix.

Week5
• If U and V are two symmetric matrices, UV is not symmetric, but U + V is symmetric.
• Complex conjugate of a + ib is a – ib. Magnitude of the former is given by re^iθ and that of the
latter is given by re^-iθ
• In the case of complex vector space, the vector is defined using its complex conjugate. Thus,
length of the vector (1, i) is calculated from its complex conjugate (1, -i) and equals 2.

• Given x and y are two vectors in the complex vector space, . Thus, x.y is not equal
to y.x. Recollect, if the vectors belong to the real space, dot product is commutative.
• Following equations are true when x and y belong to complex vector space.

• Other properties of the complex vector space are

o (x + y).z = x.y + y.z
o cx.cy = |c|(x.y)
• In the case of a complex vector space, conjugate transpose is defined as

Also,

• In the case of a real vector space, conjugate transpose equivalent is

Also,

• Inner product in a complex space is defined as

• Matrix A is called Hermitian matrix, if A* = A. In other words, Hermitian matrix is the equivalent
of real symmetric matrices in complex space.
• Properties of Hermitian matrices
o Diagonal entries of a Hermitian matrix are real numbers.
o All eigen values of a Hermitian matrix are real numbers
o If eigen-values obtained are distinct, corresponding eigen vectors are orthogonal to
each other. Hence, the matrix is orthogonally diagonalizable.
o Matrix that diagonalizes a Hermitian matrix is called a Unitary matrix. Thus,

In the above case, A is a Hermitian and U is a Unitary matrix.

o A = UΛU* and Λ = U*AU are equivalent expressions, where A is the Hermitian matrix, U
and U* are unitary matrices and Λ is the diagonal matrix.
• Given upper triangular matrix A, AA* and A*A are Hermitian matrices.
• Every real diagonal matrix is Hermitian. However, in general, any diagonal matrix is not
Hermitian, since the diagonal elements Hermitian matrices cannot be complex.

• Unitary matrices are square matrices with orthonormal columns. For such matrices,

or
• If U is unitary matrix, then U* is also unitary matrix.
• All Hermitian matrices are unitarily diagonalizable but, all unitarily diagonalizable matrices are
not Hermitian. Watch https://www.youtube.com/watch?v=VYS9EYZ3gCo for an example
diagonalization of a Hermitian matrix.
• Properties of Unitary matrices
o ||Ux|| = ||x|| (Length unchanged)
o Eigen-values of unitary matrices have an absolute value equal to 1, although not
necessarily real.
o Eigen-vectors corresponding to the eigen-values of a unitary matrix are orthogonal.
• Given a non-symmetric n * n matrix A, it cannot be diagonalized, but it can be upper-
triangularized. There exists a unitary matrix U and an upper triangular matrix T such that A =

• To upper-triangularize a matrix, follow these steps.

o Find the characteristic polynomial of the matrix
o Find 1st eigen vector. Extend with (n-1) basis, where n is the number of columns
o Use Gram-schmidt process and arrive at its orthonormal basis. Call this U1.
o Extract the square matrix at the right-bottom of U1.
o Repeat all of the above steps on the new matrix, until you get U2.

o Now, will result in an upper triangular matrix.

• Given an arbitrary basis {u1, u2, …un} for a n n-dimensional inner product space V, Gram-Schmidt
algorithm constructs an orthogonal basis {v1, v2, …vn} for V:

https://www.khanacademy.org/math/linear-algebra/alternate-bases/orthonormal-
basis/v/linear-algebra-the-gram-schmidt-process

Singular Value Decomposition

• For a real matrix A, AAT always results in a real symmetric matrix.

•
•
• In the case of diagonalizable matrices, we already know how to decompose them using spectral
theorem.
• If the matrix is not diagonalizable, we’ll still be able to decompose it, as follows.
NOTE: Note that this works for any real m * n matrix. In this case, Q1 is m * m and Q2 is n * n.

Here Q1 is made up of normalized eigen vectors of AAT. Q2 is made up of normalized eigen vectors
of ATA. Note that both are real symmetric square matrices. These matrices are also orthogonal.
denotes the square root of the eigen values of ATA. The diagonal of the matrix ∑ are normally in a
sorted (descending) order.

• Q2T rotates the input, ∑ stretches it ( gives the semi-major and semi-minor axis length of a
stretched unit circle) and Q1T further rotates it. Effectively, this will produce the same
transformation as matrix A will cause.

Week6
• Every quadratic function of the form ax2+bxy+cy2 has a stationary point at (0, 0). First derivative
of the function becomes zero at this point.
• A function f that vanishes at (0, 0) and is strictly positive at all other points is called positive
definite, and is denoted as f > 0
• ax2+bxy+cy2 is positive definite if and only if a > 0 and ac > b2
• if ac = b2, ax2+bxy+cy2 is positive semi-definite if a > 0
• if ac = b2, ax2+bxy+cy2 is negative semi-definite if a < 0
• If ac < b2, ax2+bxy+cy2 has a saddle point at (0, 0)
• Following is the matrix-based representation of a function in two variables using matrices.

NOTE: A is real-symmetric matrix.

• In general, function in n variables can be represented as

• In general, function f (in n variables) is positive, if all pivots are positive in its reduced row
echelon form.
• Definiteness of a bi-variate function f represented as a 2 x 2 matrix A can be determined using
the following table, from its elements:
• Definiteness of a function f represented as a matrix A can be determined using the following
table, from its eigen values:

NOTE: Matrix A represents the function f in matrix form.

• For any function f, following are true at point (p, q):

where fxx is the second order derivative of f and D is the determinant of the second order
derivatives at the given point (given below)

• Example:

Principal Component Analysis

• PCA aims to project given data onto a lower-dimensional subspace, such that reconstruction
error is minimized and variance of the projected data is maximized.
• Following are the steps involved:
o Step1: Data matrix X has n data points, each with m features (dimensions)
o Step2: Find the mean vector of data points

o
o Step3: Subtract mean vector from the given data points.

o Step4: Find the covariance matrix C (A symmetric m x m matrix)

o Step5: Find the eigen-values and eigen-vectors uj of C.

o Step6: Choose the eigen-vectors corresponding to the first k eigen-values and derive the
transformed data points.
o Step7: Calculate the reconstruction error

and projected variance

•
• Rank of the matrix is less than or equal to n.
•
• Appending a 1 to the end of every data point does not change the results of performing
PCA, except that the useful principal component vectors have an extra 0 at the end and
there is one useless component with eigen value 0.
• If you perform a 90-degree clockwise rotation of the data points before performing PCA, the
largest eigenvalue does not change.
• If you perform a 90-degree clockwise rotation of the data points before performing PCA, the
variance along each eigen vector does not change.

Week7
• Pillars of machine learning are Linear algebra, Probability, Optimization
• The structure and relationship between data points is dealt within Linear Algebra.
• Modelling noise/uncertainty in data is done using Probability.
• Optimization is the mathematical tool that helps in converting data to decisions.
• The “best” of something often means the least loss or maximum reward, and found using
calculus.
• A typical optimization problem looks like this:

How close can a cow tied to point (20, 30) using a rope that measures 10 units get to the grass at
(40, 40), given that a fence is placed along the perpendicular at x = 25.

Given the problem, following equation mathematically represent the objective function that must
be minimized.

And, following are the two equations that mathematically represent the constraints.
• In general, constraints could include inequality and equality constraints and can be represented
pictorially as follows:

• One method to find the minimum/maximum parameters of an equation is to find the first
derivative and equate to 0. As an example, to minimize y = (x - 5)2

NOTE: However, not all models can be minimized like this, since this might require us to solve
equations with degrees higher than 2.
• One of the algorithms that could solve the above equation is as follows:
o Start with x = x0
o Find the derivative of the function at the current x.
o Add negative of the derivative to the current value of x, to get the new x.

o Repeat above two steps multiple times.

• Unfortunately, the above algorithm oscillates between xt and xt+1. This can be solved by
introducing a scalar factor (step-size η) for d.
• Step size of η = (½)^n fails to work if x0 (starting value) is very far away from the actual value (in
fact, it can never cross a value of 2). Hence, it’s a better idea to work with 1/(t + 1) which will
eventually converge to the actual value.
• Considering the above aspects, an acceptable algorithm is as follows:
• The afore mentioned Gradient Descent algorithm converges to a local minimum of the given
function. Even though, no general purpose algo exists that’ll converge to a global minimum,
converging to a local minimum serves well in most machine learning exercises. Such functions
are called convex functions.
• Taylor’s series can be used to understand why derivative appears in the update rule.

NOTE: This implies that the function value at any x_hat, can be calculated if the local
information (at x) is known.

For small-enough η, ignore the higher-order terms (from the 3rd term onwards) and f at the new
x can be approximated as follows:

Since the LHS is negative quantity (due to decreasing function value), it’s required that

Since η is a positive quantity (step-size), . If we choose d to be -f`(x), then it

satisfies this condition, since negative of the square of derivative of the function is always less
than 0. Thus proved.

When dealing with higher dimensions of the input data, this can be rewritten as .
Pictorially, this constitutes the region to the left of the perpendicular to the gradient vector of
the function.
• In the case of the cow-grass example above, in order to move towards the grass from (x1, x2), say (5,
2), following calculations should be done.

Now, this vector gives the direction that takes the cow (presently at (5, 2)) towards the grass at (40,
40). In order to compute the vector, choose an appropriate η to scale this direction vector before
adding to (5, 2).

Generalizing this algo for an d-dimensional space, we get

As mentioned before, this is not the only direction to move so that f reduces. It can take any of the
infinite directions to the left side of the perpendicular to the gradient vector of the function.

• Newton method is an alternative (to gradient descent) algorithm to calculate the minimum value of
a function in iterations. Following is the update rule in this case.
This method might seem to have more precision than the gradient descent method, but in the case
of higher dimensions, it requires computing the second order derivative of the function (Hessian
matrices) that can get tough to compute.
• Taylor’s series can be used to derive the linear approximation formula from week-2 as follows.
Assuming x = a + Δ, Taylor’s series can be written as
f(x) = f(a) + Δ.f`(a) + Δ2.f``(a) + …

The above can be rewritten as

f(x) = f(a) + (x – a).f`(a) + (x - a)2.f``(a)
This equation is the linear approximation formula learnt during week-2.

Week8
• Given a problem that requires to minimize objective function f(x) such that g(x) satisfies the
inequality g(x) <= 0, point x* will solve it optimally, if and only if there is no further descent direction
that’s also feasible. In the above context, descent direction is dependent on f(x) and is given by
direction d such dT∇f(x*) < 0, and feasible direction is given by the same direction d such that
dT∇g(x*) <= 0. In other words, the problem is solved only if there is no direction d available such
that both dT∇f(x*) < 0 and dT∇g(x*) <= 0.
• The feasible descent directions are depicted below.

From the above picture, it’s clear that the intersection of green and yellow regions comprises of all
feasible descent directions.
• Now, x* is considered an optimal solution only when dT∇f(x*) and dT∇g(x*) are anti-parallel as
depicted below.
This is mathematically represented in an equation as , where λ is a positive
scalar value.
• If the inequality in the above problem is changed to an equality condition (g(x) = 0), x* is optimal
only when dT∇g(x*) = 0, where d denotes the direction (vector) of movement. This is pictorially
shown as follows.

Thus, all feasible directions lie on a line (with the blue arrow)
• Extending this argument, the best possible (optimal) solutions for the constrained (by equality)
optimization problem defined above occur when f and g move parallel or anti-parallel as shown in
the picture below.

and mathematically represented as , where λ (called Lagrange

multiplier) is any non-zero scalar value.
• For example, consider a problem defined by the following bi-variate objective function f and
constraint g.
The gradient of these functions are as follows:

Solution per Lagrange method is as follows:

Substituting x1 = 0 into the second equation,

Now, substituting this value of x2 into the equation representing constraint g, we have

thus, yielding points are

thus, yielding points

Now, the solution to the problem is worked out from these points, as follows:
Thus, the point (0, 1) is called the maximizer and the points are called minimizers.

• In general, it may not be always possible to solve such constrained optimization problems using
Lagrange method, in which case it can be solved through the Projected Lagrange method.
• As far the objective function is a convex function, the gradient descent can be projected onto the
feasible set to solve the problem. This is pictorially represented as

• The corresponding algorithm is called Projected Gradient Descent and rewritten as follows:

Convexity
• A set of points containing x1 and x2 is called a convex set when λx1 + (1-λ) x2 is also in the set, where
λ is in [0, 1]
• A pictorial representation of this concept is as follows:
Note that the line joining the two points x1 and x2 will contain all points in the convex set.
• A hyperplane is convex, since a line joining any two points in the hyperplane produces a convex set.
• If two sets S1 and S2 are convex, S1∩S2 is also convex.
• Set defined as {x ∈ Rd: Ax = b} given A ∈ Rm*d and b ∈ Rd*1 is convex.
• Convex combinations is defined as follows:

• Set consisting of all such convex combinations is called a convex hull and lies inside the area
bounded by points x1, x2…xn. This is mathematically represented as follows:

• Alternatively, convex hull is defined to be the intersection of all sets containing the points x1, x2…xn.
• A convex hull is also a convex set.
• epigraph epi(f) is defined as [x, z], where z > f(x). Note that epigraph is Rd+1. An example in R1 is
shown below.

• Function is a convex function, when its domain and its epigraph is a convex set. If the function’s
domain is not a convex set, the function is not convex.
• Here’s another definition of convex function.

• In yet another definition, the function is convex if the value at any point on its domain is greater
than the linear approximation at that point. It’s stated mathematically as

NOTE: For this definition, it’s assumed that the function is differentiable.
• A differentiable function f: Rd->R is convex, if
o The Hessian matrix H is positive definite or positive semi-definite matrix, det(H) > 0
o Eigenvalues of the Hessian matrix H are non-negative, Eigenvalues(H) >= 0.
• If f is a convex function, all its local minima are also global minima. This implies that optimization
logic that minimizes to local minima also minimizes to global minima, in the case of convex
functions.
• f is a convex function, if the determinant of the Hessian matrix formed by fxx, fxy and fyy is positive. If
the determinant is negative, the function is not convex.
• In order to find the interval over which the function is convex, calculate its double-derivative f`` and
solve f`` > 0. If f`` < 0, the function is concave.
• Function f: Rd -> R, f(x) = xTAx is convex, if matrix A is positive definite or positive semi-definite.
• To find the maximum/minimum of an objective function f given one or more constraints (g and h),
start with the Lagrange multiplier method . Solve for variables,
satisfying the above equation and all the given constraints. Example follows.

From the above equation, we get

Week9
• For a convex function, every local minima is a global minima too. It means that there could be
multiple local minima with the same value.
• Set of all global minima of a convex function is a convex set.
• For any function f (Rd->R) that is both differentiable and convex, if x*∈Rd is a global minimum of f,
then . Note that this is true, whether the function is convex or not. This is called the
first-order optimality condition.
• Converse of the above theorem is also true; thus, if , then x* is a global minimum.
However, this is true only when the function is convex.
• Properties of convex functions:
o If f and g (both Rd->R) are convex functions, then h(x) = f(x) + g(x) is also a convex function.
o If f and g (both Rd->R) are convex functions, then h(x) = f(x) * g(x) is also a convex function.
o If f (R->R) and g (Rd->R) are convex functions, where f is non-decreasing, then h(x) = f(g(x)) is
also a convex function.
o If f (R->R) and g (Rd->R) are convex functions, where g is linear, then h(x) = f(g(x)) is also a
convex function.
o It follows from the above that if f is not non-decreasing, or g is linear, f(g(x)) isn’t necessarily
convex.
• In a linear regression problem of machine learning, where the regression line is represented by a

linear equation h(x) = wTx (w∈ Rd), sum of squares error is represented as where n is
the number of items in the data set.
• Specific goal of a linear regression is to minimize the sum of squares error above. This can be
mathematically represented as

NOTE: ½ is a scaling factor, applied merely to render the calculations simpler.

• Above equation of sum of squares error is a convex function, since it’s a composition of a convex
function f(z) = z2 and g(w) = wTx - yi. Since g(w) is a linear function, the composition f(g) is a convex
function.
• In order to find the minimum w, equate the gradient to 0 to get where is
called a pseudo-inverse. This is the analytical solution.
• To find the global minimum of the composition, take its gradient, which yields

• Computation of the inverse is O(d3), where d is the number of dimensions and is highly expensive.
Hence, it’s advisable to use the iterative algorithm which computes gradient descent on each step.

• Note that the gradient calculation doesn’t involve an inverse computation and hence
much more efficient computationally, and can be further simplified by approximating it using a
technique called stochastic gradient.
• The stochastic gradient technique essentially samples a smaller subset of datapoints (uniformly, at
random) and computes gradient (over several iterations) on this subset, instead of using all
datapoints. When averaged over all iterations, the resultant w should be equal to w*
• Finding the optimum value of f that’s constrained by say, h(x) <= 0 is same as minimizing the
maximum of a Lagrangian L(x, λ). This is called primal problem and is represented as follows:

NOTE:
NOTE2: λ is called the lagrangian multiplier.
• The min-max (primal) problem can be converted to a max-min (dual) problem as follows:

.
Note that the min (inner function) problem is a concave function.
• Solution for the min-max problem is as follows.

• Solution for max-min problem (dual) is as follows

• When the function value at the dual optimum is less than or equal to function value at the primal
optimum, it’s called weak duality.
• If f and h are convex functions, then we get strong duality. At this point, f(x*) = h(x*). In this case,
we can solve either primal problem or the dual problem to arrive at the optimal solution x* for f.
• Thus, for the objective function f and inequality constraint h, the (local) optimal solution (x*, λ*) is
given by the Karush-Kuhn-Tucker (KKT) conditions, which are enlisted as follows:

NOTE: If the functions f and h are convex, these conditions ensure optimal solution, that’re not just
local.
• If the list of constraints includes equality constraints (represented as l) in addition to the inequality
conditions (represented as h), the KKT conditions can be re-written as

NOTE: Vectors u and v represent the lagrangian multipliers for the inequality and the equality
constraints respectively.
NOTE2: If the (primal) inequality constraints use > or >=, the dual feasibility should use <=
• Dual of dual is primal
• If either the primal or dual problem has an infeasible solution, then the value of the objective
function of the other is unbounded.
• If either the primal or dual problem has a solution then the other also has a solution and their
optimum values are equal.
• If one of the variables in the primal has unrestricted sign, the corresponding constraint in the dual is
satisfied with equality.

Week11
• Experiment in a sample space is represented as (Ω, F, P), where Ω is the sample space, F is the set of
experiments and P is the probability
• Axioms of probability with continuous random variables
o P(A) >= 0
o P(Ω) = 1
o P (A1 U A2 …U An) = ∑P(Ai) for I = 1 to n
• In the case of continuous variables, domain and range of sample space Ω is uncountably finite (set
all real numbers)
• When the continuous random variable X takes an exact value x, the probability is 0 by definition.
• PDF and CDF of continuous random variable is defined as follows:

• Following are the properties of PDF and CDF

• Also, Fx(b) – Fx(a) = P (a < X <= b)

• For a continuous random variable X with PDF fx, an event A is a subset of a real line and its
probability is computed as P(A) = ∫A f(x)dx
• Conditional probability is given by the following formula

• When PDF is integrated, you get the CDF. Likewise, when CDF is differentiated, you get PDF.
• Expectation of a continuous random variable is given by

• Properties of expectation are

• Variance is given by
• Properties of variance are
• Standard deviation is the square root of the variance
• In the case of uniform distribution in the interval [a, b], expectation is (b - a)2 / 12
• Total expectation law is as follows:

• Joint distribution

• Properties of joint distribution

• Cumulative distribution

• Properties of cumulative distribution

is non-decreasing. Cumulative probability increasing when either x or y

increases.
• Marginal densities are given by

• Conditional density is given by

• X and Y are independent if

• Covariance of random variables X and Y is defined as

• Correlation coefficient is given by

• If X and Y are independent random variables, covariance is 0, implying that the correlation
coefficient is 0 and hence they are uncorrelated. However, the reverse is not true. Uncorrelated
random variables need not be independent.
• To get the joint distribution of derived random variables from original random variables, use

, where J represents the Jacobian

• Memoryless property of exponential distribution

• Expectation of exponential distribution exp(λ) is 1/λ and variance is (1/λ)2

• When and , then

Week12
• Covariance matrix is always a square matrix. For a system with two variables, it’s a 2 x 2 matrix.

•
• Variances are in the position of diagonal elements in a covariance matrix.
• If Z = CY, where Y is a normal distribution and C is the coefficient matrix, then
• Cov[X1, X2] = Cov[X2, X1]
• Maximum log likelihood is given by

• In the case of multi-variate normal distribution, this method of estimation computes the mean
and variance as follows:

• Markov’s inequality is given by the formula:

• Chebyshev’s inequality is given by the formula:

• WLLN is given by the formula:

MLF Combined
No ratings yet
MLF Combined
84 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
AIMLB PGP 2025 Session 4
No ratings yet
AIMLB PGP 2025 Session 4
38 pages
MLF Notes - Rishab Dec 24
No ratings yet
MLF Notes - Rishab Dec 24
6 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Selected Linear Algebra For Machine Learning
No ratings yet
Selected Linear Algebra For Machine Learning
30 pages
Cis515 15 sl1 A
No ratings yet
Cis515 15 sl1 A
68 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
2linear Regression
No ratings yet
2linear Regression
102 pages
Machine Learning Matrix Methods
No ratings yet
Machine Learning Matrix Methods
25 pages
72073931-8e00-4107-bdde-c19d4ec282cb
No ratings yet
72073931-8e00-4107-bdde-c19d4ec282cb
5 pages
Average Rates of Change and Intersections
No ratings yet
Average Rates of Change and Intersections
77 pages
Least Squares & Vector Calculus
No ratings yet
Least Squares & Vector Calculus
4 pages
Linear Algebra
No ratings yet
Linear Algebra
100 pages
Math Essentials for ML Enthusiasts
No ratings yet
Math Essentials for ML Enthusiasts
25 pages
Matrix 123
No ratings yet
Matrix 123
6 pages
Geometry of Least Squares Explained
No ratings yet
Geometry of Least Squares Explained
39 pages
Day 1
No ratings yet
Day 1
41 pages
Linear Programming Study Guide
No ratings yet
Linear Programming Study Guide
122 pages
Mathematics C1 Cayley
No ratings yet
Mathematics C1 Cayley
118 pages
Linear Algebra Cheat Sheet
100% (1)
Linear Algebra Cheat Sheet
3 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Linear Algebra 1730400240
No ratings yet
Linear Algebra 1730400240
26 pages
Handout 1
No ratings yet
Handout 1
21 pages
Notes Ending 21 Feb 2024
No ratings yet
Notes Ending 21 Feb 2024
7 pages
Lecture Notes On Linear Algebra: October 2023
No ratings yet
Lecture Notes On Linear Algebra: October 2023
123 pages
Cis515 13 sl1 A
No ratings yet
Cis515 13 sl1 A
68 pages
T&S Book
No ratings yet
T&S Book
8 pages
Least Square by Nicholson-Linear Algebra-2018
No ratings yet
Least Square by Nicholson-Linear Algebra-2018
12 pages
Machine Learning & Python Techniques
No ratings yet
Machine Learning & Python Techniques
59 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Vmls - 103exercises
No ratings yet
Vmls - 103exercises
50 pages
SolutionManual Ch1 2
100% (1)
SolutionManual Ch1 2
14 pages
SM (1e) PDF
No ratings yet
SM (1e) PDF
212 pages
Lecture 2: Background: - Linear Algebra
No ratings yet
Lecture 2: Background: - Linear Algebra
36 pages
Linear Algebra and Applications Compressed 28.12.2023
No ratings yet
Linear Algebra and Applications Compressed 28.12.2023
291 pages
Linear Algebra & Differential Equations
No ratings yet
Linear Algebra & Differential Equations
3 pages
Cs221 Section1 Problems
No ratings yet
Cs221 Section1 Problems
9 pages
Applied Econometrics: Department of Economics Stern School of Business
No ratings yet
Applied Econometrics: Department of Economics Stern School of Business
27 pages
Machine Learning Notes2
No ratings yet
Machine Learning Notes2
34 pages
TA Notes PDF
No ratings yet
TA Notes PDF
19 pages
Numerical Linear Algebra With Matlab
No ratings yet
Numerical Linear Algebra With Matlab
16 pages
Linear Algebra Def
No ratings yet
Linear Algebra Def
11 pages
Linear Algebra LectureNote
No ratings yet
Linear Algebra LectureNote
223 pages
Linear Programmng
No ratings yet
Linear Programmng
27 pages
MLF Week 4 Notes by Manisha Pal
No ratings yet
MLF Week 4 Notes by Manisha Pal
13 pages
Tut 2 - FromStats2DM - Linear Algebra and Convex Optimzation
No ratings yet
Tut 2 - FromStats2DM - Linear Algebra and Convex Optimzation
26 pages
Chapter 12 Lecture Notes
No ratings yet
Chapter 12 Lecture Notes
4 pages
Function Approximation in Numerical Analysis
No ratings yet
Function Approximation in Numerical Analysis
8 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Mat188 Notes
No ratings yet
Mat188 Notes
18 pages
Econometrics I 3
No ratings yet
Econometrics I 3
27 pages
Tutorial Qcumber - Fabre
No ratings yet
Tutorial Qcumber - Fabre
70 pages
EE Course Guide for Undergrads
No ratings yet
EE Course Guide for Undergrads
7 pages
Orthogonal Diagonalization Explained
No ratings yet
Orthogonal Diagonalization Explained
11 pages
Cayley Hamilton Theorem
No ratings yet
Cayley Hamilton Theorem
5 pages
Feg 202 Module 3 Notes
No ratings yet
Feg 202 Module 3 Notes
30 pages
Quantum ESPRESSO Calculation Guide
No ratings yet
Quantum ESPRESSO Calculation Guide
66 pages
Linear Algebra & Convex Optimization
No ratings yet
Linear Algebra & Convex Optimization
16 pages
Matrices
No ratings yet
Matrices
2 pages
Instructor's Solutions Manual: 9th Edition
No ratings yet
Instructor's Solutions Manual: 9th Edition
89 pages
The Singular Value Decomposition (SVD)
No ratings yet
The Singular Value Decomposition (SVD)
9 pages
B.Sc. Mathematics Syllabus: Semesters III-IV
No ratings yet
B.Sc. Mathematics Syllabus: Semesters III-IV
25 pages
Linear Algebra for Students
No ratings yet
Linear Algebra for Students
3 pages
The Equivalence Myth of Quantum Mechanics-Part I: F. A. Muller
No ratings yet
The Equivalence Myth of Quantum Mechanics-Part I: F. A. Muller
27 pages
Mathematics Graduate Program
No ratings yet
Mathematics Graduate Program
27 pages
Pseudopotential Method Guide
No ratings yet
Pseudopotential Method Guide
7 pages
PH 403: Quantum Mechanics I Quiz 2 (2015)
No ratings yet
PH 403: Quantum Mechanics I Quiz 2 (2015)
3 pages
Mth501 Midterm Solved Mcqs by Junaid-1
No ratings yet
Mth501 Midterm Solved Mcqs by Junaid-1
37 pages
Spectral - Graph - Theory - 3
No ratings yet
Spectral - Graph - Theory - 3
27 pages
PDF Matrix Analysis and Applications 1st Edition Xian-Da Zhang Download
No ratings yet
PDF Matrix Analysis and Applications 1st Edition Xian-Da Zhang Download
65 pages
7.1 Solutions: Notes
No ratings yet
7.1 Solutions: Notes
2 pages
574 Sample - Solutions Manual Elementary Linear Algebra 11th Edition by Howard Anton, Chris Rorres
0% (3)
574 Sample - Solutions Manual Elementary Linear Algebra 11th Edition by Howard Anton, Chris Rorres
8 pages
Eigenbases and Matrix Similarity Concepts
No ratings yet
Eigenbases and Matrix Similarity Concepts
36 pages
Cambridge Math Schedules PDF
No ratings yet
Cambridge Math Schedules PDF
42 pages
Eigenvalues and Diagonalization Tutorial
No ratings yet
Eigenvalues and Diagonalization Tutorial
2 pages
W. Schirmacher PRL
No ratings yet
W. Schirmacher PRL
4 pages
Math Methods for Engineers & Scientists
No ratings yet
Math Methods for Engineers & Scientists
583 pages
GM 265-Lecture 1 & 2
No ratings yet
GM 265-Lecture 1 & 2
16 pages
DiagonalIzation Matrix
No ratings yet
DiagonalIzation Matrix
4 pages
MAIR11 Calculus and Matrices Sec B
No ratings yet
MAIR11 Calculus and Matrices Sec B
5 pages
Nicholson Solution For Linear Algebra 7th Edition.
60% (5)
Nicholson Solution For Linear Algebra 7th Edition.
194 pages

Lecture Notes

Uploaded by

Lecture Notes

Uploaded by

Week1

Supervised learning: Regression

The output of regression model is continuous and with any range.

Supervised Learning: Classification

• De-Morgan’s laws of sets

• A sequence is said to converge, if

• Norm of a vector x is defined as

• Two vectors are said to be orthogonal(perpendicular), if the dot product is 0

Can also be written as

• Function is said to be continuous if it’s continuous at all points in the domain.

• A discontinuous function is not differentiable.

This is represented mathematically as

• Thus, to solve the following problem,

More details on this follows:

If v is denoted by the vector,

Following are a few observations, while projecting.

Least square error

Case 1: When columns of A are linearly independent,

Case 3: If vector b is in the null space of transpose of A, projection is 0.

Case 4: If matrix A is invertible, projection is vector b itself.

Case 5: If matrix is of rank 1, projection is the same as with that on a line.

• Projection matrix P is symmetric , and satisfies . Converse is also true.

Let’s apply the equation above

Solving the above set of linear equations, we get

Least squares approximation | Linear Algebra | Khan Academy - YouTube

and the least squares is given by

We can solve , if A is a full rank matrix.

Minimizing the loss is equivalent to maximizing the likelihood function.

Solution also can be represented in a regularized form as:

NOTE: This method is called ridge regression.

• A matrix is diagonalizable if there exists matrix S such that , where λ

• As an extension to the above argument, following also holds.

• Any real symmetric matrix A satisfies the following properties:

This is called spectral theorem.

• Other properties of the complex vector space are

• In the case of a real vector space, conjugate transpose equivalent is

• Inner product in a complex space is defined as

In the above case, A is a Hermitian and U is a Unitary matrix.

• To upper-triangularize a matrix, follow these steps.

o Now, will result in an upper triangular matrix.

Singular Value Decomposition

NOTE: A is real-symmetric matrix.

NOTE: Matrix A represents the function f in matrix form.

• For any function f, following are true at point (p, q):

Principal Component Analysis

o Step4: Find the covariance matrix C (A symmetric m x m matrix)

o Step5: Find the eigen-values and eigen-vectors uj of C.

and projected variance

o Repeat above two steps multiple times.

Since η is a positive quantity (step-size), . If we choose d to be -f`(x), then it

Generalizing this algo for an d-dimensional space, we get

The above can be rewritten as

and mathematically represented as , where λ (called Lagrange

Solution per Lagrange method is as follows:

Substituting x1 = 0 into the second equation,

thus, yielding points are

thus, yielding points

From the above equation, we get

NOTE: ½ is a scaling factor, applied merely to render the calculations simpler.

• Solution for max-min problem (dual) is as follows

• Following are the properties of PDF and CDF

• Also, Fx(b) – Fx(a) = P (a < X <= b)

• Properties of expectation are

• Properties of joint distribution

• Properties of cumulative distribution

is non-decreasing. Cumulative probability increasing when either x or y

• Conditional density is given by

• Covariance of random variables X and Y is defined as

• Correlation coefficient is given by

, where J represents the Jacobian

• Expectation of exponential distribution exp(λ) is 1/λ and variance is (1/λ)2

• When and , then

• Markov’s inequality is given by the formula:

• Chebyshev’s inequality is given by the formula:

• WLLN is given by the formula:

You might also like