1
MATHEMATICS IN
MACHINE LEARNING -I
ROHAN PILLAI
ASSISTANT PROFESSOR
DEPARTMENT OF ELECTRICAL ENGINEERING, DTU
2
Table of Content
❑ Mathematics in Data Science vs Machine Learning ❑ CALCULUS REVISITED
❑ LINEAR ALGEBRA • Convex Sets
• Tensors • Convex Functions
• Vectors • Extrema
• Matrices • Derivatives
• Special Matrices • Rules of Derivatives
• Transpose of a Matrix • Gradient
• Inverse of a matrix • Hessian
• Matrix Differentiation • Jacobian
• Stationary Points
• Stationary Points in higher dimensions
• Example
❑ References
Mathematics in ML-I
3
Mathematics in
Data Science vs Machine Learning
Mathematics in ML-I
4
Mathematics in
Data Science vs Machine Learning
Mathematics in data science and machine learning is not about
How can we learn crunching numbers, but about what is happening, why it’s
mathematics without getting happening, and how we can play around with different things to
bogged down into the theory? obtain the results we want.
Mathematics in ML-I
5
1. LINEAR
ALGEBRA
Mathematics in ML-I
6
Tensors
A tensor is an array of numbers, that may be
❖ a scalar (0th Order Tensor)
❖ a vector (1st Order Tensor)
❖ a matrix (2nd Order Tensor)
Mathematics in ML-I
7
Vectors
Vector in Rn is an ordered set of n real numbers.
o e.g. v = (1,5,3,4,2) is in R5
1
5
o A column vector: 3
4
2
o A row vector: (1 5 3 4 2)
Vector operations : Scalar multiplication and Vector addition
Any point on the plane containing the vectors u and v is some linear combination
of au + bv
Mathematics in ML-I
8
Vectors…
Vector norms : A norm of a vector ||x|| is informally a measure of the “length” of the vector.
Common norms:
o L1 (Manhattan) -
Matrices
Frobenius Norm :
Af = σ𝑖,𝑗 𝐴𝑖𝑗 2
o L2 (Euclidean) -
Mathematics in ML-I
9
Vectors…
Vector dot (inner) product:
Geometric interpretation of a dot product : Projection of one vector upon the other
❑ If u•v=0, ||u||2 != 0, ||v||2 != 0 → u and v are orthogonal
❑ If u•v=0, ||u||2 = 1, ||v||2 = 1 → u and v are orthonormal
Mathematics in ML-I
10
Matrices
An N×M matrix has N rows and M columns i.e. it is 2-D array of numbers. Here is an
example matrix, A, with 3 rows and 2 columns :
1 2
A= 3 4
5 6
Linear transformations can be represented as matrices.
Multiplying a matrix with a vector has the geometric meaning of applying a linear
transformation to the vector
Multiplying two matrices has the geometric meaning of applying one linear
transformation after another
Mathematics in ML-I
11
Matrices…
Matrix Multiplication : C = AB Hadamard Product : C = A ⊙ B
▪ Cij = σk AikBkj ▪ Element wise multiplication
▪ Sizes must match ▪ A,B,C must be of the same size
Mathematics in ML-I
12
Special Matrices
Diagonal matrix Upper Triangular / Symmetric Matrix
Lower Triangular
Matrix
Example : Example :
a 0 0 a b c a 0 0 A = AT
0 b 0 0 d e b c 0
0 0 c 0 0 f d e f
Mathematics in ML-I
13
Transpose of a Matrix
Transpose of a Matrix : We can think of it as
o “flipping” the rows and columns
OR
o “reflecting” vector/matrix on line
T T
a a b a c
Example :
= (a b ) =
b c d b d
Mathematics in ML-I
14
Inverse of a matrix
Inverse of a square matrix A, denoted by A-1 Pseudo- Inverse of a matrix A ε Rm× n is
is the unique matrix s.t. defined as A+ ∈ Rn× m, where
o AA-1 =A-1A= I (identity matrix) A+ = (ATA)-1AT
If A-1 and B-1 exist, then
o (AB)-1 = B-1A-1,
o (AT)-1 = (A-1)T
For orthogonal matrices : (A)-1 = (AT)
For diagonal matrices :
D-1 = diag {d1-1 ,…, dn-1}
Mathematics in ML-I
17
Matrix Differentiation
Mathematics in ML-I
18
Mathematics in ML-I
19
Mathematics in ML-I
20
Mathematics in ML-I
21
Matrix Differentiation…
Mathematics in ML-I
22
Mathematics in ML-I
23
Mathematics in ML-I
24
2. CALCULUS
REVISITED
Mathematics in ML-I
25
Convex Sets
C ⊆ ℝd
∀ x, y ∈ C
∀ λ ∈ [0,1]
z = λ.x + (1- λ).y
Mathematics in ML-I
26
Convex Functions
f : ℝd ℝ
∀ x, y ∈ C
∀ λ ∈ [0,1]
z = λ.x + (1- λ).y
f(z) ≤ λ. f(x) + (1- λ). f(y)
Mathematics in ML-I
27
Extrema
Since we always seek the “best” values of a function, usually we are looking for the
maxima or the minima of a function
Global extrema: a point which achieves the
best value of the function (max/min) among
all the possible points
Local extrema: a point which achieves the
best value of the function only in a small
region surrounding that point
Most machine learning algorithms love to find the global extrema
o E.g. we saw that SVM wanted to find the model with max margin
Sometimes it is difficult so we settle for local extrema (e.g. deepnets)
Mathematics in ML-I
28
Derivatives
Derivatives measure how much one variable(y = f(x))
changes when there is a small change in the other
variable(x)
𝑑𝑦
o f '(x) =
𝑑𝑥
It is zero when the function is flat (horizontal), such as at
the minimum or maximum of f(x)
It is positive when f(x) is strictly increasing and negative
when f(x) is strictly decreasing
Double Derivatives :
𝑑2𝑦
o Double derivative f ''(x) OR y '' =
𝑑𝑥2
o It is positive for convex functions (having single
minima) and negative for concave functions (having
single maxima)
In higher dimensions (functions of many variables /
vectors ), we use the idea of partial derivatives
Mathematics in ML-I
29
Rules of Derivatives
′
Sum Rule: 𝑓 𝑥 + 𝑔 𝑥 = 𝑓 ′ 𝑥 + 𝑔′ 𝑥
′
Scaling Rule: 𝑎 ⋅ 𝑓 𝑥 = 𝑎 ⋅ 𝑓 ′ 𝑥 if 𝑎 is not a function of 𝑥
′
Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥 = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 + 𝑔′ 𝑥 ⋅ 𝑓 𝑥
2
Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 − 𝑔′ 𝑥 𝑓 𝑥 / 𝑔 𝑥
′
′
Chain Rule: 𝑓 𝑔 𝑥 ≝ 𝑓∘𝑔 𝑥 = 𝑓′ 𝑔 𝑥 ⋅ 𝑔′ 𝑥
Mathematics in ML-I
30
Gradient
Gradient ∶ Multivariate generalisation of the derivative
For multivariate functions with 𝑑-dim inputs, the
gradient simply records how much the function would
change if we move a little bit along each one of the 𝑑
axes!
𝜕𝑓 𝜕𝑓 𝜕𝑓 ′
𝛻𝑓 𝐱 = , ,…,
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑑
The gradient also has the distinction of offering the
steepest ascent i.e. if we want maximum increase in
function value, we must move a little bit along the
gradient. Similarly, we must move a little bit in the
direction opposite to gradient to get the maximum
decrease in the function value, i.e. the gradient also
offers us the steepest descent
At minima or maxima , the gradient is a zero vector
Mathematics in ML-I
31
Hessian
Hessian : Gradient of the gradient
2nd derivative of 𝑓: ℝ𝑛 → ℝ is a n × n matrix called the
Hessian
𝜕2𝑓 𝜕2𝑓 𝜕2𝑓
𝜕𝑥12 𝜕𝑥1 𝑥2 … 𝜕𝑥1 𝑥𝑛
𝜕2𝑓 𝜕2𝑓 … 𝜕2𝑓
𝛻 2 𝑓 𝐱 = 𝜕𝑥 𝑥 𝜕𝑥22 𝜕𝑥2 𝑥𝑛
2 1
⋮ ⋮ ⋱ ⋮
2 2 2
𝜕 𝑓 𝜕 𝑓 𝜕 𝑓
…
𝜕𝑥𝑛 𝑥1 𝜕𝑥𝑛 𝑥2 𝜕𝑥𝑛2
Note that the Hessian matrix is Symmetric
It is the equivalent of 2nd derivative in scalar calculus and
has similar uses
All rules of derivatives (chain, product etc) apply here as
well
Mathematics in ML-I
32
Jacobian
Jacobian is the equivalent of the gradient for vector valued functions
Hessian can be seen as the gradient (Jacobian) of the gradient (which is a vector)
𝜕𝑓 𝑥
For 𝑓: ℝ𝑛 → ℝ𝑚 , Jacobian is a m × n matrix in which we have Ji, j = 𝑖
𝜕𝑥𝑗
𝜕𝑓1 𝜕𝑓1 𝜕𝑓1
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
𝜕𝑓2 𝜕𝑓2 𝜕𝑓2
J= 𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
⋮ ⋱ ⋮
𝜕𝑓𝑚 𝜕𝑓𝑚
⋯
𝜕𝑥1 𝜕𝑥𝑛
Mathematics in ML-I
33
Stationary Points
If 𝑓 ′′ 𝑥 < 0 and 𝑓 ′ 𝑥 = 0 then derivative moves from +ve to -ve around
this point – local/global max!
If 𝑓 ′′ 𝑥 = 0 and 𝑓 ′ 𝑥 = 0 then
These are places where the derivative vanishes i.e. is 0 this may be extrema/saddle –
higher derivatives e.g. 𝑓 ′′′ 𝑥
These can be local/global extrema or saddle points needed
The derivative being zero is its way of telling us
that at that point, the function looks flat
Saddle points can be tedious in ML
We can find out if a stationary point
is saddle or extrema using 2nd derivative If 𝑓 ′′ 𝑥 > 0 and 𝑓 ′ 𝑥 = 0 then derivative moves
from -ve to +ve around this point – local/global min!
Just as sign of the derivative tells us if the function is increasing or decreasing at a given
point, the 2nd derivative tells us if the derivative is increasing or decreasing
Mathematics in ML-I
34
Stationary Points in higher dimensions
If a square 𝑑 × 𝑑 symmetric matrix 𝐴 satisfies 𝐱 ⊤ 𝐴𝐱 < 0 for all non-zero 𝐱 ∈ ℝ𝑑
then it is negative definite (ND)
These are places where the gradient vanishes i.e. is a zero vector!
We can still find out if a stationary point is saddle or extrema using the 2nd derivative test
just as in 1D
A bit more complicated to visualize, but the Hessian tells us how the surface of the
function is curved at a point
If 𝛻𝑓 𝐱 = 𝟎 and 𝛻 2 𝑓 𝐱 is a PD matrix, then 𝐱 is a local/global min
If 𝛻𝑓 𝐱 = 𝟎 and 𝛻 2 𝑓 𝐱 is a ND matrix, then 𝐱 is a local/global max
If neither of these are true, then either 𝐱 is a saddle point or the test fails, need higher
order derivatives to verify
Whether point is saddle or test has failed depends on eigenvalues of 𝛻𝑓 𝐱
Recall that if a matrix satisfies 𝐱 ⊤ 𝐴𝐱 > 0 for all non-zero 𝐱 ∈ ℝ𝑑
then it is called positive definite (PD)
Mathematics in ML-I
35
Mathematics in ML-I
36
Example – Function Values
In this discrete example, we can
calculate gradient at a point 𝑥0 , 𝑦0
as
Δ𝑓 Δ𝑓
o 𝛻𝑓 𝑥0 , 𝑦0 = , where
Δ𝑥 Δ𝑦
Δ𝑓 𝑓 𝑥0 +1,𝑦0 −𝑓 𝑥0 −1,𝑦0
o =
Δ𝑥 2
Δ𝑓 𝑓 𝑥0 ,𝑦0 +1 −𝑓 𝑥0 ,𝑦0 −1
o =
Δ𝑦 2
Mathematics in ML-I
37
Example – Gradients
In this discrete example, we can
Local max
calculate gradient at a point 𝑥0 , 𝑦0
as
Local min Δ𝑓 Δ𝑓
o 𝛻𝑓 𝑥0 , 𝑦0 = , where
Δ𝑥 Δ𝑦
Δ𝑓 𝑓 𝑥0 +1,𝑦0 −𝑓 𝑥0 −1,𝑦0
Saddle point o =
Δ𝑥 2
Δ𝑓 𝑓 𝑥0 ,𝑦0 +1 −𝑓 𝑥0 ,𝑦0 −1
o =
Δ𝑦 2
Mathematics in ML-I
38
Example – Gradients
Gradients
converge
toward local In this discrete example, we can calculate
max
gradient at a point 𝑥0 , 𝑦0 as
Gradients Δ𝑓 Δ𝑓
diverge o 𝛻𝑓 𝑥0 , 𝑦0 = , where
Δ𝑥 Δ𝑦
away from
local min Δ𝑓 𝑓 𝑥0 +1,𝑦0 −𝑓 𝑥0 −1,𝑦0
o =
Δ𝑥 2
Δ𝑓 𝑓 𝑥0 ,𝑦0 +1 −𝑓 𝑥0 ,𝑦0 −1
At saddle points,
o =
Δ𝑦 2
both can happen
along different axes o We can visualize these gradients
using simple arrows as well
Mathematics in ML-I
39
Example – Hessians
−𝟐 −𝟎. 𝟐𝟓 In this discrete example, we can calculate
𝟐
𝛁 𝒇=
−𝟎. 𝟐𝟓 −𝟐 Hessian at 𝑥0 , 𝑦0 as
which is ND i.e. local max
Δ2 𝑓 Δ2 𝑓
Δ𝑥 2 Δ𝑥Δ𝑦
𝛁𝟐 𝒇 = o 𝛻 2 𝑓 𝑥0 , 𝑦0 = where
𝟐 𝟎. 𝟏𝟐𝟓 Δ2 𝑓 Δ2 𝑓
𝟎. 𝟏𝟐𝟓 𝟐 Δ𝑥Δ𝑦 Δ𝑦 2
which is PD i.e.
local min Δ2 𝑓
o = 𝑓 𝑥0 + 1, 𝑦0 + 𝑓 𝑥0 − 1, 𝑦0 − 2𝑓 𝑥0 , 𝑦0
Δ𝑥 2
Δ2 𝑓
o = 𝑓 𝑥0 , 𝑦0 + 1 + 𝑓 𝑥0 , 𝑦0 − 1 − 2𝑓 𝑥0 , 𝑦0
𝟐 Δ𝑦 2
𝛁 𝒇=
−𝟐 −𝟎. 𝟏𝟐𝟓 Δ2 𝑓 𝑓𝑥𝑦 +𝑓𝑦𝑥
−𝟎. 𝟏𝟐𝟓 𝟐
o = where
Δ𝑥Δ𝑦 2
which is neither PD nor Δ𝑓 Δ𝑓
𝑥0 ,𝑦0 +1 − 𝑥0 ,𝑦0 −1
ND (it is a saddle point) ❑ 𝑓𝑥𝑦 = Δ𝑥 Δ𝑥
2
Δ𝑓 Δ𝑓
𝑥0 +1,𝑦0 − 𝑥0 −1,𝑦0
Δ𝑦 Δ𝑦
❑ 𝑓𝑦𝑥 = 2
Mathematics in ML-I
40
References
https://github.com/purushottamkar/ml19-20w/tree/master/lecture_slides
https://www.analyticsvidhya.com/blog/2019/10/mathematics-behind-machine-learning/
https://tminka.github.io/papers/matrix/minka-matrix.pdf
Mathematics in ML-I