0% found this document useful (0 votes)
16 views29 pages

CSE465 T2 Mathematics For DL

Uploaded by

adib136718
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views29 pages

CSE465 T2 Mathematics For DL

Uploaded by

adib136718
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSE465

Lecture 2

Linear Algebra and Probability


Review

CSE465: Pattern Recognition and Neural Network


Sec: 3
Faculty: Silvia Ahmed (SvA)
Summer 2025
Today’s Topics
• Linear Algebra Essentials
• Vectors
• Matrices
• Dot Product
• Matrix Calculus for Deep Learning
• Scalar derivatives
• Vector derivatives
• Matrix derivatives
• DL context
• Probability Distributions for DL
• Review of basic probability concepts
• Key distributions

Silvia Ahmed (SvA) CSE465 ECE@NSU 2


Why are Linear Algebra & Probability Critical for DL?
• Linear Algebra:
• The "language" of neural networks.
• Data represented as vectors and matrices.
• Network operations (transformations, weights) are matrix
multiplications.
• Probability:
• Understanding data distributions.
• Interpreting model outputs (e.g., classification probabilities).
• Formulating loss functions and regularization.
• Uncertainty modeling.

Silvia Ahmed (SvA) CSE465 ECE@NSU 3


CSE465

Linear Algebra Essentials

CSE465 4
Vectors: Definition
• Ordered list of numbers (e.g., [x1, x2, x3]).
• In deep learning, vectors are commonly used to represent individual data points,
features of an input, or learned representations (embeddings).
• Definition: A vector v of dimension n can be written as:
𝑣1
𝑣
𝑣= 2

𝑣𝑛
• This is a column vector
[conventionally used in DL for inputs and features]
• A row vector would be 𝑣 𝑇 = 𝑣1 𝑣2 ⋯ 𝑣𝑛
• Geometric interpretation:
• a point in n-dimensional space
• or a directed line segment from the origin to that point.

Silvia Ahmed (SvA) CSE465 ECE@NSU 5


Vectors: Operations
• Vector Addition: Element-wise sum of two vectors of the same dimension.

• Scalar Multiplication: Multiplying a vector by a scalar (a single number)


scales its magnitude.

Silvia Ahmed (SvA) CSE465 ECE@NSU 6


Vectors: DL context
• An image can be "flattened" into a vector of pixel values
• A word can be represented by a "word embedding" vector
• The features describing a data point (e.g., height, weight, age)
form a feature vector

Silvia Ahmed (SvA) CSE465 ECE@NSU 7


Matrices
• Rectangular arrays of numbers (e.g., 2x3 matrix)
• Dimensions: rows x columns
• In deep learning, matrices are predominantly used to represent collections
of data points (mini-batches), weights connecting layers in a neural
network, or learned transformations.
• Definition: A matrix A with m rows and n columns (an m×n matrix) is written
as:

Silvia Ahmed (SvA) CSE465 ECE@NSU 8


Special Matrices
• Identity matrix (I): A square matrix with ones on the main diagonal and
zeros elsewhere. When multiplied by another matrix, it leaves the matrix
unchanged.

• Diagonal Matrix: A square matrix where all off-diagonal elements are


zero.
• Symmetric Matrix: A square matrix where A = AT (transpose of A equals
A).

Silvia Ahmed (SvA) CSE465 ECE@NSU 9


Matrices: Operations
• Matrix Addition/Subtraction: Element-wise operations,
requiring matrices to have the same dimensions.
• Scalar Multiplication: Multiplying every element of the matrix
by a scalar.
• Matrix Transpose (AT): Swapping rows and columns of a
matrix. If A is m x n, then AT is n x m.

Silvia Ahmed (SvA) CSE465 ECE@NSU 10


Matrix Multiplication
• Not element-wise! Dot product of rows and columns
• Definition: The product of two matrices A (size: m x k) and B (size: k × n)
results in a matrix C (size: m x n). Each element cij​ of C is the dot product
of the i-th row of A and the j-th column of B:

• Rules for Multiplication: For A×B, the number of columns in A must equal
the number of rows in B.

Silvia Ahmed (SvA) CSE465 ECE@NSU 11


Matrix Multiplication (contd.)
• Non-Commutativity: In general, AB ≠ BA. The order of multiplication
matters.
• Geometric Interpretation: Matrix multiplication can represent various
geometric transformations such as scaling, rotation, reflection, and
projection.
• DL Context:
• The core operation within a neural network layer: y=Wx+b, where x is the input vector,
W is the weight matrix, b is the bias vector, and y is the output vector.
• Processing mini-batches: If X is a matrix where each row is a data point (or each
column, depending on convention), and W is a weight matrix, then XW or WX
processes the entire batch efficiently.

Silvia Ahmed (SvA) CSE465 ECE@NSU 12


Dot Product (Vector Inner Product)

Silvia Ahmed (SvA) CSE465 ECE@NSU 13


Dot Product (Vector Inner Product) (contd.)
• DL Context:
• The "weighted sum" in a neuron's activation: The dot product of input features x and
weights w (i.e., w⋅x) forms the core of many activation functions before a non-linearity
is applied.

• Attention mechanisms in advanced architectures use dot products to compute


similarity scores between different parts of the input.

Silvia Ahmed (SvA) CSE465 ECE@NSU 14


CSE465

Matrix Calculus for Deep


Learning

CSE465 15
Matrix Calculus - Introduction
• Why is this hard but crucial?
• Deep Learning involves optimizing functions (loss functions) with millions of
parameters.
• We need to compute gradients to update these parameters efficiently.
• "Backpropagation" is essentially repeated application of the chain rule.

• Review:
𝑑𝑦
• Scalar Derivatives: (slope)
𝑑𝑥
𝑑𝑧 𝑑𝑧 𝑑𝑦
• Chain Rule: = (how changes propagate)
𝑑𝑥 𝑑𝑦 𝑑𝑥

Silvia Ahmed (SvA) CSE465 ECE@NSU 16


Matrix Calculus - Vector Derivatives
• Gradient (𝜵𝒙 𝒇(𝒙)): Derivative of a scalar function (f) with
respect to a vector input (x).
• Result: A vector of partial derivatives.

Silvia Ahmed (SvA) CSE465 ECE@NSU 17


Vector Derivatives (contd.)
• Interpretation: The gradient vector points in the direction of the
steepest increase of the function 𝑓. In optimization algorithms
like gradient descent, we move in the opposite direction of the
gradient to find a minimum.
• DL Context: This is used to compute the gradients of the loss
function (a scalar value) with respect to the network's
parameters (weights and biases, which are vectors or matrices).
For instance, ∇𝑤 ℒ tells us how to adjust the weight vector w to
reduce the loss ℒ.

Silvia Ahmed (SvA) CSE465 ECE@NSU 18


Vector Derivatives (contd.)
• Jacobian Matrix (J): Derivative of a vector-valued function with
respect to a vector input.

Silvia Ahmed (SvA) CSE465 ECE@NSU 19


Vector Derivatives (contd.)
• DL Context: Jacobians are essential for understanding how errors
propagate between layers. If we have a sequence of operations like 𝑧 =
𝑑𝑧 𝑑𝑧 𝑑𝑦
𝑔(𝑦) and 𝑦 = 𝑓(𝑥), then = (using the chain rule), where these
𝑑𝑥 𝑑𝑦 𝑑𝑥

"derivatives" are actually Jacobian matrices.

Silvia Ahmed (SvA) CSE465 ECE@NSU 20


Matrix Derivatives
• In deep learning, we frequently need to compute the derivative of a scalar
loss function with respect to a matrix of weights.
• Derivative of a scalar with respect to a matrix:

Silvia Ahmed (SvA) CSE465 ECE@NSU 21


Backpropagation as Chain Rule Application
1
xi1 𝑤11
2
1
𝑤12 𝑏11 𝑤11
𝑏21 𝑦ො𝑖
1
𝑤21 𝑏12 𝑤21
2

xi2 1
𝑤22

Silvia Ahmed (SvA) CSE465 ECE@NSU 22


CSE465

Probability Distributions for


Deep Learning

CSE465 23
Probability Review - Basics
• Random Variables (RV): Outcomes of random phenomena.
• Discrete: Countable outcomes (e.g., coin flip).
• Continuous: Infinite outcomes within a range (e.g., height,
temperature).
• Probability Mass Function (PMF): For discrete RVs, it gives
the probability that the random variable takes on a specific
value. P(X = x).
• Probability Density Function (PDF): For continuous RVs, it
describes the likelihood of the random variable falling within a
particular range of values. The probability of X being in an
𝑏
interval [a,b] is ‫𝑥𝑑 𝑥 𝑓 𝑎׬‬.

Silvia Ahmed (SvA) CSE465 ECE@NSU 24


Probability Review – Basics (contd.)
• Expectation (E[X]): The average or mean value of a random
variable.
• For discrete RV: E[X]=∑xP(X=x).
• For continuous RV: E[X]=∫xf(x)dx.
• Variance (Var[X]): A measure of how spread out the values of a
random variable are from its mean.
• 𝑉𝑎𝑟 𝑋 = 𝐸[ 𝑋 − 𝐸[𝑋] 2 ].

• Standard Deviation is 𝜎 = 𝑉𝑎𝑟[𝑋]

Silvia Ahmed (SvA) CSE465 ECE@NSU 25


Key Probability Distributions for DL - Bernoulli
• Describes a single experiment with only two possible outcomes (e.g.,
success/failure, 0/1), where the probability of "success" is p.
• PMF:
𝑝, 𝑖𝑓 𝑘 = 1
𝑓 𝑥 =ቊ
1 − 𝑝, 𝑖𝑓 𝑘 = 0
This can also be written as 𝑃 𝑋 = 𝑘 = 𝑝𝑘 1 − 𝑝 1−𝑘

• DL context:
• Binary Classification: Output of a sigmoid layer can be interpreted as p.
• Dropout: Each neuron is independently "dropped" (set to 0) with a Bernoulli
probability.

Silvia Ahmed (SvA) CSE465 ECE@NSU 26


Key Probability Distributions for DL - Categorical
• A generalization of the Bernoulli distribution for discrete random variables
with more than two possible outcomes (e.g., rolling a die, classifying an
image into one of several categories).
• It has k mutually exclusive possible outcomes, each with a specific
probability pi, where σ𝑘𝑖=1 𝑝𝑖 = 1
• PMF: 𝑃 𝑋 = 𝑖 = 𝑝𝑖 𝑓𝑜𝑟 𝑖 ∈ 1, 2, … , 𝑘 .
• DL context:
• Multi-class Classification: Output of a softmax layer gives probabilities for each
class.
• Language Modeling: Predicting the next word from a vocabulary.

Silvia Ahmed (SvA) CSE465 ECE@NSU 27


Key Probability Distributions for DL - Normal
(Gaussian)
• The "bell curve" distribution.
• Parameters: Mean (μ) and Standard Deviation (σ).
• PDF:

• DL context:
• Weight Initialization: Often initialized from Normal distributions (e.g., Xavier, He).
• Generative Models: VAEs often model latent spaces or outputs as Gaussian.
• Loss Functions: L2 Loss (MSE) is related to maximizing likelihood under Gaussian
noise.
• Regularization: Adding Gaussian noise to inputs/activations.

Silvia Ahmed (SvA) CSE465 ECE@NSU 28


Summary
• Linear Algebra:
• Matrix Multiplication: The fundamental operation of neural networks.
• Matrix Calculus / Gradients: How neural networks learn
(backpropagation).
• Chain Rule: The mathematical backbone of backpropagation.
• Probability:
• Bernoulli: Binary outcomes, dropout.
• Categorical (Softmax): Multi-class classification.
• Normal: Weight initialization, generative models, noise.

Silvia Ahmed (SvA) CSE465 ECE@NSU 29

You might also like