CSE465
Lecture 2
Linear Algebra and Probability
Review
CSE465: Pattern Recognition and Neural Network
Sec: 3
Faculty: Silvia Ahmed (SvA)
Summer 2025
Today’s Topics
• Linear Algebra Essentials
• Vectors
• Matrices
• Dot Product
• Matrix Calculus for Deep Learning
• Scalar derivatives
• Vector derivatives
• Matrix derivatives
• DL context
• Probability Distributions for DL
• Review of basic probability concepts
• Key distributions
Silvia Ahmed (SvA) CSE465 ECE@NSU 2
Why are Linear Algebra & Probability Critical for DL?
• Linear Algebra:
• The "language" of neural networks.
• Data represented as vectors and matrices.
• Network operations (transformations, weights) are matrix
multiplications.
• Probability:
• Understanding data distributions.
• Interpreting model outputs (e.g., classification probabilities).
• Formulating loss functions and regularization.
• Uncertainty modeling.
Silvia Ahmed (SvA) CSE465 ECE@NSU 3
CSE465
Linear Algebra Essentials
CSE465 4
Vectors: Definition
• Ordered list of numbers (e.g., [x1, x2, x3]).
• In deep learning, vectors are commonly used to represent individual data points,
features of an input, or learned representations (embeddings).
• Definition: A vector v of dimension n can be written as:
𝑣1
𝑣
𝑣= 2
⋮
𝑣𝑛
• This is a column vector
[conventionally used in DL for inputs and features]
• A row vector would be 𝑣 𝑇 = 𝑣1 𝑣2 ⋯ 𝑣𝑛
• Geometric interpretation:
• a point in n-dimensional space
• or a directed line segment from the origin to that point.
Silvia Ahmed (SvA) CSE465 ECE@NSU 5
Vectors: Operations
• Vector Addition: Element-wise sum of two vectors of the same dimension.
• Scalar Multiplication: Multiplying a vector by a scalar (a single number)
scales its magnitude.
Silvia Ahmed (SvA) CSE465 ECE@NSU 6
Vectors: DL context
• An image can be "flattened" into a vector of pixel values
• A word can be represented by a "word embedding" vector
• The features describing a data point (e.g., height, weight, age)
form a feature vector
Silvia Ahmed (SvA) CSE465 ECE@NSU 7
Matrices
• Rectangular arrays of numbers (e.g., 2x3 matrix)
• Dimensions: rows x columns
• In deep learning, matrices are predominantly used to represent collections
of data points (mini-batches), weights connecting layers in a neural
network, or learned transformations.
• Definition: A matrix A with m rows and n columns (an m×n matrix) is written
as:
Silvia Ahmed (SvA) CSE465 ECE@NSU 8
Special Matrices
• Identity matrix (I): A square matrix with ones on the main diagonal and
zeros elsewhere. When multiplied by another matrix, it leaves the matrix
unchanged.
• Diagonal Matrix: A square matrix where all off-diagonal elements are
zero.
• Symmetric Matrix: A square matrix where A = AT (transpose of A equals
A).
Silvia Ahmed (SvA) CSE465 ECE@NSU 9
Matrices: Operations
• Matrix Addition/Subtraction: Element-wise operations,
requiring matrices to have the same dimensions.
• Scalar Multiplication: Multiplying every element of the matrix
by a scalar.
• Matrix Transpose (AT): Swapping rows and columns of a
matrix. If A is m x n, then AT is n x m.
Silvia Ahmed (SvA) CSE465 ECE@NSU 10
Matrix Multiplication
• Not element-wise! Dot product of rows and columns
• Definition: The product of two matrices A (size: m x k) and B (size: k × n)
results in a matrix C (size: m x n). Each element cij of C is the dot product
of the i-th row of A and the j-th column of B:
• Rules for Multiplication: For A×B, the number of columns in A must equal
the number of rows in B.
Silvia Ahmed (SvA) CSE465 ECE@NSU 11
Matrix Multiplication (contd.)
• Non-Commutativity: In general, AB ≠ BA. The order of multiplication
matters.
• Geometric Interpretation: Matrix multiplication can represent various
geometric transformations such as scaling, rotation, reflection, and
projection.
• DL Context:
• The core operation within a neural network layer: y=Wx+b, where x is the input vector,
W is the weight matrix, b is the bias vector, and y is the output vector.
• Processing mini-batches: If X is a matrix where each row is a data point (or each
column, depending on convention), and W is a weight matrix, then XW or WX
processes the entire batch efficiently.
Silvia Ahmed (SvA) CSE465 ECE@NSU 12
Dot Product (Vector Inner Product)
Silvia Ahmed (SvA) CSE465 ECE@NSU 13
Dot Product (Vector Inner Product) (contd.)
• DL Context:
• The "weighted sum" in a neuron's activation: The dot product of input features x and
weights w (i.e., w⋅x) forms the core of many activation functions before a non-linearity
is applied.
• Attention mechanisms in advanced architectures use dot products to compute
similarity scores between different parts of the input.
Silvia Ahmed (SvA) CSE465 ECE@NSU 14
CSE465
Matrix Calculus for Deep
Learning
CSE465 15
Matrix Calculus - Introduction
• Why is this hard but crucial?
• Deep Learning involves optimizing functions (loss functions) with millions of
parameters.
• We need to compute gradients to update these parameters efficiently.
• "Backpropagation" is essentially repeated application of the chain rule.
• Review:
𝑑𝑦
• Scalar Derivatives: (slope)
𝑑𝑥
𝑑𝑧 𝑑𝑧 𝑑𝑦
• Chain Rule: = (how changes propagate)
𝑑𝑥 𝑑𝑦 𝑑𝑥
Silvia Ahmed (SvA) CSE465 ECE@NSU 16
Matrix Calculus - Vector Derivatives
• Gradient (𝜵𝒙 𝒇(𝒙)): Derivative of a scalar function (f) with
respect to a vector input (x).
• Result: A vector of partial derivatives.
Silvia Ahmed (SvA) CSE465 ECE@NSU 17
Vector Derivatives (contd.)
• Interpretation: The gradient vector points in the direction of the
steepest increase of the function 𝑓. In optimization algorithms
like gradient descent, we move in the opposite direction of the
gradient to find a minimum.
• DL Context: This is used to compute the gradients of the loss
function (a scalar value) with respect to the network's
parameters (weights and biases, which are vectors or matrices).
For instance, ∇𝑤 ℒ tells us how to adjust the weight vector w to
reduce the loss ℒ.
Silvia Ahmed (SvA) CSE465 ECE@NSU 18
Vector Derivatives (contd.)
• Jacobian Matrix (J): Derivative of a vector-valued function with
respect to a vector input.
Silvia Ahmed (SvA) CSE465 ECE@NSU 19
Vector Derivatives (contd.)
• DL Context: Jacobians are essential for understanding how errors
propagate between layers. If we have a sequence of operations like 𝑧 =
𝑑𝑧 𝑑𝑧 𝑑𝑦
𝑔(𝑦) and 𝑦 = 𝑓(𝑥), then = (using the chain rule), where these
𝑑𝑥 𝑑𝑦 𝑑𝑥
"derivatives" are actually Jacobian matrices.
Silvia Ahmed (SvA) CSE465 ECE@NSU 20
Matrix Derivatives
• In deep learning, we frequently need to compute the derivative of a scalar
loss function with respect to a matrix of weights.
• Derivative of a scalar with respect to a matrix:
Silvia Ahmed (SvA) CSE465 ECE@NSU 21
Backpropagation as Chain Rule Application
1
xi1 𝑤11
2
1
𝑤12 𝑏11 𝑤11
𝑏21 𝑦ො𝑖
1
𝑤21 𝑏12 𝑤21
2
xi2 1
𝑤22
Silvia Ahmed (SvA) CSE465 ECE@NSU 22
CSE465
Probability Distributions for
Deep Learning
CSE465 23
Probability Review - Basics
• Random Variables (RV): Outcomes of random phenomena.
• Discrete: Countable outcomes (e.g., coin flip).
• Continuous: Infinite outcomes within a range (e.g., height,
temperature).
• Probability Mass Function (PMF): For discrete RVs, it gives
the probability that the random variable takes on a specific
value. P(X = x).
• Probability Density Function (PDF): For continuous RVs, it
describes the likelihood of the random variable falling within a
particular range of values. The probability of X being in an
𝑏
interval [a,b] is 𝑥𝑑 𝑥 𝑓 𝑎.
Silvia Ahmed (SvA) CSE465 ECE@NSU 24
Probability Review – Basics (contd.)
• Expectation (E[X]): The average or mean value of a random
variable.
• For discrete RV: E[X]=∑xP(X=x).
• For continuous RV: E[X]=∫xf(x)dx.
• Variance (Var[X]): A measure of how spread out the values of a
random variable are from its mean.
• 𝑉𝑎𝑟 𝑋 = 𝐸[ 𝑋 − 𝐸[𝑋] 2 ].
• Standard Deviation is 𝜎 = 𝑉𝑎𝑟[𝑋]
Silvia Ahmed (SvA) CSE465 ECE@NSU 25
Key Probability Distributions for DL - Bernoulli
• Describes a single experiment with only two possible outcomes (e.g.,
success/failure, 0/1), where the probability of "success" is p.
• PMF:
𝑝, 𝑖𝑓 𝑘 = 1
𝑓 𝑥 =ቊ
1 − 𝑝, 𝑖𝑓 𝑘 = 0
This can also be written as 𝑃 𝑋 = 𝑘 = 𝑝𝑘 1 − 𝑝 1−𝑘
• DL context:
• Binary Classification: Output of a sigmoid layer can be interpreted as p.
• Dropout: Each neuron is independently "dropped" (set to 0) with a Bernoulli
probability.
Silvia Ahmed (SvA) CSE465 ECE@NSU 26
Key Probability Distributions for DL - Categorical
• A generalization of the Bernoulli distribution for discrete random variables
with more than two possible outcomes (e.g., rolling a die, classifying an
image into one of several categories).
• It has k mutually exclusive possible outcomes, each with a specific
probability pi, where σ𝑘𝑖=1 𝑝𝑖 = 1
• PMF: 𝑃 𝑋 = 𝑖 = 𝑝𝑖 𝑓𝑜𝑟 𝑖 ∈ 1, 2, … , 𝑘 .
• DL context:
• Multi-class Classification: Output of a softmax layer gives probabilities for each
class.
• Language Modeling: Predicting the next word from a vocabulary.
Silvia Ahmed (SvA) CSE465 ECE@NSU 27
Key Probability Distributions for DL - Normal
(Gaussian)
• The "bell curve" distribution.
• Parameters: Mean (μ) and Standard Deviation (σ).
• PDF:
• DL context:
• Weight Initialization: Often initialized from Normal distributions (e.g., Xavier, He).
• Generative Models: VAEs often model latent spaces or outputs as Gaussian.
• Loss Functions: L2 Loss (MSE) is related to maximizing likelihood under Gaussian
noise.
• Regularization: Adding Gaussian noise to inputs/activations.
Silvia Ahmed (SvA) CSE465 ECE@NSU 28
Summary
• Linear Algebra:
• Matrix Multiplication: The fundamental operation of neural networks.
• Matrix Calculus / Gradients: How neural networks learn
(backpropagation).
• Chain Rule: The mathematical backbone of backpropagation.
• Probability:
• Bernoulli: Binary outcomes, dropout.
• Categorical (Softmax): Multi-class classification.
• Normal: Weight initialization, generative models, noise.
Silvia Ahmed (SvA) CSE465 ECE@NSU 29