Big Data Visual Analytics (CS 661)
Instructor: Soumya Dutta
Department of Computer Science and Engineering
Indian Institute of Technology Kanpur (IITK)
email: soumyad@[Link]
Study Materials for Lecture 13
• [Link]
• [Link]
• A Gentle Tutorial of the EM Algorithm and its Application to
Parameter Estimation for Gaussian Mixture and Hidden Markov
Models
• EM Algorithm:
[Link]
IITK CS661: Big Data Visual Analytics: Soumya Dutta 2
Final Project
• Form your project team by Feb 28 and update the google sheet with
details of project members
• [Link]
PE_4u6pNb4FzzjbA/edit?usp=sharing
• Group size: 8-9 (8 preferred)
• Those who will not be part of a team, I will assign them into new teams
randomly
• Check the guideline document in HelloIITK for more details about the
project and how to proceed with the project
• Carefully read the instructions
IITK CS661: Big Data Visual Analytics: Soumya Dutta 3
Random
Variables and
Distributions
IITK CS661: Big Data Visual Analytics: Soumya Dutta 4
Random Variable
• Let S be a sample space of an experiment
• S is associated with a probability measure P
• A random variable X is a real valued function on S
• Key property: It is a function whose values have probabilities attached
with it
IITK CS661: Big Data Visual Analytics: Soumya Dutta 5
Random Variable: Example
• Let us flip a fair coin three times
• Sample space S = {hhh, hht, hth, htt, thh, tht, tth, ttt}
• Assume X is a function on S, so that X is the number of heads (h)
• So, we have,
• {hhh 3, hht 2, hth 2, htt 2, thh 2, tht 2, tth 1, ttt 0}
• X is a random variable
IITK CS661: Big Data Visual Analytics: Soumya Dutta 6
Random Variable: Example
• We can answer questions like:
• P(X=0) = P(ttt) = 1/8
• P(X = 1) = P(htt ) + P(tht ) + P(tth) = 3/8
• P(X = 2) = P(hht ) + P(hth) + P(thh) = 3/8
• P(X = 3) = P(hhh) = 1/8
• We can tabulate it:
IITK CS661: Big Data Visual Analytics: Soumya Dutta 7
Random Variable (RV): Example
• Rolling a fair die
• Assume a RV: X = the number that comes up
• X takes values 1,2,3,4,5,6 with probability 1/6
IITK CS661: Big Data Visual Analytics: Soumya Dutta 8
Discrete and Continuous
Random Variable
• A random variable is said to be discrete if its set of possible values is a
discrete set
• Example: Rolling a fair die and measuring the value that shows up
• A random variable is said to be continuous when it can assume an
uncountable number of values
• Example: Depth of a pool, height of all the males, etc.
IITK CS661: Big Data Visual Analytics: Soumya Dutta 9
Expected Value and Variance of a
Discrete RV
• Expected Value (mean):
• Variance:
• Standard Deviation:
IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 10
Expected Value and Variance of a
Continuous RV
• Expected Value (mean):
• Variance:
• Standard Deviation:
IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 11
Probability Distribution Function
• A probability distribution function is a mathematical function that
provides probabilities of occurrence for the possible outcomes of a
random variable
• Probability Mass Function (PMF): The probability distribution of a
discrete random variable is called probability mass function
• Probability Density Function (PDF): The probability distribution of a
continuous random variable is called probability density function
IITK CS661: Big Data Visual Analytics: Soumya Dutta 12
Probability Distribution
Function: Properties
• Discrete case: PMF • Continuous case: PDF
1.
1. for all outside a discrete range
Probability is
Probability
density
evaluated as area
under the curve
Probability
𝑃 (𝑥 =𝑐) = 0 The probability that 𝑥 takes on any individual
Data values
value is zero. The area below the curve between 𝑥= 𝑐 and
𝑥=𝑐 has no width, and therefore no area.
Data values
IITK CS661: Big Data Visual Analytics: Soumya Dutta 13
Cumulative Distribution
Function (CDF)
• Discrete RV: Non decreasing function
CDF is a right continuous function
for discrete RV
PMF CDF
IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 14
Probabilities of Events Via Discrete CDF
IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 15
Cumulative Distribution
Function (CDF)
• Continuous RV: Non decreasing function
PDF CDF
CDF is a
continuous
function
here
IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 16
Probabilities of Events Via Continuous CDF
IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 17
Discrete: Uniform Distribution
• Distribution assigns equal probabilities to a finite set of values
IITK CS661: Big Data Visual Analytics: Soumya Dutta 18
Continuous: Exponential
Distribution
IITK CS661: Big Data Visual Analytics: Soumya Dutta 19
Continuous: Beta Distribution
IITK CS661: Big Data Visual Analytics: Soumya Dutta 20
Continuous: Normal (Gaussian)
Distribution
IITK CS661: Big Data Visual Analytics: Soumya Dutta 21
Reading a Normal (Gaussian)
Distribution
IITK CS661: Big Data Visual Analytics: Soumya Dutta 22
Continuous: Standard Normal Distribution
• It is the normal distribution with a mean equal to 0 and a standard
deviation (also variance) equal to 1
• The standard normal distribution is often abbreviated to Z. It is
frequently used to simplify working with normal distributions.
Standard Normal PDF Standard Normal CDF
IITK CS661: Big Data Visual Analytics: Soumya Dutta 23
Continuous: Standard Normal Distribution
• It is the normal distribution with a mean equal to 0 and a standard
deviation (also variance) equal to 1
• The standard normal distribution is often abbreviated to Z. It is
frequently used to simplify working with normal distributions.
Standard Normal PDF Standard Normal CDF
IITK CS661: Big Data Visual Analytics: Soumya Dutta 24
Continuous: Standard Normal Distribution
• It is the normal distribution with a mean equal to 0 and a standard
deviation (also variance) equal to 1
• The standard normal distribution is often abbreviated to Z. It is
frequently used to simplify working with normal distributions.
Standard Normal PDF Standard Normal CDF
IITK CS661: Big Data Visual Analytics: Soumya Dutta 25
Joint Probability Distribution
Function
• If we have multiple random
variables, defined over the same
probability space S, then the joint
probability distribution is the
distribution function that is defined
over all possible event combinations
of all the random variables
• Joint probability density function for
two continuous random variables
and can be represented as
IITK CS661: Big Data Visual Analytics: Soumya Dutta 26
Joint Probability Distribution
Function
• The concept of joint probability distribution function is generalizable
and goes beyond two variables:
• For two variable case, must be a nonnegative function and the
following must hold:
• Joint Cumulative Distribution function (CDF)
IITK CS661: Big Data Visual Analytics: Soumya Dutta 27
Marginal Probability
Distribution Functions
• From the joint probability distribution function, we can find the
marginal probability distributions by integrating the joint distribution
function
for all
for all
• Marginal distribution functions (also known as univariate
distributions) are probability distribution functions of individual
random variables
IITK CS661: Big Data Visual Analytics: Soumya Dutta 28
Independence
• The continuous random variables are statistically independent if their
joint probability distribution function factors into a product of their
marginal distributions
IITK CS661: Big Data Visual Analytics: Soumya Dutta 29
Conditional Probability and
Bayes’ Rule
• Conditional probability: It is the probability of an event given another
event has occurred
• Bayes’ Rule:
= Conditional probability of = given = . This is also called posterior probability
= Conditional probability of = given = . This is called likelihood
= marginal of , also the prior probability of =
= marginal probability of
IITK CS661: Big Data Visual Analytics: Soumya Dutta 30
Representations of Distribution
Functions
• Non-parametric model
• Histogram
• Kernel Density Estimation (KDE)
• Parametric models
• Gaussian (Normal)
• Gaussian mixture models (GMM)
IITK CS661: Big Data Visual Analytics: Soumya Dutta 31
Non-parametric Distributions:
Histogram
• Histogram: A histogram is an approximate representation of a statistical
distribution. The area under a histogram can be normalized and used as a
discrete probability distribution function.
Univariate Histogram Joint Histogram
IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 32
Non-parametric Distributions:
KDE
• KDE: Kerner Density Estimation is a popular method of distribution estimation
technique from sample data. Formally it is defined as follows:
• f(x) is the KDE function
• n = number of data points
• b = bandwidth
• K(.) = Non-negative symmetric kernel function
such as uniform, triangular, Gaussian etc.
Univariate KDE Joint KDE
IITK CS661: Big Data Visual Analytics: Soumya Dutta 33
Parametric Distribution: GMM
• Gaussian Mixture Model (GMM): Represent a probability distribution
function as a convex combination of multiple Gaussian functions
K
= Weights of the Gaussian components
p( X ) i * N ( X | i , i ) K = Number of Gaussian components in the mixture
i 1
model
IITK CS661: Big Data Visual Analytics: Soumya Dutta Fig. source: [Link] 34
Parameter Estimation Techniques
• Estimation of Gaussian distribution parameters are trivial
• Maximum Likelihood Estimate (MLE)
• Same as computing mean and variance
• Estimation of GMM parameters require Expectation Maximization
(EM) algorithm
• Iterative technique to fit GMM parameters
• Incremental schemes for GMM parameter estimation
• Fast and approximate method to estimate GMM parameters
• Can model streaming time-varying data
IITK CS661: Big Data Visual Analytics: Soumya Dutta 35