0% found this document useful (0 votes)
41 views35 pages

Lecture13 Stats Refresher

The document outlines the course 'Big Data Visual Analytics' (CS 661) taught by Soumya Dutta at IIT Kanpur, including study materials and project guidelines. It covers concepts of random variables, probability distributions, and estimation techniques such as the EM algorithm for Gaussian Mixture Models. Key topics include discrete and continuous random variables, expected value, variance, and various distribution functions.

Uploaded by

okstudyshivi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views35 pages

Lecture13 Stats Refresher

The document outlines the course 'Big Data Visual Analytics' (CS 661) taught by Soumya Dutta at IIT Kanpur, including study materials and project guidelines. It covers concepts of random variables, probability distributions, and estimation techniques such as the EM algorithm for Gaussian Mixture Models. Key topics include discrete and continuous random variables, expected value, variance, and various distribution functions.

Uploaded by

okstudyshivi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Big Data Visual Analytics (CS 661)

Instructor: Soumya Dutta


Department of Computer Science and Engineering
Indian Institute of Technology Kanpur (IITK)
email: soumyad@[Link]
Study Materials for Lecture 13
• [Link]
• [Link]
• A Gentle Tutorial of the EM Algorithm and its Application to
Parameter Estimation for Gaussian Mixture and Hidden Markov
Models
• EM Algorithm:
[Link]

IITK CS661: Big Data Visual Analytics: Soumya Dutta 2


Final Project
• Form your project team by Feb 28 and update the google sheet with
details of project members
• [Link]
PE_4u6pNb4FzzjbA/edit?usp=sharing
• Group size: 8-9 (8 preferred)
• Those who will not be part of a team, I will assign them into new teams
randomly
• Check the guideline document in HelloIITK for more details about the
project and how to proceed with the project
• Carefully read the instructions

IITK CS661: Big Data Visual Analytics: Soumya Dutta 3


Random
Variables and
Distributions

IITK CS661: Big Data Visual Analytics: Soumya Dutta 4


Random Variable
• Let S be a sample space of an experiment
• S is associated with a probability measure P
• A random variable X is a real valued function on S
• Key property: It is a function whose values have probabilities attached
with it

IITK CS661: Big Data Visual Analytics: Soumya Dutta 5


Random Variable: Example
• Let us flip a fair coin three times
• Sample space S = {hhh, hht, hth, htt, thh, tht, tth, ttt}
• Assume X is a function on S, so that X is the number of heads (h)
• So, we have,
• {hhh  3, hht  2, hth  2, htt  2, thh  2, tht  2, tth  1, ttt  0}
• X is a random variable

IITK CS661: Big Data Visual Analytics: Soumya Dutta 6


Random Variable: Example
• We can answer questions like:
• P(X=0) = P(ttt) = 1/8
• P(X = 1) = P(htt ) + P(tht ) + P(tth) = 3/8
• P(X = 2) = P(hht ) + P(hth) + P(thh) = 3/8
• P(X = 3) = P(hhh) = 1/8
• We can tabulate it:

IITK CS661: Big Data Visual Analytics: Soumya Dutta 7


Random Variable (RV): Example
• Rolling a fair die
• Assume a RV: X = the number that comes up
• X takes values 1,2,3,4,5,6 with probability 1/6

IITK CS661: Big Data Visual Analytics: Soumya Dutta 8


Discrete and Continuous
Random Variable
• A random variable is said to be discrete if its set of possible values is a
discrete set
• Example: Rolling a fair die and measuring the value that shows up

• A random variable is said to be continuous when it can assume an


uncountable number of values
• Example: Depth of a pool, height of all the males, etc.

IITK CS661: Big Data Visual Analytics: Soumya Dutta 9


Expected Value and Variance of a
Discrete RV
• Expected Value (mean):

• Variance:

• Standard Deviation:

IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 10


Expected Value and Variance of a
Continuous RV
• Expected Value (mean):

• Variance:

• Standard Deviation:

IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 11


Probability Distribution Function
• A probability distribution function is a mathematical function that
provides probabilities of occurrence for the possible outcomes of a
random variable

• Probability Mass Function (PMF): The probability distribution of a


discrete random variable is called probability mass function

• Probability Density Function (PDF): The probability distribution of a


continuous random variable is called probability density function

IITK CS661: Big Data Visual Analytics: Soumya Dutta 12


Probability Distribution
Function: Properties
• Discrete case: PMF • Continuous case: PDF

1.
1. for all outside a discrete range
Probability is

Probability
density
evaluated as area
under the curve
Probability

𝑃 (𝑥 =𝑐) = 0 The probability that 𝑥 takes on any individual


Data values

value is zero. The area below the curve between 𝑥= 𝑐 and


𝑥=𝑐 has no width, and therefore no area.
Data values

IITK CS661: Big Data Visual Analytics: Soumya Dutta 13


Cumulative Distribution
Function (CDF)
• Discrete RV: Non decreasing function

CDF is a right continuous function


for discrete RV

PMF CDF

IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 14


Probabilities of Events Via Discrete CDF

IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 15


Cumulative Distribution
Function (CDF)
• Continuous RV: Non decreasing function

PDF CDF
CDF is a
continuous
function
here

IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 16


Probabilities of Events Via Continuous CDF

IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 17


Discrete: Uniform Distribution
• Distribution assigns equal probabilities to a finite set of values

IITK CS661: Big Data Visual Analytics: Soumya Dutta 18


Continuous: Exponential
Distribution

IITK CS661: Big Data Visual Analytics: Soumya Dutta 19


Continuous: Beta Distribution

IITK CS661: Big Data Visual Analytics: Soumya Dutta 20


Continuous: Normal (Gaussian)
Distribution

IITK CS661: Big Data Visual Analytics: Soumya Dutta 21


Reading a Normal (Gaussian)
Distribution

IITK CS661: Big Data Visual Analytics: Soumya Dutta 22


Continuous: Standard Normal Distribution
• It is the normal distribution with a mean equal to 0 and a standard
deviation (also variance) equal to 1
• The standard normal distribution is often abbreviated to Z. It is
frequently used to simplify working with normal distributions.
Standard Normal PDF Standard Normal CDF

IITK CS661: Big Data Visual Analytics: Soumya Dutta 23


Continuous: Standard Normal Distribution
• It is the normal distribution with a mean equal to 0 and a standard
deviation (also variance) equal to 1
• The standard normal distribution is often abbreviated to Z. It is
frequently used to simplify working with normal distributions.
Standard Normal PDF Standard Normal CDF

IITK CS661: Big Data Visual Analytics: Soumya Dutta 24


Continuous: Standard Normal Distribution
• It is the normal distribution with a mean equal to 0 and a standard
deviation (also variance) equal to 1
• The standard normal distribution is often abbreviated to Z. It is
frequently used to simplify working with normal distributions.
Standard Normal PDF Standard Normal CDF

IITK CS661: Big Data Visual Analytics: Soumya Dutta 25


Joint Probability Distribution
Function
• If we have multiple random
variables, defined over the same
probability space S, then the joint
probability distribution is the
distribution function that is defined
over all possible event combinations
of all the random variables
• Joint probability density function for
two continuous random variables
and can be represented as

IITK CS661: Big Data Visual Analytics: Soumya Dutta 26


Joint Probability Distribution
Function
• The concept of joint probability distribution function is generalizable
and goes beyond two variables:
• For two variable case, must be a nonnegative function and the
following must hold:

• Joint Cumulative Distribution function (CDF)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 27


Marginal Probability
Distribution Functions
• From the joint probability distribution function, we can find the
marginal probability distributions by integrating the joint distribution
function

for all
for all

• Marginal distribution functions (also known as univariate


distributions) are probability distribution functions of individual
random variables

IITK CS661: Big Data Visual Analytics: Soumya Dutta 28


Independence
• The continuous random variables are statistically independent if their
joint probability distribution function factors into a product of their
marginal distributions

IITK CS661: Big Data Visual Analytics: Soumya Dutta 29


Conditional Probability and
Bayes’ Rule
• Conditional probability: It is the probability of an event given another
event has occurred

• Bayes’ Rule:

= Conditional probability of = given = . This is also called posterior probability


= Conditional probability of = given = . This is called likelihood
= marginal of , also the prior probability of =
= marginal probability of

IITK CS661: Big Data Visual Analytics: Soumya Dutta 30


Representations of Distribution
Functions
• Non-parametric model
• Histogram
• Kernel Density Estimation (KDE)
• Parametric models
• Gaussian (Normal)
• Gaussian mixture models (GMM)

IITK CS661: Big Data Visual Analytics: Soumya Dutta 31


Non-parametric Distributions:
Histogram
• Histogram: A histogram is an approximate representation of a statistical
distribution. The area under a histogram can be normalized and used as a
discrete probability distribution function.

Univariate Histogram Joint Histogram


IITK CS661: Big Data Visual Analytics: Soumya Dutta [Link] 32
Non-parametric Distributions:
KDE
• KDE: Kerner Density Estimation is a popular method of distribution estimation
technique from sample data. Formally it is defined as follows:

• f(x) is the KDE function


• n = number of data points
• b = bandwidth
• K(.) = Non-negative symmetric kernel function
such as uniform, triangular, Gaussian etc.

Univariate KDE Joint KDE


IITK CS661: Big Data Visual Analytics: Soumya Dutta 33
Parametric Distribution: GMM
• Gaussian Mixture Model (GMM): Represent a probability distribution
function as a convex combination of multiple Gaussian functions
K
= Weights of the Gaussian components
p( X )  i * N ( X | i ,  i ) K = Number of Gaussian components in the mixture
i 1
model

IITK CS661: Big Data Visual Analytics: Soumya Dutta Fig. source: [Link] 34
Parameter Estimation Techniques
• Estimation of Gaussian distribution parameters are trivial
• Maximum Likelihood Estimate (MLE)
• Same as computing mean and variance

• Estimation of GMM parameters require Expectation Maximization


(EM) algorithm
• Iterative technique to fit GMM parameters

• Incremental schemes for GMM parameter estimation


• Fast and approximate method to estimate GMM parameters
• Can model streaming time-varying data

IITK CS661: Big Data Visual Analytics: Soumya Dutta 35

You might also like