Lecture 3: Statistical Methods for Exploratory Data
Analysis
Lecture #3
Part 1 – Outline
• Graphical Methods for Statistical Data Analysis
• An Example with Histogram
• Relation of Histogram with Probability Density Function (PDF)
• Continuous Random Variables and Normal Distribution
• PDF
• Cumulative Distribution Function (CDF)
2
Statistical Methods for Exploratory Data Analysis
Example on Histogram and its Relation with PDF
The free end of a cantilever beam is subjected to two transverse
loads X and Y along the orthogonal directions.
L Y
X
t
w
The length (L) of the beam is 100 inch and its manufacturing
tolerance is 0.1 inch.
100 sample measurements (inch) of the length are obtained.
99.99 100.02 99.97 100.04 100.06 99.99 100.00 100.02 100.04 100.02
100.01 99.98 99.99 100.02 100.06 100.02 99.96 100.01 100.00 99.99
99.99 99.97 99.99 99.97 99.99 100.00 100.01 99.99 99.99 100.00
100.02 100.04 100.02 99.99 100.02 99.99 99.94 99.92 99.94 99.99
99.98 99.98 100.00 99.98 99.98 100.02 100.03 100.08 100.01 100.00
99.98 99.99 100.00 99.94 99.99 100.03 99.98 100.02 100.03 100.04
100.03 99.92 99.98 100.03 100.01 100.00 100.01 100.04 100.01 99.97
99.99 99.99 100.03 100.01 100.02 100.00 99.99 100.00 100.01 99.95
100.01 100.03 100.00 100.05 100.02 99.99 99.99 99.99 100.01 100.03
99.99 100.02 100.06 100.03 100.03 99.97 99.98 100.01 100.00 99.95
3
Statistical Methods for Exploratory Data Analysis
Example on Histogram and its Relation with PDF
The free end of a cantilever beam is subjected to two transverse
loads X and Y along the orthogonal directions.
L Y
X
t
w
The length (L) of the beam is 100 inch and its manufacturing
tolerance is 0.1 inch.
100 sample measurements (inch) of the length are obtained.
99.92 99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.03 100.04
99.92 99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.03 100.04
99.94 99.97 99.99 99.99 99.99 100.00 100.01 100.02 100.03 100.04
99.94 99.98 99.99 99.99 100.00 100.00 100.01 100.02 100.03 100.04
99.94 99.98 99.99 99.99 100.00 100.00 100.01 100.02 100.03 100.04
99.95 99.98 99.99 99.99 100.00 100.01 100.01 100.02 100.03 100.05
99.95 99.98 99.99 99.99 100.00 100.01 100.01 100.02 100.03 100.06
99.96 99.98 99.99 99.99 100.00 100.01 100.02 100.02 100.03 100.06
99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.02 100.03 100.06
99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.02 100.03 100.08
4
Statistical Methods for Exploratory Data Analysis
Example on Histogram and its Relation with PDF
Plot histogram based on 100 measurements.
Normalize the frequency for each bin by total number of
measurements and bin width
Overlay the histogram with the PDF of a normal distribution
30 20
25
15
20
15 10
10
5
5
0 0
99.9 99.95 100 100.05 100.1 99.9 99.95 100 100.05 100.1
Beam length (in) Beam length (in)
5
Statistical Methods for Exploratory Data Analysis
Normal distribution
Model (PDF)
Probability
density
7
P ( 6 < X ≤ 7 ) = f X ( x ) dx
6
Data
Component thickness X
Random Variable
Numerical outcome of a random experiment.
The probability distribution of a random variable is the collection
of possible outcomes along with their probabilities:
M
Discrete case (probability mass function): P ( X = x ) = p ( x ) , with p ( x ) = 1
k X k
k =1
X k
x2
Continuous case (PDF): P ( x1 < X ≤ x2 ) = f X ( x ) dx
x1
PDF: Probability density function 6
Statistical Methods for Exploratory Data Analysis
FX(x)
1.0
0.8 P ( x1 < X ≤ x2 ) = FX ( x2 ) − FX ( x1 )
0.6
0.4 Cumulative distribution function (CDF)
0.2
0
x1 x2 X
fX(x)
0.3
x2
0.2 P ( x1 < X ≤ x2 ) = f X ( x ) dx
x1
0.1 Probability density function (PDF)
0 x1 x2 X
7
Statistical Methods for Exploratory Data Analysis
Procedure to Plot a PDF on top of a Histogram in MATLAB
Step 1: Generate random samples and do the calculation for the
histogram
>> ns = 1000;
>> L_beam = normrnd(100,0.1/3,1,ns);
>> [heights,locations] = hist(L_beam);
Step 2: Normalize the frequency on the histogram to have area 1
>> width = locations(2)-locations(1);
>> heights = (heights/ns)/width;
>> bar(locations,heights,'hist')
Step 3: Superimpose the normal density
>> grid = linspace(min(L_beam),max(L_beam));
>> line(grid,normpdf(grid,100,0.1/3),'color','r')
8
Lecture #3
Part 2 – Outline
• Continuous Random Variables and Normal Distribution
• Mean and Variance
• Probability Density Function (PDF)
• Cumulative Distribution Function (CDF)
• Quantitative Methods for Statistical Data Analysis
• Maximum Likelihood Estimation (MLE)
• An Example with Artificially Generated Measurements
9
Statistical Methods for Exploratory Data Analysis
FX(x)
1.0
0.8 P ( x1 < X ≤ x2 ) = FX ( x2 ) − FX ( x1 )
0.6
0.4 Cumulative distribution function (CDF)
0.2
0
x1 x2 X
fX(x)
0.3
x2
0.2 P ( x1 < X ≤ x2 ) = f X ( x ) dx
x1
0.1 Probability density function (PDF)
0 x1 x2 X
10
Statistical Methods for Exploratory Data Analysis
Mean and Variance
Consider a continuous random variable X with a PDF fX(x).
The mean μX, or expected value E(X), is computed by
+∞
E ( X ) = x ⋅ f X ( x) dx
−∞
The variance of X, denoted by σ2 or Var(X), takes the form
σ 2 = Var ( X ) = E ( X − µ X )
2
+∞
=
2
−∞
( x − µX ) f X ( x ) dx
The standard deviation of X is σ.
11
Statistical Methods for Exploratory Data Analysis
Mean and Variance
Probability distributions of random variables with different
means (μ) and standard deviations (σ).
12
Statistical Methods for Exploratory Data Analysis
Mean and Variance
Group discussion (4 min): Assume that X follows the distribution
specified by the following PDF, show that E(X) = μ.
1 1 x − µ Normal distribution
2
f X ( x) = exp − (the most widely used
2πσ 2 σ continuous distribution)
Solution
It is equivalent to show that E(X − μ) = 0.
+∞ +∞ 1 1 x − µ 2
E(X − µ) = ( x − µ ) f X ( x ) dx = −∞ ( x − µ ) exp − dx
−∞
2πσ 2 σ
1 +∞ x−µ 1 x − µ 2
=
2π −∞ σ
exp −
2 σ dx
If we replace the term (x – μ)/σ with a new variable z, we get
+∞
1 +∞ 1 1 1
E(X − µ) =
2π σ −∞
z exp − z 2 dz = −
2 2π σ
exp − z 2 = 0
2 −∞
13
Statistical Methods for Exploratory Data Analysis
Normal Distribution
Group discussion (4 min): Assume that X follows the distribution
specified by the following PDF, show that E(X) = μ.
1 1 x − µ Normal distribution
2
f X ( x) = exp − (the most widely used
2πσ 2 σ continuous distribution)
A normal random variable with μ = 0 and σ2 = 1 is called a
standard normal random variable, denoted by Z.
The CDF of a standard normal random variable is
z 1 1 2
Φ( z ) = P( Z ≤ z ) = exp − τ dτ
−∞
2π 2
14
Statistical Methods for Exploratory Data Analysis
Normal Distribution
Group discussion (4 min): Assume that a system performance
function G follows the normal distribution function with μ = –3
and σ2 = 4.
What is the reliability of the system?
15
Statistical Methods for Exploratory Data Analysis
Normal Distribution
Group discussion (4 min): Assume that a system performance
function G follows the normal distribution function with μ = –3
and σ2 = 4.
What is the reliability of the system?
16
Statistical Methods for Exploratory Data Analysis
Recall: Example on Histogram and its Relation with PDF
The free end of a cantilever beam is subjected to two transverse
loads X and Y along the orthogonal directions.
L Y
X
t
w
The length (L) of the beam is 100 inch and its manufacturing
tolerance is 0.1 inch.
100 sample measurements (inch) of the length are obtained.
99.92 99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.03 100.04
99.92 99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.03 100.04
99.94 99.97 99.99 99.99 99.99 100.00 100.01 100.02 100.03 100.04
99.94 99.98 99.99 99.99 100.00 100.00 100.01 100.02 100.03 100.04
99.94 99.98 99.99 99.99 100.00 100.00 100.01 100.02 100.03 100.04
99.95 99.98 99.99 99.99 100.00 100.01 100.01 100.02 100.03 100.05
99.95 99.98 99.99 99.99 100.00 100.01 100.01 100.02 100.03 100.06
99.96 99.98 99.99 99.99 100.00 100.01 100.02 100.02 100.03 100.06
99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.02 100.03 100.06
99.97 99.98 99.99 99.99 100.00 100.01 100.02 100.02 100.03 100.08
17
Statistical Methods for Exploratory Data Analysis
Recall: Example on Histogram and its Relation with PDF
Plot histogram based on 100 measurements.
Normalize the frequency for each bin by total number of
measurements and bin width
Overlay the histogram with the PDF of a normal distribution
30 20
25
15
20
15 10
10
5
5
0 0
99.9 99.95 100 100.05 100.1 99.9 99.95 100 100.05 100.1
Beam length (in) Beam length (in)
18
Statistical Methods for Exploratory Data Analysis
Example on Histogram and its Relation with PDF
How to estimate the parameters of a normal distribution?
Use maximum likelihood estimation (MLE) to estimate
distributional parameters
Consider the case of a continuous random variable X and assume that we
have a set of random samples x1, x2,…, xM of X.
MLE finds a point estimate of distributional parameters θ that maximizes the
probability (or likelihood) of obtaining the set of random samples.
The likelihood function takes the form
M
L ( θ) = f X ( x1;θ) f X ( x2 ;θ)⋯ f X ( xM ;θ) = ∏ f X ( xi ;θ)
i =1
Log transform for convenience
M
ln L ( θ) = ln f X ( xi ; θ )
i =1
19
Statistical Methods for Exploratory Data Analysis
Procedure to Plot a PDF on top of a Histogram in MATLAB
Step 1: Generate random samples and do the calculation for the
histogram
>> ns = 1000;
>> L_beam = normrnd(100,0.1/3,1,ns);
>> [heights,locations] = hist(L_beam);
Step 2: Normalize the frequency on the histogram to have area 1
>> width = locations(2)-locations(1);
>> heights = (heights/ns)/width;
>> bar(locations,heights,'hist')
Step 3: Superimpose the normal density
>> grid = linspace(min(L_beam),max(L_beam));
>> line(grid,normpdf(grid,100,0.1/3),'color','r')
20
Statistical Methods for Exploratory Data Analysis
Procedure to Plot a PDF on top of a Histogram in MATLAB
Step 4: Fit a normal distribution to the random samples (i.e.,
estimate the parameters of a normal distribution using MLE).
>> pd = fitdist(L_beam’,'Normal');
Step 5: Superimpose the density of the fitted normal distribution
>> line(grid,pdf(pd,grid),'color','b');
21
Statistical Methods for Exploratory Data Analysis
Procedure to Plot a PDF on top of a Histogram in MATLAB
True distribution (used to
generate random samples)
Fitted distribution
Probability density (in-1)
22