Slides Data Analysis
Slides Data Analysis
Eric Marsden
<[Link]@[Link]>
curve
fitting
data probabilistic model consequence model
risks
Where does this fit into risk engineering?
curve
fitting
data probabilistic model consequence model
risks costs
criteria
decision-making
Where does this fit into risk engineering?
curve
fitting
data probabilistic model consequence model
These slides
risks costs
criteria
decision-making
Descriptive statistics
Descriptive statistics
Source: [Link]
Measures of central tendency
▷ The median is the point in the distribution where half the values are
lower, and half higher
• it’s the 0.5 quantile
▷ The mode is the element that occurs most frequently in your data
Illustration: fatigue life of aluminium sheeting
Source: Birnbaum, Z. W. and Saunders, S. C. (1958), A statistical model for life-length of materials, Journal of the American Statistical
Association, 53(281)
Aside: sensitivity to outliers
Note: the mean is quite sensitive to outliers, the median much less.
▷ the median is what’s called a robust measure of central tendency
20
2 Count the number of observations in each
interval 15
interval
5
▷ The first quartile, at the 25th percentile, divides the 25% of observations
25% of
first ¼ of cases from the latter ¾ observations
▷ The second quartile, the median, divides the dataset in 25% of observations
half
25% of observations
A “box and whisker” plot or boxplot shows the spread of import [Link] as plt
1000
▷ the lower whisker: first datum > Q1 - 1.5×IQR
500
outliers
plot whiskers
Note that some people th
tly, to rep rese nt the 5th and 95
differen and
e, or eve n the min
percentiles for exampl
max valu es…
Violin plot
2500
Adds a kernel density estimation to a boxplot
2000
1500
import seaborn as sns
1000
[Link](cycles, orient="v")
[Link]("Cycles until failure") 500
0
Cycles until failure
Bias and precision
Precise Imprecise
Biased
Unbiased
A good estimator should be unbiased, precise and consistent (converge as sample size
increases).
Estimating values
population
mean
m
m
m A 90% confidence interval means that 10% of
m the time, the parameter of interest will not be
m included in that interval.
m Here, for a two-sided confidence interval.
m
m
m
m
Interpreting confidence intervals
population
mean
m
m
m A 90% confidence interval means that 10% of
m the time, the parameter of interest will not be
m included in that interval.
m Here, for a one-sided confidence interval.
m
m
m
m
Illustration: fatigue life of aluminium sheeting
population
sample
Statistical inference means deducing
information about a population by
examining only a subset of the
population (the sample).
▷ If you don’t want to make too many assumptions, a technique called the
bootstrap can help
▷ General idea:
• I want to determine how much uncertainty is generated by the fact that I only
have a limited sample of my full population
• If I had a “full” sample, I could extract a large number of limited samples and
examine the amount of variability in those samples (how much does my
parameter of interest vary across these samples?)
• I only have a limited sample, but I can look at a large number of limited
samples from my own limited sample, by sampling with replacement
Bootstrap methods
▷ Instead of fitting your model to the original 𝑋 and 𝑦, you fit your model to
resampled versions of 𝑋 and 𝑦 many times
1 Take a large number of samples of the same size as our original dataset, by
sampling with replacement
2 For each sample, calculate the parameter of interest (eg. the mean)
3 Calculate the relevant percentile from the distribution of the parameter of interest
Let’s calculate the bootstrapped mean and associated 95% confidence interval for
the Birnbaum and Saunders (1958) fatigue data.
▷ For example
• [Link](data)
• [Link](data)
• [Link](data)
• [Link](data)
0.04
0.03
import pandas
import [Link] as plt 0.02
data = pandas.read_csv("[Link]", header=None)
vangel = data[0] 0.01
N = len(vangel)
0.00
[Link](vangel, density=True, alpha=0.5) 10 20 30 40 50 60
[Link]("Vangel material data (n={})".format(N)) Specimen strength
[Link]("Specimen strength")
for a Weibull distribution that shows the best fit to the 0.03
0.00
from [Link] import weibull_min 10 20 30 40 50 60
[Link](vangel, density=True, alpha=0.5) Specimen strength
0.6
import [Link]
0.4
ecdf = [Link](vangel)
[Link](x, ecdf(x), label="Empirical CDF") 0.2
50
Distributions generally differ the most in the tails, so the
Ordered Values
40
position of the points at the left and right extremes of the
plot are the most important to examine. 30
20
10 20 30 40 50
from [Link] import probplot, weibull_min Theoretical quantiles
probplot(vangel, \
dist=weibull_min(shape,loc,scale),\
plot=[Link]().add_subplot(111))
[Link]("Weibull probability plot of Vangel data")
Example: material strength data
We can use the fitted distribution (the model for our observations)
to estimate confidence intervals concerning the material strength.
For example, what is the minimal strength of this material, with a
99% confidence level? It’s the 0.01 quantile, or the first percentile
of our distribution.
(The K-S test also returns a p-value, which describes the statistical significance of the D
statistic. However, the p-value is not valid when testing against a model that has been
fitted to the data, so we will ignore it here.)
Exercise
Problem
We have the following field data for time to failure of a pump (in hours): 3568
2599 3681 3100 2772 3272 3529 1770 2951 3024 3761 3671 2977 3110 2567 3341
2825 3921 2498 2447 3615 2601 2185 3324 2892 2942 3992 2924 3544 3359 2702
3658 3089 3272 2833 3090 1879 2151 2371 2936
What is the probability that the pump will fail after it has worked for at least
2000 hours? Provide a 95% confidence interval for your estimate.
Exercise
Solution (1/3)
import [Link]
import [Link] as plt
Distribution of failure times (hours)
0.0008
obs = [3568, 2599, 3681, 3100, 2772, 3272, 3529, 1770, 2951, \ 0.0006
3024, 3761, 3671, 2977, 3110, 2567, 3341, 2825, 3921, 2498, \ 0.0004
2447, 3615, 2601, 2185, 3324, 2892, 2942, 3992, 2924, 3544, \ 0.0002
3359, 2702, 3658, 3089, 3272, 2833, 3090, 1879, 2151, 2371, 2936] 0.0000
2000 2500 3000 3500 4000 4500 5000
Solution (2/3)
Distribution of failure times (hours)
0.0008
0.0004
Ordered Values
3500
2500
fig = [Link]().add_subplot(111)
2000
2000 2500 3000 3500 4000
Theoretical quantiles
Solution (3/3)
def failure_prob(observations):
mu, sigma = [Link](observations)
return [Link](mu, sigma).cdf(2000)
0.05
We can attempt to fit a normal distribution to the data, and 0.06 K-S test: D=1.0, p=0.0
0.03
not really a very good choice).
0.02
0.01
0.00
shape, loc = [Link](wind) 10 20 30 40 50
fitted = [Link](shape, loc) Wind speed (km/h)
[Link](wind, density=True, alpha=0.5)
x = [Link]([Link](), [Link](), 100)
[Link](x, [Link](x))
Normal probability plot of wind speed
[Link]("Normal fit on TLS 2013 wind speed data")
50
[Link]("Wind speed (km/h)")
40
[Link](wind, dist=fitted,
Ordered Values
plot=[Link]().add_subplot(111))
30
[Link]("Normal probability plot of wind speed")
20
Indeed, the probability plot shows quite a poor fit for the 10
distributions. 0 10 20 30
Theoretical quantiles
Illustration: fitting a distribution to wind speed data
Lognormal fit on TLS 2013 wind speed data
0.07
0.04
and examine a probability plot. 0.03
0.02
shape, loc, scale = [Link](wind) 0.01
fitted = [Link](shape, loc, scale)
0.00
[Link](wind, density=True, alpha=0.5) 10 20 30 40 50
x = [Link]([Link](), [Link](), 100) Wind speed (km/h)
[Link](x, [Link](x))
[Link]("Lognormal fit on TLS 2013 wind speed data") Lognormal probability plot of wind speed
[Link]("Wind speed (km/h)") 50
[Link](wind, dist=fitted, 40
Ordered Values
plot=[Link]().add_subplot(111))
30
[Link]("Lognormal probability plot of wind speed")
20
0
0 10 20 30 40
Theoretical quantiles
Exercise
▷ Download heat flow meter data collected by B. Zarr (NIST, 1990)
→ https:
//[Link]/div898/handbook/eda/section4/[Link]
▷ Calculate the sample mean and the estimated population mean using the
bootstrap technique
▷ Estimate the 95% confidence interval for the population mean, using the
bootstrap technique
Analyzing
data
TLS 2013 wind speed data
0.07
0.05
0.04
0.03
0.02
0.01
0.00
10 20 30 40 50
Wind speed (km/h)
▷ Import wind speed data for Toulouse airport 50
Normal probability plot of wind speed
40
Ordered Values
▷ 30
20
10
0 10 20 30
Theoretical quantiles
▷ Does the data seem to follow a normal distribution?
• use a probability plot to check 50
Weibull probability plot of wind speed
40
Ordered Values
▷ 30
20
0
0 10 20 30
Theoretical quantiles
[Link]/hadobs/hadcrut4/
−0.5
@LearnRiskEng
[Link]/RiskEngineering
Was some of the content unclear? Which parts were most useful to
you? Your comments to feedback@[Link]
(email) or @LearnRiskEng (Twitter) will help us to improve these
materials. Thanks!