Datasets can be standardized using the mean
and standard deviation
• Each data point minus the mean, divided by the standard deviation.
𝑥 − 𝑚𝑒𝑎𝑛
𝑆𝐷
• Standardization allows for comparison of data with different units and
variances
• “My test 1 score was 12 points higher than the class average (mean 72, SD 8).
My test 2 score was also 12 points higher than the class average (mean 72, SD
4)”
• I scored 1.5 SD higher and 3 SD higher than the class average on tests 1 and 2,
respectively.
1
When data are assumed to be normal, then
standardization becomes even more powerful because
AUC is equal to 1
• You can know the probability of
values more extreme than a
certain criterion
• For instance, the probability of a
value being more than 3
standard deviations above the
mean is only 0.001 (0.1%)
• p(>0sd)=0.500
• p(>1sd)=0.159
• p(>2sd)=0.023
• p(>3sd)=0.001
2
Inferential statistics starts with descriptive
statistics
Population Sample
Mean 𝜇 𝑋ത
Standard
deviation 𝜎 𝑠
3
The cost of using a sample = 1
• Population variance (𝜎 2 ) is just the average of SS (i.e., SS divided by
N). But we can almost never calculate 𝜎 2 , so instead we estimate 𝜎 2
by calculating a sample’s variance (𝑠 2 ).
• Luckily, the equation for 𝑠 is almost identical to 𝜎:
σ𝑖=𝑁
𝑖=1 (𝑥𝑖 − ത
𝑋) 2
𝑠=
𝑛−1
• The denominator represents the degrees of freedom
4
Degrees of freedom
• Describes the number of
scores in a sample that are
independent and free to
vary.
•https://www.youtube.com/watch?v=wsvfasNpU2s
5
Degrees of
Freedom:
Real-World
Scenario
6
Equations for sample statistics:
• Mean
σ𝑖=𝑁
𝑖=1 𝑥𝑖 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑋ത = =
𝑛 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚
# of observations
• Variance:
𝑖=𝑁 ത 2
2
σ (𝑥
𝑖=1 𝑖 −𝑋) 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
𝑠 = =
𝑛−1 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚
• Standard deviation:
2 # of observations minus
𝑠= 𝑠 # of estimates used
7
Does a sample represent the population?
• We want our data to be normally distributed so that we can use things
like the standard normal distribution to calculate probabilities.
• We can assess the data’s normality using Q-Q plots.
• https://youtu.be/okjYjClSjOg
• We hope the sample’s distribution reflects the underlying population’s
distribution (i.e., that it’s normal)
• The central limit theorem allows us to be very relaxed with how strict
the definition of normal is.
8
Central Limit Theorem
https://gallery.shinyapps.io/CLT_mean/
if you have a population with mean μ and standard deviation σ and
take sufficiently large random samples from the population with
replacement, then the distribution of the sample means will be
approximately normally distributed.
9
Central Limit Theorem
• The sampling distribution of sample means will have…
• the same mean as the population
• a standard deviation (i.e., the standard error) that gets smaller as the sample
size increases
• a shape that becomes more and more normal as the sample size increases
10
Practice tasks/questions
1. In R, type: sample(1:100,10,replace=TRUE) to get a random set of
numbers. Manually calculate the mean, median, range, variance, and
standard deviation of these 10 numbers.
• Change 1:100 and 10 to whatever values you want!
• Keep doing this until you’re comfortable with all calculations!
2. In the above examples, variance and standard deviation estimates should
use df=n-1. Why is that? In your answer, use a reference to the R code
used to create the samples.
3. Using the CLT app, assess the impact of changing sample size and
number of samples. Why do we care more about one of these two?
(hint: in reality, which of these two do we have more control of?)
11
Relevant Code
mean() # calculate mean
median() # calculate median
var() # calculate sample variance
sd() # calculate sample standard deviation
length() # calculate number of observations
boxplot() # make boxplot
hist() # make histogram
rnorm() # randomly sample from a normal population
sample() # sample from provided dataset
12
Questions/Comments
13
Foundations for inference
PNB 3XE3 – Fall 2024
Crump et al. (5)
Illowsky & Dean (6-8)
Lane (7, 9-11)
14
Objectives
• Understand Central Limit Theorem
• Explain confidence intervals
Sampling Distribution of Sample Means (SDSM)
We estimate mean and standard
deviation of SDSM using that We estimate the population’s
sample’s mean and standard mean using the SDSM
We collect data from a deviation
sample of 𝑛 observations
𝜇
Divide 𝒔 by 𝒏
Why even deal with the SDSM????
Two reasons:
1. We can create “confidence intervals”
2. We can assess the likelihood that two (or more) samples belong to
a single population*
*actually, this is technically not true. We will revisit it later.
Confidence intervals
• Recall that the sampling distribution of sample means:
• Is a standard normal distribution
• Has a standard deviation called the standard error of the mean (SEM)
• Reflects a distribution that includes the population mean
• Since the SDSM includes the population
mean, and we know its AUC, then we
can identify a range that probably
contains the population mean.
• This is the confidence interval.
Calculating Confidence Intervals
𝑠
• Formula: 𝐶𝐼 = 𝑋ത ± 𝑧
𝑛
• Where 𝑋ത is the sample mean, 𝑠 is the sample standard deviation, 𝑛 is
the sample size, and 𝑧 is the z-score for the desired probability.
• Given s/ 𝑛 is the standard error, we can also say:
𝐶𝐼 = 𝑋ത ± (𝑧 × 𝑠𝑋ത )
• Where 𝑠𝑋ത is the standard error, AKA the standard deviation of the
sampling distribution of sample means.
19
Deriving the CI formula
• Recall the formula for z-score:
-1.96 +1.96
𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 − 𝑚𝑒𝑎𝑛 𝑋 − 𝑋ത
𝑧= =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑠𝑋ത
• Re-arrange: 𝑧 × 𝑠𝑋ത = 𝑋 − 𝑋ത
𝑋 = 𝑋ത + 𝑧 × 𝑠𝑋ത
• Critically, we want to define an interval that centres on the mean, so let’s find a z
score whose ± will result in desired probability.
𝑋1 = 𝑋ത − 𝑧 × 𝑠𝑋ത 𝑋2 = 𝑋ത + 𝑧 × 𝑠𝑋ത 𝐶𝐼 = 𝑋ത ± 𝑧 × 𝑠𝑋ത
• Typically, we like to cover 95% probability, so the z-score for that would be ±1.96
20
Example: Calculating a confidence interval
Stanley Milgram measured the level of obedience in individuals. In an effort to
replicate it, we brought students into the lab and measured their obedience using
an adapted, more ethical version of Milgram’s study. Obedience scores were
collected from 30 participants.
21
E.g., calculating a confidence interval
Data: 10.5 22.9 14.6 18.5 24.9 17.1 13.0 23.9 19.4 20.1 25.5 22.9 15.0 17.4 19.0
15.9 9.3 15.7 13.8 14.3 19.9 9.8 3.4 15.2 17.2 9.0 24.8 2.7 12.0 16.4
σ𝑖=𝑁
𝑖=1 𝑥𝑖
1. Calculate mean: 𝑋ത = = 16.14
𝑛
σ𝑖=𝑁 ത 2
𝑖=1 (𝑥𝑖 −𝑋)
2. Calculate sample SD: 𝑠 = = 5.89
𝑛−1
𝑠
3. Calculate SEM: 𝑠𝑋ത = = 1.07
𝑛
4. Calculate 95% CI: 𝐶𝐼95% = 𝑋ത ± 1.96𝑠𝑋ത = [14.03, 18.24]
22
E.g., interpreting a confidence interval
• The 95% CI for the sample mean is from 14.03 𝑡𝑜 18.24.
• One way to think about the calculated CI:
In reality, the population mean is NOT the exact same as this
sample’s mean. Imagine we magically knew that the
population mean was exactly 17, and we repeated the
14.03 18.24 experiment 100 times. The calculated CI would be just one of
these possibilities.
23