ConREM Master Programme
Helsinki Metropolia University of Applied Sciences / HTW Berlin
Advanced Mathematical Methods in Economics and Management
Sampling, estimation and confidence intervals
Population is a set of objects in which the statistical analysis is intended to do. If we want to
know e.g. the average of specific quantity attached to each object in a large population, it’s
usually impossible or at least inconvenient (too laborious or too expensive) to observe every
object separately. Instead of it, a limited number of objects, a sample, can be picked from the
population and the average (and other statistical parameters) can be estimated using the
sample.
Point estimates
The best estimate, or “the best guess“, or using the correct terminology, the unbiased
estimate, for the mean (average) of the values in the whole population is, obviously, the
mean of the values in the sample. It is the sample mean, defined as
∑𝑛𝑖=1 𝑥𝑖
𝑥=
𝑛
where 𝑛 is the sample size and 𝑥𝑖 , with all 𝑖 = 1 … 𝑛, are the values in the sample.
The unbiased estimate of standard deviation (or population standard deviation) is sample
standard deviation:
∑𝑛 (𝑥𝑖 − 𝑥)2
𝑠 = √ 𝑖=1
𝑛−1
There is a small difference to the formula of population standard deviation (which is to be
used if all values in the populations are known):
∑𝑛𝑖=1(𝑥𝑖 − 𝑥)2
𝜎=√
𝑛
With large sample sizes the difference between them is small but with small sample sizes the
latter formula underestimates the standard deviation. In Excel, STDEV.S gives sample
standard deviation and STDEV.P gives population standard deviation.
Note that the formula of population standard deviation is otherwise equal to the formula of
standard deviation of probability distribution,
𝑛
𝐷𝑋 = √∑ 𝑝𝑖 (𝑥𝑖 − 𝐸𝑋)2 ,
𝑖=1
1
but the expected value EX is replaced by mean 𝑥 and values 𝑝𝑖 by 𝑛.
Interval estimates
Repeating the process and picking another sample, the average will probably be slightly
different to the one on the first time. When several samples are collected, different sample
means (and sample standard deviations) are obtained. So, there’s uncertainty in the
estimated mean (and standard deviation, and possible other estimated parameters, but now
we’ll concentrate on mean estimation). That’s why it makes sense to use an interval to
estimate the population mean, rather than a single value.
It can be proved that the sample mean of several random samples is normally distributed, not
depending on original distribution of the sampled variable itself. This is the basis of mean
estimation methods. The expected value of the sample mean is the population mean. That’s
why the sample mean is called the unbiased estimate of the population mean. (See the
difference between the original distribution and the distribution of sample mean in several
samples, file CI_illustration.pdf tries to explain this difference). The standard deviation of the
distribution of sample means, called standard error, is not equal to standard deviation of the
original distribution. If the standard deviation of the population (or whole distribution) is σ,
𝜎
then standard error is 𝑛 , where n is sample size.
√
For instance, to investigate the mean income in a country we can pick up a sample of let’s
say 2000 employees and ask them their salary. The average of the salaries in the sample
would be the estimate for the mean income in the population (set of all employees in the
country). Obviously it is the best estimate but an interesting question is: how good the
estimate is?
First, the sample should be selected reasonably: it should represent the population well. If all
employees are selected from same company or same city, the sample isn’t representative.
This topic, proper sampling methods, is not discussed in detail on this course.
Let’s suppose that the sampling is done properly. Then the goodness of the estimate
depends on sample size – and fortune! Error risk can be calculated and it can be expressed
in exact form using confidence intervals. Confidence interval with confidence level p is the
interval, centered on the sample mean, including the population mean with estimated
probability p. (Note: The previous sentence, which associates confidence level to probability, is criticized.
More exactly, the confidence level should be understood in frequentistic sense: if sampling and calculation of
confidence interval of confidence level p is repeated many times, then proportion p of all confidence intervals
contains the population mean. And when we have determined a specific confidence interval, we’ll never know if
the interval contains the population mean or not.)
For mathematical details to determine confidence intervals and examples, see file
Determining CI for [Link], and website [Link]
(especially "Practical example“) and Excel file CI_example.xlsx.
Confidence intervals can be determined also for other population parameters, e.g.
percentages. For example, if a poll found that 16% of the voters are supporting party A and
margin of error of the poll is two percentage points with confidence level 0.95, then the real
proportion of supporters of party A very probably lies on the interval [14%, 18%], and
estimated probability for that is 0.95. There’s still possibility of 5% that the proportion is not on
that interval.
A few words about Excel commands
In the current version of Excel (from version 2010 onwards) there are two commands relating
to confidence intervals for mean, both of them producing the margin of error (the difference
between the sample mean and the upper/lower limit of the confidence interval):
- [Link] can be used in a case where population standard deviation is
known. Command CONFIDENCE from older Excel versions is still there for
compatibility, and it gives the same results as [Link].
- CONFIDENCE.T can be used in a case where population standard deviation is
unknown. It’s based on so called Student t-distribution. This command isn’t included in
the earlier versions of Excel. However, with rather large sample size
[Link] and CONFIDENCE.T give approximately equal results.