Introduction to Statistics
Simone Tonin
1/109
Suggested Reading
Wackerly, Mendenhall III, and Scheaffer (2008).
Mathematical Statistics with Applications, 7th ed.,
Cengage.
Rice (2006). Mathematical Statistics and Data Analysis,
3rd ed., Cengage.
2/109
Some basic concepts (Chapter 1)
3/109
Some basic concepts
Data consist of information coming from observations,
counts, measurements, or responses.
Statistics is the science of collecting, organizing,
analyzing, and interpreting data in order to make decisions.
A population is the collection of all outcomes, responses,
measurements, or counts that are of interest. Populations
may be finite or infinite. If a population of values consists
of a fixed number of these values, the population is said to
be finite. If, on the other hand, a population consists of an
endless succession of values, the population is an infinite
one.
A sample is a subset of a population.
A parameter is a numerical description of a population
characteristic.
A statistic is a numerical description of a sample
characteristic.
4/109
Notation
Population Sample
Size N n
Parameter Statistic
Mean µ x̄
Variance σ2 s2
Standard deviation σ s
Proportion π π̂
Correlation ρ r
5/109
Data collection
There are several ways for collecting data:
Take a census: a census is a count or measure of an entire
population. Taking a census provides complete
information, but it is often costly and difficult to perform.
Use sampling: a sample is a count or measure of a part of a
population. Statistics calculated from a sample are used to
estimate population parameters.
Random sample
Use a simulation: collecting data often involves the use of
computers. Simulations allow studying situations that are
impractical or even dangerous to create in real life and
often save time and money.
Perform an experiment, e.g. to test the effect of imposing a
new marketing strategy, one could perform an experiment
by using the new marketing strategy in a certain region.
6/109
Branches of statistics
The study of statistics has two major branches - descriptive
statistics and inferential statistics:
Descriptive statistics is the branch of statistics that
involves the organization, summarization, and display of
data.
Inferential statistics is the branch of statistics that
involves using a sample to draw conclusions about a
population, e.g. estimation and hypothesis testing.
7/109
Types of data
Data sets can consist of two types of data:
Qualitative (categorical) data consist of attributes,
labels, or nonnumerical entries. e.g. name of cities, gender
etc.
Quantitative data consist of numerical measurements or
counts. e.g. heights, weights, age. Quantitative data can
be distinguished as:
Discrete data result when the number of possible values is
either a finite number or a ”countable” number. e.g. the
number of phone calls you received in any given day.
Continuous data result from infinitely many possible
values that correspond to some continuous scale that covers
a range of values without gaps, interruptions, or jumps. e.g.
height, weight, sales and market shares.
8/109
Types of data (Econometrics)
Cross-sectional data: Data on different entities (e.g.
workers, consumers, firms, governmental units) for a single
time period. For example, data on test scores in different
school districts.
Time series data: Data for a single entity (e.g. person,
firm, country) collected at multiple time periods. For
example, the rate of inflation and unemployment for a
country over the last 10 years.
Panel data (also called longitudinal data): Data for
multiple entities in which each entity is observed at two or
more time periods. For example, the daily prices of a
number of stocks over two years.
9/109
Levels of measurement
Nominal: Categories only, data cannot be arranged in an
ordering scheme. (e.g. Marital status: single, married etc.)
Ordinal: Categories are ordered, but differences cannot be
determined or they are meaningless (e.g. poor, average,
good)
Interval: differences between values are meaningful, but
there is no natural starting point, ratios are meaningless
(e.g. we cannot say that the temperature 80◦ F is twice as
hot as 40◦ F)
Ratio: Like interval level, but there is a natural zero
starting point and rations are meaningful (e.g. £20 is twice
as much as £10)
10/109
Measures of Central Tendency (Chapter 2)
Measures of central tendency provide numerical information
about a ‘typical’ observation in the data.
The Mean (also called the average) of a data set is the sum
of the data values divided by the number of observations.
n
1X
Sample mean: x̄ = xi
n
i=1
The Median is the middle observation when the data set
is sorted in ascending or descending order. If the data set
has an even number of observations, the median is the
mean of the two middle observations.
The Mode is the data value that occurs with the greatest
frequency. If no entry is repeated, the data set has no
mode. If two (more than two) values occur with the same
greatest frequency, each value is a mode and the data set is
called bimodal (multimodal).
11/109
Measure of Dispersion
The variation (dispersion) of a set of observations refers to the
variability that they exhibit.
Range = maximum data value - minimum data value
The variance measures the variability or spread of the
observations from the mean.
n
21 X
Sample variance: s = (xi − x̄)2
n−1
i=1
The standard deviation (s) of a data set is the square
root of the sample variance.
v
u n
u 1 X
standard deviation: s = t (xi − x̄)2
n−1
i=1
12/109
Coefficient of Variation
The coefficient of variation (CV) is defined as the ratio of the
standard deviation to the mean:
σ
CV =
µ
or for the sample
s
CV =
x̄
It is a measure of the dispersion of a distribution relative to its
mean. It also helps to compare two variables that are measured
with different scales (as CV has no unit).
13/109
Example 1: Accounting final exam grades
The accounting final exam grades of 10 students are:
88 51 63 85 79 65 79 70 73 77
The sample mean grade is
n
1X 1
x̄ = xi = (88 + 51 + . . . + 77) = 73
n 10
i=1
Next we arrange the data from the lowest to the largest
grade,
51 63 65 70 73 77 79 79 85 88
The median grade is 75, which located midway between the
5th and 6th ordered data points (73 + 77)/2 = 75.
14/109
Example 1: Accounting final exam grades (Cont.)
The mode is 79 since it appears twice and all other grades
appeared only once.
The range is 88 − 51 = 37.
The variance
n
1 X 1
s2 = (xi −x̄)2 = ((88−73)2 +. . .+(77−73)2 ) = 123.78
n−1 9
i=1
√
The standard deviation: s = 123.78 = 11.13
s 11.13
The coefficient of variation: CV = x̄ = 73 = 0.1525
15/109
Shape of a distribution: Skewness
Skewness is a measure of the asymmetry of the distribution.
16/109
Shape of a distribution: Kurtosis
Kurtosis measures the degree of peakedness or flatness of the
distribution.
17/109
Measure of Position: z-score
Measure of position of a value relative to the mean of the
distribution.
The z-score of an observation tells us the number of standard
deviations that the observation is from the mean, i.e., how far
the observation is from the mean in units of standard deviation.
x − x̄
z=
s
As the z-score has no unit, it can be used to compare values
from different data sets or to compare values within the same
data set. The mean of z-scores is 0 and the standard deviation
is 1.
Note that s > 0 so if z is negative, the corresponding x-value is
below the mean. If z is positive, the corresponding x-value is
above the mean. And if z = 0, the corresponding x-value is
equal to the mean.
18/109
Measure of Position: Percentiles and Quartiles
Measure of position of a value relative to the entire set of data.
Given an ordered set of observations, the kth percentile,
Pk , is the value of X such that k% or less of the
observations are less than Pk and (100 − k)% or less of the
observations are greater than Pk
The 25th percentile, Q1 , is often referred to as the first
quartile.
The 50th percentile (the median), Q2 , is referred to as the
second or middle quartile.
The 75th percentile, Q3 , is referred to as the third quartile
19/109
Percentiles and Quartiles
The quartiles divide a data set into quarters (four equal
parts).
The interquartile range (IQR) of a data set is the difference
between the first and third quartiles (IQR = Q3 − Q1 )
The IQR is a measure of variation that gives you an idea of
how much the middle 50% of the data varies.
20/109
Five-number summary & Boxplots
To graph a boxplot (a box-and-whisker plot), we need the
following values (called the five-number summary):
1) The minimum entry 4) The maximum entry
2) The first quartile Q1 5) The third quartile Q3
3) The median (second quartile ) Q2
The box represents the interquartile range (IQR), which
contains the middle 50% of values.
21/109
Outliers & Extremes values
Some data sets contain outliers or extremes values, observations
that fall well outside the overall pattern of the data. Boxplots
can help us to identify such values if some rules-of-thumb are
used, e.g.:
Outlier: Cases with values between 1.5 and 3 box lengths
(the box length is the interquartile range) from the upper
or lower edge of the box.
Extremes: Cases with values more than 3 box lengths from
the upper or lower edge of the box.
22/109
Example 1: Accounting final exam grades (cont.)
Histograms present frequencies for values grouped into
interval.
Box plots
23/109
Descriptive statistics for qualitative variables
Frequency distributions are tabular or graphical
presentations of data that show each category for a variable
and the frequency of the category’s occurrence in the data
set. Percentages for each category are often reported
instead of, or in addition to, the frequencies.
The Mode can be used in this case as a measure of central
tendency.
Bar charts and Pie charts are often used to display the
results of categorical or qualitative variables. Pie charts are
more useful for displaying results of variables that have
relatively few categories, in that pie charts become
cluttered and difficult to read if variables have many
categories.
24/109
Descriptive statistics for qualitative variables (cont.)
25/109
Probability Theory: Basic concepts (Chapter 3)
We use probability theory to quantify uncertainty
The aim is to derive probability statements as regards the
likelihood of occurrence of possible outcomes
Random experiment: a process leading to at least two
possible outcomes with uncertainty as to which will occur
(e.g. rolling a die)
Basic outcomes: the possible outcomes of a random
experiment, e.g. (1, 2, 3, 4, 5 and 6)
Sample space: the set of all basic outcomes, e.g.
S = {1, 2, 3, 4, 5, 6}
Event: a set of basic outcomes from the sample space, that
is, a subset of the sample space (e.g. “number is odd”, or
“number is less than 5”)
26/109
Example 2: Tossing a coin
Tossing a coin:
There are two possible outcomes: head (H) or tail (T)
Sample space: S = {H, T }
E = {H} is the event of getting a head.
Assuming we have a fair coin, the probability of getting a
head is 1/2, or P (E) = 1/2.
Tossing a coin twice:
The sample space is S = {HH, HT, T H, T T }.
E = {HH, HT } is the event that the first toss results in a
head.
P (E) = 2/4 = 1/2.
27/109
Example 3: Rolling Dice
Rolling a die:
There are six possible outcomes: 1, 2, 3, 4, 5, 6.
The sample space is S = {1, 2, 3, 4, 5, 6}.
E = {2, 4, 6} is the event of rolling an even number.
The probability of rolling an even number is 3/6.
Rolling a die twice:
The sample space is S = {(i, j) : i, j = 1, 2, . . . , 6}, which
contains 36 elements.
E = {(4, 6), (5, 5), (6, 4)} is the event that the sum of the
two dice is equal to 10.
P (E) = 3/36 = 1/12.
28/109
Probability Theory: Basic concepts
Let A and B be two events in the sample space S. We
define the intersection between these two events as the set
of basic outcomes that belong to both and we denote it as
A ∩ B.
If two events share no common basic outcomes, they are
said to be mutually exclusive and their intersection is said
to equal the “empty set”, i.e. A ∩ B = ∅
29/109
Probability Theory: Basic concepts
Let A and B be two events in the sample space S. We
define the union between these events as the set of basic
outcomes that belong to at least one of these two events as
A ∪ B.
Complementary event: A ∪ Ā = S
30/109
Probability Measure
A probability measure on S is a function P from subsets of S
(events) to the real numbers that satisfies the following axioms
(Kolmogorov axioms):
For any event A, P (A) ≥ 0
P (S) = 1
If A1 , A2 , . . . , An , . . . are mutually disjoint, then
∞ ∞
!
[ X
P Ai = P (Ai )
i=1 i=1
If A1 and A2 are disjoint (A1 ∩ A2 = ∅), then
P (A1 ∪ A2 ) = P (A1 ) + P (A2 )
31/109
Probability Measure: some properties
These properties are consequences of the previous axioms.
P (Ā) = 1 − P (A)
P (∅) = 0
If A ⊂ B, then P (A) ≤ P (B)
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) (addition law)
If events A and B are mutually exclusive, then
P (A ∩ B) = 0
32/109
Conditional Probability
The probability of an event may depend on whether or not
another event is occurred.
Let A and B be two events with P (B) 6= 0. The conditional
probability of A given B is defined to be
P (A ∩ B)
P (A|B) =
P (B)
or as
P (A ∩ B) = P (B)P (A|B) (multiplication law)
Similarly (for P (A) 6= 0),
P (A ∩ B)
P (B|A) = ⇐⇒ P (A ∩ B) = P (A)P (B|A)
P (A)
33/109
Independent Events
Two events A and B are statistically independent if knowing
that the event B has occurred does not change the probability
of event A. Formally:
P (A|B) = P (A)
P (B|A) = P (B)
Another definition:
Two events A and B are statistically independent of each other
if and only if
P (A ∩ B) = P (A)P (B)
Note: it is important to distinguish between “disjoint events”
and “independent events”
34/109
Example 4: Two dice
Two dice are thrown sequentially, and the face values that
come up are recorded. The sample space:
S ={(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}
A= the sum of two values is at least 5
A ={ (1, 4), (1, 5), (1, 6),
(2, 3), (2, 4), (2, 5), (2, 6),
(3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}
35/109
Example 4: Two dice (cont.)
B= the value of the first die is higher than the value of the
second
B ={(2, 1), (3, 1), (3, 2), (4, 1), (4, 2), (4, 3), (5, 1), (5, 2),
(5, 3), (5, 4), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}
C= the first value is 4
C = {(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)}
The event that the first value is 4 AND the sum of two
values is at least 5
A ∩ C = {(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)}
The event that the value of the first die is higher than the
value of the second OR the first value is 4,
B ∪ C ={(2, 1), (3, 1), (3, 2), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}
36/109
Example 4: Two dice (cont.)
A ∩ (B ∪ C) ={(3, 2), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (5, 1),
(5, 2), (5, 3), (5, 4), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}
16 4
P (A ∩ (B ∪ C)) = =
36 9
Given that the sum of the face values is less than six, what is
the probability that at least one of the dice came up a three.
D= the sum of the face values is less than six
E= at least one of the dice came up a three
D = {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (4, 1)}
E = {(1, 3), (2, 3), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (4, 3), (5, 3), (6, 3)}
E ∩ D = {(1, 3), (2, 3), (3, 1), (3, 2)}
P (E ∩ D) 4/36 4 2
P (E|D) = = = =
P (D) 10/36 10 5
37/109
Law of Total Probability
Assume that S = E1 ∪ E2 ∪ . . . ∪ Ek where Ei ∩ Ej = ∅ for i 6= j
with P (Ei ) > 0 for all i. Then for any event A ∈ S,
k
X
P (A) = P (A|Ei )P (Ei )
i=1
38/109
Bayes Theorem
If A and B are two events with positive probabilities , then
P (A ∩ B) P (A)P (B|A)
P (A|B) = =
P (B) P (B)
In application you usually observe P (A), P (B|A), and P (B). In
Bayesian inference this formula is used to update the beliefs
(the sample information is combined with prior information).
P (A) prior probability
P (A|B) posterior probability
39/109
Example 5
According to the Association of British Insurers, 49% of UK
residents have life insurance while 16% have a personal pension
scheme. Suppose that 27% of UK residents with life insurance
also have a personal pension scheme. If a UK resident is
randomly selected, determine the following probabilities:
a. The UK resident does not have a personal pension scheme
given that he does have life insurance.
b. The UK resident has a personal pension scheme given that
he does not have life insurance.
c. The UK resident does not have life insurance if it is known
that he does have a personal pension scheme.
d. The UK resident does not have life insurance if it is known
that he does not have a personal pension scheme.
40/109
Example 5: Solution
P (LI) = 0.49, P (P P S) = 0.16, P (P P S|LI) = 0.27
a. P (P P S|LI) = 1 − 0.27 = 0.73
b. We need to find P (P P S|LI), from the Law of Total
Probability
P (P P S) = P (P P S|LI)P (LI) + P (P P S|LI)P (LI)
0.16 = (0.27 ∗ 0.49) + P (P P S|LI) ∗ 0.51
thus
0.16 − (0.27 ∗ 0.49)
P (P P S|LI) = = 0.0543
0.51
41/109
Example 5: Solution
c. We need to find P (LI|P P S), from the Bayes Theorem
P (LI ∩ P P S) P (P P S|LI)P (LI)
P (LI|P P S) = =
P (P P S) P (P SS)
0.0543 ∗ 0.51
= = 0.1731
0.16
d. We need to find P (LI|P P S), from the Law of Total
Probability
P (LI) = P (LI|P P S)P (P P S) + P (LI|P P S)P (P P S)
0.51 = (0.1731 ∗ 0.16) + P (LI|P P S) ∗ 0.84
Thus,
0.51 − (0.1731 ∗ 0.16)
P (LI|P P S) = = 0.5742
0.84
42/109
Random Variables
A random variable is a variable whose possible values are
numerical outcomes of a random experiment.
The term ‘random’ is used here to imply the uncertainty
associated with the occurrence of each outcome.
Random variables can be either discrete or continuous.
A realisation of a random variable is the value that is
actually observed.
A random variable is often denoted by a capital letter (say
X, Y , Z) and its realisation by a small letter (say x, y, z).
43/109
Discrete Random Variables
A discrete random variable (X) is a random variable
that can take on only a finite or at most a countably
infinite number of values.
The probability (mass) function of the random variable
describes completely the probability properties of the
random variable.
That is, it assign a probability to each outcome.
Let the various values of X be denoted by x1 , x2 , . . . , then
there is a function P such that
X
P (X = xi ) = p(xi ) ≥ 0 and p(xi ) = 1
i
This function is called the probability (mass) function
of the random variable X.
44/109
Discrete Random Variables
The cumulative distribution function (cdf) of a random
variable represents the probability that X does not exceed the
value x, as a function of x.
That is, is defined as
X
F (x) = P (X ≤ x) = p(xi )
xi ≤x
The cumulative distribution function is non-decreasing and
satisfies:
lim F (x) = 0
x→−∞
and
lim F (x) = 1
x→∞
45/109
Properties of random variables
If X is a discrete random variable with probability mass
function P , the expected value of X (or the mean) is
X
E(X) = xi p(xi )
i
E(X) is also referred to as the mean of X and is often denoted
by µ or µX .
46/109
Properties of random variables (cont.)
The variance of X is
X
V ar(X) = E [X − E(X)]2 = [xi − µ]2 p(xi )
or as
V ar(X) = E[X 2 ] − [E(X)]2
The standard deviation of X is the square root of the
variance.
The variance (or the standard deviation) can be used as an
indication of how dispersed the probability distribution is
about its center (i.e. mean).
47/109
Example 6: Tossing a coin
Define the random variable:
Tossing a single fair coin: Let X be the number of heads, then
X can take only two values (e.g. 0 means ‘heads’ and 1 means
‘tails’).
The probability mass function of X is
1
P (X = x) = 2 x ∈ {0, 1}
0 x∈ / {0, 1}
The cumulative distribution function is
0 x<0
F (x) = 1/2 0 ≤ x < 1
1 x≥1
48/109
Example 6: Tossing a coin
x p(x) F (x) x p(x) x2 p(x)
1 1
x1 = 0 2 2 0 0
1 1 1
x2 = 1 2 1 2 2
1 1
Total 1 - 2 2
X 1
E(X) = xi p(xi ) = x1 p(x1 ) + x2 p(x2 ) =
2
i
X 1
E(X 2 ) = x2i p(xi ) = x21 p(x1 ) + x22 p(x2 ) =
2
i
V ar(X) = E(X 2 ) − [E(X)]2
2
1 1 1
= − =
2 2 4
49/109
Example 7: Tossing a coin three times
A coin is thrown three times, and the sequence of heads
(H) and tails (T ) is observed,
S = {HHH, HHT, HT T, HT H, T T T, T T H, T HH, T HT }
Let the random variable X be defined as the total number
of heads, i.e. X = {0, 1, 2, 3}.
If the coin is fair, then each of the outcomes in S has
probability 18 , and the probabilities that X takes on the
values 0, 1, 2, and 3 are
1 3 3 1
P (X = 0) = , P (X = 1) = , P (X = 2) = , P (X = 3) =
8 8 8 8
The cumulative distribution function at these values,
1 4 7 8
F (0) = , F (1) = , F (2) = , F (3) = = 1
8 8 8 8
50/109
Example 7: Tossing a coin three times (cont.)
Probability mass function & Cumulative distribution function
Find E(X) and V ar(X).
51/109
Some useful discrete
distributions
52/109
Bernoulli random variables
A Bernoulli random variable (X) takes on only two values:
0 (failure) and 1 (success), with probabilities 1 − p and p,
respectively.
The probability mass function of Bernoulli distribution is
(for x = 0, 1 and 0 ≤ p ≤ 1)
P (X = x) = px (1 − p)1−x
Find E(X) and V ar(X).
53/109
Example 8: Bernoulli Distribution
If X has a Bernoulli distribution, that is X takes on values 0
and 1 with probability 1 − p and p, respectively. Then
E(X) =
and the variance,
V ar(X) =
54/109
Binomial distribution
This is a generalisation of the Bernoulli distribution
Suppose that n independent experiments, or trials, are
performed, where n is a fixed number, and that each
experiment results in a “success” with probability p and a
“failure” with probability 1 − p. The total number of
successes, X, is a binomial random variable with
parameters n and p.
For example, a coin is tossed 10 times and the total number
of heads is counted (“head” is identified with “success”).
The probability mass function (for x = 0, 1, . . . , n and
0 ≤ p ≤ 1) can be found as
n x n!
p(x) = p (1 − p)n−x = px (1 − p)n−x
x x!(n − x)!
55/109
Binomial Distribution
If X has a Binomial distribution with n independent
experiments and p probability of success with p, that is,
X ∼ B(n, p). Then
E(X) = np
and the variance,
V ar(X) = np(1 − p)
57/109
Poisson distribution
The Poisson distribution is useful to study events of the
following types:
the number of calls coming into an exchange service during
a unit of time
the number of vehicles that pass a marker on a roadway
during a unit of time.
It expresses the probability of a given number of events
occurring in a fixed interval of time or space.
The Poisson probability mass function with parameter λ > 0 is
λx −λ
P (X = x) = e , x = 0, 1, . . .
x!
58/109
Example 10: Poisson Distribution
Let X be a Poisson random variable with parameter λ > 0,
then the expected value of X is
∞
X λx
E(X) = x e−λ
x!
x=0
∞
−λ
X λx−1
= λe
(x − 1)!
x=1
∞
X λk
= λe−λ
k!
k=0
λk
since ∞ λ
P
k=0 k! = e , we have E(X) = λ.
The parameter λ of the Poisson distribution can thus be
interpreted as the average count.
The variance is: V ar(X) = λ.
60/109
Example 11: Poisson Distribution
Suppose that an office receives telephone calls as a Poisson
process with λ = 0.5 per min. The number of calls in a 5-min
interval follows a Poisson distribution with parameter 5λ = 2.5.
a. What is the probability of no calls in a 5-min interval?
b. What is the probability of exactly one call in a 5-min
interval?
Solution:
(a) The probability of no calls in a 5-min interval is
e−2.5 = 0.082.
(b) The probability of exactly one call is 2.5e−2.5 = 0.205.
61/109
Continuous Random Variables (Chapter 5)
For a continuous random variable, the role of the
probability mass function is taken by a density function,
f (x), which has the properties that:
f (x) ≥ 0
Z ∞
f (x)dx = 1
−∞
For any a < b, the probability that X falls in the interval
(a, b) is the area under the density function between a and
b: Z b
P (a < X < b) = f (x)dx
a
62/109
Continuous Random Variables
Thus the probability that a continuous random variable X
takes on any particular value is 0:
Z c
P (X = c) = f (x)dx = 0
c
Although this may seem strange initially, it is really quite
natural. If the uniform random variable of Example A had
a positive probability of being any particular number, it
should have the same probability for any number in [0, 1],
in which case the sum of the probabilities of any countably
infinite subset of [0, 1] (for example, the rational numbers)
would be infinite.
If X is a continuous random variable, then
P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b)
Note that this is not true for a discrete random variable.
63/109
Cumulative distribution function
The cumulative distribution function (cdf) of a
continuous random variable X is defined as:
Z x
F (x) = P (X ≤ x) = f (u)du
−∞
The cdf can be used to evaluate the probability that X
falls in an interval:
Z b
P (a ≤ X ≤ b) = f (x)dx = F (b) − F (a)
a
64/109
Characteristics of probability distributions
If X is a continuous random variable with density f (x),
then Z ∞
µ = E(X) = xf (x)dx
−∞
or in general, for any function g,
Z ∞
E(g(X)) = g(x)f (x)dx
−∞
The variance of X is
Z ∞
2 2
(x − µ)2 f (x)dx
σ = V ar(X) = E [X − E(X)] =
−∞
The variance of X is the average value of the squared
deviation of X from its mean.
The variance of X can also be expressed as
V ar(X) = E(X 2 ) − [E(X)]2 .
65/109
Inverse distribution function
The pth quantile of the distribution F is defined to be the
smallest value xp such that F (xp ) = p or P (X ≤ xp ) = p.
That is xp is uniquely defined as xp = F −1 (p).
Special cases are p = 12 , which corresponds to the median
of F ; and p = 14 and p = 34 , which corresponds to the lower
and upper quartiles of F .
66/109
Inverse distribution function
67/109
Some useful continuous
distributions
68/109
Uniform distribution
A random variable X with the density function
1
f (x) = , a≤x≤b
b−a
is called the uniform distribution on the interval [a, b].
The cumulative distribution function is
0 for x < a
x−a
F (x) = for a ≤ x < b
b−a
1 for x ≥ b
A special case, f (x) = 1 and 0 ≤ x ≤ 1.
69/109
Exponential distribution
The exponential density function is
f (x) = λe−λx , x ≥ 0 and λ > 0
The cumulative distribution function is
Z x
F (x) = f (u)du = 1 − e−λx
−∞
The exponential distribution is often used to model
lifetimes or waiting times data.
70/109
Exponential distribution: probability density function
71/109
Normal distribution
The normal (Gaussian) distribution plays a central role in
probability and statistics, probably the most widely known
and used of all distributions
The normal distribution fits many natural phenomena, e.g.
human’s height, weight, IQ scores. In business, for example,
the annual cost of household insurance, among others.
The density function of the normal distribution depends on
two parameters, µ and σ (where −∞ < µ < ∞, σ > 0):
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞
σ 2π
The parameters µ and σ are the mean and standard
deviation of the normal density.
We write X ∼ N (µ, σ 2 ) as short way of saying ‘X follows a
normal distribution with mean µ and variance σ 2 ’.
72/109
Normal distribution, N (µ, σ 2 )
73/109
Standard normal distribution N (µ = 0, σ 2 = 1)
The probability density function of the standardized
normal distribution is given by:
1 z2
f (z) = √ e− 2 , −∞ < z < ∞
2π
We write Z ∼ N (0, 1) as short way of saying ‘Z follows a
standard normal distribution with mean 0 and variance 1’.
To standardize any variable X (into Z) we calculate Z as:
X −µ
Z=
σ
The Z-score calculated above indicates how many standard
deviations X is from the mean.
74/109
Standard normal distribution N (µ = 0, σ 2 = 1)
75/109
Standard normal distribution N (µ = 0, σ 2 = 1)
76/109
Log-normal distribution and its properties
If X ∼ N (µ, σ 2 ) then Y = eX (y ≥ 0) has a log-normal
2
distribution with mean E(Y ) = eµ+σ /2 and variance
2 2
V (Y ) = (eσ − 1)e2µ+σ .
77/109
Distributions derived from the normal distribution
We consider here 3 probability distributions derived from the
normal distribution:
Chi-square distribution (χ2 )
T or t distribution
F distribution
These distributions are mainly useful for statistical inference,
e.g. hypothesis testing and confidence intervals (to follow).
78/109
Chi-square distribution, χ2(df )
If Z is a standard normal random variable, the distribution
of U = Z 2 is called the chi-square distribution with 1
degree of freedom and is denoted by χ21 .
If U1 , U2 , . . . , Un are independent chi-square random
variables with 1 degree of freedom, the distribution of
V = U1 + U2 + . . . + Un is called the chi-square distribution
with n degrees of freedom and is denoted by χ2n .
79/109
Chi-square distribution, χ2(df )
80/109
T distribution, t(df )
∼ χ2n and Z and U are independent, then
If Z ∼ N (0, 1) and U q
the distribution of Z/ Un is called the t distribution with n
degrees of freedom.
81/109
F distribution, F(df1 ,df2 )
Let U and V be independent chi-square random variables with
m and n degrees of freedom, respectively. The distribution of
U/m
W =
V /n
is called the F distribution with m and n degrees of freedom
and is denoted by Fm,n .
82/109
F distribution, F(df1 ,df2 )
83/109
Joint distribution (Chapter 4)
The joint probability mass function:
Let X and Y be two discrete random variables. Then the
function f (x, y) = P (X = x, Y = y) gives the (joint)
probability that X takes the value x and Y takes the value
y.
And, XX
f (x, y) = 1
x y
The marginal probability mass functions of X and Y ,
respectively, are
X
fX (x) = f (x, y)
y
X
fY (y) = f (x, y)
x
84/109
Joint distribution (Cont.)
The joint density function f (x, y) of two continuous random
variables X and Y is such that
f (x, y) ≥ 0
Z dZ b
f (x, y)dxdy = P (a ≤ X ≤ b, c ≤ Y ≤ d)
c a
Z ∞Z ∞
f (x, y) dxdy = 1
−∞ −∞
The marginal density function of X is
Z ∞
fX (x) = f (x, y) dy
−∞
Similarly, the marginal density function of Y is
Z ∞
fY (y) = f (x, y) dx
−∞
85/109
Joint distribution (Cont.)
The joint cumulative distribution function of two random
variables X and Y is
F (x, y) = P (X ≤ x, Y ≤ y)
The cdf gives the probability that the point (X, Y ) belongs to a
semi-infinite rectangle in the plane.
86/109
Joint distribution (Cont.)
The probability that (X, Y ) belongs to a given rectangle is
P (x1 < X ≤ x2 , y1 < Y ≤ y2 ) = F (x2 , y2 )−F (x2 , y1 )−F (x1 , y2 )+F (x1 , y1 )
87/109
Joint distribution (Cont.)
The cdf of two continuous random variables X and Y can
be obtained as
Z x Z y
F (x, y) = f (u, v)dudv
−∞ −∞
and
∂2
f (x, y) = F (x, y)
∂x∂y
wherever the derivative is defined.
88/109
Conditional probability density function PDF (Chapter 5)
The conditional probability (density) functions may be
obtained as follows:
f (x, y)
fX|Y (x|y) = conditional PDF of X
f (y)
f (x, y)
fY |X (y|x) = conditional PDF of Y
f (x)
Two random variables X and Y are statistically
independent if and only if
f (x, y) = f (x)f (y)
That is, if the joint PDF can be expressed as the product
of the marginal PDFs. So,
fX|Y (x|y) = f (x) and fY |X (y|x) = f (y)
89/109
Properties of Expected values and Variance
The expected value of a constant is the constant itself, i.e.
if c is a constant, E(c) = c.
The variance of a constant is zero, i.e. if c is a constant,
V ar(c) = 0.
If a and b are constants, and Y = aX + b, then
E(Y ) = aE(X) + b and V ar(Y ) = a2 V ar(X) (if V ar(X)
exists).
If X and Y are independent, then E(XY ) = E(X)E(Y )
and
V ar(X + Y ) = V ar(X) + V ar(Y )
V ar(X − Y ) = V ar(X) + V ar(Y )
If X and Y are independent random variables and g and h
are fixed functions, then
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
90/109
Covariance
Let X and Y be two random variables with means µx and
µy , respectively. Then the covariance between the two
variables is defined as
cov(X, Y ) = E {(X − µx )(Y − µy )} = E(XY ) − µx µy
If X and Y are independent, then cov(X, Y ) = 0.
If two variables are uncorrelated, that does not in general
imply that they are independent.
V ar(X) = cov(X, X)
cov(bX + a, dY + c) = bd cov(X, Y ), where a, b, c, and d are
constants.
91/109
Correlation Coefficient
The (population) correlation coefficient ρ is defined as
cov(X, Y ) cov(X, Y )
ρ= p =
V ar(X)V ar(Y ) σx σy
Thus, ρ is a measure of linear association between two
variables and lies between −1 (indicating perfect negative
association) and +1 (indicating perfect positive
association).
cov(X, Y ) = ρ σx σy
Variances of correlated variables,
V ar(X ± Y ) = V ar(X) + V ar(Y ) ± 2cov(X, Y )
V ar(X ± Y ) = V ar(X) + V ar(Y ) ± 2ρ σx σy
92/109
Conditional expectation and conditional variance
Let f (x, y) be the joint PDF of random variables X and Y . The
conditional expectation of X, given Y = y, is defined as
X
E(X|Y = y) = xfX|Y (x|Y = y) if X is discrete
x
Z ∞
E(X|Y = y) = xfX|Y (x|Y = y)dx if X is continuous
−∞
The conditional variance of X given Y = y is defined as, if X is
discrete,
X
V ar(X|Y = y) = [X − E(X|Y = y)]2 fX|Y (x|Y = y)
x
and if X is continuous,
Z ∞
V ar(X|Y = y) = [X − E(X|Y = y)]2 fX|Y (x|Y = y)dx
−∞
93/109
Conditional expectation & conditional variance: Properties
If f (X) is a function of X, then E[f (X)|X] = f (X)
If f (X) and g(X) are functions of X, then
E[f (X)Y + g(X)|X] = f (X)E(Y |X) + g(X)
If X and Y are independent, then E(Y |X) = E(Y )
The law of iterated expectations, E(Y ) = E[E(Y |X)]
If X and Y are independent, then V ar(Y |X) = V ar(Y )
V ar(Y ) = E[V ar(Y |X)] + V ar[E(Y |X)].
94/109
Sampling
95/109
Sampling
Sampling is widely used as a means of gathering useful
information about a population.
Data are gathered from samples and conclusions are drawn
about the population as a part of the inferential statistics
process.
Often, a sample provides a reasonable means for gathering
such useful decision-making information that might be
otherwise unattainable and unaffordable.
Sampling error occurs when the sample is not
representative of the population.
96/109
Random versus non-random sampling
In random sampling every unit of the population has the
same probability of being selected into the sample.
In non-random sampling not every unit of the
population has the same probability of being selected into
the sample.
97/109
Simple random sampling
Simple random sampling: is the basic sampling technique
where we select a group of subjects (a sample) from a larger
group (a population). Each individual is chosen entirely by
chance and each member of the population has an equal chance
of being included in the sample.
98/109
Sample mean X̄
Let the random variable X1 , X2 , . . . , Xn denote a random
sample from a population. The sample means value of these
random variables is defined as follows:
n
1X
X̄n = Xi
n
i=1
99/109
Sampling distribution of the sample mean X̄
100/109
Sampling distribution of the sample mean X̄
There are two cases:
1. Sampling is from a normally distributed population with
mean µ and variance σ 2 :
σ2
X̄ ∼ N µ,
n
That is, the sampling distribution of the sample mean is
normal with mean µX̄ = µ and standard deviation
σX̄ = √σn .
101/109
Sampling distribution of the sample mean X̄
2. Sampling is from a non-normally distributed population
with mean µ and variance σ 2 and n is large, then the mean
of X̄,
µX̄ = µ
and the variance,
2
σ
n with replacement (infinite population)
2
σX̄ =
σ2 N −n without replacement (finite population)
n N −1
If the sample size is large, the central limit theorem applies
and the sampling distribution of X̄ will be approximately
normal.
102/109
Central limit theorem
Let X1 , X2 , . . . be independent and identically distributed
(i.i.d.) random variables with mean µ and variance σ 2 . Then as
n increases indefinitely (i.e. n → ∞), X n approaches the normal
2
distribution with mean µ and variance σn . That is
σ2
Xn ∼ N µ,
n→∞ n
Note that this result holds true regardless of the form
of the underlying distribution. As a result, it follows that
Xn − µ
Z= √ ∼ N (0, 1)
σ/ n n→∞
That is, Z is a standardized normal variable.
103/109
Central limit theorem
Roughly, The Central Limit Theorem states that whenever a
random sample of size n is taken from any distribution with
mean µ and variance σ 2 , then the sample mean X̄ will be
approximately normally distributed with mean µ and variance
σ2
n . The larger the value of the sample size n, the better the
approximation to the normal.
104/109
Sampling distribution of the sample mean X̄
The standard deviation of the sampling distribution of the
sample mean, σX̄ , is called the standard error of the
mean or, simply, the standard error
If X̄ is a normal distributed (or approximately normal
distributed), we can use the following formula to transform
X̄ to a Z-score.
X̄ − µX̄
Z=
σX̄
where Z ∼ N (0, 1).
105/109
Sampling distribution of the sample proportion
Let X is the number in the sample with the characteristic of
interest.
Let π̂ = Xn the proportion of the sample with the characteristic
of interest
When the sample size n is large, the distribution of the
sample proportion, π̂, is approximately normally
distributed by the use of the central limit theorem,
π(1 − π)
π̂ ≈ N π,
n
then
π̂ − π
Z=q ≈ N (0, 1)
π(1−π)
n
A widely used criterion is that both nπ and n(1 − π) must
be greater than 5 for this approximation to be reasonable.
106/109
Sampling distribution of the sample variance
Sampling is from a normally distributed population with mean
µ and variance σ 2 . The sample variance is
n
1 X
s2 = (xi − x̄)2
n−1
i=1
and
E(s2 ) = σ 2
V ar(s2 ) = 2σ 4 /(n − 1)
Then
(n − 1)s2
∼ χ2n−1
σ2
107/109
Example 11
Suppose that during any hour in a large department store, the
average number of shoppers is 448, with a standard deviation of
21 shoppers. What is the probability that a random sample of
49 different shopping hours will yield a sample mean between
441 and 446 shoppers?
µ = 448, σ = 21, n = 49
441 − 448 X̄ − µ 446 − 448
P (441 ≤ X̄ ≤ 446) = √ ≤ √ ≤ √
21/ 49 σ/ n 21/ 49
P (−2.33 ≤ Z ≤ −0.67) =P (Z ≤ −0.67) − P (Z ≤ −2.33)
=0.2514 − 0.0099 = 0.2415
(we used the standard normal table to obtain these probabilities.)
That is there is a 24.15% chance of randomly selecting 49
hourly periods for which the sample mean is between 441 and
446 shoppers.
108/109
Example 12
Suppose 60% of the electrical contractors in a region use a
particular brand of wire. What is the probability of taking a
random sample of size 120 from these electrical contractors and
finding that 0.50 or less use that brand of wire?
π = 0.60, π̂ = 0.50, n = 120
π̂ − π 0.50 − 0.60 −0.10
Z=q =q = = −2.24
π(1−π) 0.60(1−0.60) 0.0447
n 120
From the standard normal table, the probability
P (π̂ ≤ 0.5) = P (Z ≤ −2.24) = 0.0125
109/109