Measures of central
tendency and dispersion
1
Measures of central tendency
Central tendency
• The tendency of statistical data to get
concentrated at certain values
Measures of central Tendency
• The various methods of determining
the actual value at which the data tend
to concentrate
• Summary value 2
Measures of central tendency
• Data is condensed to a single value
• Such a single value is expression or
presentation of data is central value
Or central tendency
Example mean or average
An average is a single value with in the
range of the data that is used to represent
all the values of the data
3
The Arithmetic Mean (Mean)
Definition: the arithmetic mean is the sum
of all observations divided by the number
of observations. It is written in statistical
terms as:
1 n
Χ Σ xi
n i 1
The arithmetic mean is, in general, a very natural measure of
central tendency.
One of its principal limitations, however, is that it is overly
sensitive to extreme values.
In this instance it may not be representative of the location of
the great majority of the sample points. 4
Example
find mean for the following data
34, 56, 78, 10, 9, 10
5
Median
• An alternative measure of central location,
perhaps second in popularity to the arithmetic
mean, is the median.
• Suppose there are n observations in a
sample. If these observations are ordered
from smallest to largest, then the median is
defined as follows:
A) The n 1
2
th
observations if n is odd.
B) The average of the n
th
and n2 1th observations if
n is even. 2
6
Example
• What is the median value for the following
data
1. 10, 15, 25, 6, 8, 23, 4, 9, 11
2. 25, 67, 24, 8, 19, 12
7
Median
• The principal strength of the sample
median is that it is insensitive to very
large or very small values.
• The principal weakness of the sample
median is that it is determined mainly by
the middle points in a sample and is less
sensitive to the actual numerical values of
the remaining data points.
8
Mode
• It is the value of the observation that occurs
with the greatest frequency.
• A particular disadvantage is that, with a small
number of observations, there may be no
mode. In addition, sometimes, there may be
more than one mode such as when dealing with
a bimodal (two-peaks) distribution.
• It is even less amenable (responsive) to
mathematical treatment than the median. The
mode is not often used in biological or medical
data. 9
Mode
Exercise
Find the modal values for the following data
a) 22, 66, 69, 70, 73.
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2,
3.5
10
Geometric mean
• It is obtained by taking the nth root of the
product of “n” values, i.e, if the values of
the observation are denoted by x1,x2 ,…,x n
then,
GM = n(x1)(x2)….(xn) .
• The geometric mean is preferable to the
arithmetic mean if the series of
observations contains one or more
unusually large values.
11
Geometric mean
• The above method of calculating
geometric mean is satisfactory only if there
are a small number of items.
• But if n is a large number, the problem of
computing the nth root of the product of
these values by simple arithmetic is a
tedious work.
• To facilitate the computation of geometric
mean we make use of logarithms.
12
Geometric mean
GM = n(x1)(x2)….(xn) = (x1)(x2)… (xn ) 1/n
Log GM = log {(x1 )(x2 )…(xn)}1/n
= 1/n log {(x1 )(x2 )…(xn)}
=1/n {log(x1 ) + log(x2 )+…log(xn)}
= (log xi)/n
13
Geometric mean
• The logarithm of the geometric mean is equal to
the arithmetic mean of the logarithms of individual
values.
• The actual process involves obtaining logarithm of
each value, adding them and dividing the sum by
the number of observations.
• The quotient so obtained is then looked up in the
tables of anti-logarithms which will give us the
geometric mean. 14
Geometric mean
Example: The geometric mean may be calculated for the following
parasite counts per 100 fields of thick films.
7 8 3 14 2 1 440 15 52 6 2 1 1 25
12 6 9 2 1 6 7 3 4 70 20 200 2 50
21 15 10 120 8 4 70 3 1 103 20 90 1 237
GM = 427x8x3x…x1x237
log Gm = 1/42 (log 7+log8+log3+..+log 237)
= 1/42 (.8451+.9031+.4771 +…2.3747)
= 1/42 (41.9985)
= 0.9999 1.0000
antilog1=10 15
Geometric mean
♣ The anti-log of 0.9999 is 9.9992 10 and
this is the required geometric mean.
♣ By contrast, the arithmetic mean, which
is inflated by the high values of 440, 237
and 200 is 39.8 40.
16
Mean
i) Advantages
It is based on all values given in the distribution.
It is easily understood.
§ It is most amenable to algebraic treatment.
ii) Disadvantages
§ It may be greatly affected by extreme items and its
usefulness as a “Summary of the whole” may be
considerably reduced.
17
Median
i) Advantages
It is easily calculated and is not much disturbed by
extreme values
It is more typical of the series
ii) Disadvantages
The median is not so well suited to algebraic treatment as
the arithmetic and geometric means.
It is not so generally familiar as the arithmetic mean
18
Mode
i) Advantages
Since it is the most typical value it is the most descriptive
average
Since the mode is usually an “actual value”, it indicates
the precise value of an important part of the series.
ii) Disadvantages
It is not capable of mathematical treatment
In a small number of items the mode may not exist.
19
Geometric mean
i) Advantages
since it is less affected by extremes it is a more
preferable average than the arithmetic mean
It is capable of algebraic treatment
It based on all values given in the distribution.
ii) Disadvantages
Its computation is relatively difficult.
It cannot be determined if there is any negative
value in the distribution, or where one of the
items has a zero value.
20
Measures of Variation
• While the mean, median, etc. give useful information
about the centre of the data, we also need to know how
“spread out” the numbers are about the centre.
Consider the following data sets:
Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
• The two data sets given above have a mean of 50, but
obviously set 1 is more “spread out” than set 2. How do
we express this numerically?
21
Measures of Variation
• The object of measuring this scatter or
dispersion is to obtain a single summary
figure which adequately exhibits whether
the distribution is compact or spread out.
• Some of the commonly used measures of
dispersion (variation) are: Range,
interquartile range and standard deviation.
22
RANGE
• The range is defined as the difference between the
highest and smallest observation in the data.
• It is the crudest measure of dispersion.
Range = xmax - xmin
In our example given above ( the two data sets)
* The range of data in set 1 is 70-30 =40
* The range of data in set 2 is 53-48 =5
23
Interquartile range
• The interquartile range is the difference
between the first and the third quartiles.
• The second quartile is the median.
• To compute it, we first sort the data, in ascending order,
then find the data values corresponding to the first
quarter of the numbers (first quartile), and then the third
quartile.
• The interquartile range (IQR) is the distance (difference)
between these quartiles.
Eg. Given the following data set (age of patients):
18, 59, 24, 42, 21, 23, 24, 32
Find the interquartile range! 24
Interquartile range
18 21 23 24 24 32 42 59
1st quartile = The (n+1)/4th observation = (2.25)th observation
= 21 + (23-21) x .25 = 21.5
3rd quartile = 3/4 (n+1)th observation = (6.75)th observation
= 32 + (42-32)x .75 = 39.5
Hence, IQR = 39.5 - 21.5 = 18
The interquartile range is a preferable measure to the range.
Because it is less prone to distortion by a single large or small
value. That is, outliers in the data do not affect the inerquartile
range.
25
Exercise
• 5 6 8 9 10 12
• Find inter quartile of the above data
• 1st quartile
• 3rd quartile
• IR
26
Exercise
• 5 6 8 9 10 12
• Find inter quartile of the above data
• 1st quartile
(N+1)1/4 = 7/4= 1.75
5+ (6-5) 0.75=5.75
• 3rd quartile
(N+1) ¾= 21/4=5.25
10+ (12-10) 0.25=10.5
• IR= 10.5-5.75= 4.75
27
Standard Deviation
The sample and population standard deviations denoted by S
and (by convention) respectively are defined as follows:
n
S= ( X i ) 2
i 1
samplevariance
n 1
= sample standard deviation
i
( X ) 2
= population standard deviation
N
28
Exercise
• Find the standard deviation for the
following data
• 5 6 8 9 10 12
29
Exercise
• Descriptive Statistics
• n Minimum Maximum Mean Std. Deviation
• 6 5.00 12.00 8.3333 2.58199
30
Standard Deviation
This measure of variation is universally
used to show the scatter of the individual
measurements around the mean of all the
measurements in a given distribution.
Note that the sum of the deviations of the
individual observations of a sample about
the sample mean is always 0.
31
Example 1: Serum Albumin concentration
in 216 patients with liver disease grams/litre):
(Symmetric Distribution)
21.1, 21.7, 22.5, 22.5, 22.6,....
Distribution of serum albumin concentration in 216 patients with liver disease
45.00
40.00
35.00
30.00
Frequency
25.00
20.00
15.00
10.00
5.00
.00
Serum Albumin(g/l)
32
Describing the serum albumin data
Total 7506
Mean 34 . 75
Numberofpa tients 216
Standard Deviation = “Average Deviation from the Mean”
= 6.0g/l
That is, on the average, the serum Albumin concentration
of each patient varies from the Mean (centre) by 6.0 g/l.
33
Example 2
Age of 40 persons who attended a meeting
on one of the health days
20 30 27 30 31 33 55 29 32 38 33 29 49 46
35 59 49 42 23 75 35 29 35 58 49 35 40 64
21 25 24 70 22 35 40 67 47 33 29 84
34
Example 2
Age of 40 persons who attended a
meeting on one of the health days
Calculate-range, IR, Median, mode, SD,
When the figures (ages) are rearranged in ascending order:
20 21 22 23 24 25 27 29 29 29 30 30 31 32
33 33 33 35 35 35 35 35 38 40 40 42 42 46
47 49 49 49 55 58 59 64 67 70 75 84
35
Median age = 35 years
Example 2 ….
1st quartile = Q 1 = {( n + 1) / 4}th observation
= 41/4 = 10.25
= 29 + (30-29)x .25
= 29.25
3rd Quartile = Q3 = ¾ ( 41) th observation
= 30.75
= 49 + {( 49-49) x .75}
= 49
36
Computation of Summary values from
Grouped Frequency Distribution
• Computing the fXc fXc
arithmetic mean
X
f n
Example: A frequency Time spent (hours) Frequency
distribution of the 10 – 14 8
amount of time (in 15 – 19 28
hours) that 80 college 20 – 24 27
25 – 29 12
students devoted to
30 – 34 4
activities during a 35 - 39 1
week.
37
calculation
example
Time spent Frequency
(hours) mid point F*mid P
10 – 14 8 12 96
15 – 19 28 17 476
20 – 24 27 22 594
25 – 29 12 27 324
30 – 34 4 32 128
35 - 39 1 37 37
80 1655
38
Computation of Summary values from Grouped
Frequency Distribution
• For the time data the mean time spent
by students for leisure activities was:
X
fX c
1,655
20 .7 hours
f 80
39
calculation
Exercise mean
age Frequency
10 – 14 7
15 – 19 13
20 – 24 25
25 – 29 10
30 – 34 5
35 - 39 8
40-44 10
45-50 22
40
Computation of Median from a
Grouped Frequency Distribution
~ ( 12 n F )
X I i w
f 50
Where ,
l = true lower limit of the interval containing the median, i.e., the median class
w = length of the interval,
n = total frequency of the sample
Fi= Cumulative frequency of all interval below i .
f50 = Frequency of that interval containing the median.
41
Computation of Median from a Grouped
Frequency Distribution
• For the time data the median time spent by
students for leisure activities was:
~ 40 36
X 19.5 5 19.5 0.7 20.2 hours
27
• In the calculation of the median from a grouped
frequency table, the basic assumption is that
within each class of the frequency distribution,
observations are uniformly or evenly distributed
over the class interval.
42
Computation of the standard deviation
from a Grouped Frequency Distribution
♣ In a grouped frequency distribution, the SD is
computed as:
S= fi ( Xci X ) 2
fi 1
Where, Xci is the mid –point of the ith class
Find the standard deviation for the time data
given earlier.
43
Individual assignment
calculate median and standard
deviation for the following data
Table 1 age of the patients treated in x health center
age Frequency
10 – 14 7
15 – 19 13
20 – 24 25
25 – 29 10
30 – 34 4
35 - 39 1
44
Summary
• Symbols
• Population mean
• population variance
• Population standard deviation
• Range
• Sample mean
• Sample standard deviation
• Sample variance
45
Standard deviation of sample
46
Symbols
47
Population mean
48
Population variance
49
Range
50
Sample mean
51
Sample variance
52
Population standard deviation
53
Sample standard deviation
54