Quantitative - Methods Course Text
Quantitative - Methods Course Text
Methods
Professor David Targett
The courses are updated on a regular basis to take account of errors, omissions and recent
developments. If you'd like to suggest a change to this course, please contact
us: [email protected].
Quantitative Methods
The Quantitative Methods programme is written by David Targett, Professor of Information Systems at
the School of Management, University of Bath and formerly Senior Lecturer in Decision Sciences at the
London Business School. Professor Targett has many years’ experience teaching executives to add
numeracy to their list of management skills and become balanced decision makers. His style is based on
demystifying complex techniques and demonstrating clearly their practical relevance as well as their
shortcomings. His books, including Coping with Numbers and The Economist Pocket Guide to Business
Numeracy, have stressed communication rather than technical rigour and have sold throughout the world.
He has written over fifty case studies which confirm the increasing integration of Quantitative Methods
with other management topics. The cases cover a variety of industries, illustrating the changing nature of
Quantitative Methods and the growing impact it is having on decision makers in the Information Technol-
ogy age. They also demonstrate Professor Targett’s wide practical experience in international
organisations in both public and private sectors.
One of his many articles, a study on the provision of management information, won the Pergamon Prize
in 1986.
He was part of the team that designed London Business School’s highly successful part-time MBA
Programme of which he was the Director from 1985 to 1988. During this time he extended the interna-
tional focus of the teaching by leading pioneering study groups to Hong Kong, Singapore and the United
States of America. He has taught on all major programmes at the London Business School and has
developed and run management education courses involving scores of major companies including:
British Rail
Citicorp
Marks and Spencer
Shell
First Published in Great Britain in 1990.
© David Targett 1990, 2000, 2001
The rights of Professor David Targett to be identified as Author of this Work have been asserted in
accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved; students may print and download these materials for their own private study only and
not for commercial use. Except for this permitted use, no materials may be used, copied, shared, lent,
hired or resold in any way, without the prior consent of Edinburgh Business School.
Contents
Module 6 4/31
Module 7 4/37
Module 8 4/45
Module 9 4/53
Module 10 4/59
Module 11 4/66
Module 12 4/72
Module 13 4/79
Module 14 4/85
Module 15 4/96
Index I/1
Learning Objectives
This module gives an overview of statistics, introducing basic ideas and concepts at
a general level, before dealing with them in greater detail in later modules. The
purpose is to provide a gentle way into the subject for those without a statistical
background, in response to the cynical view that it is not possible for anyone to read
a statistical text unless they have read it before. For those with a statistical back-
ground, the module will provide a broad framework for studying the subject.
1.1 Introduction
The word statistics can refer to a collection of numbers or it can refer to the
science of studying collections of numbers. Under either definition the subject has
received far more than its share of abuse (‘lies, damned lies…’). A large part of the
reason for this may well be the failure of people to understand that statistics is like a
language. Just as verbal languages can be misused (for example, by politicians and
journalists?) so the numerical language of statistics can be misused (by politicians
and journalists?). To blame statistics for this is as sensible as blaming the English
language when election promises are not kept.
One does not have to be skilled in statistics to misuse them deliberately (‘figures
can lie and liars can figure’), but misuses often remain undetected because fewer
people seem to have the knowledge and confidence to handle numbers than have
similar abilities with words. Fewer people are numerate than are literate. What is
needed to see through the misuse of statistics, however, is common sense with the
addition of only a small amount of technical knowledge.
The difficulties are compounded by the unrealistic attitudes of those who do
have statistical knowledge. For instance, when a company’s annual accounts report
that the physical stock level is £34 236 417 (or even £34 236 000), it conveys an aura
of truth because the figure is so precise. Accompanying the accountants who
estimated the figure, one may have thought that the method by which the data were
collected did not warrant such precision. For market research to say that 9 out of 10
dogs prefer Bonzo dog food is also misleading, but in a far more overt fashion. The
statement is utterly meaningless, as is seen by asking the questions: ‘Prefer it to
what?’, ‘Prefer it under what circumstances?’, ‘9 out of which 10 dogs?’
Such examples and many, many others of greater or lesser subtlety have generat-
ed a poor reputation for statistics which is frequently used as an excuse for
remaining in ignorance of it. Unfortunately, it is impossible to avoid statistics in
business. Decisions are based on information; information is often in numerical
form. To make good decisions it is necessary to organise and understand numbers.
This is what statistics is about and this is why it is important to have some
knowledge of the subject.
Statistics can be split into two parts. The first part can be called descriptive
statistics. Broadly, this element handles the problem of sorting a large amount of
collected data in ways which enable its main features to be seen immediately. It is
concerned with turning numbers into real and useful information. Included here are
simple ideas such as organising and arranging data so that their patterns can be seen,
summarising data so that they can be handled more easily and communicating data
to others. Also included is the now very important area of handling computerised
business statistics as provided by management information systems and decision
support systems.
The second part can be referred to broadly as inferential statistics. This element
tackles the problem of how the small amount of data that has been collected (called
the sample) may be analysed to infer general conclusions about the total amount of
similar data that exist uncollected in the world (called the population). For instance,
opinion polls use inferential statistics to make statements about the opinions of the
whole electorate of a country, given the results of perhaps just a few hundred
interviews.
Both types of statistics are open to misuse. However, with a little knowledge and
a great deal of common sense, the errors can be spotted and the correct procedures
seen. In this module the basic concepts of statistics will be introduced. Later, some
abuses of statistics and how to counter them will be discussed.
The first basic concept to look at is that of probability, which is fundamental to
statistical work. Statistics deals with approximations and ‘best guesses’ because of
the inaccuracy and incompleteness of most of the data used. It is rare to make
1.2 Probability
All future events are uncertain to some degree. That the present government will
still be in power in the UK in a year’s time (given that it is not an election year) is
likely, but far from certain; that a communist government will be in power in a
year’s time is highly unlikely, but not impossible. Probability theory enables the
difference in the uncertainty of events to be made more precise by measuring their
likelihood on a scale.
(Heads) = 0.5
(Tails) = 0.5
(b) ‘Relative frequency’ approach. When the event has been or can be repeated a
large number of times, its probability can be measured from the formula:
.
(Event) =
.
For example, to estimate the probability of rain on a given day in September in
London, look at the last 10 years’ records to find that it rained on 57 days. Then:
.
(Rain) =
. ( × )
=
= 0.19
(c) Subjective approach. A certain group of statisticians (Bayesians) would argue
that the degree of belief that an individual has about a particular event may be
expressed as a probability. Bayesian statisticians argue that in certain circum-
stances a person’s subjective assessment of a probability can and should be used.
The traditional view, held by classical statisticians, is that only objective probabil-
ity assessments are permissible. Specific areas and techniques that use subjective
probabilities will be described later. At this stage it is important to know that
probabilities can be assessed subjectively but that there is discussion amongst
statisticians as to the validity of doing so. As an example of the subjective ap-
proach, let the event be the achievement of political unity in Europe by the year
2020 AD. There is no way that either of the first two approaches could be em-
ployed to calculate this probability. However, an individual can express his own
feelings on the likelihood of this event by comparing it with an event of known
probability: for example, is it more or less likely than obtaining a head on the
spin of a coin? After a long process of comparison and checking, the result
might be:
(Political unity in Europe by 2020 AD) = 0.10
The process of accurately assessing a subjective probability is a field of study in
its own right and should not be regarded as pure guesswork.
The three methods of determining probabilities have been presented here as an
introduction and the approach has not been rigorous. Once probabilities have been
calculated by whatever method, they are treated in exactly the same way.
Examples
1. What is the probability of throwing a six with one throw of a die?
With the a priori approach there are six possible outcomes: 1, 2, 3, 4, 5 or 6 show-
ing. All outcomes are equally likely. Therefore:
(throwing a 6) =
2. What is the probability of a second English Channel tunnel for road vehicles being
completed by 2025 AD?
The subjective approach is the only one possible, since logical thought alone cannot
lead to an answer and there are no past observations. My assessment is a small one,
around 0.02.
3. How would you calculate the probability of obtaining a head on one spin of a biased
coin?
The a priori approach may be possible if one had information on the aerodynamical
behaviour of the coin. A more realistic method would be to conduct several trial
spins and count the number of times a head appeared:
.
(obtaining a head) =
.
4. What is the probability of drawing an ace in one cut of a pack of playing cards?
Use the a priori method. There are 52 possible outcomes (one for each card in the
deck) and the probability of picking any one card, say the ace of diamonds, must
therefore be 1/52. There are four aces in the deck, hence:
(drawing an ace) = =
53
66
41 71 40
110
83 106
72
20
99 92
75
43 56 60 .
45 57 61 .
46 57 62 .
48 58 62
49 58 63
49 58 65
50 59 65
Table 1.1 is an ordered array. The numbers look neater now but it is still not
possible to get a feel for the data (the average, for example) as they stand. The next
step is to classify the data and then arrange the classes in order. Classifying means
grouping the numbers in bands (e.g. 50–54) to make them easier to handle. Each
class has a frequency, which is the number of data points that fall within that class.
This is called a frequency table and is shown in Table 1.2. This shows that seven
data points were greater than or equal to 40 but less than 50, 12 were greater than or
equal to 50 but less than 60 and so on. There were 100 data points in all.
It is now much easier to get an overall conception of what the data mean. For
example, most of the numbers are between 60 and 90 with extremes of 40 and 110.
Of course, it is likely that at some time there may be a need to perform detailed
calculations with the numbers to provide specific information, but at present the
objective is merely to get a feel for the data in the shortest possible time. Another
arrangement with greater visual impact, called a frequency histogram, will help
meet this objective.
30
27
22
Frequency
20 19
12
10
10 7
3
e.g.
(40 ≤ < 50) = = 0.07
ues, the distribution becomes smoother, until, ultimately, the continuous distribu-
tion (d) will be achieved.
50 60
Continuous variable (d) Variable classes (c)
50 < x < 55
50 < x < 60
55 < x < 60
Area
0.12 Area Area
0.05 0.07
50 60 50 55 60
Example
Figure 1.6 The area under each part of the curve is shown. The total
area is equal to 1.0
Using the continuous distribution in Figure 1.6, what are the probabilities that a
particular value of the variable falls within the following ranges?
1. ≤ 60
2. ≤ 100
3. 60 ≤ ≤ 110
4. ≥ 135
5. ≥ 110
Answers
1. ( ≤ 60) = 0.01
2. ( ≤ 100) = 0.01 + 0.49 = 0.5
3. (60 ≤ ≤ 110) = 0.49 + 0.27 = 0.76
4. ( ≥ 135) = 0.02
5. ( ≥ 110) = 0.21 + 0.02 = 0.23
In practice, the problems with the use of continuous distributions are, first, that
one can never collect sufficient data, sufficiently accurately measured, to
establish a continuous distribution. Second, were this possible, the accurate
measurement of areas under the curve would be difficult. Their greatest
practical use is where continuous distributions appear as standard distributions, a
topic discussed in the next section.
For example, one standard distribution, the normal, is derived from the follow-
ing theoretical situation. A variable is generated by a process which should give the
variable a constant value, but does not do so because it is subject to many small
disturbances. As a result, the variable is distributed around the central value (see
Figure 1.7). This situation (central value, many small disturbances) can be expressed
mathematically and the resulting distribution can be anticipated mathematically (i.e.
a formula describing the shape of the distribution can be found).
(a)
(b)
Figure 1.8 Salaries: (a) hospital – high standard deviation; (b) school –
low standard deviation
68%
1s 1s
95%
2s 2s
99%
3s 3s
Example
A machine is set to produce steel components of a given length. A sample of 1000
components is taken and their lengths measured. From the measurements the average
and standard deviation of all components produced are estimated to be 2.96 cm and
0.025 cm respectively. Within what limits would 95 per cent of all components pro-
duced by the machine be expected to lie?
Take the following steps:
1. Assume that the lengths of all components produced follow a normal distribution.
This is reasonable since this situation is typical of the circumstances in which normal
distributions arise.
2. The parameters of the distribution are the average mean = 2.96 cm and the standard
deviation = 0.025 cm. The distribution of the lengths of the components will there-
fore be as in Figure 1.10.
95%
1.6.1 Definitions
Statistical expressions and the variables themselves may not have precise definitions.
The user may assume the producer of the data is working with a different definition
than is the case. By assuming a wrong definition, the user will draw a wrong
conclusion. The statistical expression ‘average’ is capable of many interpretations. A
firm of accountants advertises in its recruiting brochure that the average salary of
qualified accountants in the firm is £44 200. A prospective employee may conclude
that financially the firm is attractive to work for. A closer look shows that the
accountants in the firm and their salaries are as follows:
All the figures could legitimately be said to be the average salary. The firm has
doubtless chosen the one that best suited its purposes. Even if it were certain that
the correct statistical definition was being used, it would still be necessary to ask just
how the variable (salary) is defined. Is share of profits included in the partners’
salaries? Are bonuses included in the accountants’ salaries? Are allowances (a car,
for example) included in the accountants’ salaries? If these items are removed, the
situation might be:
The mean salary is now £36 880. Remuneration at this firm is suddenly not quite
so attractive.
1.6.2 Graphics
Statistical pictures are intended to communicate data very rapidly. This speed means
that first impressions are important. If the first impression is wrong then it is
unlikely to be corrected.
There are many ways of representing data pictorially, but the most frequently
used is probably the graph. If the scale of a graph is concealed or not shown at all,
the wrong conclusion can be drawn. Figure 1.11 shows the sales figures for a
company over the last three years. The company would appear to have been
successful.
Sales
A more informative graph showing the scale is given in Figure 1.12. Sales have
hardly increased at all. Allowing for inflation, they have probably decreased in real
terms.
12
10
Sales (£ million)
First, it arises in the collection of the data. The left-wing politician who states that
80 per cent of the letters he receives are against a policy of the right-wing govern-
ment and concludes that a majority of all the electorate oppose the government on
this issue is drawing a conclusion from a biased sample.
Second, sample bias arises through the questions that elicit the data. Questions
such as ‘Do you go to church regularly?’ will provide unreliable information. There
may be a tendency for people to exaggerate their attendance since, generally, it is
regarded as a worthy thing to do. The word ‘regularly’ also causes problems. Twice a
year, at Christmas and Easter, is regular. So is twice every Sunday. It would be
difficult to draw any meaningful conclusions from the question as posed. The
question should be more explicit in defining regularity.
Third, the sample information may be biased by the interviewer. For example,
supermarket interviews about buying habits may be conducted by a young male
interviewer who questions 50 shoppers. It would not be surprising if the resultant
sample comprised a large proportion of young attractive females.
The techniques of sampling which can overcome most of these problems will be
described later in the course.
1.6.4 Omissions
The statistics that are not given can be just as important as those that are. A
television advertiser boasts that nine out of ten dogs prefer Bonzo dog food. The
viewer may conclude that 90 per cent of all dogs prefer Bonzo to any other dog
food. The conclusion might be different if it were known that:
(a) The sample size was exactly ten.
(b) The dogs had a choice of Bonzo or the cheapest dog food on the market.
(c) The sample quoted was the twelfth sample used and the first in which as many
as nine dogs preferred Bonzo.
third factor, inflation. The variables have increased together as the cost of living has
increased, but they are unlikely to be causally related. This consideration is im-
portant when decisions are based on statistical association. To take the example
further, holding down clergymen’s salaries in order to hold down the price of rum
would work if the relationship were causal, but not if it were mere association.
admission, answers may well be biased. The figure of 2.38 is likely to be higher than
the true figure. Even so, a comparison with 20 years ago can still be made, but only
provided the bias is the same now as then. It may not be. Where did the 20-year-old
data come from? Most likely from a differently structured survey of different sample
size, with different questions and in a different social environment. The comparison
with 20 years ago, therefore, is also open to suspicion.
One is also misled in this case by the accuracy of the data. The figure of 2.38
suggests a high level of accuracy, completely unwarranted by the method of data
collection. When numbers are presented to many decimal places, one should
question the relevance of the claimed degree of accuracy.
Learning Summary
The purpose of this introduction has been twofold. The first aim has been to
present some statistical concepts as a basis for more detailed study of the subject.
All the concepts will be further explored. The second aim has been to encourage a
healthy scepticism and atmosphere of constructive criticism, which are necessary
when weighing statistical evidence.
The healthy scepticism can be brought to bear on applications of the concepts
introduced so far as much as elsewhere in statistics. Probability and distributions can
both be subject to misuse.
Logical errors are often made with probability. For example, suppose a ques-
tionnaire about marketing methods is sent to a selection of companies. From the
200 replies, it emerges that 48 of the respondents are not in the area of marketing. It
also emerges that 30 are at junior levels within their companies. What is the proba-
bility that any particular questionnaire was filled in by someone neither in marketing
nor at a senior level? It is tempting to suppose that:
Probability = = 39%
This is almost certainly wrong because of double counting. Some of the 48 non-
marketers are also likely to be at a junior level. If 10 respondents were non-
marketers and at a junior level, then:
Probability = = 34%
Only in the rare case where none of those at a junior level were outside the mar-
keting area would the first calculation have been correct.
2000
No. of civil servants
1500
1000
500
Graphical errors can frequently be seen with distributions. Figure 1.13 shows an
observed distribution relating to the salaries of civil servants in a government
department. The figures give a wrong impression of the spread of salaries because
the class intervals are not all equal. One could be led to suppose that salaries are
higher than they are. The lower bands are of width £8000 (0–8, 8–16, 16–24). The
higher ones are of a much larger size. The distribution should be drawn with all the
intervals of equal size, as in Figure 1.14.
Statistical concepts are open to misuse and wrong interpretation just as verbal
reports are. The same vigilance should be exercised in the former as in the latter.
2000
No. of civil servants
1500
1000
500
Salary (£000s)
Review Questions
1.1 One of the reasons probability is important in statistics is that, if data being dealt with
are in the form of a sample, any conclusions drawn cannot be 100 per cent certain. True
or false?
1.2 A randomly selected card drawn from a pack of cards was an ace. It was not returned
to the pack. What is the probability that a second card drawn will also be an ace?
A. 1/4
B. 1/13
C. 3/52
D. 1/17
E. 1/3
1.4 A coin is known to be unbiased (i.e. it is just as likely to come down ‘heads’ as ‘tails’). It
has just been tossed eight times and each time the result has been ‘heads’. On the ninth
throw, what is the probability that the result will be ‘tails’?
A. Less than 1/2
B. 1/2
C. More than 1/2
D. 1
25
22
17
8
6
1.5 On how many days were sales not less than £50 000?
A. 17
B. 55
C. 23
D. 48
1.6 What is the probability that on any day sales are £60 000 or more?
A. 1/13
B. 23/78
C. 72/78
D. 0
1.7 What is the sales level that was exceeded on 90 per cent of all days?
A. £20 000
B. £30 000
C. £40 000
D. £50 000
E. £60 000
1.9 A normal distribution has mean 60 and standard deviation 10. What percentage of
readings will be in the range 60–70?
A. 68%
B. 50%
C. 95%
D. 34%
E. 84%
1.10 A police checkpoint recorded the speeds of motorists over a one-week period. The
speeds had a normal distribution with a mean 82 km/h and standard deviation 11 km/h.
What speed was exceeded by 97.5 per cent of motorists?
A. 49
B. 60
C. 71
D. 104
0.9 3.5 0.8 1.0 1.3 2.3 1.0 2.4 0.7 1.0
2.3 0.2 1.6 1.7 5.2 1.1 3.9 5.4 8.2 1.5
1.1 2.8 1.6 3.9 3.8 6.1 0.3 1.1 2.4 2.6
4.0 4.3 2.7 0.2 0.3 3.1 2.7 4.1 1.4 1.1
3.4 0.9 2.2 4.2 21.7 3.1 1.0 3.3 3.3 5.5
0.9 4.5 3.5 1.2 0.7 4.6 4.8 2.6 0.5 3.6
6.3 1.6 5.0 2.1 5.8 7.4 1.7 3.8 4.1 6.9
3.5 2.1 0.8 7.8 1.9 3.2 1.3 1.4 3.7 0.6
1.0 7.5 1.2 2.0 2.0 11.0 2.9 6.5 2.0 8.6
1.5 1.2 2.9 2.9 2.0 4.6 6.6 0.7 5.8 2.0
1 Classify the data in intervals one minute wide. Form a frequency histogram. What
service time is likely to be exceeded by only 10 per cent of customers?
negotiate employee benefits on a company-wide basis, but to negotiate wages for each
class of work in a plant separately. For years, however, this antiquated practice has been
little more than a ritual. Supposedly, the system gives workers the opportunity to
express their views, but the fact is that the wages settlement in the first group invariably
sets the pattern for all other groups within a particular company. The Door Trim Line
at JPC was the key group in last year’s negotiations. Being first in line, the settlement in
Door Trim would set the pattern for JPC that year.
Annie Smith is forewoman for the Door Trim Line. There are many variations of
door trim and Annie’s biggest job is to see that they get produced in the right mix. The
work involved in making the trim is about the same regardless of the particular variety.
That is to say, it is a straight piecework operation and the standard price is 72p per unit
regardless of variety. The work itself, while mainly of an assembly nature, is quite
intricate and requires a degree of skill.
Last year’s negotiations started with the usual complaint from the union about piece
prices in general. There was then, however, an unexpected move. Here is the union’s
demand for the Door Trim Line according to the minutes of the meeting:
We’ll come straight to the point. A price of 72p a unit is diabolical… A fair
price is 80p.
The women average about 71 units/day. Therefore, the 8p more that we want
amounts to an average of £5.68 more per woman per day…
This is the smallest increase we’ve demanded recently and we will not accept
less than 80p.
(It was the long-standing practice in the plant to calculate output on an average daily
basis. Although each person’s output is in fact tallied daily, the bonus is paid on daily
output averaged over the week. The idea is that this gives a person a better chance to
recoup if she happens to have one or two bad days.)
The union’s strategy in this meeting was a surprise. In the past the first demand was
purposely out of line and neither side took it too seriously. This time their demand was
in the same area as the kind of offer that JPC’s management was contemplating.
At their first meeting following the session with the union, JPC’s management heard
the following points made by the accountant:
a. The union’s figure of 71 units per day per person is correct. I checked it against the
latest Production Report. It works out like this:
Average weekly output for the year to date is 7100 units; thus, average daily output
is 7100/5 =1420 units/day.
The number of women directly employed on the line is 20, so that average daily
output is 1420/20 = 71 units/day/woman.
b. The union’s request amounts to an 11.1 per cent increase: (80 − 72)/72 × 100 =
11.1.
c. Direct labour at current rates is estimated at £26 million. Assuming an 11.1 per cent
increase across the board, which, of course, is what we have to anticipate, total
annual direct labour would increase by about £2.9 million: £26 000 000 × 11.1% =
£2 886 000.
Prior to the negotiations management had thought that 7 per cent would be a rea-
sonable offer, being approximately the rate at which productivity and inflation had been
increasing in recent years. Privately they had set 10 per cent as the upper limit to their
final offer. At this level they felt some scheme should be introduced as an incentive to
better productivity, although they had not thought through the details of any such
scheme.
As a result of the union’s strategy, however, JPC’s negotiating team decided not to
hesitate any longer. Working late, they put together their ‘best’ package using the 10
per cent criterion. The main points of the plan were as follows:
a. Maintain the 72p per unit standard price but provide a bonus of 50p for each unit
above a daily average of 61units/person.
b. Since the average output per day per person is 71, this implies that on average 10
bonus units per person per day would be paid.
c. The projected weekly cost then is £5612:
(71 × 0.72) + (10 × 0.50) = 56.12
56.12 × 5 × 20 = £5612
d. The current weekly cost then is £5112:
71 × 0.72 × 5 × 20 = 5112
e. This amounts to an average increase of £500 per week, slightly under the 10 per
cent upper limit:
500/5112 × 100 = 9.78%
f. The plan offers the additional advantage that the average worker gets 10 bonus units
immediately, making the plan seem attractive.
g. Since the output does not vary much from week to week, and since the greatest
improvement should come from those who are currently below average, the largest
portion of any increase should come from units at the lower cost of 72p each. Those
currently above average probably cannot improve very much. To the extent that this
occurs, of course, there is a tendency to reduce the average cost below the 79p per
unit that would result if no change at all occurs:
5612/(71 × 5 × 20) = 79.0p
At this point management had to decide whether they should play all their cards at
once or whether they should stick to the original plan of a 7 per cent offer. Two further
issues had to be considered:
a. How good were the rates?
b. Could a productivity increase as suggested by the 9.8 per cent offer plan really be
anticipated?
Annie Smith, the forewoman, was called into the meeting, and she gave the following
information:
a. A few workers could improve their own average a little, but the rates were too tight
for any significant movement in the daily outputs.
b. This didn’t mean that everyone worked at the same level, but that individually they
were all close to their own maximum capabilities.
c. A number did average fewer than 61 units per day. Of the few who could show a
sustained improvement, most would be in this fewer-than-61 category.
This settled it. JPC decided to go into the meeting with their ‘best’ offer of 9.8 per
cent. Next day the offer was made. The union asked for time to consider it and the next
meeting was set for the following afternoon.
In the morning of the following day Annie Smith reported that her Production Per-
formance Report (see Table 1.4) was missing. She did not know who had taken it but
was pretty sure it was the union steward.
The next meeting with the union lasted only a few minutes. A union official stated his
understanding of the offer and, after being assured that he had stated the details
correctly, he announced that the union approved the plan and intended to recommend
its acceptance to its membership. He also added that he expected this to serve as the
basis for settlement in the other units as usual and that the whole wage negotiations
could probably be completed in record time.
And that was that. Or was it? Some doubts remained in the minds of JPC’s negotiat-
ing team. Why had the union been so quick to agree? Why had the Production
Performance Report been stolen? While they were still puzzling over these questions,
Annie Smith phoned to say that the Production Performance Report had been returned.
1 In the hope of satisfying their curiosity, the negotiating team asked Annie to bring the
Report down to the office. Had any mistakes been made?
Was JPC’s offer really 9.8 per cent? If not, what was the true offer?
year. It is a safety record which cannot be matched by any other form of general
anaesthesia.
Yours faithfully,
Mr Y, (President-Elect) Society for the Advancement of Anaesthesia in Dentis-
try.
1 Comment upon the evidence and reasoning (as given in the letters) that lead to these
two conclusions.
Learning Objectives
This module describes some basic mathematics and associated notation. Some
management applications are described but the main purpose of the module is to lay
the mathematical foundations for later modules. It will be preferable to encounter
the shock of the mathematics at this stage rather than later when it might detract
from the management concepts under consideration. For the mathematically literate
the module will serve as a review; for those in a rush it could be omitted altogether.
2.1 Introduction
The quantitative courses people take at school, although usually entitled ‘mathemat-
ics’, probably cover several quantitative subjects, including algebra, geometry and
trigonometry. Some of these areas are useful as a background to numerical methods
in management. These include graphs, functions, simultaneous equations and
exponents. Usually, the mathematics is a precise means of expressing concepts and
techniques. A technique may not be complex, but the presence of mathematics,
especially notation, can cause difficulties and arouse fears.
Most of the mathematics met in a management course will be reviewed here.
Although their usefulness in management will be indicated, this is not the prime
purpose. Relevance is not an issue at this stage; the main objective is to deal with
basic mathematical ideas now so that they do not interfere with comprehension of
more directly applicable techniques at a later stage.
1 2 3
F2
3 units
2 (3,2)
y-axis
1 2 units
1 2 3 4
x-axis
2
x-values negative x-values positive
y-values positive y-values positive
1
–3 –2 –1 0 1 2 3
–1
x-values negative x-values positive
y-values negative y-values negative
–2
y
5
(–7,4)
4
3
(3,2)
2
(612 ,1)
1
(–4,0) (4,0)
–8 –7 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7 8 x
–1
–2
–3 (0,–3)
(–4,–3)
–4
–5
This can be written more concisely with letters in place of the words. Suppose x
is the variable number of products sold, y is the variable direct profit, p the set price
and q the set cost per unit. Then:
=( − ) ﴾2.1﴿
The equation is given a label (Equation 2.1) so that it can be referred to later.
Note that multiplication can be shown in several different ways. For example,
Price (p) times Volume (x) can be written as:
·
( )( )
The multiplication sign (×) used in arithmetic tends not to be used with letters
because of possible confusion with the use of x as a variable.
The use of symbols to represent numbers is the dictionary definition of algebra.
It is intended not to confuse but to simplify. The symbols (as opposed to verbal
labels, e.g. ‘y’ instead of ‘Direct profit’) shorten the description of a complex
relationship; the symbols (as opposed to numbers, e.g. y instead of 2.1, 3.7, etc.)
allow the general properties of the variables to be investigated instead of particular
properties when the symbols take particular numerical values.
The relationship (Equation 2.1) above is an equation. Since price and cost are
fixed in this example, p and q are constants. Depending upon the quantities sold, x
and y may take on any of a range of values; therefore, they are variables. Once x is
known, y is automatically determined, so y is a function of x. Whenever the value
of a variable can be calculated given the values of other variables, it is said to be a
function of the other variables.
If the values of the constants are known, say p = 5, q = 3, then the equation
becomes:
=2 ﴾2.2﴿
A graph can now be made of this function. The graph is the set of all points
satisfying Equation 2.2, i.e. all the points for which Equation 2.2 is true. By looking
at some of the points, the shape of the graph can be seen:
when x = 0, y = 2 times 0 = 0
when x = 1, y = 2 times 1 = 2
when x = 2, y = 2 times 2 = 4, etc.
Therefore, points (0,0), (1,2), (2,4), etc. all lie on this function. Joining together a
sample of such points shows the shape of this graph. This has been done in
Figure 2.5, which shows the graph of the function y = 2x.
y = 2x
2
–2 –1 1 2 3 x
–1
(–3,7) 7
6
5
4
3
(–2,2) 2
1
–3 –2 –1 1 2 3 x
–1
2
y= x –2 –2
x = –3 –2 –1 0 1 2 3
y = 7 2 –1–2–1 2 7
y
3 2
y = x + 3x – 2
x = –3 –2 –1 0 1 2 3 4
y = –2 2 0 –2 2 18 52
x
–3 –2 –1 1 2 3
–1
–2
x = −3 −2 −1 0 1 2 3
y= 7 2 −1 −2 −1 2 7
In Figure 2.5 only two points need be plotted since a straight line is defined
completely by any two points lying on it. The number of points which require to be
plotted varies with the complexity of the function.
When we are working with functions, they are usually restricted to their algebraic
form. It is neater and more economical to use them in this form. They are generally
put in graphical form only for illustrative purposes. The behaviour of complex
equations is often difficult to imagine from the mathematical form itself.
where
Q1 is the quantity sold at price P1
Q2 is the quantity sold at price P2
Suppose the product currently sells at the price P1 and the quantity sold per
month is Q1. A new price is mooted. What is likely to be the quantity sold at this
price? If the elasticity is known (or can be estimated), then the equation can be
rearranged and solved for Q2, i.e. put in the form:
Q2 = function of E,Q1,P1,P2
The likely quantity sold (Q2) at the new price (P2) can then be calculated. But first
the equation would have to be rearranged.
The four rules by which equations can be rearranged are:
(a) Addition. If the same quantity is added to both sides of an equation, the
resulting equation is equivalent to the original equation.
Examples
1 Solve x − 1 = 2 x−1 = 2
Add 1 to both sides of the equation x−1+1 = 2+1
x = 3
(b) Subtraction. If the same quantity is subtracted from both sides of an equation,
the resulting equation is equivalent to the original.
Examples
1 Solve x + 4 = 14 x + 4 = 14
Subtract 4 from both sides of the equation x + 4 − 4 = 14 − 4
x = 10
1 Solve 8x = 72 8x = 72
Divide both sides by 8 =
x = 9
2 Solve for x: 2y − 4x + 5 = 6x − 3y −
5
Add 5 and 3y to both sides 5y − 4x + 10 = 6x
Add 4x to both sides 5y + 10 = 10x
Divide by 10 +1 = x
This illustrates that the solved variable can appear on either side of the equation.
(d) Multiplication. If both sides of an equation are multiplied by the same number,
except zero, the resulting equation is equivalent to the original.
Examples
1 Solve = 6 = 6
Multiply both sides by 3: x = 18
2 Solve =1 = 1
Multiply both sides by (4 − y) 2y + 3 = 4 − y
Add y to both sides 3y + 3 = 4
Subtract 3 from both sides 3y = 1
Divide both sides by 3 y =
Example
Simplify = =
7 y = 2x + 1
6
A
5 (x1,y1)
4
y distance = y1 – y2
3 y1 – y2
B Slope =
(x2,y2) x1 – x2
2 x distance = x1 – x2
1
( 12,0) (0,1)
–1 0 1 2 3 4 x
–1
–2
The same reasoning applies to any two points along the line, confirming the
obvious fact that the slope (and therefore m) is constant along a straight line.
If A and B are two particular points (2,5) and (1,3) (i.e. x1 = 2, y1 = 5, x2 = 1, y2 =
3) then:
Slope AB = =2
For example, if the sales volume of a company were expressed as a linear func-
tion of time, y would be sales and x would be time (x = 1 for the first time period,
x = 2 for the second time period and so on). Then m would be the constant change
in sales volume from one time period to the next. If m = 3, then sales volume would
be increasing by 3 each time period.
A few additional facts are worthy of note.
(a) It is possible for m to be negative. If this is the case, the line leans in a backward
direction, since as x increases, y decreases. This is illustrated in Figure 2.9.
(b) It is possible for m to take the value 0. The equation of the line is then
y = constant, and the line is parallel to the x axis.
(c) Similarly, the line x = constant is parallel to the y axis and the slope can be
regarded as being infinite. The last two lines are examples of constant functions
and are also shown in Figure 2.9.
3
y = –x + 3
x = –2
2
y=1
1
–3 –2 –1 0 1 2 3 4 x
2. What is the equation of the line with intercept −3 and which passes through the
point (1,1)?
Since the intercept is −3, the line is:
y = mx − 3
Since it passes through (1,1):
1=m–3
m=4
The line is y = 4x – 3.
3. What is the equation of the line passing through the points (3,1) and (1,5)?
The slope between these two points is:
= = = −2
Therefore, the line must be y = −2x + c.
Since it passes through (3,1) (NB (1,5) could just as easily be used):
1 = −6 + c
c=7
The line is y = −2x + 7.
The values of x and y that satisfy both equations are found from the point of
intersection of the lines (see Figure 2.10). Since this point is on both lines, the x and
y values here must satisfy both equations. From the graph these values can be read:
x = 4, y = 3. That these values do fit both equations can be checked by substituting
x = 4, y = 3 into the equations of the lines.
Line (2.4)
3 Line (2.5)
x
4 6 16
8
2x + 3y = 24
4
2x + 3y = 12
x
6 12
Example
Solve the two simultaneous equations:
5 + 2 = 17 ﴾2.6﴿
2 −3 =3 ﴾2.7﴿
Multiply Equation 2.6 by 3 and Equation 2.7 by 2 so that the coefficients of y are the
same in both equations. (Equation 2.6 could just as well have been multiplied by 2 and
Equation 2.7 by 5 and then x eliminated.)
15x + 6y = 51
4x − 6y = 6
Add the two equations to eliminate y:
19x = 57
x=3
Substitute x = 3 in Equation 2.7 to find the y value:
6 − 3y = 3
3y = 3
y=1
The solution is x = 3, y = 1.
2000
Balance (£)
1500
1331
1331
1210
1100
1000
0 1 2 3 4 Time
(years)
2.6.1 Exponents
Consider an expression of the form ax. The base is a and the exponent is x. If x is a
whole number, then the expression has an obvious meaning (e.g. a2 = a × a,
a3 = a × a × a, 34 = 81, etc.) It also has meaning for values of x that are not whole
numbers. To see what this meaning is, it is necessary to look at the rules for working
with exponents.
(a) Multiplication. The rule is:
× =
For example:
× =
× =
It can be seen that this makes good sense if one substitutes whole numbers for a,
x and y. For instance:
2 ×2 = 4×8
= 32
= 2
Note that the exponents can only be added if the bases are the same. For exam-
ple, a3 × b2 cannot be simplified.
(b) Division. The rule is similar to that for multiplication:
/ =
For example:
/ = = =
Again, the reasonableness of the rule is confirmed by resorting to a specific nu-
merical example.
(c) Raising to a power. The rule is:
( ) =
For example:
( ) =
Several points of detail follow from the rules:
(a) = 1 since 1 = / =
(b) = 1/ since 1/ = / = =
1 1 1 1 1 1
3 +
(c) = √ ;
2 = √ since × =
3 2 2 2 2 = 1=
This last point demonstrates that fractional or decimal exponents do have mean-
ing.
Examples
1. ( × )/
=( )/
= /
=
=
2. Evaluate: 274/3
= (27 )
= ( √27)
=3
= 81
3. Evaluate: 4−3/2
1
= /
4
1
=
(√4)
1
=
2
1
=
8
4. Evaluate: (22)3
=2
= 64
2.6.2 Logarithms
In pursuing the objective of understanding exponential functions, it is also helpful
to look at logarithms. At school, logarithms are used for multiplying and dividing
large numbers, but this is not the purpose here. A logarithm is simply an exponent.
For example, if y = ax then x is said to be the logarithm of y to the base a. This is
written as logay = x.
Examples
1. 1000 = 103 and therefore the logarithm of 1000 to the base 10 is 3 (i.e.
3 = log101000). Logarithms to the base 10 are known as common logarithms.
2. 8 = 23 and therefore the logarithm of 8 to the base 2 is 3 (i.e. 3 = log28). Logarithms
to the base 2 are binary logarithms.
3. e is a constant frequently found in mathematics (just as π is). e has the value 2.718
approximately. Logarithms to the base e are called natural logarithms and are writ-
ten ln (i.e. x = lny means x = logey).
e has other properties which make it of interest in mathematics.
The rules for manipulation of logarithms follow from the rules for exponents:
(a) Addition: log + log = log
(b) Subtraction: log − log = log | / |
(c) Multiplication by a constant: log = log
(a) y y=2×2
x
15
10
k=2
0 1 2 3 x
(b) y
–x
y=5×2
15
10
–1 0 1 2 x
New cases
Initial level
0 1 2 3 4 Time
statistical technique, called regression analysis, is the usual way of dealing with this
second problem. (Regression analysis is a topic covered in Module 11 and Module 12.)
Review Questions
2.1 Which point on the graph shown below is (−1,2)?
A
2
D
1
–2 –1 0 1 2 3 4 x
C B
A. Point A
B. Point B
C. Point C
D. Point D
2.2 What is the equation of the line shown on the graph below?
0 1 2 3 x
A. = +1
B. =1−
C. =− −1
D. = −1
2.3 Which of the ‘curves’ shown below is most likely to have the equation y = x2 − 6x + 4?
6
C
A
4
B
2
0 2 4 6 x
A. A
B. B
C. C
2.6 What is the equation of the line with intercept 3 and that goes through point (3,9)?
A. =3 +2
B. =6 +3
C. =4 +3
D. =2 +3
2.7 What is the equation of the line that goes through the points (−1,6) and (3,−2)?
A. =2 +8
B. =4−2
C. =− +5
D. =2 −8
2.12 What is the equation of the curve shown in the graph below?
30
20
10
1 2 x
A. = + 10
B. = 10 · 10 .
C. = 10 · 10 .
D. = 100 · 10 .
1 What is the breakeven point for each system (i.e. how many of each system need to be
sold so that revenue equals cost)?
Handling Numbers
Module 3 Data Communication
Module 4 Data Analysis
Module 5 Summary Measures
Module 6 Sampling Methods
Data Communication
Contents
3.1 Introduction.............................................................................................3/1
3.2 Rules for Data Presentation ..................................................................3/3
3.3 The Special Case of Accounting Data ............................................... 3/12
3.4 Communicating Data through Graphs.............................................. 3/16
Learning Summary ......................................................................................... 3/21
Review Questions ........................................................................................... 3/22
Case Study 3.1: Local Government Performance Measures ..................... 3/24
Case Study 3.2: Multinational Company’s Income Statement.................. 3/25
Case Study 3.3: Country GDPs ..................................................................... 3/25
Case Study 3.4: Energy Efficiency ................................................................. 3/26
Learning Objectives
By the end of the module the reader should know how to improve data presenta-
tion. This is important both in communicating data to others and in analysing data.
The emphasis is on the visual aspects of data presentation. Special reference is made
to accounting data and graphs.
3.1 Introduction
Data communication means the transmission of information through the medium
of numbers. Its reputation is mixed. Sometimes it is thought to be done dishonestly
(‘There are lies, damned lies and statistics’); at other times it is thought to be done
confusingly so that the numbers appear incomprehensible and any real information
is obscured. Thus far, numbers and words are similar. Words can also mislead and
confuse. The difference seems to be that numbers are treated with less tolerance and
are quickly abandoned as a lost cause. More effort is made with words. One hears,
for instance, of campaigns for the plain and efficient use of words by bureaucrats,
lawmakers, etc., but not for the plain and efficient use of numbers by statisticians,
computer scientists and accountants. Furthermore, while experts spend much time
devising advanced numerical techniques, little effort is put into methods for better
data communication.
This module attempts to redress the balance by looking closely at the question of
data communication. In management, numbers are usually produced in the form of
tables or graphs. The role of both these modes of presentation will be discussed.
The case of accounting data will be given separate treatment.
It may seem facile to say that data should be presented in a form that is suitable
for the receiver rather than convenient for the producer. Yet computerised man-
agement data often relate more to the capabilities of the computer than to the needs
of the managers – many times accounting information seems to presuppose that all
the receivers are accountants; statistics frequently can only be understood by the
highly numerate. A producer of data should have the users at the forefront of his or
her mind, and should also not assume that the receiver has a similar technical
background to him- or herself.
In the context that the requirements of the users of data are paramount, the aim
now is to show how data might be presented better. How is ‘better’ to be defined?
A manager meets data in just a few general situations:
(a) Business reports. The data are usually the supporting evidence for conclusions
or suggestions made verbally in the text.
(b) Management information systems. Large amounts of data are available on
screen or delivered to the manager at regular intervals, usually in the form of
computer printouts.
(c) Accounting data. Primarily, for a manager, these will indicate the major
financial features of an organisation. The financial analyst will have more detailed
requirements.
(d) Self-generated data. The manager may wish to analyse his own data: sales
figures, delivery performance, invoice payments, etc.
In all these situations speed is essential. A manager is unlikely to have the time to
carry out a detailed analysis of every set of data that crosses his or her desk. The
data should be communicated in such a way that its features are immediately
obvious. Moreover, the features should be the main ones rather than the points of
detail. These requirements suggest that the criterion that distinguishes well-
presented data should be: ‘The main patterns and exceptions in the data should be
immediately evident.’
The achievement of this objective is made easier since, in all the situations above,
the manager will normally be able to anticipate the patterns in the data. In the first
case above, the pattern will have been described in the text; in the other three, it is
unlikely that the manager will be dealing with raw data in a totally new set of
circumstances and he or she will therefore have some idea of what to expect. The
methods of improving data presentation are put forward with this criterion in mind.
In looking at this subject it must be stressed that it is not just communicating
data to others that is important. Communicating data to oneself is a step in coming
to understand them. The role of data communication in analysis is perhaps the most
valuable function of the ideas proposed here. Important results in the sciences,
medicine and economics are not usually discovered through the application of a
sophisticated computerised technique. They are more likely to be discovered
because someone has noticed an interesting regularity or irregularity in a small
amount of data. Such features are more likely to be noticed when the data are well
presented. Sophistication may be introduced later when one is trying to verify the
result rigorously, but this should not be confused with the original analysis. In short,
simple, often visual, methods of understanding numbers are highly important. The
role of data communication as part of the analysis of data will be explored in the
next module.
Original Rounded
(2 effective figures)
1382 1400
721 720
79.311 79
17.1 17
4.2 4.2
2.32 2.3
These numbers have been rounded to the first two figures. Contrast this to fixed
rounding, such as rounding always to the first decimal place. For example, if the
above numbers were rounded to the first decimal place, the result would be:
Original Rounded
(1st decimal place)
1382 1382.0
721 721.0
79.311 79.3
17.1 17.1
4.2 4.2
2.32 2.3
Rounding to the same number of decimal places may appear to be more con-
sistent, but it does not make the numbers easier to manipulate and communicate.
Rounding to two effective figures puts numbers in the form in which mental
arithmetic is naturally done. The numbers are therefore assimilated more quickly.
The situation is slightly different when a series of similar numbers, all of which
have, say, the first two figures in common, are being compared. The rounding
would then be to the first two figures that are effective in making the comparison
(i.e. the first two that differ from number to number). This is the meaning of
‘effective’ in ‘two effective figures’. For example, the column of numbers below
would be rounded as shown:
Original Rounded
1142 1140
1327 1330
1489 1490
1231 1230
1588 1590
The numbers have been rounded to the second and third figures because all the
numbers have the first figure, 1, in common. Rounding to the first two figures
would be over-rounding, making comparisons too approximate.
Many managers may be concerned that rounding leads to inaccuracy. It is true, of
course, that rounding does lose accuracy. The important questions are: ‘Would the
presence of extra digits affect the decision being taken?’ and ‘Just how accurate are
the data anyway – is the accuracy being lost spurious accuracy?’
Often, one finds that eight-figure data are being insisted upon in a situation
where the decision being taken rests only on the first two figures, and where the
method of data collection was such that only the first two figures can be relied upon
as being accurate. Monitoring financial budgets is a case in point. During a financial
year actual costs are continuously compared with planned. There is room for
approximation in the comparison. If the budget were £11 500, it is enough to know
that the actual costs are £10 700 (rounded to two effective figures). No different
decision would be taken had the actual costs been specified as £10 715. The
conclusion would be the same: actual costs are about 7 per cent below budget. Even
if greater accuracy were required it may not be possible to give it. At such an early
stage actual costs will almost certainly rest in some part on estimates and be subject
to error. To quote actual costs to the nearest £1 is misleading, suggesting a level of
accuracy that has not been achieved. In this situation high levels of accuracy are
therefore neither necessary nor obtainable, yet the people involved may insist on
issuing actual cost data to the nearest £1. Where there is argument about the level of
accuracy required, A. S. C. Ehrenberg (1975) suggests in his book that the numbers
should be rounded but a note at the bottom of the table should be provided to
indicate a source from which data specified to a greater precision can be obtained.
Then wait for the rush.
It is much easier to see the main features of the data from Table 3.2. East now
stands out as a clear exception, its profit being out of line with the other divisions.
The rounding also facilitates the calculation of accounting ratios. The profit margin
is about 14 per cent for all divisions except East, where it is over 20 per cent. While
it is possible to see these features in Table 3.1, they are not so immediately apparent
as in the amended table. When managers have many such tables crossing their
desks, it is essential that attention should be attracted quickly to important infor-
mation that may require action.
Frequently, tables are ordered alphabetically. This is helpful in a long reference
table that is unfamiliar to the user, but not so helpful when management infor-
mation is involved. Indeed, it may be a hindrance. In management information it is
the overall pattern, not the individual entries, that is of interest. Alphabetical order is
more likely to obscure than highlight the pattern. In addition, managers are usually
not totally unfamiliar with the data they receive. For instance, anyone looking at
some product’s sales figures by state in the USA would probably be aware that, for a
table in population order, California would be close to the top and Alaska close to
the bottom. In other words, the loss from not using alphabetical order is small
whereas the gain in data communication is large.
57
−23
34
In a table, then, the more important comparison should be presented down col-
umns, not along rows. Taking the ‘Capital employed’ data from Table 3.2, it could
be presented horizontally, as in Table 3.3, or vertically, as in Table 3.4. In Table 3.3
the differences between adjacent rows are 390, 140, 190 respectively. When the data
are in a column, such calculations are made much more quickly.
In many tables, however, comparisons across rows and down columns are equal-
ly important and no interchange is possible.
The summary measure is usually an average since it is important that the sum-
mary should be of the same size order as the rest of the numbers. A column total,
for example, is not a good summary, being of a different order of magnitude from
the rest of the column and therefore not a suitable basis for comparison. The
summary measure can also be the basis for ordering the rows and columns (see
Section 3.2.2).
type of number from another (e.g. to separate a summary row from the rest of the
numbers). Table 3.6 shows the data of Table 3.2 but with white space and gridlines
introduced. Table 3.7 is a repeat of Table 3.6 but with an acceptable use of space
and gridlines.
Table 3.6 Data of Table 3.2 with white space and gridlines
Division Capital Turnover Profit
employed
South 1870 730 96
West 1480 560 82
North 1340 530 78
East 1150 430 89
The patterns and exceptions in these data are much more clearly evident once the
white space and gridlines have been removed. The purpose of many tables is to
compare numbers. White space and gridlines have the opposite effect. They
separate numbers and make the comparison more difficult.
3.2.6 Rule 6: Labelling Should Be Clear but Unobtrusive
Care should be taken when labelling data; otherwise the labels may confuse and
detract from the numbers. This seems an obvious point, yet in practice two labelling
faults are regularly seen in tables. First, the constructor of a table may use abbreviat-
ed or obscure labels, having been working on the project for some time and falsely
assuming that the reader has the same familiarity with the numbers and their
definitions. Second, gaps may be introduced in a column of numbers merely to
accommodate extra-long labels. Labels should be clear and not interfere with the
understanding of the numbers.
Table 3.8 is an extract from a table of historical data relating to United Kingdom
utilities prior to their privatisation in the 1980s and 1990s. The extract shows ‘Gross
income as % of net assets’ for a selection of these organisations. First, the length of
the label relating to Electricity results in gaps in the column of numbers; second,
this same label includes abbreviations, the meaning of which may not be apparent to
the uninitiated. In Table 3.9 the labels have been shortened and unclear terms
eliminated. If necessary, a footnote or appendix could provide an exact definition of
the organisations concerned.
In Table 3.9 the numbers and labels are clearer. Watertight and lengthy defini-
tions of the organisations do not belong within the table. The purpose of this rule is
to assert the primary importance of tables in communicating numbers. As far as
possible, the labels should give unambiguous definitions of the numbers but should
not obscure the information contained in the numbers.
Table 3.10 GDP of nine EC countries plus Japan and the USA (€ thou-
sand million)
1965 1975 1987 1990 1995 1997
United 99.3 172.5 598.8 766.4 846.3 1133.3
Kingdom
Belgium 16.6 46.2 123.5 154.4 209.0 213.7
Denmark 10.1 26.9 88.8 99.6 129.4 140.2
France 96.8 253.3 770.2 940.8 1169.1 1224.4
Germany* 114.3 319.9 960.9 1182.2 1846.4 1853.9
Ireland 2.7 5.9 27.2 35.9 49.4 65.1
Italy 53.4 130.2 657.4 861.2 832.0 1011.1
Luxem- 0.7 1.7 6.0 8.1 13.2 13.9
bourg
Nether- 18.7 61.2 188.9 222.3 301.9 315.6
lands
Japan 83.0 372.8 2099.4 2341.5 3917.9 3712.1
USA 690.0 1149.7 3922.3 4361.5 5374.3 6848.2
* Germany includes the former GDR in 1995 and 1997.
(f) Rule 6. The labelling is already clear. No changes have been made.
(g) Rule 7. It would be very difficult to make a simple verbal summary of these
data. Moreover, in the context, the publishers would probably not wish to be
appearing to lead the reader’s thinking by suggesting what the patterns were.
The typical questions that might be asked of these data can now be applied to
Table 3.11. It is possible to see quickly that Germany’s GDP increased by 1900/110
= just over 17 times; Italy’s by 1000/53 = just over 19 times; Japan’s by just under
45 times; the UK’s by 11; Japan has overtaken Germany, France and the UK;
Ireland is over four times the size of Luxembourg economically. The information is
more readily apparent from Table 3.11 than from Table 3.10.
Operating Expenses:
Crew Wages and Social Security 7 685 965 (9 010)
Other Crew Expenses 541 014 (633)
Insurance Premiums 1 161 943 (1 367)
Provisions and Stores 1 693 916 (2 268)
Repairs and Maintenance 1 685 711 (3 297)
Other Operating Expenses 60 835 (27)
(12 829 384)
(8 131 706)
2015 2016
£ £000
Net Profit/(Loss) Currency Exch. (190 836) (680)
Dividends 35 732 47
(8 286 810)
6 967 179
These features of the company’s finances are remarkable. Even when one knows
what they are, it is very difficult to see them in the original table (Table 3.12). Yet it
is this volatility that is of major interest to shareholders and managers.
The question of rounding creates special difficulties with accounting data. The
reason is that rounding and exact adding up are not always consistent. It has to be
decided which is the more important – the better communication of the data or the
need to allow readers to check the arithmetic. The balance of argument must weigh
in favour of the rounding. Checking totals is a trivial matter in published accounts
(although not, of course, in the process of auditing). If a mistake were found in the
published accounts of such a large company, the fault would almost certainly lie
with a printer’s error. But ‘total checking’ is an obsessive pastime and few compa-
nies would risk the barrage of correspondence that would undoubtedly ensue even
though a note to the accounts explained that rounding was the cause. Because of
this factor the two effective figures rule may have to be broken so that adding and
subtracting are exact. This has been done in Table 3.13. The only remaining
question is to wonder why totals are exactly right in company accounts which have
in any case usually been rounded to some extent (the nearest £ million for many
companies). The answer is that figures have been ‘fudged’ to make it so. The same
considerations apply, but to a lesser degree, with internal company financial infor-
mation.
Communicating financial data is an especially challenging area. The guiding prin-
ciple is that the main features should be evident to the users of the data. It should
not be necessary to be an expert in the field nor to have to carry out a complex
analysis in order to appreciate the prime events in a company’s financial year. Some
organisations are recognising these problems by publishing two sets of (entirely
consistent) final accounts. One is source material, covering legal requirements and
suitable for financial experts; the other is a communicating document, fulfilling the
purpose of accounts (i.e. providing essential financial information). Other organisa-
tions may, of course, have reasons for wanting to obscure the main features of their
financial year.
12
10
Rate %
0
86 87 88 89 90 91 92 93 94 95 96
Year
12
10
Italy
6
USA
4 Canada
France
Netherlands
Denmark
2 Belgium
Germany
Japan
0
86 87 88 89 90 91 92 93 94 95 96
Year
Key: Denmark
Spain
Italy
Netherlands
Belgium
1500.0
1239.0
978.0
717.0
456.0
195.0
JAN JAN JAN JAN JAN
2011 2012 2013 2014 2015
2012 Jan. 570 450 230 420 380 2014 Jan. 930 470 470 300 360
Feb. 800 550 300 280 310 Feb. 780 510 530 390 420
Mar. 1100 770 330 400 430 Mar. 900 590 440 260 440
Apr. 910 690 540 270 380 Apr. 780 530 440 490 440
May 970 690 290 390 420 May 1100 510 420 400 440
June 910 660 350 520 240 June 1000 650 440 510 350
July 900 690 370 430 300 July 1100 550 390 350 400
Aug. 900 580 330 240 300 Aug. 870 580 570 350 360
Sept. 750 520 480 430 340 Sept. 1100 610 460 380 360
Oct. 1400 640 410 380 430 Oct. 1100 750 730 360 530
Nov. 1100 590 360 560 430 Nov. 950 660 750 530 410
Dec. 1000 450 340 570 430 Dec. 1200 600 650 500 400
If the general pattern over the years or a comparison between countries is re-
quired, Table 3.15 is suitable. This shows the average monthly imports of coal for
each year. It can now be seen that four of the countries have increased their imports
by a factor of 30–45 per cent. Italy is the exception, having decreased coal imports
by 20 per cent. The level of imports in the countries can be compared.
In these terms, Italy is the largest, followed by Belgium, followed by the other
three at approximately the same level.
Table 3.15 can be transferred to a graph, as shown in Figure 3.4. General patterns
are evident. Italy has decreased its imports, the others have increased theirs; the level
of imports is in the order Italy, Belgium … The difference between the table and the
graph becomes clear when magnitudes have to be estimated. The percentage change
(−20 per cent for Italy, etc.) is readily calculated from the table, but not from the
graph. In general, graphs show the sign of changes but a table is needed to make an
estimate of the size of the changes. The purpose of the data and personal prefer-
ence would dictate which of the two were used.
1200
1000
Italy
800 Belgium
600 Netherlands
Denmark
400 Spain
200
Pictures are useful for attracting attention and for showing very general patterns.
They are not useful for showing complex patterns or for extracting actual numbers.
Learning Summary
The communication of data is an area that has been neglected, presumably because
it is technically simple and there is a tendency in quantitative areas (and perhaps
elsewhere) to believe that only the complex can be useful. Yet in modern organisa-
tions there can be few things more in need of improvement than data
communication.
Although the area is technically simple, it does involve immense difficulties. What
exactly is the readership for a set of data? What is the purpose of the data? How can
the common insistence on data specified to a level of accuracy that is not needed by
the decision maker and is not merited by the collection methods be overcome? How
much accounting convention should be retained in communicating financial
information to the layperson? What should be done about the aspects of data
presentation that are a matter of taste? The guiding principle among the problems is
that the data should be communicated according to the needs of the receiver rather
than the producer. Furthermore, they should be communicated so that the main
features can be seen quickly. The seven rules of data presentation described in this
module seek to accomplish this.
Rule 1: round to two effective digits.
Rule 2: reorder the numbers.
Rule 3: interchange rows and columns.
Rule 4: use summary measures.
Rule 5: minimise use of space and lines.
Rule 6: clarify labelling.
Rule 7: use a verbal summary.
Producers of data are accustomed to presenting them in their own style. As al-
ways there will be resistance to changing an attitude and presenting data in a
different way. The idea of rounding especially is usually not accepted instantly.
Surprisingly, however, while objections are raised against rounding, graphs tend to
be universally acclaimed, even when not appropriate. Yet the graphing of data is the
grossest form of rounding. There is evidently a need for clear and consistent
thinking in regard to data communication.
This issue has been of increasing importance because of the growth in usage of
all types and sizes of computers and the development of large-scale management
information systems. The benefits of this technological revolution should be
enormous but the potential has yet to be realised. The quantities of data that
circulate in many organisations are vast. It is supposed that the data provide
information which in turn leads to better decision making. Sadly, this is frequently
not the case. The data circulate, not providing enlightenment, but causing at best
indifference and at worst tidal waves of confusion. Poor data communication is a
prime cause of this. It could be improved. Otherwise, one must question the
wisdom of the large expenditures many organisations make in providing untouched
and bewildering management data. One thing is clear: if information can be assimi-
lated quickly, it will be used; if not, it will be ignored.
Review Questions
3.1 In communicating management data, which of the following principles should be adhered
to?
A. The requirements of the user of the data are paramount.
B. Patterns in the data should be immediately evident.
C. The data should be specified to two decimal places.
D. The data should be analysed before being presented.
3.2 The specification of the data (the number of decimal places) indicates the accuracy. True
or false?
3.3 The accuracy required of data should be judged in the context of the decisions that are
to be based upon the data. True or false?
1732
1256.3
988.42
38.1
B.
1730
1260
990
38
C.
1700
1300
990
38
3.6 Which are correct reasons? It is easier to compare numbers in a column than in a row
because:
A. The difference between two- and three-figure numbers is quickly seen.
B. Subtractions of one number from another are made more quickly.
C. The numbers are likely to be closer together and thus easier to analyse quickly.
3.7 When the rows (each referring to the division of a large company) of a table of numbers
are ordered by size, the basis for the ordering should be:
A. The numbers in the left-hand column.
B. The averages of the rows.
C. The capital employed in the division.
D. The level of manpower employed in the division.
3.9 Only some of the presentation rules can be applied to financial accounts. This is
because:
A. Rounding cannot be done because the reader may want to check that the
auditing has been correct.
B. Rounding cannot be done because it is illegal.
C. An income statement cannot be ordered by size since it has to build up to a
final profit.
D. Published annual accounts are for accountants; therefore their presentation is
dictated by accounting convention.
1 This table is one of many that elected representatives have to consider at their monthly
meetings. The representatives need, therefore, to be able to appreciate and understand
the main features very quickly. In these circumstances, how could the data be presented
better? Redraft the table to illustrate the changes.
1 Compared to many accounting statements Table 3.17 is already well presented, but
what further improvements might be made?
1 Table 3.18 gives the results of this sensitivity analysis. It shows the extent to which the
assumptions have been varied and the new IRR for each variation. The ‘base rate’ is the
IRR for the original calculation. How could it be better presented? (Note that it is not
necessary to understand the situation fully in order to propose improvements to the
data communication.)
References
Ehrenberg, A. S. C. (1975). Data Reduction. New York: John Wiley and Sons.
Data Analysis
Contents
4.1 Introduction.............................................................................................4/1
4.2 Management Problems in Data Analysis .............................................4/2
4.3 Guidelines for Data Analysis ..................................................................4/6
Learning Summary ......................................................................................... 4/15
Review Questions ........................................................................................... 4/16
Case Study 4.1: Motoring Correspondent ................................................... 4/17
Case Study 4.2: Geographical Accounts ...................................................... 4/18
Case Study 4.3: Wages Project ..................................................................... 4/19
Learning Objectives
By the end of this module the reader should know how to analyse data systematical-
ly. The methodology suggested is simple, relying very much on visual interpretation,
but it is suitable for most data analysis problems in management. It carries implica-
tions for the ways information is produced and used.
4.1 Introduction
What constitutes successful data analysis? There is apparently some uncertainty on
this point. If a group of managers are given a table of numbers and asked to analyse
it, most probably they will ‘number pick’. Individual numbers from somewhere in
the middle of the table which look interesting or which support a long-held view
will be selected for discussion. If the data are profit figures, remarks will be made
such as: ‘I see Western region made £220 000 last year. I always said that the new
cost control system would work.’ A quotation from Andrew Lang, a Scottish poet,
could be applied to quite a few managers: ‘He uses statistics as a drunken man uses
lamp posts – for support rather than illumination.’
Real data analysis is concerned with seeking illumination, not support, from a set
of numbers. Analysis is defined as ‘finding the essence’. A successful data analysis
must therefore involve deriving the fundamental patterns and eliciting the real
information contained in the entire table. This must happen before sensible remarks
can be made about individual numbers. To know whether the cost system in the
above example really did work requires the £220 000 to be put in the context of
profit and cost patterns in all regions.
The purpose of this module is to give some guidelines showing how illumination
might be derived from numbers. The guidelines give five steps to follow in order to
find what real information, if any, a set of numbers contains. They are intended to
provide a framework to help a manager understand the numbers he or she encoun-
ters.
One might have thought that understanding numbers is what the whole subject
of statistics is about, and so it is. But statistics was not developed for use in man-
agement. It was developed in other fields such as the natural sciences. When it is
transferred to management, there is a gap between what is needed and what
statistics can offer. Certainly, many managers, having attended courses or read
books on statistics, feel that something is missing and that the root of their problem
has not been tackled. This and other difficulties involved in the analysis of manage-
ment data will be pursued in the following section, before some examples of the
types of data managers face are examined. Next, the guidelines, which are intended
to help fill the statistics gap, will be described and illustrated. Finally, the implica-
tions of this gap for the producers of statistics will be discussed.
rather like reading. When looking at a business report, a manager will usually
read it carefully, work out exactly what the author is trying to say and then decide
whether it is correct. The process is similar with a table of numbers. The data
have to be sifted, thought about and weighed. To do this, good presentation (as
stressed in Module 3 in the rules for data presentation) may be more important
than sophisticated techniques. Most managers could do excellent data analyses
provided they had the confidence to treat numbers more like words. It is only
because most people are less familiar with numbers than words that the analysis
process needs to be made more explicit (via guidelines such as those in Section
4.3 below) in the case of numbers.
(c) Over-complication by the experts. The attitude of numbers experts (and other
sorts of experts as well) can confuse managers. The experts use jargon, which is
fine when talking to their peers but not when talking to a layperson; they try
sophisticated methods of analysis before simple ones; they communicate results
in a complicated form, paying little regard to the users of the data. For example,
vast and indigestible tables of numbers, all to five decimal places, are often the
output of a management information system. The result can be that the experts
distance themselves from management problems. In some companies specialist
numbers departments have adopted something akin to a research and develop-
ment role, undertaking solely long-term projects. Managers come to believe that
they have not the skills to help themselves while at the same time believing that
no realistic help is available from experts.
Accounting Data
In Module 3 Table 3.12 showed the income statement of a multinational shipping
company. It is difficult to analyse (i.e. it is difficult to say what the significant
features of the company’s business were). Some important happenings are obscured
in Table 3.12, but they were revealed when the table was re-presented in Table 3.13.
MONTH CUMULATIVE
TERMINAL COSTS
ESTIMATE STANDARD VARIANCE VAR % ESTIMATE STANDARD VARIANCE VAR % BUDGET
LO-LO
STEVEDORING
STRAIGHT TIME - FULL 131 223 143 611 1 288 8.6 1 237 132 1 361 266 124 134 9.1 1 564 896
STRAIGHT TIME - M.T. 13 387 14 651 1 264 8.6 256 991 281 399 24 408 8.7
(UN)LASHING 78 (78) 78 (78)
SHIFTING 801 (801) 11 594 (11 594)
OVERTIME, SHIFT TIME OF
WAITING & DEAD TIME 7 102 (7 102) 190 620 (190 620)
RO-RO
STEVEDORING
TRAILERS
STRAIGHT FULL 20 354 26 136 5 782 22.1 167 159 215 161 48 002 22.3 330 074
STRAIGHT M.T. 178 228 50 21.9 14 846 18 993 4 147 21.8
RO-RO COST PLUS
VOLVO CARGO
ROLLING VEHICLES 14 326 19 515 5 189 26.6 98 210 157 163 58 951 37.5
BLOCKSTONED 29 27 (2) (7.4) 613 674 61 9.1
(UN) LASHING RO-RO 355 (355) 355 (355)
SHIFTING 977 (977) 3 790 (3 790)
OVERTIME, SHIFT TIME OF
WAITING & DEAD TIME 1 417 (1 417) (28 713) (28 713)
HEAVY LIFTS (OFF STANDARD) 2 009 (2 009)
CARS
STEVEDORING
STRAIGHT TIME 6 127 6 403 276 4.3 38 530 35 328 (3 202) (9.1) 168 000
(UN) LASHING 2 (2)
SHIFTING 795 (795) 1 288 (1 288)
OVERTIME, SHIFT TIME OF
WAITING & DEAD TIME 7 573 (7 573)
OTHER SHIPSIDE OF COSTS 3 422 (3 422) 24 473 (24 473)
TOTAL TERMINAL COSTS 200 571 210 571 16 000 4.5 2 083 976 2 069 984 (13 992) (.7) 2 062 970
Market Research
Table 4.2 indicates what can happen when experts over-complicate an analysis. The
original data came from interviews of 700 television viewers who were asked which
British television programmes they really like to watch. The table is the result of the
analysis of this relatively straightforward data. It is impossible to see what the real
information is, even if one knows what correlation means. However, a later and
simpler analysis of the original data revealed a result of wide-ranging importance in
the field of television research. (See Ehrenberg, 1975, for further comment on this
example.)
In all three examples any message in the data is obscured. They were produced by
accountants, computer scientists and statisticians respectively. What managers
would have the confidence to fly in the face of experts and produce their own
analysis? Even if they had the confidence, how could they attempt an analysis? The
guidelines described below indicate, at a general level, how data might be analysed.
They provide a starting point for data analysis.
The re-presentation being recommended does not refer just to data that are liter-
ally a random jumble. On the contrary, the assumption is that the data have already
been assembled in a neat table. Neatness is preferable to messiness but the patterns
may still be obscured. When confronted by apparent orderliness one should take
steps to re-present the table in a fashion which makes it easy to see any patterns
contained in it. The ways in which data can be rearranged were explored in detail in
the previous module. Recall that the seven steps were:
(a) Round the numbers to two effective figures.
(b) Put rows and columns in size order.
(c) Interchange rows and columns where necessary.
(d) Use summary measures.
(e) Minimise use of gridlines and white space.
(f) Make the labelling clear and do not allow it to hinder comprehension of the
numbers.
(g) Use a verbal summary.
encapsulate every nuance of reality, but for some time they have been able to
summarise and predict reality to a generally acceptable level of approximation. In
management the objective is usually no more than this.
Only if the simple approach fails are complex methods necessary, and then ex-
pert knowledge may be required. As a last resort, even if the numbers are random
(random means they have no particular pattern or order), this is a model of a sort
and can be useful. For example, the fact that the day-to-day movements in the
returns from quoted shares are random is an important part of modern financial
theory.
comparison. The other results may be from another year, from another company,
from another country or from another analyst. In other words, reference can usually
be made to a wider set of information. In consequence, questions may be prompted:
Why is the sales mix different this year from the previous five? Why do other
companies have less brand switching for their products? Why is productivity higher
in the west of Germany? Making comparisons such as these provides a context in
which to evaluate results and also suggests the consistencies or anomalies which
may in turn lead to appropriate management action.
If the results coincide with others, then this further establishes the model and
may mean that in future fewer data may need to be collected – only enough to see
whether the already established model still holds. This is especially true of manage-
ment information systems where managers receive regular printouts of sets of
numbers and they are looking for changes from what has gone before. It is more
efficient for a manager to carry an established model from one time period to the
next rather than the raw data.
Example: Consumption of Distilled Spirits in the USA
As an example of an analysis of numbers that a manager might have to carry out,
consider Table 4.3 showing the consumption of distilled spirits in different states of the
USA. The objective of the analysis would be to measure the variation in consumption
across the states and to detect any areas where there were distinct differences. How
can the table be analysed and what information can be gleaned from it? The five stages
of the guidelines are followed.
Stage 1: Reduce the data. Many of the data are redundant. Are percentage fig-
ures really necessary when per capita figures are given? It is certainly possible, with
some imaginative effort, to conceive of uses of percentage data, but they are not
central to the purposes of the table. It can be reduced to a fraction of its original
size without any loss of real information.
Stage 2: Re-present. To understand the table more quickly, the numbers can be
rounded to two effective figures. The original table has numbers, in places, to eight
figures. No analyst could possibly make use of this level of specification. What con-
clusion would be affected if an eighth figure were, say, a 7 instead of a 4? In any
event, the data are not accurate to eight figures. If the table were a record docu-
ment (which it is not) then more than two figures may be required, but not eight.
Putting the states in order of decreasing population is more helpful than alphabetical
order. Alphabetical order is useful for finding names in a long list, but it adds nothing
to the analysis process. The new order means that states are just as easy to find.
Most people will know that California has a large population and Alaska a small one,
especially since no one using the table will be totally ignorant of the demographic
attributes of the USA. At the same time, the new order makes it easy to spot states
whose consumption is out of line with their population.
The end result of these changes, together with some of a more cosmetic nature, is
Table 4.4. Contrast this table with the original, Table 4.3.
Alaska 46 47 1 391 172 1 359 422 2.3 0.33 0.32 3.64 3.86
Arizona 29 30 4 401 883 4 144 521 6.2 1.03 0.98 1.94 1.86
Arkansas 38 38 2 534 826 2 366 429 7.1 0.60 0.56 1.20 1.12
California 1 1 52 529 142 52 054 429 0.9 12.33 12.32 2.44 2.46
Colorado 22 22 6 380 783 6 310 566 1.1 1.50 1.49 2.47 2.49
Connecticut 18 18 7 194 684 7 271 320 (−1.1) 1.69 1.72 2.31 2.35
Delaware 45 43 1 491 652 1 531 688 (−2.6) 0.35 0.36 2.56 2.65
Dist. of 27 27 4 591 448 4 828 422 (−4.9) 1.08 1.14 6.54 6.74
Columbia
Florida 4 4 22 709 209 22 239 555 1.7 5.33 5.28 2.70 2.67
Georgia 13 13 10 717 681 9 944 846 7.8 2.52 2.35 2.16 2.02
Hawaii 41 40 2 023 730 1 970 089 2.7 0.48 0.47 2.28 2.28
Illinois 3 3 26 111 587 26 825 876 (−2.7) 6.13 6.35 2.33 2.41
Indiana 19 20 7 110 382 7 005 511 1.5 1.67 1.66 1.34 1.32
Kansas 35 35 2 913 422 2 935 121 (−0.7) 0.68 0.70 1.26 1.29
Kentucky 26 26 4 857 094 5 006 481 (−3.0) 1.14 1.19 1.42 1.47
Louisiana 21 21 7 073 283 6 699 853 5.6 1.66 1.59 1.84 1.77
Maryland 12 12 10 833 966 10 738 731 0.9 2.54 2.54 2.61 2.62
Massachusetts 10 10 13 950 268 14 272 695 (−2.3) 3.28 3.38 2.40 2.45
Minnesota 15 15 8 528 284 8 425 567 1.2 2.00 1.99 2.15 2.15
Missouri 20 17 7 074 614 7 697 871 (−7.9) 1.66 1.82 1.48 1.61
Nebraska 36 36 2 733 497 2 717 859 0.6 0.64 0.64 1.76 1.76
Nevada 30 31 4 360 172 4 095 910 6.5 1.02 0.97 7.15 6.92
New Jersey 8 8 15 901 587 16 154 975 (−1.6) 3.73 3.82 2.17 2.21
New Mexico 42 41 1 980 372 1 954 139 1.3 0.47 0.46 1.70 1.70
New York 2 2 41 070 005 41 740 341 (−1.6) 9.64 9.88 2.27 2.30
North Dakota 47 46 1 388 475 1 384 311 0.3 0.33 0.33 2.16 2.16
Oklahoma 33 29 3 904 574 4 187 527 (−6.8) 0.92 0.99 1.41 1.54
Rhode Island 39 39 2 073 075 2 131 329 (−2.7) 0.49 .50 2.24 2.30
South 23 25 5 934 427 5 301 054 11.9 1.39 1.26 2.08 1.88
Carolina
South Dakota 48 48 1 312 160 1 242 021 5.6 0.31 0.29 1.91 1.82
Tennessee 24 24 5 618 774 5 357 160 4.9 1.32 1.27 1.33 1.28
Texas 5 6 17 990 532 17 167 560 4.8 4.22 4.06 1.44 1.40
Wisconsin 11 11 10 896 455 10 739 261 1.5 2.56 2.54 2.36 2.33
Total licence 319 583 215 317 874 435 0.5 75.04 75.22 2.13 2.13
Stage 3: Build a model. The pattern is evident from the transformed table. Con-
sumption varies with the population of the state. Per capita consumption in each
state is about equal to the figure for all licence states with some variation (±30 per
cent) about this level. The pattern a year earlier was the same except that overall
consumption increased slightly (1 per cent) between the two years. Refer back to
Table 4.3 and see if this pattern is evident even when it is known to be there. There
may of course be other patterns but this one is central to the objectives of the anal-
ysis.
Stage 4: Exceptions. The overall pattern of approximately equal per capita con-
sumption in each state allows the exceptions to be seen. From Table 4.4, three
states stand out as having a large deviation from the pattern. The states are District
of Columbia, Nevada and Alaska. These states were exceptions to the pattern in the
earlier year as well. Explanations in the cases of District of Columbia and Nevada are
readily found, probably being to do with the large non-resident populations. People
live, and drink, in these states who are not included in the population figures (diplo-
mats in DC, tourists in Nevada). An explanation for Alaska may be to do with the
lack of leisure opportunities. Whatever the explanations, the analytical method has
done its job. The patterns and exceptions in the data have been found. Explanations
are the responsibility of experts in the marketing of distilled spirits in the USA.
Stage 5: Comparison. A comparison between the two years is provided by the
table. Other comparisons will be relevant to the task of gaining an understanding of
the USA spirits market. The following data would be useful:
(i) earlier years, say, five and ten years before;
(ii) a breakdown of aggregate data into whisky, gin, vodka, etc.;
(iii) other alcoholic beverages: wine, beer, etc.
Once data from these other sources have been collected they would be analysed in
the manner described, but of course the process would be shorter because the
pattern can be anticipated. Care would need to be taken that like was being com-
pared with like. For example, it would have to be checked that an equivalent
definition of consumption was in force ten years earlier.
The second implication is more direct. Data should be presented in forms which
enable them to be analysed speedily and accurately. Much of the reduction and re-
presentation stages of the guidelines could, in most instances, be carried out just as
well by the producer of the data as by the user. It would then need to be done only
once rather than many times by the many users of the data. Unfortunately, when
time is spent thinking about the presentation of statistics, it is usually spent in
making the tables look neat or attractive rather than making them amenable to
analysis.
There is much that the producers of data can do by themselves. For example,
refer back to the extract from a management information system shown in Ta-
ble 4.1: if thought were given to the analysis of the data through the application of
the guidelines, a different presentation would result (see Table 4.5).
(a) Some data might be eliminated. For instance, full details on minor categories of
expenditure may not be necessary. This step has not been taken in Table 4.5
since full consultation with the receivers would be necessary.
(b) The table should be re-presented using the rules of data presentation. In
particular, some rounding is helpful. This is an information document, not an
auditing one, and thus rounding is appropriate. In any case, no different conclu-
sions would be drawn if any of the expenditures were changed by one unit. In
addition, improvement is brought about by use of summary measures and a
clearer distinction between such measures and the detail of the table.
(c) A model derived from previous time periods would indicate when changes were
taking place. There is a good case for including a model or summary of previous
time periods with all MIS data. This has not been done for Table 4.5 since previ-
ous data were not available.
(d) Exceptions can be clearly marked. It is, after all, a prime purpose of budget data
to indicate where there have been deviations from plan. This can be an automat-
ic process. For example, all variances greater than 10 per cent could be marked.
This might even obviate the need for variance figures to be shown.
(e) The making of comparisons is probably not the role of the data producer in this
example, involving as it does the judgement of the receivers in knowing what the
relevant comparisons are. The task of the producer has been to facilitate these
comparisons.
Making the suggested changes does of course have a cost attached in terms of
management time. However, the cost is a small fraction of the cost of setting up and
operating the information system. The changes can transform the system and make
it fully operational. If an existing system is being largely ignored by managers, there
may be no alternative.
Table 4.5 Budgeting data from an MIS (amended from Table 4.1)
Port: Liverpool OCEAN PORT TERMINAL COSTS – SHIPSIDE OPERATIONS
Period: December (U.S Dollars: Conversion rate 1.60)
MONTH CUMULATIVE
ESTIMATE STANDARD VARIANCE VAR % ESTIMATE STANDARD VARIANCE VAR % BUDGET
LO-LO: Stevedoring (STR-FULL) 131 000 144 000 12 000 9 1 240 000 1 360 000 120 000 9 1 560 000
Stevedoring (STR-MT) 13 400 14 700 1 300 9 257 000 281 000 24 000 9
Unlashing 78 0 −78 * 78 0 −78 *
Shifting 200 0 −200 * 12 000 0 −12 000 *
Overtime, etc. 7 100 0 −7 100 * 191 000 0 −191 000 *
RO-RO: Stevedoring TR (STR-FULL) 20 400 26 100 5 800 22 167 000 215 000 48 000 22 330 000
Stevedoring TR (STR-MT) 180 230 50 22 15 000 19 000 4 100 22
Stevedoring cost plus 0 0 0 0 0 0 0 0
Stevedoring Volvo 0 0 0 0 0 0 0
Stevedoring rolling 14 300 19 500 5 200 27 98 000 157 000 59 000 37
Stevedoring blockstow 29 27 −2 −7 610 670 60 9
Unlashing 350 0 −350 * 350 0 −350 *
Shifting 980 0 −980 * 3 800 0 −3 800 *
Overtime, etc. 1 400 0 −1 400 * 29 000 0 −29 000 *
Heavy lifts 0 0 0 0 2 000 0 −2 000 *
CARS: Stevedoring (STR) 6 100 6 400 280 4 38 000 35 000 −3 000 −9 170 000
Unlashing 0 0 0 0 2 0 −2 *
Shifting 800 0 −800 * 1 300 0 −1 300 *
Overtime, etc. 0 0 0 0 7 600 0 −7 600 *
TOTALS: LO-LO 152 000 158 000 5 900 4 1 700 000 1 640 000 −59 000 4
RO-RO 38 000 46 000 8 300 18 315 000 392 000 77 000 20
CARS 6 900 6 4001 −520 −8 47 000 35 000 −12 000 −34
OTHER 3 400 0 −3 400 * 24 000 0 −24 000
GRAND 201 000 211 000 10 000 5 2 080 000 2 070 000 −14 000 −1 2 060 000
TOTAL
Totals may not agree because of rounding.
*Zero standard cost, therefore variance not calculable.
Few managers will not admit that there is currently a problem with the provision
and analysis of data, but they rarely tell this to their IT systems manager or whoever
sends them data. Without feedback, inexpensive yet effective changes are never
made. It must be the responsibility of users to criticise constructively the form and
content of the data they receive. The idea that computer scientists/statisticians
always know best, and if they bother to provide data then they must be useful, is
false. The users must make clear their requirements, and even resort to a little
persistence if alterations are not forthcoming.
Learning Summary
Every manager sees the problem of handling numbers differently because each sees
it mainly in the (probably) narrow context with which he or she is familiar in his or
her own work. One manager sees numbers only in the financial area; another sees
them only in production management. The guidelines suggested here are intended
to be generally applicable to the analysis of business data in many different situa-
tions and with a range of different requirements. The key points are:
(a) In most situations managers without statistical backgrounds can carry out
satisfactory analyses themselves.
(b) Simple methods are preferable to complex ones.
(c) Visual inspection of well-arranged data can play a role in coming to understand
them.
(d) Data analysis is like verbal analysis.
(e) The guidelines merely make explicit what comes naturally when dealing with
words.
The need for better skills to turn data into real information in managerial situa-
tions is not new. What has made the need so urgent in recent times is the
exceedingly rapid development of computers and associated management infor-
mation systems. The ability to provide vast amounts of data very quickly has grown
enormously. It has far outstripped the ability of management to make use of the
data. The result has been that in many organisations managers have been swamped
with so-called information which in fact is no more than mere numbers. The
problem of general data analysis is no longer a small one that can be ignored. When
companies are spending large amounts of money on data provision, the question of
how to turn the data into information and use them in decision making is one that
has to be faced.
Review Questions
4.1 Traditional statistical techniques do not help managers in analysing data. True or false?
4.2 The need for new management skills in data analysis arises because so many data come
from computers, which means that they have to be presented in a more complicated
style. True or false?
4.3 Which of the following reasons is correct? The first step in data analysis is to reduce the
data. It is done because:
A. Most data sets contain some inaccuracies.
B. One can only analyse a small amount of data at a time.
C. Most data sets contain some items which are irrelevant.
4.4 Data recorded to eight decimal places can be rounded down since such a degree of
accuracy will not affect the decision being taken. True or false?
A. True
B. False
4.5 Which of the following reasons are correct? A model or pattern is used to summarise a
table because:
A. Exceptions can be seen more easily and accurately.
B. It is easier to make comparisons with other sets of data.
C. The model will be more accurate than the original data.
4.7 A company has four divisions. The profit and capital employed by each are given in the
table below. Which division is the exception?
A. Division 1
B. Division 2
C. Division 3
D. Division 4
4.8 A confectionery manufacturer’s production level for a new chocolate bar is believed to
have increased by 5 per cent per month over the last 36 months. However, for 11 of
these months this model does not fit. The exceptions were as follows: for five months
strikes considerably reduced production; the three Decembers had lower figures, as did
the three Augusts, when the production plant is closed for two weeks. You would be
right in concluding that the 5 per cent model is not a good one because 11 exceptions
out of 36 is too many. True or false?
4.9 Towards the completion of an analysis of the consumption of distilled spirits across the
different states of the USA in a particular year, the results are compared with those of
similar studies. Which of the following other analyses would be useful?
A. Consumption of distilled spirits across the départements of France.
B. Consumption of wine across the départements of France.
C. Consumption of wine across the states of the USA.
D. Consumption of whisky across the states of the USA.
4.10 A simple model is used in preference to a sophisticated one in the analysis of data
because sophisticated models obscure patterns. True or false?
Ten years ago, in 2004, the death rate on the roads of this country was running
at 0.1 death for every 1 million miles driven. By 2009 a death occurred every
12 million miles. Last year, according to figures just released, there were 6400
deaths, whilst a total of 92 000 million miles were driven.
References
Ehrenberg, A. S. C. (1975). Data Reduction. New York: John Wiley and Sons.
Summary Measures
Contents
5.1 Introduction.............................................................................................5/1
5.2 Usefulness of the Measures ....................................................................5/2
5.3 Measures of Location..............................................................................5/5
5.4 Measures of Scatter ............................................................................. 5/14
5.5 Other Summary Measures ................................................................. 5/20
5.6 Dealing with Outliers .......................................................................... 5/21
5.7 Indices ................................................................................................... 5/22
Learning Summary ......................................................................................... 5/29
Review Questions ........................................................................................... 5/30
Case Study 5.1: Light Bulb Testing............................................................... 5/33
Case Study 5.2: Smith’s Expense Account .................................................. 5/34
Case Study 5.3: Monthly Employment Statistics ........................................ 5/34
Case Study 5.4: Commuting Distances ........................................................ 5/34
Case Study 5.5: Petroleum Products ........................................................... 5/35
Learning Objectives
By the end of the module, the reader should know how large quantities of numbers
can be reduced to a few simple summary measures that are much easier to handle
than the raw data. The most common measures are those of location and scatter.
The special case of summarising time series data with indices is also described.
5.1 Introduction
When trying to understand and remember the important parts of a lengthy verbal
report, it is usual to summarise. This may be done by expressing the essence of the
report in perhaps a few sentences, by underlining key phrases or by listing the main
subsection headings. Each individual has his own method, which may be physical (a
written précis) or mental (some way of registering the main facts in the mind).
Whatever the method, the point is that it is easier to handle information in this way,
by summarising and storing these brief summaries in one’s memory. On the few
occasions that details are required, it is necessary to turn to the report itself.
The situation is no different when it is numerical rather than verbal information
that is being handled. It is still better to form a summary to capture the salient
characteristics. The summary may be a pattern, simple or complex, revealed when
analysing the data, or it may be based on one or more of the standard summary
measures described in this module.
Inevitably some accuracy is lost. In the extreme, if the summarising is badly done,
it can be wholly misleading. (How often do report writers claim to have been totally
misunderstood after hearing someone else’s summary of their work?) Take the case
of a recent labour strike in the UK about wages payment. In reporting the current
levels of payment, newspapers/union leaders/employers could not, of course, state
the payments for all 173 000 employees in the industry. They had to summarise.
Five statements as to the ‘average’ weekly wage were made:
The average weekly wage is …
All these quoted wages were said to be the same thing: the average weekly wage.
Are the employees in the industry grossly underpaid or overpaid? It is not difficult
to choose an amount that reinforces one’s prejudices. The discrepancies are not
because of miscalculations but because of definitions. Quote 1 is the basic weekly
wage without overtime, shift allowances and unsocial hours allowance, and it has
been reduced for tax and other deductions. Since the industry is one that requires
substantial night-time working for all employees, no one actually takes home the
amount quoted. Quote 2 is the same as the first but without the tax deduction.
Quote 3 is the average take-home pay of a sample of 30 employees in a particular
area. Quote 4 is basic pay plus unsocial hours allowance but without any overtime
or tax deductions. Quote 5 is basic pay plus allowances plus maximum overtime
pay, without tax and other deductions.
It is important when using summary measures (and in all of statistics) to apply
common sense and not be intimidated by complex calculations. Just because
something that sounds statistical is quoted (‘the average is £41.83’) does not mean
that its accuracy and validity should be accepted without question. When summary
measures fail, it is usually not because of poor arithmetic or poor statistical
knowledge but because common sense has been lacking.
In this context, the remainder of this module goes on to describe ways of sum-
marising numbers and to discuss their effectiveness and their limitations.
monthly a computer report of the two previous months’ production. The report for
June (22 working days) and May (19 working days) is given in Table 5.1.
The data as shown are useful for reference purposes (e.g. what was the produc-
tion on 15 May?) or for the background to a detailed analysis (e.g. is production
always lower on a Friday and, if so, by how much?). Both these types of use revolve
around the need for detail. For more general information purposes (e.g. was May a
good month for production? What is the trend of production this year?) the amount
of data contained in the table is too large and unwieldy for the manager to be able to
make the necessary comparisons. It would be rather difficult to gauge the trend of
production levels so far this year, from six reports, one for each month, such as that
in Table 5.1. If summary measures were provided, then most questions, apart from
the ones that require detail, could be answered readily. A summary of Table 5.1
might be as shown in Table 5.2.
The summary information provided in Table 5.2 enables a wide variety of man-
agement questions to be answered and, more importantly, answered quickly.
Comparisons are made much more easily if summary data for several months, or
years, are available on one report.
Three types of summary measure are used in Table 5.2. The first, average pro-
duction, measures the location of the numbers and tells at what general level the
data are. The second, the range of production, measures scatter and indicates how
widely spread the data are. The third indicates the shape of the data. In this case,
the answer ‘symmetrical’ says that the data fall equally on either side of the average.
The three measures reflect the important attributes of the data. No important
general features of the data are omitted. If, on the other hand, the measure of scatter
had been omitted, the two months could have appeared similar. In actual fact, their
very different ranges provide an important piece of information that reflects
production planning problems.
For each type of measure (location, scatter, shape) there is a choice of measure to
use (for location, the choice is between arithmetic mean and other measures). The
different types of measures are described below. The measures have many uses
other than as summaries and these will be indicated. They will also be found in
other subject areas. For example, the variance, a measure of scatter, plays a central
role in modern financial theory.
where:
refers to the data in the set
is standard notation for the arithmetic mean
∑ is the Greek capital sigma and, mathematically, means ‘sum of’
is standard notation for the numbers of readings in the set.
3, 3, 4, 5, 5, 6, 6, 6, 7
↑
Middle number
Median = 5
If there is an even number of readings, then there can be no one middle number.
In this case, it is usual to take the arithmetic mean of the middle two numbers as the
median.
For example, if the set of nine numbers above was increased to ten by the pres-
ence of ‘8’, the set would become:
3, 3, 4, 5, 5, 6, 6, 6, 7, 8
︸
Middle two numbers
Median = (5+6)/2
Median = 5.5
5.3.3 Mode
The third measure of location is the mode. This is the most frequently occurring
value. Again, there is no mathematical formula for the mode. The frequency with
which each value occurs is noted and the value with the highest frequency is the
mode.
Again, using the same nine numbers as an example: 3, 3, 4, 5, 5, 6, 6, 6, 7
Number Frequency
3 2
4 1
5 2
6 3
7 1
Mode = 6
Treating all data in a class as if each observation were equal to the mid-point is of
course an approximation, but it is done to simplify the calculations. However, on
some occasions the data may only be available in this form anyway. For example, in
measuring the lengths of machined car components as part of a quality check, the
observations would probably be recorded in groups such as ‘100.5 to 101.0’ rather
than as individual measurements such as ‘100.634’. The most serious approximation
in Table 5.4 is in taking the mid-point of the 90+ class as 94.5, since this class could
include days when complaints had been much higher, say 150, because of some
special circumstances such as severe disruption on account of a derailment. For
open-ended groups such as this it may be necessary to examine the outliers to test
the validity of the mid-point approximation.
Calculating the mode and median from a frequency table is more straightforward.
The median class is the one in which the middle observation lies. In this case the
175th and 176th observations lie in the 30–39 class (i.e. the median is 34.5). The
mode is the mid-point of the class with the highest frequency. In this case the class
is 20–29 and the mode is therefore 24.5.
4
No. of readings
1 2 3 4 5 6 7 8 9 10 11 12
Marks
For the symmetrical distribution (Figure 5.1) all three measures are equal. This is
always approximately the case for symmetrical data. Whichever measure is chosen, a
similar answer results. Consequently, it is best to use the most well-known measure
(i.e. the arithmetic mean) to summarise location for this set of data.
Calculations for Figure 5.1:
Mean =
=
= 8
4
No. of readings
0 1 2 3 4 5 17 18 19
Episodes seen
to quote more than one mode, each mode corresponding to one peak. For example,
in Figure 5.2, had the frequencies for 0 and 19 episodes been 5 and 4 respectively,
technically there would have been one mode at 0, but because the histogram still
would have two peaks, the data should be reported as having two modes.
Calculations for Figure 5.2:
Mean =
=
= 8
Median = middle value of set
= average of 10th and 11th values
= 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 4, 17, 18, 18, 19, 19, 19, 19, 19
=2
Mode = most frequently occurring values
= 0 and 19
Context: Weeks away from work through sickness in a one-year period for a sample of 20 employees in
a particular company.
Readings: 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 5, 18, 28, 44, 52
Shape:
7
5
No. of readings
0
0 1 2 3 5 18 28 44 52
Weeks of sickness
value would be obtained if the outlier of 52 weeks had not been present. Then the
mean would have been reduced from 8.0 to 5.7.
In all situations, the arithmetic mean can be misleading if there are just one or
two extremes in the data. The mode is not misleading, just unhelpful. Most sickness
records have a mode of 0, therefore to quote ‘mode 0’ is not providing any more
information than merely saying that the data concern sickness records.
Calculations for Figure 5.3:
Mean =
=
= 8
Median = middle value of set
= average of 10th and 11th values
= 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 5, 18, 28, 44, 52
=1
Mode = most frequently occurring value
=0
Mean, medium and mode are the major measures of location and are obviously
useful as summaries of data. Equally obviously, they do not capture all aspects of a
set of numbers. Other types of summary measure are necessary. However, before
we leave measures of location, their uses, other than as summarisers, will be
described.
Adding the mean of each set of data, as in the next two sets, allows the shape of
the distribution to become apparent more quickly to the eye. As you can see along
the rows:
In the first case, the focus enables one to see that the numbers are scattered
closely and about equally either side of the mean; in the second case, one sees that
most numbers are below the mean with just a few considerably above.
In fact, the two sets are the symmetrical data and the reverse J-shaped data intro-
duced earlier in Figure 5.1 and Figure 5.3. In the latter case the arithmetic mean was
judged not to be the most useful measure to act as a summary; nevertheless it has a
value when used as a focus for the eye. One meets this usage with row and column
averages in tables of numbers.
For Comparisons
Measures of location can be used to compare two (or more) sets of data regardless
of whether the measure is the best summary measure for that set.
Set 1: 5, 7, 8, 9, 9, 10 Mean = 8
Set 2: 5, 5, 5, 6, 6, 7, 8, 10 Mean = 6.5
The two sets of data above contain a different number of readings. The arithme-
tic mean may or may not be the correct summary measure for either set.
Nevertheless, a useful comparison between them can be effected through the mean.
Similarly, the sickness records of a group of people (reverse J shape) over several
years can be compared using the arithmetic mean, even though one would not use
this measure purely to summarise the data.
Salary
(inc. bonuses)
1 Founder/managing director £60 000
4 Skilled workers £14 000
5 Unskilled workers £12 000
Arithmetic mean salary = £17 600
The lesson is: when averaging averages where groups of different size are in-
volved, go back to the basic definition of the average.
Except where one of the difficulties described above applies, the arithmetic mean
is the first choice of measure of location.
5.4.1 Range
The best-known and certainly the simplest measure of scatter is the range, which is
the total interval covered by the numbers.
Range = Largest reading − Smallest reading
For example, for the nine numbers 3, 3, 4, 5, 5, 6, 6, 6, 7:
Range = 7 − 3
=4
3, 3, 4, 5, 5, 6, 6, 6, 7
︸ ︸
remove remove
Interquartile range = 6 − 4
=2
where:
is the arithmetic mean
is the number of readings in the set
(The notation | | (pronounced ‘the absolute value of’) means the size of the
number disregarding its sign).
For example, calculate the MAD of: 3, 3, 4, 5, 5, 6, 6, 6, 7.
From the previous work: = 5
3 3 4 5 5 6 6 6 7
− −2 −2 −1 0 0 1 1 1 2
| − | 2 2 1 0 0 1 1 1 2
∑| − | = 2 +2 +1 +0 +0 +1 +1 +1 +2
= 10
MAD =
= 1.1
The concept of absolute value used in the MAD is to overcome the fact that
( − ) is sometimes positive, sometimes negative and sometimes zero. The
absolute value gets rid of the sign. Why is this necessary? Try the calculation without
taking absolute values and see what happens.
5.4.4 Variance
An alternative way of eliminating the sign of deviations from the mean is to square
them, since the square of any number is never negative. The variance is the average
squared distance of readings from the arithmetic mean:
Variance =
population. We can see this intuitively because the one-in-a-million extreme outlier
that is present in the population will not usually be present in the sample. Extreme
outliers have a large impact on the variance since it is based on squared deviations.
Consequently, when the variance is calculated from a sample, it tends to underesti-
mate the true population variance.
Dividing by n − 1 instead of n increases the size of the calculated figure, and this
increase offsets the underestimate by just the right amount. ‘Just the right amount’
has the following meaning. Calculating the variance with n − 1 as the denominator
will give, on average, the best estimate of the population variance. That is, if we
were to repeat the calculation for many, many samples (in fact, an infinite number
of samples) and take the average, the result would be equal to the true population
variance. If we used n as the denominator this would not be the case. This can be
verified mathematically but goes beyond what a manager needs to know – consult a
specialist statistical text if you are interested.
Section 9.3 on ‘Degrees of Freedom’ in Module 9 gives an alternative and more
technical explanation.
Unless you are sure you are in the rare situation of dealing with the whole popu-
lation of a variable, you should use the n − 1 version of the formula. Calculators and
popular spreadsheet packages that have a function for calculating the variance
automatically nearly always use n − 1, although there may be exceptions.
Taking the usual example set of numbers, calculate the variance of 3, 3, 4, 5, 5, 6,
6, 6, 7 (Mean = 5).
3 3 4 5 5 6 6 6 7
− −2 −2 −1 0 0 1 1 1 2
( − ) 4 4 1 0 0 1 1 1 4
∑( − ) = 4 +4 +1 +0 +0 +1 +1 +1 +4
= 16
∑( )
Variance =
=
= 2
The variance has many applications, particularly in financial theory. However, as
a pure description of scatter, it suffers from the disadvantage that it involves
squaring. The variance of the number of weeks of sickness of 20 employees is
measured in square weeks. However, it is customary to quote the variance in
ordinary units (e.g. in the above example the variance is said to be two weeks).
3 9
3 9
4 16
5 25
5 25
6 36
6 36
6 36
7 49
Total 45 241
Mean = =5
Variance = ∑( )− · / − 1
= [241−9×25]/8
= 2
Standard deviation Easy to handle mathematically Too involved for descriptive purposes
Used in other theories
All the measures have their particular uses. No single one stands out. When a
measure of scatter is required purely for descriptive purposes, the best measure is
probably the mean absolute deviation, although it is not as well known as it deserves
to be. When a measure of scatter is needed as part of some wider statistical or
mathematical theory, then the variance and standard deviation are frequently
encountered.
Further Example
A company’s 12 salespeople in a particular region last month drove the following
number of kilometres:
Salesperson Kilometres
(hundreds)
1 34
2 47
3 30
4 32
5 38
6 39
7 36
8 43
Salesperson Kilometres
(hundreds)
9 31
10 40
11 42
12 32
Calculate:
(a) range
(b) interquartile range
(c) MAD
(d) variance
(e) standard deviation.
Which measure is the most representative of the scatter in these data?
(a) Range = Highest – Lowest = 47 – 30 = 17
(b) Putting the numbers in ascending order:
30, 31, 32, 32, 34, 36, 38, 39, 40, 42, 43, 47
Lowest quartile (25%) Highest quartile (25%)
Interquartile range = 40 – 32 = 8
(c) To calculate MAD, it is first necessary to find the arithmetic mean:
Mean =
=
= 37
Next calculate the deviations:
34 47 30 32 38 39 36 43 31 40 42 32
− −3 10 −7 −5 1 2 −1 6 −6 3 5 −5
| − | 3 10 7 5 1 2 1 6 6 3 5 5
∑| − | = 54
∑| |
MAD =
= 4.5
(d) The variance
34 47 30 32 38 39 36 43 31 40 42 32
− −3 10 −7 −5 1 2 −1 6 −6 3 5 −5
( − ) 9 100 49 25 1 4 1 36 36 9 25 25
∑( − ) = 320
∑( )
Variance =
=
= 29.1
(e) Standard deviation
= √Variance
= √29.1
= 5.4
The best descriptive measure of scatter in this situation is the mean absolute
deviation. The average difference between a salesperson’s travel and the average travel
is 4.5 (450 kilometres). This is a sensible measure that involves all data points. The range
is of great interest, but not as a measure of scatter. Its interest lies in indicating the
discrepancy between the most and least travelled. It says nothing about the ten in-
between salespeople. The interquartile range is probably the second choice. The
variance and standard deviation are probably too complex conceptually to be descrip-
tive measures in this situation, where further statistical analysis is not likely.
This is useful when sets of data with very different characteristics are being com-
pared. For example, suppose one is looking at the number of passengers per day
passing through two airports. Over a one-year period the average number of
passengers per day, the standard deviations and the coefficients of variation are
calculated.
Consideration of the standard deviations alone would suggest that there was more
scatter at Airport 2. In relation to the number of passengers using the two airports, the
scatter is smaller at Airport 2 as revealed by the coefficient of variation being 0.14 as
against 0.25 at Airport 1.
5.5.1 Skew
Skew measures the extent to which a distribution is non-symmetrical. Figure 5.4(a)
is a distribution that is left-skewed; Figure 5.4(b) is a symmetrical distribution with
zero skew; Figure 5.4(c) is a distribution that is right-skewed.
Figure 5.4 Skew: (a) left skew; (b) zero skew; (c) right skew
The concept of skew is normally used purely descriptively and is assessed visually
(i.e. one looks at the distribution and assesses whether it is symmetrical or right- or
left-skewed). Skew can be measured quantitatively but the formula is complex and
the accuracy it gives (over and above a verbal description) is rarely necessary in
practice. However, the measurement of skew gives rise to the alternative labels
positively skewed (right-skewed) and negatively skewed (left-skewed).
5.5.2 Kurtosis
Kurtosis measures the extent to which a distribution is ‘pinched in’ or ‘filled out’.
Figure 5.5 shows three distributions displaying increasing levels of kurtosis. As with
skew, a qualitative approach is sufficient for most purposes (i.e. when one looks at
the distribution, one can describe it as having a low, medium or high level of
kurtosis). Kurtosis can also be measured quantitatively, but, again, the formula is
complex.
particularly true of the variance and standard deviation, which use squared values.
How does one deal with the outliers? Are they to be included or excluded? Three
basic situations arise.
(a) Twyman’s Law. This only half-serious law states that any piece of data that
looks interesting or unusual is wrong. The first consideration when confronted
by an outlier is whether the number is incorrect, perhaps because of an error in
collection or a typing mistake. There is an outlier in these data, which are the
week’s overtime payments to a group of seven workers (in £s):
13.36, 17.20, 16.78, 15.98, 1432, 19.12, 15.37
Twyman’s Law suggests that the outlier showing a payment of 1432 occurs be-
cause of a dropped decimal point rather than a fraudulent claim. The error
should be corrected and the number retained in the set.
(b) Part of the pattern. An outlier may be a definite and regular part of the pattern
and should be neither changed nor excluded. Such was the case with the sickness
record data of Figure 5.3. The outliers were part of the pattern and similar ef-
fects were likely to be seen in other time periods and with other groups of
employees.
(c) Isolated events. Outliers occur that are not errors but that are unlikely to be
repeated (i.e. they are not part of the pattern). Usually they are excluded from
calculations of summary measures, but their exclusion is noted. For example, the
following data, recorded by trip wire, show the number of vehicles travelling
down a major London road during a ten-day period:
5271, 5960, 6322, 6011, 7132, 5907, 6829, 741, 7098, 6733
The outlier is the 741. Further checking shows that this day was the occasion of
a major royal event and that the road in question was closed to all except state
coaches, police vehicles, etc. This is an isolated event, perhaps not to be repeated
for several years. For traffic control purposes, the number should be excluded
from calculations since it is misleading, but a note should be made of the exclu-
sion. Hence, one would report:
Mean vehicles/day =
=
= 6363 (excluding day of Royal Event)
The procedure for outliers is first to look for mistakes and correct them; and
second, to decide whether the outlier is part of the pattern and should be includ-
ed in calculations or an isolated event that should be excluded.
5.7 Indices
An index is a particular type of measure used for summarising the movement of a
variable over time. When a series of numbers is converted into indices, it makes the
numbers easier to understand and to compare with other series.
The best-known type of index is probably the cost of living index. The cost of
living comprises the cost of many different goods: foods, fuel, transport, etc.
Instead of using the miscellaneous and confusing prices of all these purchases, we
use an index number, which summarises them for us. If the index in 2014 is 182.1
compared with 165.3 in 2013, we can calculate that the cost of living has risen by:
. .
× 100 = 10.2%
.
This is rather easier than having to cope with the range of individual price rises
involved.
Every index has a base year when the index was 100 (i.e. the starting point for
the series). If the base year for the cost of living index was 2008, then the cost of
living has risen 82.1 per cent between 2008 and 2014. This could be said in a
different way: the 2014 cost of living is 182.1 per cent of its 2008 value.
The index very quickly gives a feeling for what has happened to the cost of living.
Comparisons are also easier. For example, if the Wages and Salaries Index has 2008
as its base year and stood at 193.4 in 2014, then, over the six years, wages out-
stripped the cost of living: 93.4 per cent as against 82.1 per cent.
The cost of living index is based on a complicated calculation. However, there
are some more basic indices.
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Price (£000s) 6.1 8.2 8.6 10.1 11.8 12.4 16.9 19.0 19.7 19.4
Index 49 66 69 81 95 100 136 153 159 156
The index for 2010, being the base year, is 100. The other data in the series are
scaled up accordingly. For instance, the index for 2007 is:
8.6 × = 69
.
where 8.6 is the original datum and 12.4 is the original datum for the base year. And
for 2013:
19.7 × = 159
.
The choice of the base year is important. It should be such that individual index
numbers during the time span being studied are never too far away from 100. As a
rule of thumb, the index numbers are not usually allowed to differ from 100 by
more than the factor of 3 (i.e. the numbers are in the range 30 to 300). If the base
year for the numbers in the series above had been chosen as 2005, then the index
series would have been from 100 to 318.
In long series, there might be more than one base year. For example, a series
covering more than 30 years from 1982 to 2014 might have 1982 as a base year with
the series then rising to 291 in 2001, which could then be taken as a second base
year:
Care obviously needs to be taken in interpreting this series. The increase from
1982 to 2014 is not 113 per cent. If the original series had been allowed to continue,
the 2014 index would have been 213 × 2.91 = 620. The increase is thus 520 per
cent.
If January is taken as the base, then the indices for the months are:
One possible disadvantage of this index is that livestock with a low price will
have much less influence than livestock with a high price. For instance, in February
a 20 per cent change in the price of cattle would change the index much more than
a 20 per cent change in the pig price:
However, this may be a desirable feature. If the price level of each factor in the
index reflects its importance, then the higher-priced elements should have more
effect on the index. On the other hand, this feature may not be desirable. One way
to counter this is to construct a price relative index. This means that the individual
prices are first converted into an index and then these individual indices are aver-
aged to give the overall index.
Instead of adding them up, we first weight the prices by the quantities, and the
final index is formed from the resulting monthly total. The quantities used for the
weighting should be the same for each month, since this is a price index. Otherwise
the index would measure price and volume changes. If the quantities used for
weighting are the base month (January) quantities, then the index is known as a
Laspeyres Index and is calculated as follows:
A Laspeyres Index, like other indices, can be used for quantities as well as prices.
For a quantity index the role of price and quantity in the above example (of a price
index) would be reversed, with prices providing the weightings to measure changes
in quantities. The weights (prices) remain at the constant level of some base period
while the quantities change from time period to time period. For example, the UK
Index of Manufacturing Production shows how the level of production in the
country is changing as time goes by. The quantities refer to different types of
product – consumer goods, industrial equipment, etc. – and the prices are those of
the products in a base year. The use of prices as weights for quantities gives the
most expensive products a heavier weighting.
A major criticism of the Laspeyres Index is that the weights in the base year may
soon become out of date and no longer representative. An alternative is the
Paasche Index, which takes the weights from the most recent time period – the
weightings therefore change from each time period to the next. In the livestock
example a Paasche Index would weight the prices in each month with the quantities
relating to December, the most recent month. A Paasche Index always uses the
most up-to-date weightings, but it has the serious practical disadvantage that, if it is
to be purely a price index, every time new data arrive (and the weightings change)
the entire past series must also be revised.
A fixed weight index may also be used. Its weights are from neither the base
period nor the most recent period. They are from some intermediate period or from
the average of several periods. It is a matter of judgement to decide which weights
to use.
The cost of living index has already been introduced. It indicates how the cost of
a typical consumer’s lifestyle changes as time goes by, and it has many practical uses.
For example, it is usually the starting point in wage negotiations, since it shows how
big a wage increase is needed if an employee’s current standard of living is to be
maintained. A wage increase lower than the cost of living index would imply a
decrease in real wages.
What type of index should be used for the cost of living index? Table 5.8 and
Table 5.9 show the simplified example from Economics Module 12 (Tables 12.1 and
12.2).
The weights are multiplied by the prices to give the index, as in the livestock
example. The weights are changed regularly as a result of government surveys of
expenditure patterns. It is important to note that, for the cost of living index,
previous values of the index are not changed as the weights change (i.e. the index
remains as it was when first calculated). Consequently, in comparing the cost of
living now with that of 20 years ago, a cost of living index reflects changes in
purchasing behaviour as well as the inevitably increasing prices.
A price index such as the cost of living index can be used to deflate economic
data. For example, the measure of a country’s economic activity is its GNP (gross
national product), the total value of the goods and services an economy produces in
a year. The GNP is measured from statistical returns made to the government and is
calculated in current prices (i.e. it is measured in terms of the prices for the goods
and services that apply in that year). Consequently, the GNP can rise from one year
to another because prices have risen through inflation, even though actual economic
activity has decreased. It would be helpful to neutralise the effect of prices in order
to have a more realistic measure of economic activity. This is done by deflating the
series so that the GNP is in real, not current, terms. The deflation is carried out by
using a price index. Table 5.11 shows the GNP of a fictional country for 2008–14,
together with a price index for those years. Current GNP is converted to real GNP
by dividing by the price index.
The changes in GNP(real) over the years 2008–14 show that economic activity
did increase in each of those years but not by as much as GNP(current) suggested.
It is important that an appropriate price index is used. It would not have been
appropriate to use the cost of living index for GNP since that index deals only with
consumer expenditure. A price index that incorporates the prices used in the GNP –
that is, investment and government goods as well as consumer goods – should be
used.
Learning Summary
In the process of analysing data, at some stage the analyst tries to form a model of
the data, as suggested previously. ‘Pattern’ or ‘summary’ are close synonyms for
‘model’. The model may be simple (all rows are approximately equal) or complex
(the data are related via a multiple regression model). Often specifying the model
requires intuition and imagination. At the very least, summary measures can provide
a model based on specifying for the data set:
(a) number of readings;
(b) a measure of location;
(c) a measure of scatter;
(d) the shape of the distribution.
In the absence of other inspiration, these four attributes provide a useful model
of a set of numbers. If the data consist of two or more distinct sets (as, for example,
a table), then this basic model can be applied to each. This will give a means of
comparison between the rows or columns of the table or between one time period
and another.
The first attribute (number of readings) is easily supplied. Measures of location
and scatter have already been discussed. The shape of the distribution can be found
by drawing a histogram and literally describing its shape (as with the symmetrical, U
and reverse-J distributions seen earlier). A short verbal statement about the shape is
often an important factor in summarising or forming a model of a set of data.
Verbal statements have a more general role in summarising data. They should be
short, no more than one sentence, and used only when they can add to the sum-
mary. They are used in two ways: first, they are used when the quantitative measures
are inadequate; second, they are used to point out important features in the data.
For example, a table of a company’s profits over several years might indicate that
profits had doubled. Or a table of the last two months’ car production figures might
have a note stating that 1500 cars were lost because of a strike.
It is important, in using verbal summaries, to distinguish between helpful state-
ments pointing out major features and unhelpful statements dealing with trivial
exceptions and details. A verbal summary should always contribute to the objective
of adding to the ease and speed with which the data can be handled.
Review Questions
5.1 Which of the following statements about summary measures are true?
A. They give greater accuracy than the original data.
B. It is easier to handle information in summary form.
C. They are never misleading.
D. Measures of location and scatter together capture all the main features of data.
Questions 5.2 to 5.4 refer to the following data:
1, 5, 4, 2, 7, 1, 0, 8, 6, 6, 5, 2, 4, 5, 3, 5
5.5 Which of the following applies? As a measure of location, the arithmetic mean:
A. is always better than the median and mode.
B. is usually a misleading measure.
C. is preferable to mode and median except when all three are approximately
equal.
D. should be used when the data distribution is U shaped.
E. None of the statements applies.
5.6 An aircraft’s route requires it to fly along the sides of a 200 km square (see figure
below). Because of prevailing conditions, the aircraft flies from A to B at 200 km/h, from
B to C at 300 km/h, from C to D at 400 km/h and from D to A at 600 km/h. What is
the average speed for the entire journey from A to A?
A B
200 km
D C
A. 325 km/h
B. 320 km/h
C. 375 km/h
D. 350 km/h
5.7 Which of the following statements about measures of scatter are true?
A. Measures of scatter must always be used when measures of location are used.
B. A measure of scatter is an alternative to a measure of location as a summary.
C. One would expect a measure of scatter to be low when readings are close
together, and high when they are further apart.
D. A measure of scatter should be used in conjunction with a measure of disper-
sion.
5.14 Which of the following statements are true regarding the presence of an outlier in a
data set of which summary measures are to be calculated?
A. An outlier should either be retained in or excluded from the set.
B. An outlier that is part of the pattern of the data should always be used in
calculating the arithmetic mean.
C. An outlier that is not part of the pattern should usually be excluded from any
calculations.
Which of the following is correct? Between 2013 and 2014, the growth of the cost of
living compared to wages and salaries was:
A. much greater.
B. slightly greater.
C. equal.
D. slightly less.
E. much less.
Mr Smith’s boss felt that these expenses were excessive because, he said, the average
expense per day was £35. Other salespeople away on weekly trips submitted expenses
that averaged at around £20 a day. Where did the £35 come from? How can Smith
argue his case?
Department
A B C D
Mean monthly employment level 10 560 4891 220 428
Standard deviation 606 302 18 32
Is the monthly employment level more stable in some departments than in others?
The mean distance from work was found to be 10.5 miles. Calculate the mode, the
median and two measures of scatter. How would you summarise these data succinctly?
Prices Quantities
2012 2013 2014 2012 2013 2014
Car petrol 26.2 27.1 27.4 746 768 811
Kerosene 24.8 28.9 26.5 92 90 101
Paraffin 23.0 24.1 24.8 314 325 348
a. Use 2012 = 100 to construct a simple aggregate index for the years 2012 to 2014
for the prices of the three petroleum products.
b. Use 2012 quantities as weights and 2012 = 100 to construct a weighted aggregate
index for the years 2012 to 2014 for the prices of the three petroleum products.
c. Does it matter which index is used? If so, which one should be used?
d. How else could our index be constructed?
Sampling Methods
Contents
6.1 Introduction.............................................................................................6/1
6.2 Applications of Sampling .......................................................................6/3
6.3 The Ideas behind Sampling ....................................................................6/3
6.4 Random Sampling Methods ...................................................................6/4
6.5 Judgement Sampling ........................................................................... 6/10
6.6 The Accuracy of Samples ................................................................... 6/12
6.7 Typical Difficulties in Sampling .......................................................... 6/13
6.8 What Sample Size? .............................................................................. 6/15
Learning Summary ......................................................................................... 6/16
Review Questions ........................................................................................... 6/18
Case Study 6.1: Business School Alumni ..................................................... 6/20
Case Study 6.2: Clearing Bank ...................................................................... 6/20
Learning Objectives
By the end of this module the reader should know the main principles underlying
sampling methods. Most managers have to deal with sampling in some way. It may
be directly in commissioning a sampling survey, or it may be indirectly in making
use of information based on sampling. For both purposes it is necessary to know
something of the techniques and, more importantly, the factors critical to their
success.
6.1 Introduction
Statistical information in management is usually obtained from samples. The
complete set of all conceivable observations of a variable is a population; a subset
of a population is a sample. It is rarely possible to study a population. Nor is it
desirable, since sample information is much less expensive yet proves sufficient to
take decisions, solve problems and answer questions in most situations. For
example, to know what users of soap powder in the UK think about one brand of
soap powder, it is hardly possible to ask each one of the 15 million of them his or
her opinion, nor would the expense of such an exercise be warranted. A sample
would be taken. A few hundred would be interviewed and from their answers an
estimate of what the full 15 million are thinking would be made. The 15 million are
the population, the few hundred are the sample. A population does not have to refer
to people. In sampling agricultural crops for disease, the population might be 50 000
hectares of wheat.
In practice, information is nearly always collected from samples as opposed to
populations, for a wide variety of reasons.
(a) Economic advantages. Collecting information is expensive. Preparing ques-
tionnaires, paying postage, travelling to interviews and analysing the data are just
a few examples of the costs. Taking a sample is cheaper than observing the
whole population.
(b) Timeliness. Collecting information from a whole population can be slow,
especially waiting for the last few questionnaires to be filled in or for appoint-
ments with the last few interviewees to be granted. Sample information can be
obtained more quickly, and sometimes this is vital, for example, in electoral opin-
ion polls when voters’ intentions may swing significantly as voting day
approaches.
(c) Size and accessibility. Some populations are so large that information could
not be collected from the whole population. For example, a marketing study
might be directed at all teenagers in a country and it would be impossible to
approach them all. Even in smaller populations there may be parts that cannot
be reached. For example, surveys of small businesses are complicated by there
being no up-to-date lists because small businesses are coming into and going out
of existence all the time.
(d) Observation and destruction. Recording data can destroy the item being
observed. For example, in a quality test on electrical fuses, the test ruins the fuse.
It would not make much sense to destroy all fuses immediately after production
just to see whether they worked. Sampling is the only possible approach to some
situations.
There are two distinct areas of theory associated with sampling, illustrated in
Figure 6.1. In the example of market research into soap powder above, the first area
of theory would show how to choose the few hundred interviewees; the second
would show how to use the sample information to draw conclusions about the
population.
Theory
Theory How to make
How to choose inferences about
a sample. the population
from the sample.
Only the first area, methods of choosing samples, is the topic here. The methods
will be described, some applications will be illustrated and technical aspects will be
introduced. The second area, making inferences, is the subject of Module 8.
random sampling is equivalent to this procedure. The selection of the sample is left
to chance and one would suppose that, on average (but not always), the sample
would be reasonably representative.
The major disadvantage of simple random sampling is that it can be expensive. If
the question to be answered is the likely result of a UK general election, then the
sampler must find some means of listing the whole electorate, choose a sample at
random and then visit the voters chosen (perhaps one in Inverness, one in Pen-
zance, two in Norwich, one in Blackpool, etc.). Both selection and questioning
would be costly.
Variations on simple random sampling can be used to overcome this problem.
For example, multi-stage sampling in the opinion poll example above would
permit the sample to be collected in just a few areas of the country, cutting down on
the travelling and interviewing expenses.
The variations on simple random sampling also make it possible to use other
information to make the sample more representative. In the soap powder example a
stratified sample would let the sampler make use of the fact that known percent-
ages of households use automatic machines, semi-automatic machines, manual
methods and launderettes.
Some situations do not allow or do not need any randomisation in sampling.
(How would a hospital patient feel about a ‘random’ sample of blood being taken
from his body?) Judgement sampling refers to all methods that are not essentially
random in character and in which personal judgement plays a large role. They can
be representative in many circumstances.
Figure 6.2 shows diagrammatically the main sampling methods and how they are
linked. In the next sections these methods will be described in more detail.
To choose the sample, the random numbers are taken one at a time from Ta-
ble 6.1 and each associated with the corresponding number and name from
Table 6.2. The first number is 31, so the corresponding employee is Lester, E. The
second number is 03, so the employee is Binks, J. This is continued until the
necessary sample is collected. The third number is 62, and thus the name is Sutcliffe,
H. The fourth number is 98, which does not correspond to any name and is
ignored. Table 6.3 shows the sample of five.
The drawbacks of simple random sampling are that the listing of the population
can prove very expensive, or even impossible, and that the collection of data from
the sample can also be expensive, as with opinion polls. Variations on simple
random sampling have been developed to try to overcome these problems. These
methods still have a sizeable element of random selection in them and the random
part uses a procedure such as that described above.
Multi-Stage
In multi-stage sampling, the population is split into groups and each group is split
into subgroups, each subgroup into subsubgroups, etc. A random sample is taken at
each stage of the breakdown. First a simple random sample of the groups is taken;
of the groups chosen a simple random sample of their subgroups is taken and so on.
For example, suppose the sample required is of 2000 members of the workforce of
a large company, which has 250 000 employees situated in offices and factories over
a large geographical area. The objective is to investigate absenteeism. The company
has 15 regional divisions; each division has an average of ten locations at which an
office or factory is situated. Each location has its own computerised payroll. Since
the company operates on a decentralised basis, no company-wide list of employees
exists. Figure 6.3 illustrates how the population is split.
Company
15 geographical divisions
Cluster Sampling
Cluster sampling is closely linked with multi-stage sampling in that the population
is divided into groups, the groups into subgroups and so on. The groups are
sampled, the subgroups sampled, etc. The difference is that, at the final stage, each
individual of the chosen groups is included in the final sample. Pictorially, the final
sample would look like a series of clusters drawn from the population.
For example, suppose the company described in the previous section on multi-
stage sampling is slightly different. The locations are not offices and factories but
small retail establishments at which an average of six staff are employed. The four
divisions and two locations per division could be sampled as before, but, because
there are only six staff, further samples at each location would not be taken. All six
would be included in the sample, carrying the further advantage of ensuring that all
grades of employee are in the sample. In this case the total sample size is 4 × 2 × 6
= 48. This is two-stage cluster sampling, because two stages of sampling, of
divisions and locations, are involved. If all locations, ignoring divisions, had been
listed and sampled, the process would involve only one sampling stage and would be
called ordinary cluster sampling. If a total sample larger than 48 were required
then more divisions or more than two locations per division would have to be
selected.
Stratified Sampling
In stratified sampling prior knowledge of a population is used to make the sample
more representative. If the population can be divided into subpopulations (or strata)
of known size and distinct characteristics then a simple random sample is taken of
each subpopulation such that there is the same proportion of subpopulation
members in the sample as in the whole population. If, for example, 20 per cent of a
population forms a particular stratum, then 20 per cent of the sample will be of that
stratum. The sample is therefore constrained to be representative at least as regards
the occurrence of the different strata. Note the different role played by the strata
compared to the role of the groups in multi-stage and cluster sampling. In the
former case, all strata are represented in the final sample; in the latter, only a few of
the groups are represented in the final sample.
For an example of stratified sampling, let’s return to the previous situation of
taking a sample of 2000 from the workforce of a large company. Suppose the
workforce comprises 9 per cent management staff, 34 per cent clerical, 21 per cent
skilled manual and 36 per cent unskilled manual. It is desirable that these should all
be represented in the final sample and in these proportions. Stratified sampling
involves first taking a random sample of 180 management staff (9 per cent of 2000),
then 680 clerical, then 420 skilled and finally 720 unskilled. This does not preclude
the use of multi-stage or cluster sampling. In multi-stage sampling the divisions and
locations would be selected first, then samples taken from the subpopulations at
each location (i.e. take 22 management staff – 9 per cent of 250 – at location one, 22
at location two and so on for the eight locations).
The final sample is then structured in the same way as the population as far as
staff grades are concerned. Note that stratification is only worth doing if the strata
are likely to differ in regard to the measurements being made in the sample. Other-
wise the sample is not being made more representative. For instance, absenteeism
results are likely to be different among managers, clerks, skilled workers and
unskilled workers. If the population had been stratified according to, say, colour of
eyes, it is unlikely that absenteeism would differ from stratum to stratum, and
stratified sampling would be inapplicable.
Weighting
Weighting is a method of recognising the existence of strata in the population after
the sampling has been carried out instead of before, as with stratification. Weighting
means that the measurements made on individual elements of the sample are
weighted so that the net effect is as if the proportions of each stratum in the sample
had been the same as those in the population. In the absenteeism investigation,
suppose that at each location the computerised payroll records did not indicate to
which staff grade those selected for the sample belonged. Only when the personnel
records were examined could this be known. Stratification before sampling is
therefore impossible. Or, at least, it would be extremely expensive to amend the
computer records. The sample of 2000 must be collected first. Suppose the strata
proportions are as in Table 6.4.
The table shows the weighting that should be given to the elements of the sam-
ple. The weighting allocated to each stratum means that the influence each stratum
has on the results is the same as its proportion in the population. If the measure-
ments being made are of days absent for each member of the sample, then these
measurements are multiplied by the appropriate weighting before calculating average
days absent for the whole sample.
Probability Sampling
In simple random sampling each element of the population has an equal chance of
being selected for the sample. There are circumstances when it is desirable for
elements to have differing chances of being selected. Such sampling is called
probability sampling.
For example, when a survey is carried out into the quality and nutritional value of
school meals, a random sample of schools will have to be taken and their menus
inspected. If every school has an equal chance of being chosen, children at large
schools will be under-represented in the sample. The probability of choosing menus
from small schools is greater when a sample of schools is taken than when a sample
of schoolchildren is taken. This may be important if there is a variation in meals
between large and small establishments.
The issue hinges on what is being sampled. If it is menus, then school size is
irrelevant; if it is children subjected to different types of menu, then school size is
relevant. Probability sampling would give schools different chances of being chosen,
proportional to the size of the school (measured in terms of the number of children
attending). In the final sample, children from every size of school would have an
equal chance that their menu was selected.
Variable Sampling
In variable sampling some special subpopulation is over-sampled (i.e. deliberately
over-represented). This is done when the subpopulation is of great importance and
a normal representation in the sample would be too small for accurate information
to be gleaned.
Suppose the project is to investigate the general health levels of children who
have suffered from measles. The population is all children, of specified ages, who
have had measles. A small subpopulation (perhaps 1 per cent) is formed of children
who have suffered permanent brain damage as part of their illness. Even a large
sample of 500 would contain only five such children. Just five children would be
insufficient to assess the very serious and greatly variable effects of brain damage.
Yet this is a most important part of the survey. More of such children would
purposely be included in the sample than was warranted by the subpopulation size.
If calculations of, for instance, IQ measurements were to be made from the sample,
weighting would be used to restore the sample to representativeness.
Area Sampling
Sometimes very little may be known about the population. Simple random sampling
may be difficult and expensive, but, at the same time, lack of knowledge may
prevent a variation on simple random sampling being employed. Area sampling is
an artificial breaking down of the population to make sampling easier. The popula-
tion is split into geographical areas. A sample of the areas is taken and then further
sampling is done in the few areas selected.
The survey of users’ attitudes to a particular soap powder is such a case. A listing
of the whole population of users is impossible, yet not enough is known to be able
to use multi-stage or cluster sampling. The country/district where the sampling is to
be conducted is split into geographical areas. A sample of the areas is selected at
random and then a further sample is taken in the areas chosen, perhaps by listing
households or using the electoral roll. The difficult operation of listing the popula-
tion is reduced to just a few areas.
first name on the list, the starting point would be selected at random from the
first 50.
To return to the workforce absenteeism survey, the computerised payroll could
be sampled systematically. If the payroll had 1750 names and a sample of 250
was required, then every seventh name could be taken. Provided that the list is in
random order, the resulting sample is, in effect, random, but all the trouble of
numbering the list and using random numbers has been avoided. Systematic
sampling would usually provide a random sample if the payroll were in alphabet-
ical order.
There are, however, dangers in systematic sampling. If the payroll were listed in
workgroups, each group having six workers and one foreman, then taking one
name in seven might result in a sample consisting largely of foremen or consist-
ing of no foremen at all.
A systematic sample can therefore result in a sample that is, in effect, a random
sample, but time and effort has been saved. At the other extreme, if care is not
taken, the sample can be hopelessly biased.
(b) Convenience sampling. This means that the sample is selected in the easiest
way possible. This might be because a representative sample will result or it
might be that any other form of sampling is impossible.
In medicine, a sample of blood is taken from the arm. This is convenience sam-
pling. Because of knowledge of blood circulation, it is known that the sample is
as good as random.
As a further example, consider the case of a researcher into the psychological
effects on a family when a member of the family suffers from a major illness.
The researcher will probably conduct the research only on those families that are
in such a position at the time of the research and who are willing to participate.
The sample is obviously not random, but the researcher has been forced into
convenience sampling because there is no other choice. He or she must analyse
and interpret the findings in the knowledge that the sample is likely to be biased.
Mistaken conclusions are often drawn from such research by assuming the sam-
ple is random. For instance, it is likely that families agreeing to be interviewed
have a different attitude to the problem of major illness from families who are
unwilling.
(c) Quota sampling. Quota sampling is used to overcome interviewer bias. It is
frequently used in market research street interviews. If the interviewer is asked to
select people at random, it is difficult not to show bias. A young male interviewer
may, for instance, select a disproportionate number of attractive young females
for questioning.
Quota sampling gives the interviewer a list of types of people to interview. The
list may be of the form:
10 males age 35–50
10 females age 35–50
10 males age 50+
10 females age 50+
Total sample = 40
Discretion is left to the interviewer to choose the people in each category.
Note that quota sampling differs from stratified sampling. A stratum forms a
known proportion of the population; quota proportions are not known. Any
conclusions must be interpreted in the context of an artificially structured, non-
random sample.
If samples of size 2 are taken and the average salary of each sample is computed,
then using simple random sampling there will be six possible samples:
Under stratified sampling, where each sample of two comprises one manager
from each stratum, there are four possible samples:
If a sample of size 2 were being used to estimate the population average salary
(23), simple random sampling could give an answer in the range 18 → 28, but
stratified sampling is more accurate and could only give a possible answer in the
range 20 → 26. On a large scale (in terms of population and sample size), this is the
way in which stratified sampling improves the accuracy of estimates.
In summary, simple random sampling allows (in a way as yet unspecified) the
accuracy of a sample estimate to be calculated. Stratified sampling, besides its other
advantages, improves the accuracy of sample estimates.
6.7.2 Non-response
Envisage a sample of households with which an interviewer has to make contact to
ask questions concerning the consumption of breakfast cereals. If there is no one at
home when a particular house is visited, then there is non-response. If the house is
revisited (perhaps several times are necessary) then the sampling becomes very
expensive; if the house is not revisited then it will be omitted from the sample and
the sample is likely to be biased. The bias occurs because the households where no
one is in may not be typical of the population. For instance, if the interviewer made
his calls during the daytime then households where both marriage partners are at
work during the day would tend to be omitted, possibly creating bias in the results.
The information from the sample could not properly be applied to the whole
population but only parts of it.
Non-response does not occur solely when house visits are required, but whenev-
er the required measurements cannot be made for some elements of the sample. In
the absenteeism example, missing personnel records for some of the staff selected
for the sample would constitute a non-response problem. If efforts are made to find
or reconstruct the missing records it will be expensive; if they are left out of the
sample it will be biased since, for instance, the missing records may belong mainly to
new employees.
6.7.3 Bias
Bias is a systematic (i.e. consistent) tendency for a sampling method to produce
samples that over- or under-represent elements of the population. It can come from
many sources, and sampling frame error and non-response, as described above, are
two of them. Other sources include:
(a) Inaccurate measurement. This might be physical inaccuracy, such as a
thermometer that is calibrated 1 degree too high; it might be conceptual, such as
a salary survey that excludes perks and commission bonuses.
(b) Interviewer bias. This is when the interviewer induces biased answers. For
example, an interviewer may pose a question aggressively: ‘You don’t believe the
new job conditions are advantageous, do you?’, signalling that a particular answer
is required.
(c) Interviewee bias. Here the interviewee injects the bias. The interviewee may be
trying to impress with the extent of his knowledge and falsely claim to know
more than he does. Famously, a survey of young children’s video viewing habits
seemed to indicate that most were familiar with ‘adult’ videos until it was realised
that the children interviewed were anxious to appear more grown-up than they
were and made exaggerated claims. People of all ages tend to give biased answers
in areas relating to age, salary and sexual practice.
(d) Instrument bias. The instrument refers to the means of collecting data, such as
a questionnaire. Poorly constructed instruments can lead to biased results. For
example, questions in the questionnaire may be badly worded. ‘Why is Brand X
superior?’ raises the question of whether Brand X actually is thought to be supe-
rior. Bad questions may not just be a question of competence: they may indicate
a desire to ‘rig’ the questionnaire to provide the answers that are wanted. It is
always advisable to ask about the source of funding for any ‘independent’ survey.
Bias is most dangerous when it exists but is not recognised. When this is the case
the results may be interpreted as if the sample were truly representative of the
population when it is not, possibly leading to false conclusions.
When bias is recognised, it does not mean that the sample is useless, merely that
care should be taken when making inferences about the population. In the example
concerned with family reactions to major illness, although the sample was likely to
be biased, the results could be quoted with the qualification that a further 20 per
cent of the population was unwilling to be interviewed and the results for them
might be quite different.
In other circumstances a biased sample is useful as a pilot for a more representa-
tive sampling study.
The second answer to the sample size question is not theoretical but is more
frequently met in practice. It is to collect the largest sample that the available budget
allows. Then the results can be interpreted in the light of the accuracy given by the
sample size.
It may seem surprising that the population size has no bearing on the sample
size. However, the population size has an effect in a different way. The definition of
simple random sampling (which is also involved in other types of sampling) is that
each member of the population should have an equal chance of being chosen.
Suppose the population is small, with only 50 members. The probability of the first
member being chosen is 1/50, the second is 1/49, the third is 1/48 and so on (i.e.
the probabilities are not equal). For a large population the difference in probabilities
is negligible and the problem is ignored. For a small population there are two
options. First, sampling could be done with replacement. This means that when an
element of the sample has been chosen, it is then returned to the population. The
probabilities of selection will then all be equal. This option means that an element
may be included in the sample twice. The second option is to use a different theory
for calculating accuracy and sample size. This second option is based on sampling
without replacement and is more complicated. A small population, therefore,
while not affecting sample size, does make a difference to the nature of sampling.
Learning Summary
It is most surprising that information collection should be so often done in apparent
ignorance of the concept of sampling. Needing information about invoices, one
large company investigated every single invoice issued and received over a three-
month period: a monumental task. A simple sampling exercise would have reduced
the cost to around 1 per cent of the actual cost with little or no loss of accuracy.
Even after it is decided to use sampling, there is still, obviously, a need for careful
planning. This should include a precise timetable of what and how things are to be
done. The crucial questions are: ‘What are the exact objectives of the study?’ and
‘Can the information be provided from any other source?’ Without this careful
planning it is possible to collect a sample and then find the required measurements
cannot be made. For example, having obtained a sample of 2000 of the workforce,
it may be found that absence records do not exist, or it may be found that another
group in the company carried out a similar survey 18 months before and their
information merely needs updating.
The range of uses of sampling is extremely wide. Whenever information has to
be collected, sampling can prove valuable. The following list gives a guide to the
applications that are frequently encountered:
(a) opinion polls of political and organisational issues;
(b) market research of consumer attitudes and preferences;
(c) medical investigations;
(d) agriculture (crop studies);
(e) accounting;
(f) quality control (inspection of manufactured output);
Review Questions
6.1 Which reason is correct?
The need to take samples arises because:
A. it is impossible to take measurements of whole populations.
B. sampling gives more accurate results.
C. sampling requires less time and money than measuring the whole population.
6.3 A British company divides the country into nine sales regions. Three of them are to be
selected at random and included in an information exercise. Using the table of random
numbers, what are the regions chosen? (Take the numbers starting at the top left, a row
at a time.)
Random Numbers
5 8 5 0 4
7 2 6 9 3
6 1 4 7 8
Sales regions
1. North West England
2. Eastern England
3. Midlands
4. London
5. South East England
6. South West England
7. Wales
8. Scotland
9. Northern Ireland
The sample will comprise:
A. SE England, Scotland, SE England.
B. SE England, Scotland, London.
C. Scotland, SE England, NW England.
D. SE England, Wales, SW England.
E. NW England, N Ireland, NW England.
6.4 The advantages of multi-stage sampling over simple random sampling are:
A. fewer observations need to be made.
B. the entire population does not have to be listed.
C. it can save time and effort by restricting the observations to a few areas of the
population only.
D. the accuracy of the results can be calculated more easily.
6.5 Which of the following statements about stratified sampling are true?
A. It is usually more representative than a simple random sample.
B. It cannot also be a cluster sample.
C. It may be more expensive than a simple random sample.
6.6 Which of the following statements about variable sampling fractions is true?
A. Measurements on one part of the sample are taken extra carefully.
B. A section of the population is deliberately over-represented in the sample.
C. The size of the sample is varied according to the type of items so far selected.
D. The results from the sample are weighted so that the sample is more repre-
sentative of the population.
6.8 What is the essential difference between stratified and quota sampling?
A. Quota sampling refers only to interviewing, whereas stratification can apply to
any sampling situation.
B. Stratification is associated with random selection within strata, whereas quota
sampling is unconnected with random methods.
C. The strata sizes in the sample correspond to the population strata sizes,
whereas the quota sizes are fixed without necessarily considering the popula-
tion.
6.9 A sample of trees is to be taken (not literally) from a forest and their growth monitored
over several years. In the forest 20 per cent of the trees are on sloping ground and 80
per cent are on level ground. A sample is taken by first dividing the forest into geo-
graphical areas of approximately equal size. There are 180 such areas (36 sloping, 144
level). Three sloping areas and 12 level ones are selected at random. In each of these 15
areas, 20 trees are selected at random for inclusion in the sample.
The resulting sample of 300 (20 × 15) trees can be said to have been selected using
which of the following sample methods?
A. Multi-stage.
B. Cluster.
C. Stratification.
D. Weighting.
E. Probability.
F. Area.
G. Systematisation.
H. Convenience.
6.10 A random sample of 25 children aged 10 years is taken and used to measure the average
height of 10-year-old children with an accuracy of ±12 cm (with 95 per cent probability
of being correct). What would have been the accuracy had the sample size been 400?
A. ±3 cm
B. ±0.75 cm
C. ±48 cm
D. ±4 cm
It is thought that branch size affects the profitability of customer accounts, because of
differing staff ratios and differing ranges of services being available at branches of
different size. Each branch has between 100 and 15 000 chequebook accounts. All
accounts are computerised regionally. Each region has its own computer, which can
provide a chronological (date of opening account) list of chequebook accounts and from
which all the necessary information on average balances and so on can be retrieved.
a. Why should the bank adopt a sampling approach rather than taking the information
from all account holders?
b. In general terms, what factors would influence the choice of sample size?
c. If a total sample of 2000 is to be collected, what sampling method would you
recommend? Why?
d. In addition, the bank wants to use the information from the sample to compare the
profitability of chequebook accounts for customers of different socioeconomic
groups. How would you do this? What extra information would you require? NB:
An account holder’s socioeconomic group can be specified by reference to his/her
occupation. The bank will classify into five socioeconomic groups.
e. What practical difficulties do you foresee?
Statistical Methods
Module 7 Distributions
Module 8 Statistical Inference
Module 9 More Distributions
Module 10 Analysis of Variance
Distributions
Contents
7.1 Introduction.............................................................................................7/1
7.2 Observed Distributions ..........................................................................7/2
7.3 Probability Concepts ..............................................................................7/8
7.4 Standard Distributions ........................................................................ 7/14
7.5 Binomial Distribution .......................................................................... 7/15
7.6 The Normal Distribution .................................................................... 7/19
Learning Summary ......................................................................................... 7/27
Review Questions ........................................................................................... 7/29
Case Study 7.1: Examination Grades ........................................................... 7/31
Case Study 7.2: Car Components................................................................. 7/31
Case Study 7.3: Credit Card Accounts ........................................................ 7/32
Case Study 7.4: Breakfast Cereals ................................................................ 7/32
7.1 Introduction
Several examples of distributions have been encountered already. In Module 5, on
summary measures, driving competition scores formed a symmetrical distribution;
viewing of television serial episodes formed a U-shaped distribution; sickness
6
5
9
11
8
11 12
12 11 10
18 16 14
16 17 15
17 15
16
19
17 13 23 15
18 19
20 21 16 19 17
20
19 18 18
19 20 18
20 18
21 19
13 23 23
22 24 19
13 22 20
23 23
24 20
25 24 26 25
22 24 24
23 25 24 25 20 25
27 24
18 25
30 16 28 25
25 29 25 25
26
16 29 28 26 27
29 27 27
26 27 28 28 31
4
34 32
8
28
36
Figure 7.1 Market share (%) of soft drink product throughout Europe
Figure 7.1 is a mess. At first, most data are. They may have been taken from large
market research reports, dog-eared production dockets or mildewed sales invoices;
they may be the output of a computer system where no effort has been made at data
communication. Some sorting out must be done. A first attempt might be to arrange
the numbers in an ordered array, as in Table 7.1.
The numbers look neater now, but it is still difficult to get a feel for the data –
the average, the variability, etc. – as they stand. The next step is to classify the data.
Classifying means grouping the numbers in bands, such as 20–25, to make them
easier to appreciate. Each class has a frequency, which is the number of data points
(market shares) that fall within that class. A frequency table is shown in Table 7.2.
30
25
Frequency
20
15
10
0 5 10 15 20 25 30 35 40
Market shares (%)
20–29 75 25 0.25
30–39 42 14 0.14
40–49 29 10 0.10
50+ 19 6 0.06
Total 300 100 1.00
Probability 0.25
0.20
0.15
0.10
0.05
Consequently:
P(deliveries exceed 45) = P(deliveries 46–49) + P(deliveries 50 or more)
= 0.04 + 0.06
= 0.10
Therefore, X = 45.
The capacity of 45 will result in overtime being needed on 10 per cent of days.
An alternative representation of a frequency distribution is as a cumulative frequency
distribution. Instead of showing the frequency for each class, a cumulative frequency
distribution shows the frequency for that class and all smaller classes. For example,
instead of recording the number of days on which the deliveries of food to a hypermar-
ket were in the ranges 0–9, 10–19, 20–29, etc., a cumulative distribution records the
number of days when deliveries were less than 9, less than 19, less than 29, etc.
Table 7.4 turns Table 7.3 into cumulative form.
Table 7.4 shows cumulative frequencies in ‘less than or equal to’ format. They could just
as easily be in ‘more than or equal to’ format, as in Table 7.5.
These cumulative frequency tables can be put into the form of graphs. They are then
known as ogives. Figure 7.4 shows Table 7.4 and Table 7.5 as ogives.
less than
more than
300
250
200
Frequency
150
100
50
0
0 9 19 29 39 49
Deliveries
possible for the company to make a loss (event A) and be made bankrupt (event
AA).
The addition law for mutually exclusive events is:
(A or B or C or … ) = (A) + (B) + (C) + ⋯
This law has already been used implicitly in the example on hypermarket deliver-
ies. The probabilities of different classes were added together to give the probability
of an amalgamated class. The law was justified by relating the probabilities to the
frequencies from which they had been derived.
Example
Referring to the example on hypermarket deliveries (Table 7.3), what is the probability
of fewer than 40 deliveries on any day?
The events (the classes) in Table 7.3 are mutually exclusive. For example, given that the
number of deliveries on a day was in the range 10–19, it is not possible for that same
number of deliveries to belong to any other class. The addition law, therefore, can be
applied:
P(fewer than 40 deliveries) = P(0–9 deliveries or 10–19 or 20–29 or 30–39
= P(0–9) + P(10–19) + P(20–29) + P(30–39)
= 0.18 + 0.27 + 0.25 + 0.14
= 0.84
Equally, the result could have been obtained by working in the frequencies from which
the probabilities were obtained.
Note that if events are not mutually exclusive, this form of the addition law cannot be
used.
A conditional probability is the probability of an event under the condition that
another event has occurred or will occur. It is written in the form P(A/B), meaning the
probability of A given the occurrence of B. For example, if P(rain) is the probability of
rain later today, P(rain/dark clouds now) is the conditional probability of rain later
given that the sky is cloudy now.
The probabilities of independent events are unaffected by the occurrence or non-
occurrence of the other events. For example, the probability of rain later today is not
affected by dark clouds a month ago and so the events are independent. On the other
hand, the probability of rain later today is affected by the presence of dark clouds now.
These events are not independent. The definition of independence is based on condi-
tional probability. Event A is independent of B if:
(A) = (A/B)
This equation is merely the mathematical way of saying that the probability of A is not
affected by the occurrence of B. The idea of independence leads to the multiplication
law of probability for independent events:
(A and B and C and … ) = (A) × (B) × (C) × …
The derivation of this law will be explained in the following example.
Example
Twenty per cent of microchips produced by a certain process are defective. What is the
probability of picking at random three chips that are defective, defective, OK, in that
order?
The three events are independent since the chips were chosen at random. The multipli-
cation law can therefore be applied.
(1st chip defective and 2nd chip defective and 3rd chip OK)
= (defective) × (defective) × (OK)
= 0.2 × 0.2 × 0.8
= 0.032
This result can be verified intuitively by thinking of choosing three chips 1000 times.
According to the probabilities, the 1000 selections are likely to break down as in
Figure 7.5. Note that in a practical experiment the probability of, for instance, 20 per
cent would not guarantee that exactly 200 of the first chips would be defective and 800
OK, but it is the most likely outcome. Figure 7.5 shows 32 occasions when two
defectives are followed by an OK; 32 in 1000 is a proportion of 0.032 as given by the
multiplication law.
A B C D
B C D A C D A B D A B C
2 possibilities for 3rd place
C D B D B C C D A D A C B D A D A B B C A C A B
1 possibility for 4th place
D C D B C B D C D A C A D B D A B A C B C A B A
Second, repetition occurs because the order of the two selected for the subcom-
mittee does not affect its constitution. It is the same subcommittee of A and B
whether they were selected first A then B or vice versa. In Table 7.6 each pair of
orderings 1,7; 2,8; 3,14 etc. provides just one subcommittee. Consequently, the
number of subcommittees can be further halved from 12 to 6. Since the process
started with all possible sequences of the four candidates, and since allowance has
now been made for all repetitions, this is the final number of different subcommit-
tees. They are:
AB, AC, AD, BC, BD, CD
Put more concisely, the number of ways of choosing these six was calculated
from:
More generally, the number of ways of selecting r objects from a larger group of
n objects is:
!
=
( !×( )!)
These calculations are not of immediate practical value, but they are the basis for
the derivation of the binomial distribution later in the module.
7.5.1 Characteristics
The binomial distribution is discrete (the values taken by the variable are distinct),
giving rise to stepped shapes. Figure 7.7 illustrates this and also that the shape can
vary from right-skewed through symmetrical to left-skewed depending upon the
situation in which the data are collected.
The elements of a statistical population are of two types. Each element must be
of one but only one type. The proportion, p, of the population that is of the
first type is known (and the proportion of the second type is therefore 1 − p).
A random sample of size n is taken. Because the sample is random, the number
of elements of the first type it contains is not certain (it could be 0, 1, 2, 3, …
or n) depending upon chance.
From this theoretical situation the probabilities that the sample contains given
numbers (from 0 up to n) or elements of the first type can be calculated. If a great
many samples were actually collected, a histogram would be gradually built up, the
variable being the number of elements of the first type in the sample. Probabilities
measured from the histogram should match those theoretically calculated (approxi-
mately but not necessarily exactly, as with the coin tossing example). The
probabilities calculated theoretically are binomial probabilities; the distribution
formed from these probabilities is a binomial distribution.
For example, a machine produces microchip circuits for use in children’s toys.
The circuits can be tested and found to be defective or OK. The machine has been
designed to produce no more than 20 per cent defective chips. A sample of 30 chips
is collected. Assuming that overall exactly 20 per cent of chips are defective, the
The following example shows how this formula can be used. If 40 per cent of the
USA electorate are Republican voters, what is the probability that a randomly
assembled group of three will contain two Republicans?
Using the binomial formula:
(2 Republicans in group of 3)
= ( = 2)
= × × (1 − )
= × × (1 − ) since sample size, , equals 3
= × 0.4 × 0.6 since the proportion of type 1 (Republicans) is 0.4
!
= × 0.16 × 0.6
( !× !)
= 3 × 0.16 × 0.6
= 0.288
There is a 28.8 per cent chance that a group of three will contain two Republican
voters.
Making similar calculations for r = 0, 1 and 3, we obtain the binomial distribution
of Figure 7.8.
43.2%
28.8%
21.6%
6.4%
0 1 2 3
Example
A manufacturer has a contract with a supplier that no more than 5 per cent of the
supply of a particular component will be defective. The component is delivered in lorry
loads of several hundred. From each delivery a random sample of 20 is taken and
inspected. If three or more of the 20 are defective, the load is rejected. What is the
probability of a load being rejected even though the supplier is sending an acceptable
proportion of 5 per cent defective?
The binomial distribution is applicable, since the population is split two ways and the
samples are random. Table A1.1 in Appendix 1 is a binomial table for samples of size 20.
As with Table A1.1, the rows refer to the number of defective components in the
sample; the columns refer to the proportion of defectives in the population of all
components supplied (0.05 here). The following is based on the column headed 0.05 in
Table A1.1 in Appendix 1:
The manufacturer rejects loads when there are three or more defectives in the sample.
From the above, the probability of a load being rejected even though the supplier is
sending an acceptable proportion of defectives is (6.0% + 1.6% =) 7.6 per cent.
7.5.5 Parameters
The distributions in Figure 7.7 differ because the parameters differ. Parameters fix
the context within which a variable varies. The binomial distribution has two
parameters: the sample size, n, and the population proportion of elements of the
first type, p. Two binomial distributions with different parameters, while still having
the same broad characteristics shown in Figure 7.7, will be different; two binomial
distributions with the same sample size and the same proportion, p, will be identical.
The variable is still free to take different values but the parameters place restrictions
on the extent to which different values can be taken.
The binomial probability formula in Section 7.5.3 demonstrates why this is so.
Once n and p are fixed, the binomial probability is fixed for each r value; if n and p
were changed to different values, the situation would still be binomial (the same
general formula is being used), but the probabilities and the histogram would differ.
Right-skewed shapes occur when p is small (close to 0), left-skewed shapes when p
is large (close to 1), and symmetrical shapes when p is near 0.5 or n is large (how
large depends upon the p value). This can be checked by drawing histograms for
different parameter values.
7.6.1 Characteristics
The normal distribution is bell shaped and symmetrical, as in Figure 7.9. It is also
continuous. The distributions met up to now have been discrete. A distribution is
discrete if the variable takes distinct values such as 1, 2, 3, … but not those in
between. The delivery distribution (Table 7.3) is discrete since the data are in
groups; the binomial distribution is discrete since the variable (r) can take only
whole number values. Continuous means that all values are possible for the variable.
It could take the values 41.576, 41.577, 41.578, … rather than all these values being
grouped together in the class 40–49 or the variable being limited to whole numbers
only.
68%
1s 1s
2s 2s
3s 3s
of disturbance. Each source gives rise to