BIOMETRYHg
10-447400
~— BBBIOMETRY
THE PRINCIPLES AND PRACTICE OF
STATISTICS IN BIOLOGICAL RESEARCH
THIRD EDITION
Robert R. SOKAL and F. James ROHLF
State University of New York at Stony Brook
CULE
ZAK tein
7 BiBLioTEC® |
oo
Cina
W. H. FREEMAN AND COMPANY
New YorkCONTENTS
PREFACE
NOTES ON THE THIRD EDITION
INTRODUCTION
1.1 Some Definitions
1.2 The Development of Biometry.
13. The Statistical Frame of Mind
DATA IN BIOLOGY
2.1 Samples and Populations
2.2. Variables in Biology
2.3. Accuracy and Precision of Data
24 Derived Variables
2.8. Prequency Distributions
THE HANDLING OF DATA
3.1 Computers
3.2. Software
3.3 Biiciency and Economy in Data Processing.
DESCRIPTIVE STATISTICS
4.1 ‘The Arithmetic Mean
4.2 Other Means
43. The Median
44 The Mode
45. The Range
48 The Serird De
10
3
16
19
33
34
35
37Library of Congress Caalogng-in-Publication Data
Sokal, Robert R.
‘Biomety: the principles and practice of stasis in biological research / Rober R.
Sokal and F James Rohif.—3d ed
pcm.
Includes bibliographical references (p. 850) and index.
IseNOTGT-I
1. biometry. ROBIE, F James, 4936~ Title
(H32835.$63 1995
57401'S195—4e20 94-1120
cw
©1995, 1981, 1969 by W. H. Freeman and Company
[No past ofthis book may be reproduced by any mechanical, photographic, or electronic
process, oc in the form of a phonograph recording, nor may it be stored ina retrieval
Fystem, Uansmited, or otherwise copied fr public or pivae use, witout writen
permission from the publisher.
Printed in the United States of Amerie
123456789 VB 9987654
To our parents of blessed memory
Kara and Siegfried Sokal
Harviet and Gilbert Robicontents
1B ASSUMPTIONS OF ANALYSIS OF VARIANCE.
h
131
132
133
134
135
136
137
138
Bg
13.10
Bu
13.12
‘A Fundamental Assumption
Independence
Homogeneity of Variances
Normality.
Additivity
‘Transformations:
‘The Logarithmic Transformation
‘The Square-Root Transformation
‘The Box-Cox Transformation
‘The Arcsine Transformation.
Nonparametric Methods in Lieu of Single-
Classification Anovas.
Nonparametric Methods in Lieu of Two-Way Anova
LINEAR REGRESSION
14.4 Introduction to Regression
142 Models in Regression
143 The Linear Regression Equation
144 Tests of Significance in Regression
143 More Than One Value of ¥ for Each Value of X
146 The Uses of Regression...
147 Estimating X from Y.
14.8 Comparing Regression Lines
149 Analysis of Covariance
14.10 Linear Comparisons in Anovas
14.41 Examining Residuals and Transformations
in Regression
14.12. Nonparametric Test for Regression
14.13 Model II Regression,
CORRELATION
15.1 Correlation and Regression
152 ‘The Product-Moment Correlation Coefficient
15.3 The Variance of Sums and Differences
154 Computing the Product-Moment Correlation
Coefficient
15.5 Significance Tests in Correlation
1546 Applications of Correlation
so | sucipal. vand det gio
15% Nonoarametric Tests for Association
392
393
393
396
406
401
409
413
ais
ar
419
23
440
451
452
455
457
476
486
491
493,
521
531
539
Sal
555
556
559
561
509
374
583
593
16 MULTIPLE AND CURVILINEAR REGRESSION
16
162
163
16.4
165
166
167
Multiple Regression: Computation
Multiple Regression: Significance Tests,
Path Analysis
Partial and Multiple Correlation
Choosing Predictor Variables
Curvifinear Regression,
Advanced Topics in Regression and Correlation
I ANALYSIS OF FREQUENCIES
ra
172
173
4
Ins
176
17
Introduction to Tests for Goodness of Fit
Single-Classification Tests for Goodness of Fit
Replicated Tests of Goodness of
‘Tests of Independence: Two-Way Tables.
‘Analysis of Three-Way and Multiway Tables
‘Analysis of Proportions
Randomized Blocks for Frequency Data
18 MISCELLANEOUS METHODS
18.1
182
183
184,
185
Combining Probabilities From Tests of Significance
‘Tests for Randomness of Nominal Data: Runs Tests
Randomization Tests
‘The Jackknife and the Bootstrap
‘The Future of Biometry: Data Analysis
APPENDIX: MATHEMATICAL PROOFS
BIBLIOGRAPHY
AUTHOR INDEX
SUBJECT INDEX
cop
610
23
634
649
665
678
686
7
115
na
43
160
718
794
4
27
803
820
825
833
850
865
8714.7 Sample Statistics and Parameters
48 Coding Data Before Computation
49 Computing Means and Standard Deviations
4.10. The Coefficient of Variation
INTRODUCTION TO PROBABILITY DISTRIBUTIOI
BINOMIAL AND POISSON
5.1 Probability, Random Sampling, and Hypothesis Testing
52. The Binomial Distribution
53. The Poisson Distribution
54 Other Discrete Probability Distributions
‘THE NORMAL PROBABILITY DISTRIBUTION
6.1 Frequency Distributions of Continuous Variables
62 Properties of the Normal Distribution
63. A Model for the Normal Distribution
64 Applications of the Normal Distribution
65 Fitting a Normal Distribution to Observed Data
66 Skewness and Kurtosis,
67 Graphic Methods
68 Other Continuous Distributions
ESTIMATION AND HYPOTHESIS TESTING
7.1 Distribution and Variance of Means
712. Distribution and Variance of Other Statistics
73 Inteoduction to Confidence Lit
Ta The eDistribution
75 Confidence Limits Based on Sample Si
1 The Chi-Square Distribution
1.7 Confidence Limits for Variances
7.8 Introduction to Hypothesis Testing
719° Tests of Simple Hypotheses Using the Normal and
Distributions
7.10 Testing the Hypothesis Ha: o* = a
INTRODUCTION TO THE ANALYSIS
OF VARIANCE,
8.1 Variances of Samples and Theit Means.
82 The FDistribution
83 The Hypothesis Hy: o3= of
2
4
7
61
a
1
8
93
98
98
01
106
109
un
1
116
123
a7
18
136
139
143
146
152
154
137
169)
175
179
180
184
wo
R
8.4 Heterogeneity Among Sample Means
8.5 Paniioning the Total Sum of Squares and Degrees
‘of Freedom,
86 Model | Anova
8.7 Model Il Anova
SINGLE-CLASSIFICATION ANALYSIS
OF VARIANCE.
9.1 Computational Formulas
9.2 General Case: Unequal
93 Special Case: Equal
94 Special Case: Two Groups
95 Special Case: A Single Specimen Compared
With a Sample
9.46 Comparisons Among Means: Planned Comparison
9.7 Comparisons Among Means: Unplanned Comparisons
9.8 Finding the Sample Size Required for a Test
NESTED ANALYSIS OF VARIANCE
10.4 Nested Anova: Design
102. Nested Anova: Computation
10.3 Nested Anovas With Unequal Sample Sizes
104 The Optimal Allocation of Resources
TWO-WAY ANALYSIS OF VARIANCE,
HL Two-Way Anova: Design
11.2. Two-Way Anova With Equal Replication; Computation
113 Two-Way Anova: Significance Testing
114) ‘Two-Way Anova Without Replication
1.5. Paired Comparisons
11.6 Unequal Subclass Sizes
11.7 Missing Values in a Rand
ized-Blocks Design
MULTIWAY ANALYSIS OF VARIANCE
12.1 The Factorial Design
12.2 A Three-Way Factorial Anova
12.3. Higher-Order Factorial Anovas
124 Other Designs
125. Anovas by Computer
190
wr
201
203DATA IN BIOLOGY
ss In Section 2.1 we explain the statistical meaning of “sample” and
population," terms used throughout this book. Then we come to the types of
observations obtained from biological research material, with which we shall
perform the computation in the rest of this book (Section 2.2). In obtaining data
‘we shal run into the problem of the degree of accuracy necessary for reconling
the data, This problem and the procedure for rounding off figures are discussed
in Section 2.3, after which we will be ready to consider in Section 24 certain
nds of derived data, such as ratios and indices, frequently used in biological
science, which present peculiar problems with respect to their accuracy and
distribution. Knowing how to arrange data as frequency distributions is impor-
tant, because such arrangements permit us to get an overall impression of the
shape ofthe variation present ina sample, Frequency distributions, as well as the
presentation of numerical data, are discussed in the last section (2.5) of this
chapter.
21 SAMPLES AND POPULATIONS
We shall now define a number of important terms necessary for an understanding
‘of biological data. The data ina biometric study are generally based on individ
ual observations, which are observations or measurements taken on the small-
est sampling unit. These smallest sampling units frequent, but not necessarily,
are also individuals inthe ordinary biological sense. If we measure weight in 100,
rats, then the weight of each rat is an individual observation; the hundred rat
‘weights together represent the sample of observations, defined as.a collection of
individual observations selected by a specified procedure. In ths instance, one
individual observation is based on one individual in a biological sense —that i,
‘one rat, However, if we had studied weight in a single rat over a period of time,
the sample of individual observations would be al the weighis recorded on one
vor at gmerosivg times, In > smidy pf femperntirs in ant rolonies. where each
2.1 SAMPLES AND POPULATIONS 9
colony is a basic sampling unit, each temperature reading for one colony is an
individual observation, and the sample of observations is the temperatures for all
the colonies considered. An estimate of the DNA content of a single mammalian
sperm cell is an individual observation, and the corresponding sample of obser-
vations is the estimates of DNA concent of all other sperm cells studied in one
individual mammal. A synonym for individual observation is “‘item.”*
Up to now we have carefully avoided specifying the particular variable being
studied because “individual abservation"” and ‘‘sample of observations”” as we
just used them define only the structure but not the nature of the data in a study.
The actual property measured by the individual observations isthe variable, ot
character. The more common term employed in general statistics is variable. In
evolutionary and systematic biology however, character is frequently used syn-
‘onymously. More than one variable can be measured on each smallest sampling
unit, Thus, in a group of 25 mice we might measure the blood pH and the
erythrocyte count. The mouse (a biological individual) would be the smallest
sampling unit; blood pH and cell count would be the two variables studied. In
this example the pH readings and cell counts are individual observations, and
two samples of 25 observations on pH and erythrocyte count would result. Al-
temnatively, we may call this example a bivariate sample of 25 observations,
each referring to a pH reading paired with an erythrocyte count.
Next we define population. The biological definition of this term is well
known: It refers to all the individuals of a given species (perhaps of a given life
history stage or sex) found in a circumscribed area at a given time. In statistics,
population always means the sotality of individual observations about which
inferences are to be made, existing anywhere in the world or at least within a
definitely specified sampling area limited in space and time. If you take five
‘humans and study the number of leucocytes in their peripheral blood and you are
prepared to draw conclusions about all humans from this sample of five, then the
population from which the sample has been drawn represents the leucocyte
counts of all humankind—that is, all extant members of the species Homo
sapiens. If, on the other hand, you restrict yourself to a more narrowly specified
sample, such as five male Chinese, aged 20, and you are restricting your conclu-
sions to this particular group, then the population from which you are sampling
will be leucocyte numbers of all Chinese males of age 20. The population in this
statistical sense is sometimes referred to as the universe. A population may refer
to variables of a concrete collection of objects or creatures—such as the tail
Tengths of all the white mice in the world, the leucocyte counts ofall the Chinese
‘men in the world of age 20, or the DNA contents of all the hamster sperm cells in
cexistence—or it may refer to the outcomes of experiments —such as all the
heartbeat frequencies produced in guinea pigs by injections of adrenalin. In the
first three cases the population is finite. Although in practice it would be impos-
sible o collect, count, and examine all white mice, all Chinese men of age 20, or
all hamster sperm cells in the world, these populations ate finite. Certain smaller
Population, suchas all the whooping cranes in North America o al the pocket
410 CHAPTER 2 DATA IN BloLoGY
gophers ina given colony, may lie within reach ofa total census. By contrast,
experiment can be repeated an infinite number of times (atleast in theory). An
‘experiment such as the administration of adrenalin to guinea pigs eould be re
peated as long as the experimenter could obtain material and his or her health and.
patience held out. The sample of experiments performed is a sample from an
‘number that could be performed. Some of the statistical methods to be
developed later make 2 distinction between sampling from finite and from infi-
nite populations. However, although populations are theoretically finite in most
applications in biology, they are generally 50 much larger than samples drawn
from them that they can be considered as de facto infinitely sized populations.
22 VariaBLes IN BIOLOGY
Enumerating all the possible kinds of variables that can be studied in biological
research would be a hopeless task. Each discipline has its own set of variables,
Which may include conventional morphological measurements, concentrations
of chemicals in body fuids, rates of certain biological processes, frequencies of
certain events (asin genetics and radiation biology), physical readings of optical
orelecitonic machinery used in biological research, and many more. We assume
that persons reading this book already have special interests and have become
acquainted with the methodologies of research in their areas of interest, so that
the variables commonly studied in their fields are at least somewhat familar. In
any case, the problems for measurement and enumeration must suggest them:
selves tothe researcher; statistics will not, in general, contribute to the discovery
and definition of such variables.
Some exception must be made to this statement. Once a variable has been
chosen, statistical analysis may demonstrate it to be unreliable. If several vari-
ables are studied, certain elaborate procedures of multivariate analysis can assign
‘weights to them, indicating their value for a given procedure. For example, in
taxonomy and various other applications, the method of discriminant functions
can identify the combination ofa series of variables that best distinguishes be-
tween two groups (see Section 16.7). Other multivariate techniques, such as
‘canonical variates analysis, principal components analysis, or factor analysis,
can specify characters that best represent or summarize certain pattems of varia-
tion (Krzanowski, 1988; Jackson, 1991). As a general rule, however, and partic
ularly within the framework ofthis book, choosing a variable as well as defining
the problem to be solved is primarily the responsibility of the biological re-
searcher.
'A more precise definition of variable than the one given earlier i desirable
Its a property with respect to which individuals in a sample difer in some
‘ascertainable way. If the property does not differ within a sample at hand or
at least among the samples being studied, it cannot be of statistical interest.
Being entirely uniform, such a property would also not be a variable from the
etymological point of view and should not be so called. Length, height, weight,
number of teeth, vitamin C content, and genotypes are examples of variables in
ordinary, genetically and phenotypically diverse groups of organisms. Warm
bloodedness in a group of mammals isnot a variable because they are all alike in
this regard, although body temperature of individual mammals would, of course,
bbe a variable. Also, if we had a heterogeneous group of animals, of which some
‘were homeothermic and others were not, then body temperature regulation (with
its two states or forms of expression, ‘warm-blooded’ and “‘cold-blooded")
‘would be a variable.
‘We can classify variables as follows:
Variables
Measurement variables
Continuous variables
Discontinuous variables
Ranked variables
Attributes
‘Measurement variables are those whose differing siates can be expressed in
‘4 numerically ordered fashion. There are two types of measurement variables:
continuous and discontinuous, Continuous variables at least theoretically can
assume an infinite number of values between any two fixed points. For example,
between the two length measurements 1.5 em and 1.6 em an infinite number of
lengths could be measured if one were so inclined and had a measuring instru-
rent with sufficiently precise calibration. Any given reading of a continuous
variable, such as a length of 1.57 cm, is an approximation to the exact reading,
\hich in practice cannot be known. For purposes of computation, however, these
approximations are usually sufficient and, as will be scen below, may be made
‘even more approximate by rounding. Many of the variables studied in biology
are continuous. Examples are length, arca, volume, weight, angle, temperature,
Period of time, percentage, and rate.
Discontinuous variables, also known as meristic variables (the term we use
in this book) or discrete variables, are variables that have only certain fixed
‘numerical values, with no intermediate values possible. The number of segments
ina certain insect appendage, for instance, may be 4 to S or 6 but never 5} or 4.
Examples of discontinuous variables are number of a certain structure (such as
segments, bristles, teeth, or glands), number of offspring, number of colonics of
‘microorganisms or animals, or number of plants in a given quadrat.
‘A word of caution: not all variables restricted to integral numerical values are
Imeristic. An example wil illustrate this point. If an animal behaviorist were to
‘cade the reactions of animals in a series of experiments as (I) very aggressive,
(2) aggressive, (3) neutral, (4) submissive, and (5) very submissive, we might be
tempted to believe that these five different states of the variable were meristic
because they assume integral values. However, they are clearly only arbitrarypoints (class marks, see Section 2.5) along a continuum of aggressiveness; the
‘only reason that no values such as 1.5 occur is that the experimenter did not wish
to subdivide the behavior classes too finely, either for convenience or because of
an inability to determine more than five subdivisions of this spectrum of behav
ior with precision. Thus, this variable is clearly continuous rather than meristic,
sit might have appeared at first sight.
‘Some variables cannot be measured but at least can be ordered or ranked by
their magnitude. Thus, in an experiment one might record the rank order of
‘emergence of ten pupae without specifying the exact time at which each pupa
‘emerged. In such a case we would code the data as a ranked variable, the order
Cf emergence. Special methods for dealing with ranked variables have been
developed, and several are described in this book. By expressing a variable as a
series or ranks, such as 1, 2, 3, 4, 5, we do not imply that the difference in
‘magnitude between, say, ranks 1 and 2 is identical to or even proportional to the
Stwoys wSs9sge-noleg aur] sss § gs UoHNARSIp panos ap Jo par axoge Wnoys Uonngunsp Kouanbay CurBUO 2tp Jo WFOsH,
x gS
| sty ser-sow r sesso
° sores
I 9% sevesey | ss sopesee 1 Sees
¢ sees
HIT sey sveescy + ser-ser
UHM er seeeste 1 scr-str
lll ste scersor z sre-sor
° sor-see
4 ov stv-see soe sor-see ¢ sot-se
+ sweat
see seereve ° sue-s9e
HIT ce seensse * soe-see
Hit se spe-sre 1 sse-sve
° sense
Wore sse-see |__ see ove-see 1 seesce
smu eu Su a L
ivy pny firey pon,
0 essay 30 Toren onnguasp KounbasyeurB40
soosep son Sednaus, samp oy Sains
ty 9 vy
oc sy 1
“yee 6
ey oe ee
sey ee
101 wus uy ase suSUIDINSEOHY smsZansueuyndod sndtydurg pryde 2x9 Jo s>\ROW WIS Jo SyIBUE] IMEI} 2Ay-AAEEM,
‘swuowamstou jeuuo
"SIVAYBINI SSVTD BaGIM
Heese HLIM S3SSVID YAMGd OLNI ONINOYD GNY NOLLNEIMISIa ADNENDIUd JO NOLLVUVaaedub CHAPTER 2 DATA IN BIOLOGY
IF the original data provide us with fewer classes than we think we should
‘have, then nothing can be done ifthe variable is meristc, since this is the nature
‘of the data in question. With a continuous variable, however, a scarcity of classes
indicates that we probably have not made our measurements with sufficient
precision. If we had followed the rule on number of significant figures stated in
Section 23, this could not have happened.
‘Whenever there are more than the desired number of classes, grouping should
bbe undertaken. When the data are meristic, the implied limits of continuous
variables are meaningless. Yet with many meristic variables, such as a bristle
‘number varying from 13 to 81, it would probably be wise to group them into
classes, each containing several counts. This grouping can best be done by using
aan odd number as a clas interval so thatthe class mark representing the data is a
whole rather than a fractional number. Thus if we were to group the bristle
‘numbers 13, 14,15, and 16 into one class, the class mark would have tobe 14.5,
‘meaningless value in terms of bristle number. It would therefore be better to use
a class ranging over 3 bristles or 5 bristles, giving the integral values 14 or 15 as,
class marks.
Grouping data into frequency distributions was necessary when computations
‘were done by pencil and paper or with mechanical calculators. Nowadays even
thousands of variates can be processed efficiently by computer without prior
grouping. However, frequency distributions are still extremely useful as a tool
for data analysis, especially in an age where it is all too easy for a researcher 10
obtain a numerical result from a computer program without ever really examin-
ing the data for outliers (extreme values) or for other ways in which the sample
‘may not conform to the assumptions ofthe statistical methods. For this reason
‘most modem statistical computer programs furnish some graphic output of the
frequency distribution of the observations.
‘An alternative to setting up a frequency distribution with tally marks as shown
in Box 2.1 is the so-called stem-and-leaf display suggested by Tukey (1977).
‘The advantage of this technique is that it not only results in a frequency distribu-
tion of the variates of a sample, but also presents them in a form that makes a
ranked (ordered) array very easy to construct. It also effectively creates a list of
these values—and is very easy to check, unlike the tallies, which can be checked
‘only by repeating the procedure. This technique is therefore useful in computing
the median of a sample (see Section 4.3) and in computing various nonparamet-
Fic statistics that require ordered arrays of the sample variates (see Sections 13.11
and 13.12),
Let us frst learn how to construct the stem-and-leaf display. In Box 15.6 we
feature measurements of total length recorded for a sample of 15 aphid stem
mothers, The unordered measurements are reproduced here: 8.7, 8:5, 9.4, 10.0,
63,78, 11.9, 65,66, 106, 10-2, 72,86, 11.1, 11.6. To prepare a stem-and-leat
display we write down the leading digit or digits of the variates inthe sample 10
the left ofa vertical line (the “stem”")as shown below; we then put the next digit
2.5 FREQUENCY DISTRIBUTIONS. w
of the first variate (a “Jeaf™) at that level ofthe stem corresponding to its leading
digins)
‘Completed Array
Step 1 Step2-» Step7---(Glep 15) Ordered Array
eee os 6)356 6386
7] 1 als a]s2 aps
alr ars alas s}756 3|sor
|| ola ols oa
wf of aolo s0lose rolor6
ite aller il) ulsis 109
‘The first observation in our sample is 8.7. We therefore place a 7 next to the 8.
‘The next variate is 8.5. Itis entered by finding the stem level for the leading digit
8 and recording a 5 next to the 7 that is already there. Similarly for the third
variate, 9.4, we record a 4 next tothe 9, and so on until all 15 variates have been
centered (as “Jeaves") in sequence along the appropriate leading digits of the
stem, Finally, we order the leaves from smallest (0) to largest (9.
‘The ordered array is equivalent to a frequency distribution and has the ap-
pearance of a histogram or bar diagram (see below), but it also is an efficient
‘ordering of the variates. Thus from the ordered array it becomes obvious that the
appropriate ordering of the 15 variates is: 63, 6.5, 6.6,7.2,7.8, 85, 8.6, 8.7, 9.4,
100, 10.2, 10.6, 11-1, 11.6, 11.9. The median, the observation having an equal
number of variates on either side, can easily be read off the stem-and-leaf dis-
play. It is 8.7.
‘The computational advantage ofthis procedure is that it orders the variates by
their leading digits in a single pass through the data set, relying on further
ordering of the tailing digits in each leaf by eye, whereas a direct ordering,
technique could require up to m passes through the data before all items are
comrectly arrayed. A FORTRAN program for preparing stem-and-leaf displays is
‘given by McNeil (1977). If most variates are expressed in three digits, the lead-
ing digits (o the eft ofthe stem) can be expressed as two-digit numbers (as was
done for the last two lines in the example above), or two-digit values can be
displayed as leaves to the right of the stem. In the latter case, however, the
two-digit leaves would have t0 be enclosed in parentheses or separated by
‘commas to prevent confusion. Thus if the observations ranged from 1.17 to 8.53,
for example, the leading digits tothe left of the stem could range from I to 8, and
the leaves of one class, say 7, might read 26, 31, 47, corresponding to 7.26, 7.31,
and 7.47,
In biometric work we frequently wish to compare two samples to see if they
differ. In such cases, we may employ back-to-back stem-and-leaf displays,30 CHAPTER 2 DATA IN BIOLOGY
illustrated here:
Sample A Sample B
9% | 10 | 0578,
sssea2 | ur | 16
sssooss1 | 12 | o13
‘This example is taken from Box 13:7 and describes a morphological measure-
ment obtained from two samples of chiggers. By setting up the stem in such a
‘way that it may serve for both samples, we can easily compare the Frequencies.
Even though these data furnish only three classes forthe stem, we can realy see
that the samples differ in their distributions. Sample A has by far the higher
readings,
‘When the shape of a frequency distribution is of particular interest, we may
often wish to present the distribution in graphic form when discussing the result.
‘This is generally done with frequency diagrams, of which there are two common
types. For a distribution of meristic data we use a bar diagram, as shown in
Figure 2.2 for the sedge data of Table 2.3. The abscissa represents the variable
(in our case the number of plants per quadrat), and the ordinate represents the
frequencies. What is important about such a diagram is thatthe bars do not touch
each other, which indicates that the variable is not continuous.
By contrast, continuous variables, such as the frequency distribution of the
femur lengths of aphid stem mothers, are graphed as a histogram, in which the
‘width of each bar along the abscissa represents a class interval of the frequency
distribution and the bars touch each other to show that the actual limits of the
classes are contiguous. The midpoint of the bar corresponds tothe class mark. At
the bottom of Box 2.1 are histograms ofthe frequency distribution of the aphid
data, ungrouped and grouped. The height of the bars represents the frequency of
each class. To illustrate that histograms are appropriate approximations to the
continuous distributions found in nature, we may take a histogram and make the
class intervals more narrow, producing more classes. The histogram would then
clearly fit more closely to a continuous distribution. We can continue this pro-
cess until the class intervals approach the limit of infinitesimal width. AC this
Point the histogram becomes the continuous distribution of the variable. Occa-
sionally the class intervals of a grouped continuous frequency distribution are
‘unequal. For instance, ina frequency distribution of ages, we might have more
Atal on the different stages of young individuals and less accurate idemtifi-
cation of the ages of old individuals. In such cases, the class intervals for the
older age groups would be wider, those for the younger age groups, narrower.
In representations of such data, the bars of the histogram are drawn with
different widths.
exeRCISES 2 a
enero as
eng:
OW S00 S130 15
Birth weight (02)
FIGURE 2.3 Frequency polygon. Birth weights of 9465 male infants. Chinese thin
‘lass patients in Singapore, 1950 and 1951. (Data from Millis and Seng, 1954.)
Figure 2.3 shows another graphic mode of representing a frequency distribu-
tion of a continuous variable—birth weight in infants, Ths isa frequency pals
gon, in which the heights of the class marks in a histogram afe connected by
straight lines. As we shall see later the shapes of frequency distributions as seen
in these various graphs can reveal much about the biological situations affecting
a given variable.
EXERCISES 2
2.1 Differentiate between the following pars of terms and give an example of each, (8)
Statistical and biological populations. (b Variate and individual. (e) Accuracy and
precision (repeatability). (4) Class interval and elass mark. (e) Bar diagram and
histogram. (€) Abscissa and ordinate.
222 Round the following numbers to three significant figures: 106.85, 0.06819, 3.0495,
7815.01, 2.9149, and 20.1500. What are the implied limits before and after round
Jing? Round these same numbers to one decimal place,
23 Given 200 measurements ranging from 1.32 10 2.95 mm, how would you group them
into a frequency distribution? Give clas limits as well as class marks.
24 Group the following 40 measurements of interorbital width ofa sample of domestic
Pigeons into a frequency distribution and draw is histogram (data fom Olson and
Milles, 1958). Measurements are in millimeters.
2 29 1s 19 16 Mkt 2322 eR
oa) 0s | 0S) 02) ne) iss) 05) NN
21 19 10s 107 08 110 119 102 109 Ihe
108 6 104 107 120 24 TBS
2.5 How precisely should you measure the wing length of a species of mosyuitues in
study of geographic variation, ithe smallest specimen hasa length of about 2.8m
and the largest a length of about 3.5 mm?3 CHAPTER 2 DATA IN BIOLOGY
2.6 ‘Transform the 40 measurements in Exercise 24 into common logarithms with your
Calculator and make a frequency distribution ofthese transformed variates. Comment
‘on the resulting change in the pattern of the frequency distribution from that found
before.
2.7 tn Exercise 4.3, we feature 120 percentages of buterfat from Ayrshire cows, Make a
frequency distibution of these values using a slemand-leaf display. Prepare an
frdered array of these variates from the display. Save the display For use later in
Exercise 4300)
THE HANDLING OF DATA
We have already stressed in the preface thatthe skillful and expe-
ditious handling of data is essential to the successful practice of statistics. Since
the emphasis of this book is to a large extent on practical statistics, itis necessary
to acquaint readers with the various techniques available for carrying out statis-
tical computations. We discuss computer hardware in Section 3.1, software in
Section 3.2. In Section 3.3 we focus on the important question of which compu-
tational devices and types of software are most appropriate in terms of efficiency
and economy.
Lacking mechanical computational aids, we would be reduced to carrying out
statistical computations by the so-called pencil-and-paper methods. Textbooks
of statistics used to contain extensive sections dealing with clever ways to make
computation by hand feasible. At present, however, the use of shortcut or ap-
proximate computations by pencil-and-paper methods isa very inefficient use of
time and energy. Even handheld calculators are now used much less than they
were before, although the more advanced models are capable of performing
‘many of the simpler analyses presented in this text, With microcomputers on
‘most desktops and battery-powered portable computers, computation by hand or
with calculators is used mostly to verify the initial results produced by computer.
‘The benefit from these developments is that researchers can concentrate on
the more-important problems of the proper design of an experiment and the
interpretation of the statistical results rather than spending time on computational
details and special ticks to make computations less tedious. We have eliminated
explanation of all the special methods for hand or calculator computation and
expect that this will make the material in this text easier to learn.
‘Another benefit of the routine use of computers is that more-powerful, com
plex analyses now are usually feasible. Some of these methods are able 10 give
‘more exact probability values for standard tests of significance. Other methods
(ee Chapter 18) enable one to make significance tests for nonstandard statistics
or for situations in which the usual distributional assumptions are not likely to
be true
aa CHAPTER 3 THE HANDLING OF DATA
3.1 computers
Since the publication of the first edition of Biometry, a revolution in the types of
‘equipment available for performing statistical computations has occurred. The
‘once standard, electrically driven, mechanical desk calculators that were used for
routine statistical calculations in biology have completely disappeared. They
were replaced first by a wide variety of electronic calculators (Fanging from
pocket to desktop models). Except for very simple computations, their use has
‘now been largely replaced by computers. Although some calculators are capable
Of performing /-tests, regression analysis, and similar calculations, they are now
used most often just for making rough checks or for manipulating results from a
computer:
Calculators range from devices that can only add, subtract, multiply, and
divide to scientific calculators with many mathematical and statistical functions
built in. Their principal limitation is that they usually can store only limited
amounts of data, This means that one must reenter the data in order to ry
aan alternative analysis (eg, with various tansformations, with outliers re-
moved, and so on). Reentry is very tedious and greatly increases the chance
for error.
Itis difficult to know where to draw the line between the more sophisticated
clectronic calculators and digital computers. There is a continuous range of
capabilities from programmable calculators through palm-top microcomputers.
notebook and laptop computers, desktop microcomputers, workstations, and
‘minicomputers, to large-scale mainframe and supercomputers. The largest are
usually located in a central computation center of a university or research labo-
ratory,
Programmable calculators are usually programmed in languages unique to the
‘manufacturer of the device. This means that considerable effort may be required
to implement an analysis on a new device, which limits the development of
software. Even so, collections of statistical programs (often donated by other
users) are usually available. These collections do not, however, compare to the
‘depth of libraries of programs available for most computers. Although con
puters can be programmed in “machine language" specific to each line of e
puters, programs are now usually writen using standard high-level languages
such as BASIC, C, FORTRAN, or Pascal, enabling the accumulation of large
libraries of programs.
‘Computers consist of three main components: the central processing unit
(which performs the calculations and controls the other components, the mem:
cry (which stores both data and instructions), and peripheral devices (which
hhandle the input of data and instructions, the output of results, and perhaps
intermediate storage of data and instructions). Different devices vary considera:
bly in the capabilities of these three components. In a simple calculator, the
processor is the arithmetic unit that adds, subtracts, multiplies. and divides
and the memory may consist of only a few registers, each capable of storing a
12 SOFTWARE 38
10-digit number. The only peripheral devices may be the keyboard for entering
the data and a light-emitting diode (LED) or liquid erystal (LCD) display.
‘At present, standard microcomputers usually have 4 10 10 8 X 10" eight
bit numbers or characters (bytes) of main memory, 300 X 10* bytes of disk
storage, and can perform about 1.8 X 10? average floating-point operations
ithmetie operations on numbers with a decimal place) per second. Large main-
frame computers now have capabilites such as 500 10° bytes of main mem-
cory and 2X 10" bytes of secondary storage. They can process. 800 X 10"
floating-point operations per second. There are also specialized computers con-
sisting of many processors tied together to work on a single problem that can
achieve even faster combined processing speeds. OF course, with the rapid de-
velopmicnts in computing technology over the past few years, we expect these
‘numbers to be out of date by the time this book isin print and to seem very small
by the time this book is next revised. These figures are given just for comparison.
Hecause of aggressive marketing, most users are quite impressed with recent
developments in microcomputers and workstations. The equally impressive Ue-
velopment in supercomputers needed for the solution of many large-scale prob-
are not us visible
creasing speed of computation is important for many applications
Computers perform no operations that could not, in principle, be done by had.
However, tasks that large-scale computers can perform in a matter of seconds
might take thousands of years for humans to complete. Computer hardware is
also very reliable. IL is unlikely that a person working on a very long calculation.
\woukl not make an error somewhere along the way unless time were allowed for
independent checking ofall results, Itis now possible to solve problems that one
‘would searcely even have contemplated before computers were developed,
Generally speaking, the largest and fastest computers are the most efficient
and economical for carrying out a given computation, but other factors often
influence the decision of what computing hardware is used. Availability is. of
course, very important. Because of their cost, very large-scale computers. are
located only at major universities or laboratories, and their use is shared among.
hundreds or thousands of authorized users. These computers can, however. be
accessed easily from anywhere in the world via terminals or microcomputers
attached to high-speed networks. Most statistical analyses require relatively Hite
‘computation, so the capabilities of a large-scale computer are not needed. (There
‘may be move overhead associated with making the connection to such a com:
puter than with the actual arithmetic.) Often the most important factor is the
availability of appropriate software, which is discussed in the next section.
32 SorTwaRE
‘The instructions to computer are known as the program, and the persons who
specialize in the writing of computer programs are known as programmers36 CHAPTER 3 THE HANDLING OF DATA
‘Waiting instructions inthe generally waque language fr each model of com.
peri avery tedious task and is now seldom done by programmers consracd
With software writen for siasical apleaons. Fortuna, programs called
Comps ve hes eel oan tons wen mle
5, problenorented languages into the machine langusg fra par
falar compte Somect te bese known compilers are BASIC. ©: FORTRAN,
tnd Pascal, FORTRAN is ne ofthe oldest compiter languages ands sil very
popular for numerical calculations, including Targe-scale vector and pall
Computations on many supercomputers Tete ar also programs called inter
Dreters iat compile and exec each insruton as is eneed, which can Be
Nery convenient for developing software hat is changed often and adapted 10
dierent applications The best known inerpretr is forthe BASIC langage.
Many new computr languages which simplify the developmen of software
for partlar epee of application ate tow aval, Thee ae, or example,
several powerful syseme wth bln commands fr staal at and
tape operation, Recellyeysems have been developed that allow the wet
{o wre a program by pointing to objet (eons represning input devices or
files, particular mathematic! operon, satacal analytes, various (pes of
frapit, et) on the computer screen and then drawing connections baween thin
{oindicat tbe low of infomation between objet eq rom an ip le, to
farticlr siainicl analysis 10» graph ofthe reals, Menus and Galogue
Exes are ved to spcity various options fr the properties ofthe objects Tis
chjec-orened programming approach means tat user no longer fas obea
programmer inorder \o create sofware appropriate for acenain peilized ask,
Tr does, however make i esental for aesearcher tobe computer Iter.
"The maeril presented in his book consists of eaively tana stastcl
cornutations, most of which have been programmed as pat of many diferent
snl cigs Te BION pe sofware or ID FC compact
was developed by one of us (Ro cry out he computation or mos of he
Topics covered in hs book An ode “Gon wie In FORTRAN, a be
taped to run on oer computers, including mini and mainframe computes
(These programs are availabe rom Exeter Software, 100 Nor County Road
Seta, New York 11733)
"The specific steps used during computations on the computer depend onthe
particular computer and software ised. At onetime, when using centralized
Computer centers, one ado employs batch made of emputations In sich an
environment on lf cat dks of progr and data 0 Deum when computer
time was avallable; printed oupt was returied our ltr or vente ex sy
Now large computer center ean be acesed vn compute terial (ple
Aovices tat const of just a keyboard and x CRT scren and operate only under
the contol ofthe remote compute) or microcomputers tha se sotvar tha
lates te properties os compe inal
‘Ths new cnvonten pean inerntve mode of computation in which
the computer responds rapidly to each veer inp, The connections canbe via
3.3 EFFICIENCY AND ECONOMY IN DATA PROCESSING 31
directly connected lines, dial-up phone lines, or various types of network con-
nections allowing convenient access to the resources of a large computer no
‘matter where the computer is physically located. An important limitation ofthis
‘mode of computing is the speed of communication between the user and the
remote computer. Speed limits applications such as interactive graphics (real
{ime rotation of three-dimensional objects), as well as the practicality of applica-
tions such as word processing, in which the screen is reformatted as each charac.
{er is typed and in which very little actual computation is performed,
Although communication speeds are increasing rapidly, these operations are
‘more efficient if performed locally. This is one ofthe reasons forthe popularity
of powerful microcomputers and personal workstations where itis practical to
hhave convenient, easy-to-use, and elegant graphical user interfaces, Other rea,
sons for the popularity of small, powerful computers are more psychological
‘THe user generally feels more free to experiment because there are no Usage
charges, nor the formality of passwords, computer allocations, and logging on to
4 computer at a remote center. The most important factor in choosing system
should be simply the availability of software to perform the desired tasks An
enormous quantity of software for personal computers has been developed in
recent years,
33 EFFICIENCY AND ECONOMY IN DATA PROCESSING
From the foregoing description it may seem tothe reader that a computer should
always be preferred over a calculator. This is usually the case, but simpler calcu
lations can often be done more quickly on a calculator. (One may be done before
‘most computers finish booting.) On the other hand, even simple computations
often need to be checked or repeated in a slightly diferent way. It is very
convenient not to have to reenter the input more than once and to have a printed
‘ecord of all input and operations performed leading to a given resul.
‘The difficult decision is not whether to use a computer but which computer
‘and software to use. Software for most of the computations described int this
book is available on most computers. Thus, the choice of hardware usually is
setermined by factors such as the type of computer already present in one's lab,
‘The computations described in this book can be performed on but do not require
large-scale computing systems (except, possibly, for very large randomization
{ests, Mantel tests, and other complex computations) It is uo longer necessary or
even efficent for most users to write their own softwace, since there is usually an
overwhelming array of choices of software that can be used for statistical com,
Dutations. Not all software, however, is of the same quality. There can be differ.
tent degrees of rounding error othe software may sometimes produce erroneous
results because of bugs in the program. It is a good idea to check a software
package by using standard data sets (such as the examples in the boxes in this38 CHAPTER 3 THE HANDLING OF DATA
text) before running one’s own data, One should also read published reviews of,
the software before purchasing it
‘Most statistical software can be classified into one of the following major
groups (some major packages combine the features of two or more groups):
specialized single-operation programs, command language-driven programs,
menu-driven programs, and spreadsheets. The earliest statistical software was.
developed for the batch-computing environment. The program would read a
previously prepared file of data, perform a single type of analysis, and then.
‘output the results (which usually would include many possible analyses in case
they might interest the user).
‘Alternatively, large statistical packages were developed that combined many
types of analysis, With such a package, the user selects the particular operations
desired using statements in an artificial command language unique to that soft-
ware, This type of software requires time to set up the data and to learn the
‘commands and various options needed fora particular analysis, but it is usually a
very efficient mode for processing large datasets, These programs also usually
offer the widest selection of options. User-friendly, menu-driven programs per-
mit rapid data input and feature easy options for performing standard statistical
analyses. Options can be conveniently selected from lists in menus or in dialogue
boxes. Some packages combine the advantages ofthese last two modes by letting
the user employ a menu system to prepare the command file needed to carry out
the desired analysis.
‘Although spreadsheet programs are not often used for statistical calculations
in biology, their very different mode for computation is often useful. They might
bre viewed as the computer generalization ofthe calculator. They simulate a large
sheet of paper that has been divided into cells. Each cell can contain a value
centered by the user or a result of a computation based on other cells in the
spreadsheet. The operations used 10 produce a particular numerical result or
raph are remembered by the program. Ifan input cel is changed by the user, the
consequent changes in the resulls arc immediately recomputed and displayed,
‘This feature makes it very casy to experiment with data and sce the results of
alternative analyses.
‘A danger inherent in computer processing of data is thatthe user may simply
obtain the final statistical results without observing the distribution of the var-
iates (in fact without even seeing the data if they are collected automatically by
various data acquisition devices). One should take advantage of any options
available to display the data that could Iead to interesting new insights into their
nature, of to the rejection of some outlying observations, or to suggestions that
the data do not conform to the assumptions of a particular statistical test. Most
statistical packages are capable of providing such graphics. We strongly urge
research workers (0 make use of such operations.
‘Another danger is that it is too easy to blindly use whatever tests are provided
‘with a particular program without understanding their meanings and assump-
tions (Searle, 1989). The availability of computers relieves the tediuin of com=
pation, but not the necessity to understand the methods being employed.
DESCRIPTIVE STATISTICS
‘An early and fundamental stage in any science is the descriptive
stage. Until the facts can be described accurately, analysis of their causes is
premature. The question what must come before how. The application of statis
tics to biology has followed these general trends. Before Francis Galton could
begin to think about the relations between the heights of fathers and those of their
sons, he had to have adequate tools for measuring and describing heights in a
population, Similarly, unless we know something about the usual distribution of
the sugar content of blood in a population of guinca pigs, as well as its luctw-
tions from day to day and within days, we cannot ascertain the effect of a given
dose of a drug upon this variable.
Ina sizable sample, obtaining knowledge of the material by contemplating all
the individual observations would be tedious, We need some form of summary to
deal with the data in manageable form, as well as to share our findings with
‘others in scientific talks and publications. A histogram or bar diagram of the
frequency distribution is one type of summary. For most purposes, however,
numerical summary is needed to describe the properties of the observed fre-
{quency distribution concisely and accurately. Quantities providing such a sun
mary are called descriptive statisties, This chapter will introduce you t $0
them and show how they are computed.
“Two kinds of descriptive statistics will be discussed in this chapter: statistics
‘of location and statistics of dispersion. Statisties of location describe the posi-
tion of a sample along a given dimension representing a variable. For example.
‘we might like to know whether the sample variates measuring the |
certain animals lie inthe vicinity of 2 cm or 20 em. A statistic of loc
yield a representative value for the sample of observations. However, such a
Satistic (sometimes also known as a measure of central tendency) does not
describe the shape of a frequency distribution. This distribution may be long or
very narrow; it may be humped or U-shaped; it may contain two humps, oF it
may be markedly asymmetrical. Quantitative measures of such aspects o fre-
quency distributions are required. To this end we need to define and study
statisties of dispersion.
3940 CHAPTER 4 DESCRIPTIVE STATISTICS
‘The arithmetic mean described in Section 4.1 is undoubtedly the most impor-
tant single statistic of locaton, but others (the geometric mean, the harmonig
‘att, the median, and the mode) are mentioned briefly in Sections 4.2, 4.3, an
FX simple statistic of dispersion, the range, is briefly noted in Section 45, and
the standard deviation, the most common statistic for describing dispersion, 's
Explained in Section 4.6. Our fist encounter with contrasts Between samPi=
exrretice and population parameters occurs in Section 4.7, in connection with
statics of lovation and dispersion. Section 4.8 contains a description of meth-
aarvof coding data to simplify the computation ofthe mean and standard devia-
Gon, which is discussed in Section 4.9. The coefficient of variation (a statistic
that permits us to compare the relative amount of dispersion indifferent samples)
js explained in the last section (4.10).
30, the ap-
proximation C3~ 1 + [4(n — 1)]”' is sufficiently accurate. In the case of the
smi i eens
G
4.8 CODING DATA BEFORE COMPUTATION
Coding the origina datas far ss important subject atthe present time, when
sen ompaaone visa ead val an wn a ys,
hen twas eoeatia for carying ott most computations, By coding we mean
the adlon o subtraction ofa constant number fo the eiginal daa andor the
tlpiation or division ofthese data by constant. Data may needa be coJod
boca they were originally expressed in too many digits orate very lange
numbers that may cause ificales and eros during data handing. Cong can5h CHAPTER 4 DESCRIPTIVE statistics
therefore simplify computation appreciably, and for certain techniques, such as,
polynomial regression (see Section 16.6), it can be very useful to reduce round.
ing error (Bradley and Srivastava, 1979). The types of coding shown here are
linear transformations of the variables. Persons using statistics should know the
effects of such transformations on means, standard deviations, and any other
Statistics they intend to employ.
Additive coding isthe addition or subtraction of a constant (since subtraction
is addition of a negative number). Similarly, multiplicative coding isthe mult
plication or division by a constant (since division is multiplication by the recip-
rocal ofthe divisor). Combination coding is the application of both additive and
‘multiplicative coding to the same set of data. In Section A.2 ofthe appendix we
examine the consequences of the three types of coding for computing means,
variances, and standard deviations.
For the case of means, the formula for combination coding and decoding isthe
‘most generally applicable one. Ifthe coded variable is ¥, = DO’ + C), then
y,
F=a-c
D
where Cis an additive code and D is @ multiplicative code. Additive codes have
ho effect, however, on the sums of squares, variances, or standard deviations.
‘The mathematical proof is given in Section A.2, but this can be scen intuitively
because an additive code has no effect on the distance of an item from its mean.
For example, the distance from an item of 15 to its mean of 10 would be 5. If we
‘wore to code the variates by subtracting a constant of 10, the item would now be
5 and the mean zero, but the difference between them would still be 5, Thus if
only additive coding is employed, the only statistic in need of decoding is the
‘mean. Multiplicative coding, on the other hand, does have an effect on sums of
squares, variances, and standard deviations. The standard deviations have to be
divided by the multiplicative code, just as had to be done forthe mean; the sums
of squares of variances have to be divided by the multiplicative codes squared
because they are squared terms, and the multiplicative factor became squared
‘during the operations. In combination coding the additive code can be ignored,
‘An example of coding and decoding data is shown in Box 4.3
4,9 CompuTING MEANS AND STANDARD DEVIATIONS
‘Three steps are necessary for computing the standard deviation: (1) finding Xy?,
the sum of squares, (2) dividing by n ~ 1 to give the variance, and (3) taking the
‘square root of the variance to obtain the standard deviation, The procedure used
{to compute the sum of squares in Section 4,6 can be expressed by the following
formula:
DrsSa-7y an
|
|
t
4.9 COMPUTING MEANS AND STANDARD DEVIATIONS. ”
When the data are unordered, the computation proceeds as in Box 4.2, which is
based on the unordered aphid femur length data shown at the head of Box 2.1
Occasionally original data are already in the form ofa frequency distribution,
fr the person computing the statistics may want to avoid manual entry of large
‘numbers of individual variates, in which case setting up a frequency distribution
is also advisable, Data already arrayed in a frequency distribution speed up the
‘computations considerably. An example is shown in Box 4,3, Hand calculations
are simplified and data entry into computers is less tedious (and hence there will
be less chance for input errors) by coding to remove the awkward clas marks.
‘We coded each class mark in Box 4.3 by subtracting 59.5, the lowest class mark
of the array. The resulting class marks are the values 0, 8, 16, 24, 32, and so on,
Dividing these values by 8 changes them to 0, 1,2, 3, 4, and $0 on, which is the
desired format, shown in column (3). Details ofthe computation are given in Box
43,
‘Computer programs that compute the basic statistics ¥, 3, s, and others that
‘we have not yet discussed are furnished in many commercially available pro
‘grams. The BIOM-pe program version 3 accepts raw, unordered observations as
input, as well as data in the form of a frequency distribution.
‘An approximate method for estimating statistis is useful when cheeking the
results of ealculations because it enables the detection of gross errors in compu
tation. A simple method for estimating the mean is to average the largest and
simallest observations to obtain the midrange. For the aphid stem mother data of
Box 2.1, this value is (4.7 + 3.3)/2 = 4.0, which happens to fall almost exactly
‘on the computed sample mean (but, of course, this will not be true of other «
sels), Standard deviations ean be estimated from ranges by appropriate division
of the range:
For samples of divide the range by
w 3
30 4
100 5
500 6
1000 o
‘The range of the aphid data is 1.4. When this value is divided by 4 we get an
estimate ofthe standard deviation of 0.35, which compares not too badly with the
calculated value of 0.3657 in Box 4.2.
‘A more accurate procedure for estimating statistics is to use Statistical Table
1, which furnishes the mean range for different sample sizes of a normal distr-
bution (see Chapter 6) with a variance of one. When we divide the range of
a sample by a mean range from Table I, we obtain an estimate of the stan-
dard deviation of the population from which the sample was taken. Thus. for56 CHAPTER 4 DESCRIPTIVE STATISTICS
[REE carcuration oF ¥ AND s FROM UNORDERED DATA.
‘Based on aphid femur length data, unordered, as shown at the head of Box 2.1,
Computation
n= 25D Y= 100: P= 4008
LS yt=Dor— PR = 3.2056
EP 32096 giao
mie
= OTST = 03651
oe
CALCULATION OF ¥, s, AND V
FROM A FREQUENCY DISTRIBUTION.
Les
Bint weights of male Chinese in ounces (from Box 4.1).
@
cna — Coeds ce
L %
———
rr! °
as 6 1
3} 2
bs as 3
a3 ‘
bp arm 5
wis 20 é
13s dom $ :
ins ia33 : i
bis ‘etl : i
tes OD) ® F
Mis, at i :
(ss 2
toss 8
ts "
Ha =n
Sounce Mi an Seg (95.
4.10 THE COBPFICIENT OF VARIATION 51
Box 4.3 Conrinuso
pore
ae
fon
resets
comma
tg
DJ = 1,040,199.5,
oie
Dh = Ds ~ PF
v= 3 x 100 = 35% x 100 = 123708
109.9
the aphid data we look up n= 25 in Table I and obtain 3931. We estimate |
$= 143.931 ~ 0356, a value closer to the sample standard deviation than
that obtained by the Tougher method dicusod above (which, however, is
based on the same principle and assumption)
‘4JO THE COEFFICIENT OF VARIATION
Having obtained the standard deviation as a measure of the amount of variation
in the data, you may legitimately ask, What can I do with it? At this stage in our
‘comprehension of statistical theory, nothing realy useful comes of the computa-
tions we have carried out, although the skills earned are basic (0 all statistical
work. So far, the only use that we might have for the standard deviation is as
an estimate of the amount of variation in a population. Thus we may wish to58 CHAPTER 4 DESCRIPTIVE STATISTICS
‘compare the magnitudes ofthe standard deviations of similar populations to see
‘whether population A is more or less variable than population B. When popul
tions differ appreciably in their means, however, the direct comparison of their
variances or standard deviations is less useful, since larger organisms usually
‘vary more than smaller ones. For instance, the standard deviation of the tail
lengths of elephants is obviously much greater than the entire tail length of a
‘mouse, To compare the relative amounts of variation in populations having
different means, the coefficient of variation, symbolized by V (or occasionally
CV), has been developed. This coefficient is simply the standard deviation ex-
pressed as a percentage of the mean Its formula is
= 100
Y
For example, the coefficient of variation of the birth weights in Box 4.3 is
12.37%, as shown at the bottom of that box. The coeflicient of variation is
independent of the unit of measurement and is expressed as a percentage. The
coefficient of variation as computed above i a biased estimator of the population
V. The following estimate V* is corrected for bias
L
vee(t
( +t)v a9)
{In small samples this correction can make an appreciable difference. Note that
when using Expression (4.9), the standard deviation used to compute V should
not be corrected using C, because this would result in an overcorrection. Note
also thatthe conection factor approximates C,, he comet faror wed in
ection 4:
Coefficients of variation are used extensively when comparing the variation
‘of two populations independent of the magnitude of their means. Whether the
birth weights ofthe Chinese children (see Box 4.1) are more or less variable than
the femur lengths of the aphid stem mothers (see Box 2.1) is probably of litle
interest, but we can calculate the latter as 0.3656 100/4,004 = 9.13%, which
‘would suggest that the birth weights are more variable. More commonly, we
‘might wish to test whether a given biological sample is more variable for one
character than for another, For example, for a sample of rats, is body weight
‘more variable than blood sugar content? Another frequent type of comparison,
especially in systematics, is among different populations for the same character.
If, for instance, we had measured wing length in samples of birds from several
localities, we might wish to know whether any one of these populations is more
variable than the others. An answer to this question can be obtained by examin-
ing the coefficients of variation of wing length in these samples.
‘Employing the coefficient of variation in a comparison between two variables
‘or two populations assumes thatthe variable in the second sample is proportional
to that in the first. Thus we could write ¥, = KY, where k is a constant of
proportionality. If two variables are related in this manner, their coefficients of
as)
EXERCISES 4 39
variation should be identical. This should be obvious if we remember that k is a
‘multiplicative code, Thus ¥, = k¥, and s, = ks, and consequently
y= 1005, 100K, _ 0s
(ye
If the variables are transformed to logarithms (see Section 13.7), the relationship
between them can be writen as In ¥; = Ink + In ¥, and since In kis a constan
the variances of In ¥, and In ¥, are identical, This relation can lead to a test of
ccquality of coefficients of variation (sec Lewontin, 1966; Sokal and Braumann,
1930).
‘At one time systematists put great stock in coefficients of variation and even,
based some classification decisions on the magnitude of these coefficients,
However, there is litle, if any, foundation for such actions. More extensive
discussion ofthe coefficient of variation, especially as it relates to systematics
tean be found in Simpson et al. (1960), Lewontin (1966), Lande (1977), and
‘Sokal and Braumann (1980),
EXERCISES 4
4.1 Find the mean, standard deviation, and coefficient of variation for the pigeon data
isiven in Exercise 24. Group the dala info ten classes, compute F and s, and
‘Compare them with the results obined from the ungrouped data. Compute the me-
tian for the grouped dala, Answer: For ungrouped data, Y= 1148 and x =
0.69178.
42 Find Y, 5, ¥, and the median for the following data (milligrams of glycine per
hilligram of creatinine inthe wine of 37 chimpanzees; from Garter ta, 1956)
OS 0180550535052 077026. MD
025 036 043.100 120,110 «100350100300
OL 060 070030, 080110110 120,138 100
“oo 155.370 019100100116
143 The following ae percentages of butlerfat from 120 registered three-year-old Ayr-
shire cows selected at random from a Canadian stock record book.
432 424 429 400 396 448 389 402 378 442
420 387 410 400 433-381 433 416 38K 4A
423 467 374 425 «428 403 442 409 GIS 429
427 438 449 403 397 432 467 4.11 424 5:00
400 438 372 399 400 446 482391 «4713.96
366 410 438 416 377 440 406 408 3.66 4.70
397 397 420 441 431 370 383 428 4300 IT
397 420 451 386 436 418 424 405 4053.56
a4 389 458 399 4.17 382 370 433 406 3D
407358 393 420 389 460 438 4.14 40 397
422 347 392 491 395° 438 412 452 43531
410 409 409 434 409 488428 3.98 3K SK60 CHAPTER 4 DESCRIPTIVE STATISTICS
ina foqueney
fo CaeateF, sand Vary rom the aa (0) Group the at
Co Gant Og clei Pa an V- Compare te ests wih ae (0)
Sisto a i aed by erouping? Alo calelate te mein. An-
How ie pooped aia P= 4.16008, # = O0238,V = 725815.
44 ales woul ing cosa 2 allen vt et
at cw tse PV average devin, eis
cao te let of etg 2 ad then mpyng he sus by £07
‘ange? Wal wo cic in he above sats if we molpled by BO frst
then aed 5.27
ange se Section 49) forthe atin
45 tmate and owing the mirage and the ane nn
Se ae 5 How ell ese ema age
Bae avers Ena a fr Bnrse 42 Me 0224 ad
Ste respect
4 Stow at he aon ovarian cna fe wens
@)
eel
Drea = 7m
wad 7
ion wo te estimated standard deviation ofthe data in Exrcie
42 Apply the C conection oh nee
sepia ofthe oun of varia,
24 Alo compe tee comes Answer IC = 009621, V* =
ones.
INTRODUCTION TO PROBABILITY
DISTRIBUTIONS: BINOMIAL AND
POISSON
sonar) Section 2.5 was our first discussion of frequency distributions. For
‘example, Table 2.3 shows a distibution for a merstic, or discrete (discon
tinuous) variable, the number of sedge plants per quadrat. Examples of dstibu-
tions for continuous variables are the femur lengths of aphid stem mothers in
Box 2.1 of the human birth weights in Box 4.3. Each of these distributions
informs us about the absolute frequency of any given class and permits eompu-
tation of the relative frequencies of any class of variable. Thus, most of the
‘quadrats contained either no sedges or just one or two plants. Inthe 139.5-0z
class of birth weights, we find only 201 out of the 9465 babies recorded; that
approximately only 2.1% of the infants are in that birth-weight class.
(Ofcourse, these frequency distributions are only samples from given popula-
tions. The birth weights representa population of male Chinese infants from a
given geographical area. If we knew our sample to be representative of that
population, however, we could make al sorts of predictions based on the fre
{quency distribution ofthe sample. For instance, we could say that approximately
2.1% of male Chinese babies bom in ths population weigh between 135.5 and
1435 02 at birth, Similarly, we might say thatthe probability ofthe weight at
birth of any one baby in this population being in the 139.502 birth class is quite
low. If each ofthe 9465 weights were mixed up ina hat, and we pulled one out
the probability that we would pull out one ofthe 201 in the 139.5-0z class would
be very low indeed —oaly 2.15. t would be much more probable that we would
sample an infant of 107-5 or 115.5 oz, since the infants in these clases are
represented by the frequencies 2240 and 2007, respective
Finally, if we were to sample from an unknown population of babies and find
that the very first individual sampled had a bin weight of 170 oz, we would
probably eject any hypothesis that the unknown population was the same as that
sampled in Box 4.3. We would arrive at this eonclusion because in the distibu-
tion in Box 43, only one out of almost 10,000 infants had a birth weight tha
high. Although itis possible to have sampled from the population of male Ci
nese babies and obtained a birth weight of 170 oz the probability that the first
individual sampled would have such a value is very low indeed. Its much more
61reasonable to suppose that the unknown population from which we are sampling
is different in mean and possibly in variance from the one in Box 4.3,
We have used this empirical frequency distribution to make certain predic
tions (with what frequency a given event will occur) or make judgments and
decisions (whether itis likely that an infant of a given birth weight belongs to this
population). In many cases in biology, however, we will make such predictions
not from empirical distributions, but on the basis of theoretical considerations
that in our judgment are pertinent. We may feel thatthe data should be distrib
‘uted in a certain way because of basic assumptions about the nature ofthe forces
acting on the example at hand. If our observed data do not sufficiently conform
to the values expected on the basis of these assumptions, we will have serious
«doubts about our assumptions. This is a common use of frequency distributions
in biology. The assumptions being tested generally lead to a theoretical fre
‘quency distribution, known also as a probability distribution.
‘A probability distribution may be a simple two-valued distribution such as the
3: [ratio in a Mendelian cross, or it may be a more complicated function that is.
intended to predict the number of plants in a quadrat. If we find that the observed.
data do not fit the expectations on the basis of theory, we are often led to the
discovery of some biological mechanism causing this deviation from expecta-
tion. The phenomena of linkage in genetics, of preferential mating between
different phenotypes in animal behavior, of congregation of animals at certain
favored places or, conversely, their teritorial dispersion are cases in point. We
will thus make use of probability theory o test our assumptions abou the laws of
occurrence of certain biological phenomena, Probability theory underlies the
entire structure of statistics, a fact which, because of the nonmathematical orien-
tation of this book, may not be entirely obvious.
In Seotion 5.1 we present an elementary discussion of probability, limited to
what is necessary for comprehension of the sections that follow. The binomial
frequency distribution, which not only is important in certain types of studies,
such as genetics, but is fundamental to an understanding of the kinds of pro
bility distributions discussed in dhs book, is covered in Section 5.2. The Poisson
distribution, which follows in Section 5.3, is widely applicable in biology, espe-
cially for tests of randomness of the occurrence of certain events. Both the
binomial and Poisson distributions are discrete probability distributions. Some
other discrete distributions are mentioned briefly in Section 5.4, The entire
chapter therefore deals with discrete probability distributions, The most common
continuous probability distribution is the normal frequency distribution, dis
ccussed in Chapter 6,
5.1 PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING
We will start this discussion with an example that is not biometrical or biological
in the strict sense. We have often found it pedagogically effective to introduce
5.1 PROBABILITY, SAMPLING, AND HYVOTHESIS TESTING 63
new concepts through situations thoroughly familiar to the student, even if the
‘example is not relevant to the general subject matter of biometry.
Let us imagine ourselves at Matchless University, a state insttutio
where between the Appalachians and the Rockies. Its enrollment figures yield
the following breakdown of the student body: 70% of the students are American
undergraduates (AU), 26% are American graduate students (AG), and the re
maining 4% are from abroad, Of these, 1% are foreign undergraduates (FU) and
3% are foreign graduate students (FG). In much of our work we use proportions.
rather than percentages as a useful convention. Thus the enrollment consists of
(0.70 AUs, 0.26 AGs, 0.01 FUs, and 0.03 FGs. The total student body, corre-
sponding to 100%, is represented by the figure 1.0.
If we sample 100 students at random, we expect that, on the average, 3 will be
foreign graduate students, The actual outcome might vary. There might not be a
single FG student among the 100 sampled or there may be quite a few more tha
3, and the estimate of the proportion of FGs may therefore range from 0 10
iveater than 0.03, If we increase our sample size to S00 or 1000, itis less likely
thatthe ratio will Nuctuate widely around 0.03. The larger the sample. the closer
100.03 the ratio of FG students sampled to total students sampled will be. In Fact,
the probability of sampling a foreign student can be defined as the limit reached
by the ratio of foreign students to the tolal number of students sampled, as
uple size increases, Thus, we may formally summarize by stating that the
probability of a student at Matchless University being a foreign graduate student
is PLFG] = 0.03, Similarly, the probability of sampling a foreign undergrad
is PLFU} = 0.01, that of sampling an American undergraduate is PLAU]
and that for American graduate students, PLAG] = 0.26.
"Now imagine the following experiment: We try to sample a student at random
from the student body at Matchless University. This task is not as easy as might
be imagined. If we wanted to do the operation physically, we would have to set
up a collection or trapping station somewhere on campus. And to make ertain
thatthe sample is truly eandom with respect to the entire student population, we
‘would have to know the ecology of students on campus very thoroughly so th
‘we could locate our trap ata station where each student had an equal probability
of passing. Few, if any, such places can be found in a university. The student
tunion facilities are likely to be frequented more by independent and foreign
students, less by those living in organized houses and dormitories. Fewer foreign
and graduate students might be found along fraternity row. Clearly we would not
‘wish (0 place our trap near the International Club or House because our proba
bility of sampling a foreign student would be greatly enhanced. In front of the
bursar’s window we raight sample students paying tuition, The time of sampling
is equally important, in the seasonal as well as the diurnal cycle. There seems no
ceasy solution in sight
‘Those of you who are interested in sampling organisms from nature will
already have perceived parallel problems in your work. If we were 0 sample
only students wearing turbans or saris, their probability of being foreign students
70,6h CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS.
‘would be almost 1. We could no longer speak ofa random sample. In the familiar
ecosystem of the university these violations of proper sampling procedure are
‘obvious to all of us, but they are not nearly so obvious in real biological instances
‘that are less well known.
‘How should we proceed to obtain a random sample of leaves from a tre, of
insects ftom a field, or of mutations in a culture? In sampling at random we are
attempting to permit the frequencies of various events occuring in nature to be
reproduced unalteredly in our records; that is, we hope that on the average the
frequencies of these events in our sample will be the same as they arc in the
natural situation. Another way of stating this goal is that ina random sample we
want every individual in the population being sampled to have an equal proba-
bility of being included in the sample.
‘We might go about obtaining a random sample by using records representing
the student body, such asthe student directory, selecting a page from it at random
and a name at random from the page. Or we could assign an arbitrary number to
each student, write each on a chip or disk, put these in a large container, stir well,
and then pull out a number.
Imagine now that we sample a single student by the best random procedure
we can devise. What are the possible outcomes? Clearly the student could be
either an AU, AG, FU, or FG. This set of four possible outcomes exhausts the
possibilities of the experiment. This set, which we can represent as (AU, AG,
FU, FG) is called the sample space. Any single experiment would result in only
one of the four possible outcomes (elements) in the set. Such an element in a
sample space is called a simple event. Ii distinguished from an event, which is
‘any subset of the sample space. Thus in the sample space defined above {AU},
{AG}, {FU}, or (FG) are each simple events. The following sampling results
are some of the possible events: {AU, AG, FU}, (AU, AG, FG), {AG, FG),
(AU, FG). The meaning of these events, in order, implies being an American, an
undergraduate, or both; being an American, a graduate student, or both; being a
‘graduate student; being an American undergraduate or a foreign graduate stu-
‘dent. By the definition of event, simple events, as well asthe enire sample space,
are also events
Given the sampling space described above, the event A = (AU, AG) encom-
passes all possible outcomes in the space yielding an American Student. Simi-
larly, the event B= (AG, FG) summarizes the possibilities for obtaining a
‘graduate student, The intersection of events A and B, written A 1 B, describes
Only the events that are shared by A and B. Clearly only AG qualifies, as can be
seen here:
A= (AU, AG)
B= (AG,FG)
‘Thus AM Bis the event in the sample space that gives rise vo the sampling of an
‘American graduate student. When the intersection of two events is empty, as in
51 PROBAUILITY, SAMPLING, AND HYPOTHESIS TESTING 65
3.0 C,where C= (AU,FU), the events Band Case mutually exclusive. These
(wo events have no common element in the sampling space. "
We may alo defn events that se unions of we oer cunt in he sample
spac. Ths A'U B indicts that Aor Br boll‘ sad B sae ae de
above, AU Bb describes all stuns wo are Amedeanpadte sedans
Sen graduate students), a
is dscustion of events makes the following relations almos self-evident
Probus ae necessary bounded by O and I Three
Os PIA s1 6.)
‘The probability of an entire sample space is
PIS) = 1 (5.2)
whet $i he sample space ofall possible events, Als, fo
f ts, Alb, for any ere A, he
brohabiliy of iat occuring is | PIAL This known ste copie
PIAS] = 1 PIAL 63)
hte AC stands forall evens ht are not A.
Why ae we concemed wih defining sample spaces and evens? Becawe
these concepts ead sto weil dfinons and eperatons regarding he sake
Diy of various outcomes. If we can assign eumberD = pl tose ros
event in a sample space such that the su of tase p's overall sinpecren
the space equals unt, then th space Becomes (Rate) probabny pace te
our example he following numbers were asociated with eapponis ee
even inthe sample pce:
{AU, AG, FU, FG)
(0.70, 0.26, 0.01, 0.03)
Given tis probability space, we ae now ale o make statement regarding
probably of given evens For example, wha the pot ee ating
sampled at random. being an Americ, pradvte Suden ‘Cle en
PUAG)) = 0.26. Whats the probably thas stant eee Arcee
fade sade? The ener oma en of be padre rs
PIA UB] = PIA) + PIB] ~ PLAN By 5.4)
mn terms of the events defined earlier, the answer is.
PIAU BY = PULAU, AG}] + P{LAG, FG)} — P{LAG)]
196 + 0.29 ~ 0.26
099
a a a eeWe can sec that we must subtract P[A 1B] = P{[AG)], because if we did not
do s0 it would be included twice, once in P(A] and once in PIB}, and would lead.
to the absurd result of a probability greater than 1
Now let us assume that we have sampled our single student from the student
body of Matchless University. He rms out to be a foreign graduate student.
What can we conclude? By chance alone, this result has a probability of 0.03, or
‘would happen 3% of the time, that is, not very frequently. The assumption that
‘we have sampled at random should probably be rejected, since if we accept the
hypothesis of random sampling, the outcome of the experiment is improbable.
Please note that we said improbable, not impossible. We could have chanced
‘upon an FG as the first one to be sampled, but i is not very likely. The probabi
ity is 0.97 that a single student sampled would be a non-FG. If we could be
certain that our sampling method was random (as when drawing student numbers
out of a container), we would of course have to reach the conclusion that an
Improbable event has occurred. The decisions ofthis paragraph are all based on
‘our definite knowledge that the proportion of students at Matchless University is
as specified by the probability space. If we were uncertain about this, we would
be led to assume a higher proportion of foreign graduate students as a conse-
quence of the outcome of our sampling experiment.
[Now we will sample two students rather than just one. What are the possible
outcomes of this sampling experiment? The new sample space can best be de-
picted by a diagram that shows the set of the sixteen possible simple events as
points in a lattice (Figure 5.1). The possible combinations, ignoring which stu-
ent was sampled first, are (AU, AU), (AU, AG}, (AU, FU), (AU, FG}, (AG,
AG}. (AG, FU}, (AG, FG}, (FU, FU}. (FU, FG), and (FG, FG).
‘What would be the expected probabilities of these new outcomes? Now the
nature of the sampling procedure becomes quite important, We may sample with
ome
Joon
¥ oms.ac
g
oz au
om OG
07 026 oot
Fit tet
FIGURE 5.1 Sample space for sampling two students from Matchless University
5.) PROBABILITY. SAMPLING, AND HYPOTHESIS TESTING 61
oor without replacement; that is, we may return the first student sampled to the
population or may keep him out ofthe pool of individuals to be sampled. If we Jo
‘nol replace the first individual sampled, the probability of sampling a foreizn
‘graduate student will no longer be exactly 0.03. We can visualize this easily.
‘Assume that Matchless University has 10,000 students. Since 3% are foreign
‘graduate students, there must be 300 FG students atthe university. After obtain
ing a foreign graduate student in the first sample, this number is reduced to 299
‘out of 9999 students, Consequently, the probability of sampling an FG student
w becomes 299/9999 = 0.0299, a slightly lower probability than the value of
0.03 for sampling the first FG student. If, on the other hand, we return the
original forcign student to the student population and make certain that the
population is thoroughly randomized before being sampled again (that is, give
him a chance to lose himself among the campus crowd or, in drawing student
‘numbers out of a container, mix up the disks with the numbers on them), the
probability of sampling a second FG student is the same as before—0.03, In
fact, if we continue to replace the sampled individuals in the original population,
‘ve cam sample from it as though it were an infinitely sized population.
Biological populations are, of course, finite, but they are frequently so large
that for purposes of sampling experiments we can consider them effectively
infinite, whether we replace sampled individuals or not. After all, even in this
relatively small population of 10,000 students, the probability of sampling a
second foreign graduate student (without replacement) is only minutely different
from 0.03, For the test of this section, we wil assume that sampling is with
replacement, so the probability of obtaining a foreign student will not change.
‘There is a second potential source of difficulty in this design. We not only
have to assume that the probability of sampling a second foreign student is equal
to that ofthe firs, but also that these events are independent. By independence
of events we mean thatthe probability of one event isnot affected by whether or
rot another event has occurred. Inthe case ofthe students, having sampled one
foreign student, is it more or less likely that the second student we sampled is
also a forcign student? Independence of the events may depend on where we
sample the students of on the method of sampling. If we sampled students on
‘pus, it is quite likely that the events are not independent: that is, having.
sampled one foreign student, the probability thatthe second student we sample is
foreign increases, since foreign students tend to congregate. Thus, al Matchless
University the probability that a student walking with a foreign graduate student
is also an FG would be greater than 0.03.
[Events D and E in a sample space are defined as independent whenever
PID 1 Ej = PDI) 6.5)
‘The probability values assigned to the sixteen points in the sample space of
Figure 5.1 have been computed to satisfy this condition. Thus letting P{D| equal
the probability that the fest student sampled is an AU—that is, P[(AU,AU;.
AU,AG,, AU,EU;, AU,FG,}]—and P{E] equal the probability that the second68 CHAPTER § BINOMIAL AND POISSON DISTRIBUTIONS.
student is an FG—thatis, P{(AUFG2, AG,PG,, FU,FG,, FGFG:)1— we note
that the intersection D 9 Eis [AU,EG, }. Ths event has a value of 0.0210 inthe
probability space of Figure 5.1. We find that this value is the product of
PUAUIPLLEG}} = 0.70 x 0.03 = 0.0210. These mutually independent rela-
cee ave been deliberately imposed upon all points in the probability space.
“Thetefors ifthe sampling probabilities for the second student are independent of
the type of student sampled fist, we can compute the probabilities of the out
‘omnes asthe product of the independent probabilities. Thus the probabil
fty of bushing two FG students is P((FG)}P{(FG)] = 003 X 0.09 = 0.000%
“The probability of obtaining one AU and one FG student inthe sample might
soem to be the prodoct 0.70 X 0.03. However, itis twice that probability, I is
SSey to see why, There is only one way of obtaining two FG students, namely by
SSRpling frat one FG and then again another FG. Similarly, there is only one
Mayto sample wo AU students. However, sampling one of each type of student
were done by first sampling an AU followed by an FG or by ist sampling an
FO followed by an AU. Thus the probability is 2P{(AU)IPLFGI] = 2 *
0.70 x 0.03 = 0.0420.
If we conducted such an experiment and obtained a sample of two FG stu
denis we would be led tothe following conclusions: Since only 0.0009 of the
samples (9/100th of 18, 9 out of 10,000 cases) would be expected to consist
Siro Toreign graduate students, obtaining such a result by chance alone i quite
SInprobable. Given tat P{{FG}} = 0.03 tre, we would suspect that sampling
sere random of that the events were not independent (or that both
wSSruptions-—-random sampling and independence of events—were incorrect)
Random sampling is sometimes confused with randomness in nature. The
Tomar isthe faithful representation inthe sample ofthe distribution ofthe events
perme’ the later is the independence of the events in nature, Random sam-
pling generally i or should be under the control of dhe experimenter and
earned to the strategy of good sampling. Randomness in nature generally de~
reltbos an innate property of the objects being sampled and thus is of greater
biological interest. The confusion betweca random sampling and independence
Or events aries because lack of either can yield observed frequencies of events
Gifesng from expectation. We have already seen how lack of independence in
Samples of foreign students can be interpreted from both pints of view in our
example from Matchless University.
Enxpression (5.5) ean be used to test whether two events ae independent. I is
‘well known that foreign students at American universities are predominantly
Taduate students, quite unlike the situation for American students Can we
Srronstrate this for Matchless University? The proportion of foreign students at
de niversity PLE] = PLFU] + P{FG] ~ 0.04 and of graduate students is
PIG] = PIAG] + PIFG) = 0.29. Ifthe properties of being a foreign student and
et being a graduate student were independent, then P(FG) would equal
PEFLPIGI. In fact, it does nt, since 0.08 X 0.29 = 00116, not 0.03, the actual
proportion of foreign graduate students. Thus nationality
. Thus nationality and academic rank are
bot independent The probability of being» graduate sent fers depending
on the nationality: Among American students the proportion of graduate students
1s 0.26/0.96 = 0.27, bt among foreign students itis 0030.04 = 0.75
aps epenns ews diy he ie of endian pein
speak othe conditional probability of event A, given event B, symbolized as
PIA) This quantity can be computed ss follows: aa
_ PAN B)
PIB)
Let ws apply tis formato our Mathes U
mula to our Matchless Universi example, What she
Sine pty eine dn en gen a eS
incr We alae ACTF = AG OFF] ~ 000M ~ O75, Tn
Sm ofthe foreign suds ate paduate students, Simi, te condional
foi of ee + ean wn, te ce Arcam,
uch ower valu Only 27.08% ofthe Amin
are graduate students. Sf laa aaa!
Here another example that requires Exe
tures Expression (56) fr onion prob
viliy Let Cte cern tata eon fa cancer an UC} ent te post
ty that a panicular population has cancer (and P[{CS] = | — P[C], the probabil-
ie ving a) Inepnlogy PC] Lown at prev
i ita Ft tse grotto
ees ha pons camer Un ese ie et et fr 3
Pa rh ain woo nC pty tt
Pat co Tels wee pose. Apping Expresion
PIAL!
6.6)
Ac aT)
CIT) = Sa 7)
colt frm ton individ known to hae cane non hose known to
possible. It follows that P{TICS}, the probability of a positive result in cancer-
free patients equals | — P{TCICS), the one-complement of the specificity.
PICT = PEFICIPICL Te probabiiy ofan event se aT canbe writen
PIT}
PITICIPIC] + PITIC PICS10 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS
whic is the sum of the probabilities of having a positive test among those who
have cancer and among those who do not have cancer —each weighted bythe
frequencies of the wo populations. Substituting these two results into Expres.
sion (57) yields
; prTiCVtc)
MON = SEE TAC] + PCTS oe
‘This expression is known as Bayes’ theorem and canbe generalized to allow
foran event C having more than just two states (the denominators summed over
all events C, rather than just C and its complemend. This famous formula,
published posthumously by the eighteenth-century English clergyman Thomas
Bayes, ib led to much controversy over the interpretation of the quantity
PciTy
Earlier we defined “probability” asthe proportion tha an event occurs out of
4 large numberof tials. Inthe current example we have only single alien
‘who either does or doesnot have cancer. The patent doesnot have cancer sone
Proportion ofthe tne. Thus the meaning of P{CIT] in ths case is the degree of
One's belief or the likelihood that this patient has cancer. Its this allemaive
interpretation of probability and the question of how it should be applied 0
States that i controversial. Kotz and Stroup (1983) give a good introduction
tothe idea that probably refers to uncertainty of knowledge rather than of
ovens
Consider te following example in which Bayes" theorem was aplied to a
diagnose test The figures are based on Walson and Tang (1980). The sensiv-
iy ofthe radioimmunoassay for prostatic acid phosphatase (RIA-PAP) a atest
for prostatic eancer is 0.70. Is specificity is 0.94. The prevalence of prostatic
cancer in the white male population is 35 per 100,000, or 0.00035, Applying
these values to Expression (33), we ind
PUTICIPIC)
PUTIC}PIC] + PUTIC PICS]
= 0:70 0.00035
© 070 X-0,00035) + [1 — 0.941 = 0.00035)]
The rather surprising result is that the likelihood that a white male who tests
positive for the RIA-PAP test actually has prostate cancer is only 0.41%. This
probability is known in epidemiology as the positive predictive value. Even if,
the test had been much more sensitive, say, 0.95 rather than 0.70, the positive
predictive value would have been low—0°55 percent. Only for a perfect test
(ic. sensitivity and specificity both = 1) would a positive test imply with cer-
tainty that the patient had prostate cancer.
‘Tho paradoxically low positive predictive value is a consequence of its de-
pendence on the prevalence of the disease. Only if the prevalence of prostatic
cir:
0.0041
5.2 THE BINOMIAL DISTRIBUTION N
‘cancer were 7895 per 100,000 would there be a 50:50 chance that a patient with
a positive test result has cancer, This is more than 127 times the highest preva-
lence ever reported from a population in the United States. Watson and Tang.
‘(.980) use these findings (erroneously reported as 1440 per 100,000) and further
analyses to make the point that using the RIA-PAP test as a routine serecning
procedure for prostate cancer is not worthwhile.
Readers inicrested in extending their knowledge of probability should refer
‘general texts such as Galambo (1984) or Kolz and Stroup (1983) fora simple
introduction
5.2 THE BINOMIAL DISTRIBUTION
For the discussion to follow, we will simplify our sample space to consist of only
two clements, foreign and American students, represented by {F, A}, and ignore
whether they are undergraduates or graduates. Let us symbolize the probability
space by [p,q], where p = PIF}, the probability of being a foreign student, and
PIA|, the probability of being an American student. As before, we can
compute the probability space of samples of two students as follows:
(FE, FA, AA)
{p?.2pq. 4?)
IC we were to sample three students independently, the probability space of the
sample would be
(FFF, FFA, FAA, AAA)
(e394. 3p. g?
Samples of thre foreign or three American students can be obtained in only one
way, and their probabilities are p” and q° respectively. In samples of thee,
however there are three ways of obtaining two students of one kind and one
student ofthe other. As before, fA stands for American and F stands or foreign,
then the sampling sequence could be AFF, FAR, and FFA for two foreign std
dents and one American Thus the probability ofthis outcome will be 3"y.
Similarly, the probability for two Americans ad one foreign student is 34.
'A convenient way to summarize these results i bythe binomial expansion,
whic is applicable to samples of any size from populations in which objects
occur independently in only two classes—students who may be foreign oF
“American, individuals who may be dead or alive, male o female, black or whit
rough or sinooth, and so forth, Tis summary is accomplished by expanding the
binomial term (p + 9), where k equals sample size, p equals the probability of
occurrence of the fist class, and y equals the probability of occurrence of then CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS
= I; hence q is a function of p: 4 =
second class. By definition, p + q = 1; hence q is a functc
‘We will expand the expression for samples of k from 1 to 3:
For samples of 1 (p+ g)'= p+
For samples of 2. (p +g) = p? + 2pa + q?
For samples of 3 (p + 4)? =p? + 3p%4 + 3pq + @?
1s discussed previously. The coeffi-
‘These expressions yield the same outcomes di i
‘cients (the numbers before the powers of p and q) express the number of way
articular outcome is obtained.
Park general formula that gives both the powers of p and q, as well as the
binomial coefficients, is
BY ore pL v= pyr (5.9)
e st us and whose probability
‘number or count of successes,” the items that interest us and
BF oveurrence is symbolized by p. In our example, ¥ designates the number of
secapomaes Te nn 1) nb utr of iso
that can be formed from k items taken Y at a time. This expression can be
Gualuated as k'/(YI(k — Y)HJ, where ! means factorial. In mathematics, k Facto
lis the product of all the integers from 1 up to and including &. Thus: 5
2x3 X 4% 5 = 120, By convention, 0! = 1. In working out fractions
Containing factorals, note that a factorial always cancels against a higher facto-
Sal Thus SUB! = (6 X 4 X 3D! = 5X 4. For example, the binomial coeth-
rent for the expected frequency of samples of 5 students containing 2 foreign
(Sx 42 = 10.
users is (5) = 5120!
let um a ilgial example. Suppose we aves populaon of
(ae a a cheer ha gen vs $I we Ue
iY it cach an exis ach inset sear ee rs
samo of fat ato frp col we exp ithe bof
a tina sample were inept om ht of te ices
ifn a cv p= Othe poporon infec, and ¢ = 06. the
i a ne The opulton tue evo ge ht he gue
Fo ings wor wibstepaceentsilevat racial
a a gegen wou be he expansion fe ov
(p+ gt = 04 + 06"
“With the aid of Expression (5.9) this expansion is
po + Spq + lOp'g? + 10p%¢' + Spat + 4°
5.2 THE BINOMIAL DISTRIBUTION B
(04)! + 5(0.4)40.6) + 10(0.4)°0.6) + 1000.40.59 + 5(0.4)10.6) + (0.6)
representing the expected proportions of samples of five infected insects, four
infected and one noninfected insects, three infected and two noninfected insects,
and s0 on.
By now you have probably realized that the terms ofthe binomial expansion
yield a type of frequency distribution for these different outcomes. Associated
‘with each outcome, such as “five infected insects,” is a probability of
‘occurrence —in this case (0.4)* = 0.01024. This is a theoretical frequency dis-
tribution, or probability distribution, of events that can occur in two classes. It
describes the expected distribution of outcomes in random samples of five in-
sects, 40% of which are infected. The probability distribution described here is
known as the binomial distribution; the binomial expansion yields the expected
frequencies ofthe classes of the binomial distribution.
‘A convenient layout for presentation and computation of a binomial distribu-
‘ion is shown in Table 5.1, based on Expression (5.9). Inthe first column, which
lists the number of infected insects per sample, note that we have revised the
order of the terms to indicate a progression from ¥ = 0 successes (infected
insects) to ¥ = k successes. The second columa features the binomial coefficient
as given by the combinatorial portion of Expression (5.9). Column 3 shows
Stable $1, EXPECTED FREQUENCIES OF INFECTED WsEcTS IN
il SAMPLES OF 5 INSECTS SAMPLED FROM AN INFINITELY
LARGE POPULATION WITH AN ASSUMED INFECTION RATE OF 40%.
w o © o o
amber of
infected Relative Absolute
insets Powers Powers expected expected Observed
per sample fof fequencies fequencies frequencies
Y a= 06 has i f
0 07776 0077761884 «202
1 012960 025920 6280683
2 021600 034560 837.4 SIT
3 036000 023040 5583535
4 0.50000 0.07680 186.1 197
5 1 001024 1.00000 go1024 248 2
Ler Zs=m) 1.00000 34230 Baas
‘Mean 2.00000 2.00008. 1.98721,
Standard deviation 1.09545 1.09543, L.11934Th CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS
increasing powers ofp from p® to p*, and column (4) shows decreasing powers of
4 from q° to q°. The relative expected frequencies, which are the probabilities
Of the various outcomes, are shown in column (5). We label such expected
frequencies f... They are the product of columns (2), (3), and (4), and their sum is,
‘equal to 1.0, since the events in column (1) exhaust the possible outcomes. We
see from column (5) that only about 1% of the samples are expected to consist of
5 infected insects, and 25.9% are expected to contain I infected and 4 nonin-
fected insects. We will now test whether these predictions hold in an actual
experiment
EXPERIMENT 5.1. Simulate the case of the infected insects by using a table of
random numbers such as Statistical Table FF. These are randomly chosen one-digit
‘numbers in which each digit 0 through 9 has an equal probability of appearing. The
‘numbers are grouped in blocks of 25 fr convenience. Such numbers can also be obtained
from random number keys on some pocket calculators and by pseudorandom nun
generating algorithms in computer programs. Since there is an equal probability for any
‘ne dit to appear you ean et any four digits (say 0, 1, 2,3) stand fo the infected insects
and the remaining digits (4, 5,6, 7, 8, 9 stand forthe noninfected insects. The probability
that any one digit selected from the table represents an infected insoct (thai, 40, 1,2, oF
3)is therefore 40% or 0.4, since these are four of the ten possible digits. Also, successive
Uigts are assumed to be independent ofthe values of previous digits, Thus the assump:
‘ions of the binomial distribution should be met in tis experiment. Ener the table of
random aumbers at an arbitrary point (not always atthe beginning!) and look at succes-
sive groups of five digits, noting in each group how many ofthe digits are 0, 1,2. oF 3.
‘Take as many groups of five as you can find time to-do, but no fewer than 100 groups
(Persons with computer experience can easily generate the Jaa required by this exereise
Without using Table FF. There are also some programs that specialize in simulating
sampling experiments.)
Column (7) in Table 5.1 shows the results of such an experiment by a bi
ety clas. A total of 2423 samples of five numbers were obtained from Stats-
tical Table FF and the distribution ofthe four digits simulating the percentage of
infection is shown in this column. The observed frequencies are labeled f To
calculate the expected frequencies for this example, we multiplied the relative
expected frequencies, of column (5) by n = 2423, the number of samples
taken, These calculations resulted in absolute expected frequencies, f, shown
in column (6). When we compare the observed frequencies in column (7) with
the expected frequencies in column (6), we note general agreement between the
two columns of figures. The two distributions ae illustrated in Figure 52.1 the
observed frequencies didnot ft expected frequencies, we might believe that the
lack offi was due to chance alone. Or we might be Ie to reject one or more of
the following hypotheses: (1) thatthe tue proportion of digits 1,2, and 3s 0.4
(rejection of this hypothesis would normally not be reasonable, for we may rely
fn the fact thatthe proportion of digits 0, 1 2, and 3 ina tale of random
15
Bh Observed frequencies
1D Expected feuencies
Frequency
cERESESRZES
o 1 2
Number of infeted insects per sample
FIGURE 5.2 Bar diagram of observed amd expected frequencies given n Table 5.1,
pee
numbers is 0.4 oF very elose to it); (2) that sampling was random; and (3) that
‘events are independent.
“These statements ean be reinterpreted in terms of the original infection mode!
with which we started this discussion. If, instead of a sampling experiment of
igits by a biometry class, this had been i real sampling experiment of insects,
‘we would conclude thatthe inscets had indeed been randomly sampled and that
‘we had no evidence to reject the hypothesis that the proportion of infected inscets
‘was 40%, IF the observed frequencies had not fit the expected frequencies, the
Jack of fit might be attributed to chance or to the conclusion that the true propor-
tion of infection is not 0.4, or we would have to reject one or both the following
assumptions: (1) that sampling was at random, and (2) that the occurrence of
infected inscets in these samples was independent.
Experiment 5.1 was designed to yield random samples and independent
events. How could we simulate a sampling procedure in which the occurrences
of the digits 0, 1,2, and 3 were not independent? We could, for example. instruct
the sampler to sample as indicated previously, but every time he found a 3, to
search through the succeeding digits until he found another one of the four digits
standing for infected individuals and to incorporate this in the sample. Thus.
‘once a3 was found, the probability would be I.0 that another one ofthe indicated
digits would be included in the sample. After repeated samples, this procedure
‘would result in higher frequencies of classes of two or mote indicated digits and
in lower frequencies than expected (on the basis of the binomial distribution) of
classes of one event. Many such sampling schemes could be devised. It should be
lear thatthe probability ofthe second event occurring would be different from
‘and dependent on that ofthe frst.16 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS,
How would we interpret large departure ofthe observed frequencies from
extete fronvncin in anther example? We have ot yet ead techies
coerce whathercbxerved frequencies difer from those expected by more
see eae bated to chance alone This topic wil be taken up in Chapter 17.
rao at sucha test has been eared ot and tht it has shown thal our
ase roquenies ae significantly different from the expected frequencies
Tree ges of deparare fom expetaion are ikely (1 clumping and 2)
ein, own i fettios examples in Table 5.2. In el examples we would
reeset rt ntions about the magaitade of p, the probability of one of the
mat te athe outcomes. In such eases x customary to obtain p fom the ob
weet sample and to caleulat the expected Frequencies sing te samplep. This
a a ia ibe hypsess hat pis a given vue cannot be std. since by
dame i pnpcctediequencies all Rave the ame p value asthe Observed
desig th. 1 in
clumped samples, and <1 in cases of repulsion. In the yeast cell example,
CD = 1.092. Computer program BIOM-pe includes an option for expected
Poisson frequencies.
Figure 5.3 will give you an idea of the shapes of the Poisson distributions of
different means. It shows frequency polygons (the lines connecting the mid
10
Relative expected frequeney
2 4 6 8 0 2 4 16
[Number of rare evens per sample
FIGURE 5.3 Frequency polygons of the Poisson distributions for various values of
the meanpoints of a bar diagram) for five Poisson distributions. For the low value of
r= OL, the frequency polygon is extremely U-shaped, but with an increase in
the value of , the distributions become humped and eventually nearly syrmmet-
rical
‘We conclude our study of the Poisson distribution by considering several
examples from diverse fields. The fist example, whichis analogous to that ofthe
Yeast cells, is from an ecological study of mosses ofthe species Hypnum schre-
Jeri invading mica residues of china clay (Table 5.5). These residues occur on
‘deposited *'dams” (often 5000 yd? in area), on which the ecologists laid out 126
quadrats, In each quadrat they counted the number of moss shoots. Expected
Frequencies are again computed, using the mean number of moss shoots,
Y= 0.4841, as an estimate of ps. There are many more quadrats than expected at
the two tals of the distribution than atts center. Thus, although we would expect
only approximately 78 quadrats without a moss plant, we find 100. Similarly,
while there are 11 quadrats containing 3 or more moss shoots, the Poisson
Uistribution predicted only 1.7 quadrats. By way of compensation the central
‘lasses are shy of expectation. Instead of the nearly 38 expected quadrats
‘one moss plant each, only 9 were found, and there is a slight deficiency also
the 2-mosses-per-quadrat class. This case is another illustration of clumping,
‘which was encountered first in the binomial distribution. The sample variance
y= '1308, much larger than P= 0.4841, yields a coeflicient of dispersion
cD = 2.702.
‘Searching for a biological explanation of the clumping, the investigators
found that the protonemata, or spores, of the moss were carried in by water and
Table 557 NUMBER OF MOSS SHOOTS (HYPHUM SCHREBER? PER
ble 5.5. QuaDRAT ON CHINA CLAY RESIDUES (MICA).
wo @ @ ©
unter of tte Detation
EES onenet don
et egenees Gegpences expectation
nant 7 wt
: o 100 mI +
1 2 M6
> ° :
3 sl 13 4
‘ thr ation he
a 3
2 ;
Source: Ba fom Bans sd Sty (950,
MITES (ARRENURUS sp.) INFESTING 589 CHIRONOMID
Table 5.6 ruics (cALoPsEcTRA AKRINA)
> a
sinter pao Deaton
Chmics—Oteeved pen tem
Pe cee | eee eee
7 7 Tt
oo +
1 1 wes :
2 > 362 E
3 i 33 *
3 ‘ os 1
$ ‘ a1 +
2 Shar gy thy
? Q a0 °
h 1 29 4
Total 389 589.0
deposited at anon bul hat cch proton gave ie ta number of up
show, com be ate eda ped atrbuon Inf when he
clon instead of indvidal stots were used vara che vestigins found
tat the chm followed a Poisson dtibtion that were randomly db
ted. Thete ae prblete nr eppyng is aprsch sine tee a th
tanling units Can vnc heel Ths woul app, for nao the
plans produced substances tha prevented other las fom growing very clone
(other pulsed daibun)tur ona age scale the pans were lured in
favorable regions (clinpd dabuton), Grganss ving unde thelr ovn
font wh td o cmp on ay Beep 1 rn
Socialists or may have accumulated in clump, aa eoul fo in rxponso
tecenvironmena foes Temps asarsalichor nee
"The seond example features the dsbtion of water mite on as of
chromomi fy Tale 5. Ths exams inl toe Geof the moses
‘except that here the sampling unit is a fly instead of a quadrat, The rare event is a
Ines nesting they. The coffe of person, 2.22, eet the paern
ofbecredfrojuencc, whic ar reser han aperiodic al ade han
expesed nthe centr Thi slaonsip a made lear in the lt clu of the
feuney distibuen, which shows th signe of the deviations (observed =
quences minus eaeced reqenis) and shows characte clumped pt
tk A ponible eplanation I tha he dense of mites Inthe several poms90 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS.
from which the chironomid flies emerged differed considerably. Chironomids
that emerged from ponds with many mites would be parasitized by more than
‘one mite, but those from ponds where mites were scarce would show little or no
infestation,
‘The third example tests the randomness of the distribution of weed seeds in
samples of grass seed (Table 5.7). Here the total number of seeds in a quarter-
‘ounce sample could be counted, so we could estimate k (which is several thou-
sand), and , which represents the large proportion of grass seed, as contrasted
with p, the small proportion of weed seed. The data are therefore structured as in
a binomial distribution with alternative states, weed seed and grass seed. For
purposes of computation, however, only the number of weed seeds must be
‘considered. This is an example of a case mentioned at the beginning of this
section, a binomial in which the frequency of one outcome is very much smaller
than that of the other and the sample size is large, We can use the Poisson
distribution as a useful approximation of the expected binomial frequencies for
the tal ofthe distribution. We use the average number of weed seeds per sample
of seeds as our estimate of the mean and calculate expected Poisson frequencies
for this mean. Although the pattern of the deviations and the coefficient of
dispersion again indicate clumping, this tendency is not pronounced, and the
methods of Chapter 17 would indicate insufficient evidence for the conclusion
that the distribution is not a Poisson. We conclude that the weed seeds are
randomly distributed through the samples. If clumping had been significant it
Table 5:1: POTENTILLA (WEED) SEEDS IN 98 QUARTER-OUNCE
fable 3.1 samples OF GRASS SEEDS (PHLEUM PRATENSE).
, @ ° @
Number of weed Poisson Deviation
seeds per Observed expected from
sample of seds frequencies frequencies expectation
Y f f ia
7 33 +
2 387 7
16 208 S
9 18 +
2 22 5
Oh oshi06 = = be
1 on +
1 00) +
580
Yau f= 17 CD
socr: Motil rom Lega (539.
5.3 THE POISSON DISTRIBUTION Bi}
EXPECTED FREQUENCIES COMPARED FOR BINOMIAL
Table 5.8. ano poisson oisTRIBUTIONS.
o @ @
Expected
frequencies
frequencies approximated
= 0001 “by Poisson
= 100 wel
y Jes Jr
00367695 (0367879
10368063 0367879
2 oeos2 183940
30061283 a0si313
4 0013290 o.1s328
50003049 0.003066
6 0000508 000511
7 gwar a.gov073
Koo? 0.000009
9 poo! 0.000001
Total 1.000000 099999
might have been because the weed seeds stuck together, possibly because of
interlocking hairs, sticky envelopes, or the like.
Perhaps you need to be convinced that the binomial under these conditions|
{does approach the Poisson distribution. The mathematical proof is too involved
for this text, ut we can give an empirical example (Table 5.8). Here the expected
binomial frequencies for the expression (0.001 + 0.999)!" are given in column
(2), and in column (3) these frequencies are approximated by a Poisson distribu-
tion with mean equal to I, since the expected value for p is 0.001, which is one
event for a sample size of k = 1000. The two columns of expected frequencies
are extremely similar.
“The next distribution is extracted from an experimental study of the effects of
different densities of parents of the azuki bean weevil (Table 5.9). Larvae of
se weevils enter the beans, feed and pupate inside them, and then emerge
through a hole, Thus the number of emergence holes per bean is a good measure
of the number of adults that have emerged. The rare event in this case is the
‘weevil present in the bean, We note that the distribution is strongly repulsed, a
far rarer occurrence than a clumped distribution. There are many more beans
containing one weevil than the Poisson distribution would predict. A statistical
finding of this sort leads us to investigate the biology of the phenomenon. In this.
case it was found that the adult female weevils tended to deposit their eggs
evenly rather than randomly over the available beans. This prevented too many9 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS.
Table 5,9 2ZUKI_BEAN WEEVILS (CALLOSOBRUCHUS CHINENSIS)
fable 5.9 | emencinG FROM 112 AZUKI BEANS (PHASEOLUS RADIATUS).
3.4 OTHER DISCRETE PROBAUILITY DISTRIBUTIONS 93
Table 5,10 MEN KILLED BY BEING KICKED BY A HORSE IN 10
PRUSSIAN ARMY CORPS IN THE COURSE OF 20 YEARS.
© @ @ o o @ 5
Number of Poston Delton
‘evi Poison Deviation Namberofmen Obered expected om
merging Observed capeted tom kiledperyear” —feqwencesfeaeanees eapeson
Terbean —fequenclesGoquenciss_eapeaton reraimycome
’ f 7 = oo
SF vost
3 a 7a = t «3
1 50 m7 a 2 mn
2 1 6 3 a
3 ob, tabey = 4 os} an
ae oon . 5+ au
Tout Tia Tot x00
0.269 CD = 0.579 ¥ = 0610 ost CD = 7
SoUNcE Du or Ue 60)
eggs being placed on any one bean and precluded heavy competition among the
developing larvae. A contributing factor was competition between larvac feeding
fn the same bean, generally resulting in all but one being killed or driven out.
‘Thus its easy to understand how these biological phenomena would give rise 10
a repulsed distribution,
‘We will end this discussion with a classic, ragicomic application of the Pois-
son distribution. Table 5.10 is a frequency distribution of men killed by being
kicked by a horse in 10 Prussian army corps over 20 years. The basic sampling
unit is temporal inthis case, one army comps per year. We are not certain exactly
how many men are involved, but most likely a considerable number. The 0.610
‘men killed per army corps per year isthe rare event. If we knew the number of
‘men in each army corps, we could calculate the probability of not being killed by
1 horse in one year. This calculation would give us a binomial that could be
approximated by the Poisson distribution. Knowing that the sample size (the
‘number of men in an army corps) is large, however, we need not concern our-
selves with values of p and k and can consider the example simply from the
Poisson model, using the observed mean number of men killed per army corps
per year as an estimate of
‘This example is an almost perfect fit of expected frequencies to observed
cones. What would clumping have meant in such an example? An unusual num-
ber of deaths in a certain army corps ina given year could have been due 10 poor
discipline in the particular corps or to @ particularly vicious horse that killed
SUH: Daa ome Bonowcr (98)
Several men before the corps got rd of it. Repulsion might mean that one death
per year per corps occurred more frequently than expected. This might have been
50 if men in each corps were careless each year until someone had been killed,
after which time they all became more cazeful for a while.
{ Occasionally samples of tems that could be distributed in Poisson fashion are
‘taken in such a way that the zero class is missing. For example, an entomologist
studying the number of mites per leaf of a tree may not sample leaves at random
{From the tree, but collect only leaves containing mites, excluding those without
| nites. For these so-called truncated Poisson distributions, special methods can
De used to estimate parameters and to calculate expected frequencies. Cohen
© (1960) gives tables and graphs for rapid estimation.
5.4 OTHER DiscRETE PROBABILITY DisTRIBUTIONS
‘The binoril and Poisson dstittions ate both examples of sree probabil
istibutions Forte moss example of te previous secon (Table 99) th
means that tere is ether no mons pan pot quadrat er one ors plan td
plants thee plans, aso forth bu ho vales in between. Other dees dines
butions are known in probability theory, and in reen yrs any have taeg
Suggested as slab for one application canoe. A abepinnor, you reed sex9h CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS.
50) samples (see Box
63, part Il), In smaller samples a difference of one item per class would make a
substantial difference inthe cumulative percentage in the tails. For small samples
(<50) the method of ranked normal deviates or ranks, is preferable. With this,
‘method. instead of quantiles we use the ranks ofeach observation in the sample,
and instead of NEDs we plot values from a table of rankits, the average positions
PYRE] GRAPHIC TEST FOR NORMALITY OF SAMPLES BY °
MEANS OF NORMAL QUANTILE PLOTS.
1. Routine computer processed samples
Computation
1. Array samples in order of increasing magritude of variates.
2, For each variate compute the quantity p = (i~ 49m, where is the rank onder of the
‘th variate in the aay, and is the sample site, These Yalues will be used for
computing NEDs. Th correction of} prevents the last variate from yielding p =
1.0, for which the NED would be postive infinity. For ied values of calculate an
average p.
13. For each value ofp evaluate the coresponding NED. If ne computer program is
Available, the NEDs can be looked up ina table ofthe inverse cumulative normal
distribution (eg. Table 4 in Pearson and Harley, 1958, or Table 1.2 in Owen,
1962), or by inverse interpolation in Statistical Table A.
4, Plot the NEDs against the original variates and examine the scatterpot visually for
Tineaiy. Some programs also plot the straight line expected when the sample i
perfectly normally distribute, to serve as a benchmark agaist which the observed
Scatterpot can be judged. Alternatively straight line is fte to the points by eye,
Preferably using a ransparent plastic ler, which permits ll he points to be seen 8
the line is drawn, In drawing the line, most weight should be given to the poins
between cumulative frequencies of25% to 75% because a difference ofa single item
may make appreciable changes in the percentages atthe tis. Some programs plo
INEDs along the abscissa and the variable along the ordinate. We prefer the more
‘common arrangement shown in Figure 6.6 andthe figures inthis box
Figure A shows 1400 housely wig lengths randomly sampled from Table 61 a8 3
normal quantile plot Since these are approximately normal dat, itis wo surpeise that
the seaterplot forms «nearly seaight line.
ng
Box 6.3 (CONTINUED)
23
20
15
10
os!
00)
05
-10]
[Normal quantiles
-15)
20)
ee ae eee eee
So WT 40 a2 44 a6 a8 50 52 34 56
Housel wing lengths
FIGURE A Normal quantile plot of 1400 random samples of 5 housefly wing
Teng from Table 6.1
Large frequency distributions
‘When the sample is large, ploting every observation may nt be wore, athe
rormal quantile plot can be applied tothe frequency distribution as ftlows, We employ
the by now thoroughly fail Chinese bith weights.
Birth weights of male Chinese in ounces, rom Box 43,
o ® ® @ o
Class Upper Cumulative
‘mark class frequencies
Y tim Sf F poe Hin
ws Ss 2 ‘0.0002
os ms 6 8 0.0008
sa) = se 2 0) a 0.0049
BS 87S ORS 432 0.0856,
31S 983 RR 1320, 0.1394
9S 103s 129309 o3z21
wors11n3 2240589 03587
ss 195 20077296 07108
ass 12338829 sort
is 1385 Git 9170 0.9688
nes 3s 0 TL 0.9900
175 StS 749445 9078
Iss 195148459. 0.9993
1633 9464 0.9998
mms 93685 09999)10
2, Form a.com
Box 6.3 (conTiNvED)
Compuation
1. Prepare a frequency distribution as shown in columns (1), (2), and (3)
sive frequeney distribution as shown in column (). Its obtained by
successive summation ofthe Frequency values Incolumn (5) express the cumoltive
frequencies asp-alucs using the formula instep 2of part.
53. Graph the NEDs, quantiles, comesponding these p-values against the upper cass
Tmt ofeach claws (Figure B) Notice tht the upper frequencies deviate othe right
ofthe straight ine. This is (ical of data that ae skewed tothe right (se Figure
660).
4. The graph from step 3 (and those described inthe other pars ofthis box) can ao BE
Teter pid estimation ofthe ean and saan devin of sample The nea
sSSpmonitat bya papi estimation ofthe mein, The more era he dst
Sot te closer the ean wil be fo the mean, Te moan etme by
Normal equivalent deviates
BSS 713 B15 HONS 1195 1353 1313 1675
‘Binh weights of Chinese males (oz)
FIGURE B Graphic analysis of data
7 GRarmie METHODS ni
Box 6.3 (continueD)
dropping a perpendicular from the intersection ofthe 0 pont onthe ordinate and the
cumulative Frequency curve othe abscissa (see Figure B). The estimate ofthe mean
of 10.7 02 is quite close wo the computed mean of 1099 0,
‘5. The standard deviation is estimated by dropping similar perpendiculars from the
intersections ofthe = and the +1 points with the cumulative curve, espotivel.
These points enclose the portion ofa normal curve represented by ys 2 0. By mea.
suring the difference between these perpeniculas and dividing this value by 2, we
‘obtain an estimate of one standant deviation. In this instance the estimate i + =
136since the difference is27.2 oz divided by 2. This close approximation tothe
computed value of 13.59 02
ML, Smal samples (n= 50)
Femur lengths of aphid stem mothers fom Ix 2.1 # = 25,
Ta (oe enor
Ranks Rankie Ranks Rankis
‘rom allowing. fiom allowing
¥ Table “fortes | Y Table ‘forties
tor 97e eo emer ars
35-12-12 | 1020s
36-136 42° 03000
36-107 43 oat O58
36-091 le 0s ose
36-076 430688
Be 064 43 076 sk
38-082 44 oot 1.08
3K “0i wy rt
38-030 44 126 Low
39 “020 450 1s 32
39-010 47st a7
39 000
Computation
1. Enter the sample arayed in increasing order of magni in column (1) ncolunin
(@) put coresponding rankits from Slalistical Table MH, The table gives only the
anki forthe al of each distribution greater than the median for any sample size.
‘The ther half s the same bot is negative in sign All samples conaining an od
‘umber of variates (suchas this one) hae Oa the median vale. The ranks fr this
example are looked up under sample size n= 25,
‘A special problem illustrated inthis example i the ease of tes, oF variates of
‘dentical magnitude, In sich a case we sum the rant vals forthe corresponding
ranks and ind their mean, Thus the ~ 1.00 occupying lines 3 06 in column (3) isthe