0% fanden dieses Dokument nützlich (0 Abstimmungen)
829 Ansichten447 Seiten

Sokal, Rohlf 1995

métodos estadísticos aplicados
Copyright
© © All Rights Reserved
Wir nehmen die Rechte an Inhalten ernst. Wenn Sie vermuten, dass dies Ihr Inhalt ist, beanspruchen Sie ihn hier.
Verfügbare Formate
Als PDF herunterladen oder online auf Scribd lesen
0% fanden dieses Dokument nützlich (0 Abstimmungen)
829 Ansichten447 Seiten

Sokal, Rohlf 1995

métodos estadísticos aplicados
Copyright
© © All Rights Reserved
Wir nehmen die Rechte an Inhalten ernst. Wenn Sie vermuten, dass dies Ihr Inhalt ist, beanspruchen Sie ihn hier.
Verfügbare Formate
Als PDF herunterladen oder online auf Scribd lesen
BIOMETRY Hg 10-447400 ~— BBBIOMETRY THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH THIRD EDITION Robert R. SOKAL and F. James ROHLF State University of New York at Stony Brook CULE ZAK tein 7 BiBLioTEC® | oo Cina W. H. FREEMAN AND COMPANY New York CONTENTS PREFACE NOTES ON THE THIRD EDITION INTRODUCTION 1.1 Some Definitions 1.2 The Development of Biometry. 13. The Statistical Frame of Mind DATA IN BIOLOGY 2.1 Samples and Populations 2.2. Variables in Biology 2.3. Accuracy and Precision of Data 24 Derived Variables 2.8. Prequency Distributions THE HANDLING OF DATA 3.1 Computers 3.2. Software 3.3 Biiciency and Economy in Data Processing. DESCRIPTIVE STATISTICS 4.1 ‘The Arithmetic Mean 4.2 Other Means 43. The Median 44 The Mode 45. The Range 48 The Serird De 10 3 16 19 33 34 35 37 Library of Congress Caalogng-in-Publication Data Sokal, Robert R. ‘Biomety: the principles and practice of stasis in biological research / Rober R. Sokal and F James Rohif.—3d ed pcm. Includes bibliographical references (p. 850) and index. IseNOTGT-I 1. biometry. ROBIE, F James, 4936~ Title (H32835.$63 1995 57401'S195—4e20 94-1120 cw ©1995, 1981, 1969 by W. H. Freeman and Company [No past ofthis book may be reproduced by any mechanical, photographic, or electronic process, oc in the form of a phonograph recording, nor may it be stored ina retrieval Fystem, Uansmited, or otherwise copied fr public or pivae use, witout writen permission from the publisher. Printed in the United States of Amerie 123456789 VB 9987654 To our parents of blessed memory Kara and Siegfried Sokal Harviet and Gilbert Robi contents 1B ASSUMPTIONS OF ANALYSIS OF VARIANCE. h 131 132 133 134 135 136 137 138 Bg 13.10 Bu 13.12 ‘A Fundamental Assumption Independence Homogeneity of Variances Normality. Additivity ‘Transformations: ‘The Logarithmic Transformation ‘The Square-Root Transformation ‘The Box-Cox Transformation ‘The Arcsine Transformation. Nonparametric Methods in Lieu of Single- Classification Anovas. Nonparametric Methods in Lieu of Two-Way Anova LINEAR REGRESSION 14.4 Introduction to Regression 142 Models in Regression 143 The Linear Regression Equation 144 Tests of Significance in Regression 143 More Than One Value of ¥ for Each Value of X 146 The Uses of Regression... 147 Estimating X from Y. 14.8 Comparing Regression Lines 149 Analysis of Covariance 14.10 Linear Comparisons in Anovas 14.41 Examining Residuals and Transformations in Regression 14.12. Nonparametric Test for Regression 14.13 Model II Regression, CORRELATION 15.1 Correlation and Regression 152 ‘The Product-Moment Correlation Coefficient 15.3 The Variance of Sums and Differences 154 Computing the Product-Moment Correlation Coefficient 15.5 Significance Tests in Correlation 1546 Applications of Correlation so | sucipal. vand det gio 15% Nonoarametric Tests for Association 392 393 393 396 406 401 409 413 ais ar 419 23 440 451 452 455 457 476 486 491 493, 521 531 539 Sal 555 556 559 561 509 374 583 593 16 MULTIPLE AND CURVILINEAR REGRESSION 16 162 163 16.4 165 166 167 Multiple Regression: Computation Multiple Regression: Significance Tests, Path Analysis Partial and Multiple Correlation Choosing Predictor Variables Curvifinear Regression, Advanced Topics in Regression and Correlation I ANALYSIS OF FREQUENCIES ra 172 173 4 Ins 176 17 Introduction to Tests for Goodness of Fit Single-Classification Tests for Goodness of Fit Replicated Tests of Goodness of ‘Tests of Independence: Two-Way Tables. ‘Analysis of Three-Way and Multiway Tables ‘Analysis of Proportions Randomized Blocks for Frequency Data 18 MISCELLANEOUS METHODS 18.1 182 183 184, 185 Combining Probabilities From Tests of Significance ‘Tests for Randomness of Nominal Data: Runs Tests Randomization Tests ‘The Jackknife and the Bootstrap ‘The Future of Biometry: Data Analysis APPENDIX: MATHEMATICAL PROOFS BIBLIOGRAPHY AUTHOR INDEX SUBJECT INDEX cop 610 23 634 649 665 678 686 7 115 na 43 160 718 794 4 27 803 820 825 833 850 865 871 4.7 Sample Statistics and Parameters 48 Coding Data Before Computation 49 Computing Means and Standard Deviations 4.10. The Coefficient of Variation INTRODUCTION TO PROBABILITY DISTRIBUTIOI BINOMIAL AND POISSON 5.1 Probability, Random Sampling, and Hypothesis Testing 52. The Binomial Distribution 53. The Poisson Distribution 54 Other Discrete Probability Distributions ‘THE NORMAL PROBABILITY DISTRIBUTION 6.1 Frequency Distributions of Continuous Variables 62 Properties of the Normal Distribution 63. A Model for the Normal Distribution 64 Applications of the Normal Distribution 65 Fitting a Normal Distribution to Observed Data 66 Skewness and Kurtosis, 67 Graphic Methods 68 Other Continuous Distributions ESTIMATION AND HYPOTHESIS TESTING 7.1 Distribution and Variance of Means 712. Distribution and Variance of Other Statistics 73 Inteoduction to Confidence Lit Ta The eDistribution 75 Confidence Limits Based on Sample Si 1 The Chi-Square Distribution 1.7 Confidence Limits for Variances 7.8 Introduction to Hypothesis Testing 719° Tests of Simple Hypotheses Using the Normal and Distributions 7.10 Testing the Hypothesis Ha: o* = a INTRODUCTION TO THE ANALYSIS OF VARIANCE, 8.1 Variances of Samples and Theit Means. 82 The FDistribution 83 The Hypothesis Hy: o3= of 2 4 7 61 a 1 8 93 98 98 01 106 109 un 1 116 123 a7 18 136 139 143 146 152 154 137 169) 175 179 180 184 wo R 8.4 Heterogeneity Among Sample Means 8.5 Paniioning the Total Sum of Squares and Degrees ‘of Freedom, 86 Model | Anova 8.7 Model Il Anova SINGLE-CLASSIFICATION ANALYSIS OF VARIANCE. 9.1 Computational Formulas 9.2 General Case: Unequal 93 Special Case: Equal 94 Special Case: Two Groups 95 Special Case: A Single Specimen Compared With a Sample 9.46 Comparisons Among Means: Planned Comparison 9.7 Comparisons Among Means: Unplanned Comparisons 9.8 Finding the Sample Size Required for a Test NESTED ANALYSIS OF VARIANCE 10.4 Nested Anova: Design 102. Nested Anova: Computation 10.3 Nested Anovas With Unequal Sample Sizes 104 The Optimal Allocation of Resources TWO-WAY ANALYSIS OF VARIANCE, HL Two-Way Anova: Design 11.2. Two-Way Anova With Equal Replication; Computation 113 Two-Way Anova: Significance Testing 114) ‘Two-Way Anova Without Replication 1.5. Paired Comparisons 11.6 Unequal Subclass Sizes 11.7 Missing Values in a Rand ized-Blocks Design MULTIWAY ANALYSIS OF VARIANCE 12.1 The Factorial Design 12.2 A Three-Way Factorial Anova 12.3. Higher-Order Factorial Anovas 124 Other Designs 125. Anovas by Computer 190 wr 201 203 DATA IN BIOLOGY ss In Section 2.1 we explain the statistical meaning of “sample” and population," terms used throughout this book. Then we come to the types of observations obtained from biological research material, with which we shall perform the computation in the rest of this book (Section 2.2). In obtaining data ‘we shal run into the problem of the degree of accuracy necessary for reconling the data, This problem and the procedure for rounding off figures are discussed in Section 2.3, after which we will be ready to consider in Section 24 certain nds of derived data, such as ratios and indices, frequently used in biological science, which present peculiar problems with respect to their accuracy and distribution. Knowing how to arrange data as frequency distributions is impor- tant, because such arrangements permit us to get an overall impression of the shape ofthe variation present ina sample, Frequency distributions, as well as the presentation of numerical data, are discussed in the last section (2.5) of this chapter. 21 SAMPLES AND POPULATIONS We shall now define a number of important terms necessary for an understanding ‘of biological data. The data ina biometric study are generally based on individ ual observations, which are observations or measurements taken on the small- est sampling unit. These smallest sampling units frequent, but not necessarily, are also individuals inthe ordinary biological sense. If we measure weight in 100, rats, then the weight of each rat is an individual observation; the hundred rat ‘weights together represent the sample of observations, defined as.a collection of individual observations selected by a specified procedure. In ths instance, one individual observation is based on one individual in a biological sense —that i, ‘one rat, However, if we had studied weight in a single rat over a period of time, the sample of individual observations would be al the weighis recorded on one vor at gmerosivg times, In > smidy pf femperntirs in ant rolonies. where each 2.1 SAMPLES AND POPULATIONS 9 colony is a basic sampling unit, each temperature reading for one colony is an individual observation, and the sample of observations is the temperatures for all the colonies considered. An estimate of the DNA content of a single mammalian sperm cell is an individual observation, and the corresponding sample of obser- vations is the estimates of DNA concent of all other sperm cells studied in one individual mammal. A synonym for individual observation is “‘item.”* Up to now we have carefully avoided specifying the particular variable being studied because “individual abservation"” and ‘‘sample of observations”” as we just used them define only the structure but not the nature of the data in a study. The actual property measured by the individual observations isthe variable, ot character. The more common term employed in general statistics is variable. In evolutionary and systematic biology however, character is frequently used syn- ‘onymously. More than one variable can be measured on each smallest sampling unit, Thus, in a group of 25 mice we might measure the blood pH and the erythrocyte count. The mouse (a biological individual) would be the smallest sampling unit; blood pH and cell count would be the two variables studied. In this example the pH readings and cell counts are individual observations, and two samples of 25 observations on pH and erythrocyte count would result. Al- temnatively, we may call this example a bivariate sample of 25 observations, each referring to a pH reading paired with an erythrocyte count. Next we define population. The biological definition of this term is well known: It refers to all the individuals of a given species (perhaps of a given life history stage or sex) found in a circumscribed area at a given time. In statistics, population always means the sotality of individual observations about which inferences are to be made, existing anywhere in the world or at least within a definitely specified sampling area limited in space and time. If you take five ‘humans and study the number of leucocytes in their peripheral blood and you are prepared to draw conclusions about all humans from this sample of five, then the population from which the sample has been drawn represents the leucocyte counts of all humankind—that is, all extant members of the species Homo sapiens. If, on the other hand, you restrict yourself to a more narrowly specified sample, such as five male Chinese, aged 20, and you are restricting your conclu- sions to this particular group, then the population from which you are sampling will be leucocyte numbers of all Chinese males of age 20. The population in this statistical sense is sometimes referred to as the universe. A population may refer to variables of a concrete collection of objects or creatures—such as the tail Tengths of all the white mice in the world, the leucocyte counts ofall the Chinese ‘men in the world of age 20, or the DNA contents of all the hamster sperm cells in cexistence—or it may refer to the outcomes of experiments —such as all the heartbeat frequencies produced in guinea pigs by injections of adrenalin. In the first three cases the population is finite. Although in practice it would be impos- sible o collect, count, and examine all white mice, all Chinese men of age 20, or all hamster sperm cells in the world, these populations ate finite. Certain smaller Population, suchas all the whooping cranes in North America o al the pocket 4 10 CHAPTER 2 DATA IN BloLoGY gophers ina given colony, may lie within reach ofa total census. By contrast, experiment can be repeated an infinite number of times (atleast in theory). An ‘experiment such as the administration of adrenalin to guinea pigs eould be re peated as long as the experimenter could obtain material and his or her health and. patience held out. The sample of experiments performed is a sample from an ‘number that could be performed. Some of the statistical methods to be developed later make 2 distinction between sampling from finite and from infi- nite populations. However, although populations are theoretically finite in most applications in biology, they are generally 50 much larger than samples drawn from them that they can be considered as de facto infinitely sized populations. 22 VariaBLes IN BIOLOGY Enumerating all the possible kinds of variables that can be studied in biological research would be a hopeless task. Each discipline has its own set of variables, Which may include conventional morphological measurements, concentrations of chemicals in body fuids, rates of certain biological processes, frequencies of certain events (asin genetics and radiation biology), physical readings of optical orelecitonic machinery used in biological research, and many more. We assume that persons reading this book already have special interests and have become acquainted with the methodologies of research in their areas of interest, so that the variables commonly studied in their fields are at least somewhat familar. In any case, the problems for measurement and enumeration must suggest them: selves tothe researcher; statistics will not, in general, contribute to the discovery and definition of such variables. Some exception must be made to this statement. Once a variable has been chosen, statistical analysis may demonstrate it to be unreliable. If several vari- ables are studied, certain elaborate procedures of multivariate analysis can assign ‘weights to them, indicating their value for a given procedure. For example, in taxonomy and various other applications, the method of discriminant functions can identify the combination ofa series of variables that best distinguishes be- tween two groups (see Section 16.7). Other multivariate techniques, such as ‘canonical variates analysis, principal components analysis, or factor analysis, can specify characters that best represent or summarize certain pattems of varia- tion (Krzanowski, 1988; Jackson, 1991). As a general rule, however, and partic ularly within the framework ofthis book, choosing a variable as well as defining the problem to be solved is primarily the responsibility of the biological re- searcher. 'A more precise definition of variable than the one given earlier i desirable Its a property with respect to which individuals in a sample difer in some ‘ascertainable way. If the property does not differ within a sample at hand or at least among the samples being studied, it cannot be of statistical interest. Being entirely uniform, such a property would also not be a variable from the etymological point of view and should not be so called. Length, height, weight, number of teeth, vitamin C content, and genotypes are examples of variables in ordinary, genetically and phenotypically diverse groups of organisms. Warm bloodedness in a group of mammals isnot a variable because they are all alike in this regard, although body temperature of individual mammals would, of course, bbe a variable. Also, if we had a heterogeneous group of animals, of which some ‘were homeothermic and others were not, then body temperature regulation (with its two states or forms of expression, ‘warm-blooded’ and “‘cold-blooded") ‘would be a variable. ‘We can classify variables as follows: Variables Measurement variables Continuous variables Discontinuous variables Ranked variables Attributes ‘Measurement variables are those whose differing siates can be expressed in ‘4 numerically ordered fashion. There are two types of measurement variables: continuous and discontinuous, Continuous variables at least theoretically can assume an infinite number of values between any two fixed points. For example, between the two length measurements 1.5 em and 1.6 em an infinite number of lengths could be measured if one were so inclined and had a measuring instru- rent with sufficiently precise calibration. Any given reading of a continuous variable, such as a length of 1.57 cm, is an approximation to the exact reading, \hich in practice cannot be known. For purposes of computation, however, these approximations are usually sufficient and, as will be scen below, may be made ‘even more approximate by rounding. Many of the variables studied in biology are continuous. Examples are length, arca, volume, weight, angle, temperature, Period of time, percentage, and rate. Discontinuous variables, also known as meristic variables (the term we use in this book) or discrete variables, are variables that have only certain fixed ‘numerical values, with no intermediate values possible. The number of segments ina certain insect appendage, for instance, may be 4 to S or 6 but never 5} or 4. Examples of discontinuous variables are number of a certain structure (such as segments, bristles, teeth, or glands), number of offspring, number of colonics of ‘microorganisms or animals, or number of plants in a given quadrat. ‘A word of caution: not all variables restricted to integral numerical values are Imeristic. An example wil illustrate this point. If an animal behaviorist were to ‘cade the reactions of animals in a series of experiments as (I) very aggressive, (2) aggressive, (3) neutral, (4) submissive, and (5) very submissive, we might be tempted to believe that these five different states of the variable were meristic because they assume integral values. However, they are clearly only arbitrary points (class marks, see Section 2.5) along a continuum of aggressiveness; the ‘only reason that no values such as 1.5 occur is that the experimenter did not wish to subdivide the behavior classes too finely, either for convenience or because of an inability to determine more than five subdivisions of this spectrum of behav ior with precision. Thus, this variable is clearly continuous rather than meristic, sit might have appeared at first sight. ‘Some variables cannot be measured but at least can be ordered or ranked by their magnitude. Thus, in an experiment one might record the rank order of ‘emergence of ten pupae without specifying the exact time at which each pupa ‘emerged. In such a case we would code the data as a ranked variable, the order Cf emergence. Special methods for dealing with ranked variables have been developed, and several are described in this book. By expressing a variable as a series or ranks, such as 1, 2, 3, 4, 5, we do not imply that the difference in ‘magnitude between, say, ranks 1 and 2 is identical to or even proportional to the Stwoys wSs9sge-noleg aur] sss § gs UoHNARSIp panos ap Jo par axoge Wnoys Uonngunsp Kouanbay CurBUO 2tp Jo WFOsH, x gS | sty ser-sow r sesso ° sores I 9% sevesey | ss sopesee 1 Sees ¢ sees HIT sey sveescy + ser-ser UHM er seeeste 1 scr-str lll ste scersor z sre-sor ° sor-see 4 ov stv-see soe sor-see ¢ sot-se + sweat see seereve ° sue-s9e HIT ce seensse * soe-see Hit se spe-sre 1 sse-sve ° sense Wore sse-see |__ see ove-see 1 seesce smu eu Su a L ivy pny firey pon, 0 essay 30 Toren onnguasp KounbasyeurB40 soosep son Sednaus, samp oy Sains ty 9 vy oc sy 1 “yee 6 ey oe ee sey ee 101 wus uy ase suSUIDINSEOHY smsZansueuyndod sndtydurg pryde 2x9 Jo s>\ROW WIS Jo SyIBUE] IMEI} 2Ay-AAEEM, ‘swuowamstou jeuuo "SIVAYBINI SSVTD BaGIM Heese HLIM S3SSVID YAMGd OLNI ONINOYD GNY NOLLNEIMISIa ADNENDIUd JO NOLLVUVaaed ub CHAPTER 2 DATA IN BIOLOGY IF the original data provide us with fewer classes than we think we should ‘have, then nothing can be done ifthe variable is meristc, since this is the nature ‘of the data in question. With a continuous variable, however, a scarcity of classes indicates that we probably have not made our measurements with sufficient precision. If we had followed the rule on number of significant figures stated in Section 23, this could not have happened. ‘Whenever there are more than the desired number of classes, grouping should bbe undertaken. When the data are meristic, the implied limits of continuous variables are meaningless. Yet with many meristic variables, such as a bristle ‘number varying from 13 to 81, it would probably be wise to group them into classes, each containing several counts. This grouping can best be done by using aan odd number as a clas interval so thatthe class mark representing the data is a whole rather than a fractional number. Thus if we were to group the bristle ‘numbers 13, 14,15, and 16 into one class, the class mark would have tobe 14.5, ‘meaningless value in terms of bristle number. It would therefore be better to use a class ranging over 3 bristles or 5 bristles, giving the integral values 14 or 15 as, class marks. Grouping data into frequency distributions was necessary when computations ‘were done by pencil and paper or with mechanical calculators. Nowadays even thousands of variates can be processed efficiently by computer without prior grouping. However, frequency distributions are still extremely useful as a tool for data analysis, especially in an age where it is all too easy for a researcher 10 obtain a numerical result from a computer program without ever really examin- ing the data for outliers (extreme values) or for other ways in which the sample ‘may not conform to the assumptions ofthe statistical methods. For this reason ‘most modem statistical computer programs furnish some graphic output of the frequency distribution of the observations. ‘An alternative to setting up a frequency distribution with tally marks as shown in Box 2.1 is the so-called stem-and-leaf display suggested by Tukey (1977). ‘The advantage of this technique is that it not only results in a frequency distribu- tion of the variates of a sample, but also presents them in a form that makes a ranked (ordered) array very easy to construct. It also effectively creates a list of these values—and is very easy to check, unlike the tallies, which can be checked ‘only by repeating the procedure. This technique is therefore useful in computing the median of a sample (see Section 4.3) and in computing various nonparamet- Fic statistics that require ordered arrays of the sample variates (see Sections 13.11 and 13.12), Let us frst learn how to construct the stem-and-leaf display. In Box 15.6 we feature measurements of total length recorded for a sample of 15 aphid stem mothers, The unordered measurements are reproduced here: 8.7, 8:5, 9.4, 10.0, 63,78, 11.9, 65,66, 106, 10-2, 72,86, 11.1, 11.6. To prepare a stem-and-leat display we write down the leading digit or digits of the variates inthe sample 10 the left ofa vertical line (the “stem”")as shown below; we then put the next digit 2.5 FREQUENCY DISTRIBUTIONS. w of the first variate (a “Jeaf™) at that level ofthe stem corresponding to its leading digins) ‘Completed Array Step 1 Step2-» Step7---(Glep 15) Ordered Array eee os 6)356 6386 7] 1 als a]s2 aps alr ars alas s}756 3|sor || ola ols oa wf of aolo s0lose rolor6 ite aller il) ulsis 109 ‘The first observation in our sample is 8.7. We therefore place a 7 next to the 8. ‘The next variate is 8.5. Itis entered by finding the stem level for the leading digit 8 and recording a 5 next to the 7 that is already there. Similarly for the third variate, 9.4, we record a 4 next tothe 9, and so on until all 15 variates have been centered (as “Jeaves") in sequence along the appropriate leading digits of the stem, Finally, we order the leaves from smallest (0) to largest (9. ‘The ordered array is equivalent to a frequency distribution and has the ap- pearance of a histogram or bar diagram (see below), but it also is an efficient ‘ordering of the variates. Thus from the ordered array it becomes obvious that the appropriate ordering of the 15 variates is: 63, 6.5, 6.6,7.2,7.8, 85, 8.6, 8.7, 9.4, 100, 10.2, 10.6, 11-1, 11.6, 11.9. The median, the observation having an equal number of variates on either side, can easily be read off the stem-and-leaf dis- play. It is 8.7. ‘The computational advantage ofthis procedure is that it orders the variates by their leading digits in a single pass through the data set, relying on further ordering of the tailing digits in each leaf by eye, whereas a direct ordering, technique could require up to m passes through the data before all items are comrectly arrayed. A FORTRAN program for preparing stem-and-leaf displays is ‘given by McNeil (1977). If most variates are expressed in three digits, the lead- ing digits (o the eft ofthe stem) can be expressed as two-digit numbers (as was done for the last two lines in the example above), or two-digit values can be displayed as leaves to the right of the stem. In the latter case, however, the two-digit leaves would have t0 be enclosed in parentheses or separated by ‘commas to prevent confusion. Thus if the observations ranged from 1.17 to 8.53, for example, the leading digits tothe left of the stem could range from I to 8, and the leaves of one class, say 7, might read 26, 31, 47, corresponding to 7.26, 7.31, and 7.47, In biometric work we frequently wish to compare two samples to see if they differ. In such cases, we may employ back-to-back stem-and-leaf displays, 30 CHAPTER 2 DATA IN BIOLOGY illustrated here: Sample A Sample B 9% | 10 | 0578, sssea2 | ur | 16 sssooss1 | 12 | o13 ‘This example is taken from Box 13:7 and describes a morphological measure- ment obtained from two samples of chiggers. By setting up the stem in such a ‘way that it may serve for both samples, we can easily compare the Frequencies. Even though these data furnish only three classes forthe stem, we can realy see that the samples differ in their distributions. Sample A has by far the higher readings, ‘When the shape of a frequency distribution is of particular interest, we may often wish to present the distribution in graphic form when discussing the result. ‘This is generally done with frequency diagrams, of which there are two common types. For a distribution of meristic data we use a bar diagram, as shown in Figure 2.2 for the sedge data of Table 2.3. The abscissa represents the variable (in our case the number of plants per quadrat), and the ordinate represents the frequencies. What is important about such a diagram is thatthe bars do not touch each other, which indicates that the variable is not continuous. By contrast, continuous variables, such as the frequency distribution of the femur lengths of aphid stem mothers, are graphed as a histogram, in which the ‘width of each bar along the abscissa represents a class interval of the frequency distribution and the bars touch each other to show that the actual limits of the classes are contiguous. The midpoint of the bar corresponds tothe class mark. At the bottom of Box 2.1 are histograms ofthe frequency distribution of the aphid data, ungrouped and grouped. The height of the bars represents the frequency of each class. To illustrate that histograms are appropriate approximations to the continuous distributions found in nature, we may take a histogram and make the class intervals more narrow, producing more classes. The histogram would then clearly fit more closely to a continuous distribution. We can continue this pro- cess until the class intervals approach the limit of infinitesimal width. AC this Point the histogram becomes the continuous distribution of the variable. Occa- sionally the class intervals of a grouped continuous frequency distribution are ‘unequal. For instance, ina frequency distribution of ages, we might have more Atal on the different stages of young individuals and less accurate idemtifi- cation of the ages of old individuals. In such cases, the class intervals for the older age groups would be wider, those for the younger age groups, narrower. In representations of such data, the bars of the histogram are drawn with different widths. exeRCISES 2 a enero as eng: OW S00 S130 15 Birth weight (02) FIGURE 2.3 Frequency polygon. Birth weights of 9465 male infants. Chinese thin ‘lass patients in Singapore, 1950 and 1951. (Data from Millis and Seng, 1954.) Figure 2.3 shows another graphic mode of representing a frequency distribu- tion of a continuous variable—birth weight in infants, Ths isa frequency pals gon, in which the heights of the class marks in a histogram afe connected by straight lines. As we shall see later the shapes of frequency distributions as seen in these various graphs can reveal much about the biological situations affecting a given variable. EXERCISES 2 2.1 Differentiate between the following pars of terms and give an example of each, (8) Statistical and biological populations. (b Variate and individual. (e) Accuracy and precision (repeatability). (4) Class interval and elass mark. (e) Bar diagram and histogram. (€) Abscissa and ordinate. 222 Round the following numbers to three significant figures: 106.85, 0.06819, 3.0495, 7815.01, 2.9149, and 20.1500. What are the implied limits before and after round Jing? Round these same numbers to one decimal place, 23 Given 200 measurements ranging from 1.32 10 2.95 mm, how would you group them into a frequency distribution? Give clas limits as well as class marks. 24 Group the following 40 measurements of interorbital width ofa sample of domestic Pigeons into a frequency distribution and draw is histogram (data fom Olson and Milles, 1958). Measurements are in millimeters. 2 29 1s 19 16 Mkt 2322 eR oa) 0s | 0S) 02) ne) iss) 05) NN 21 19 10s 107 08 110 119 102 109 Ihe 108 6 104 107 120 24 TBS 2.5 How precisely should you measure the wing length of a species of mosyuitues in study of geographic variation, ithe smallest specimen hasa length of about 2.8m and the largest a length of about 3.5 mm? 3 CHAPTER 2 DATA IN BIOLOGY 2.6 ‘Transform the 40 measurements in Exercise 24 into common logarithms with your Calculator and make a frequency distribution ofthese transformed variates. Comment ‘on the resulting change in the pattern of the frequency distribution from that found before. 2.7 tn Exercise 4.3, we feature 120 percentages of buterfat from Ayrshire cows, Make a frequency distibution of these values using a slemand-leaf display. Prepare an frdered array of these variates from the display. Save the display For use later in Exercise 4300) THE HANDLING OF DATA We have already stressed in the preface thatthe skillful and expe- ditious handling of data is essential to the successful practice of statistics. Since the emphasis of this book is to a large extent on practical statistics, itis necessary to acquaint readers with the various techniques available for carrying out statis- tical computations. We discuss computer hardware in Section 3.1, software in Section 3.2. In Section 3.3 we focus on the important question of which compu- tational devices and types of software are most appropriate in terms of efficiency and economy. Lacking mechanical computational aids, we would be reduced to carrying out statistical computations by the so-called pencil-and-paper methods. Textbooks of statistics used to contain extensive sections dealing with clever ways to make computation by hand feasible. At present, however, the use of shortcut or ap- proximate computations by pencil-and-paper methods isa very inefficient use of time and energy. Even handheld calculators are now used much less than they were before, although the more advanced models are capable of performing ‘many of the simpler analyses presented in this text, With microcomputers on ‘most desktops and battery-powered portable computers, computation by hand or with calculators is used mostly to verify the initial results produced by computer. ‘The benefit from these developments is that researchers can concentrate on the more-important problems of the proper design of an experiment and the interpretation of the statistical results rather than spending time on computational details and special ticks to make computations less tedious. We have eliminated explanation of all the special methods for hand or calculator computation and expect that this will make the material in this text easier to learn. ‘Another benefit of the routine use of computers is that more-powerful, com plex analyses now are usually feasible. Some of these methods are able 10 give ‘more exact probability values for standard tests of significance. Other methods (ee Chapter 18) enable one to make significance tests for nonstandard statistics or for situations in which the usual distributional assumptions are not likely to be true a a CHAPTER 3 THE HANDLING OF DATA 3.1 computers Since the publication of the first edition of Biometry, a revolution in the types of ‘equipment available for performing statistical computations has occurred. The ‘once standard, electrically driven, mechanical desk calculators that were used for routine statistical calculations in biology have completely disappeared. They were replaced first by a wide variety of electronic calculators (Fanging from pocket to desktop models). Except for very simple computations, their use has ‘now been largely replaced by computers. Although some calculators are capable Of performing /-tests, regression analysis, and similar calculations, they are now used most often just for making rough checks or for manipulating results from a computer: Calculators range from devices that can only add, subtract, multiply, and divide to scientific calculators with many mathematical and statistical functions built in. Their principal limitation is that they usually can store only limited amounts of data, This means that one must reenter the data in order to ry aan alternative analysis (eg, with various tansformations, with outliers re- moved, and so on). Reentry is very tedious and greatly increases the chance for error. Itis difficult to know where to draw the line between the more sophisticated clectronic calculators and digital computers. There is a continuous range of capabilities from programmable calculators through palm-top microcomputers. notebook and laptop computers, desktop microcomputers, workstations, and ‘minicomputers, to large-scale mainframe and supercomputers. The largest are usually located in a central computation center of a university or research labo- ratory, Programmable calculators are usually programmed in languages unique to the ‘manufacturer of the device. This means that considerable effort may be required to implement an analysis on a new device, which limits the development of software. Even so, collections of statistical programs (often donated by other users) are usually available. These collections do not, however, compare to the ‘depth of libraries of programs available for most computers. Although con puters can be programmed in “machine language" specific to each line of e puters, programs are now usually writen using standard high-level languages such as BASIC, C, FORTRAN, or Pascal, enabling the accumulation of large libraries of programs. ‘Computers consist of three main components: the central processing unit (which performs the calculations and controls the other components, the mem: cry (which stores both data and instructions), and peripheral devices (which hhandle the input of data and instructions, the output of results, and perhaps intermediate storage of data and instructions). Different devices vary considera: bly in the capabilities of these three components. In a simple calculator, the processor is the arithmetic unit that adds, subtracts, multiplies. and divides and the memory may consist of only a few registers, each capable of storing a 12 SOFTWARE 38 10-digit number. The only peripheral devices may be the keyboard for entering the data and a light-emitting diode (LED) or liquid erystal (LCD) display. ‘At present, standard microcomputers usually have 4 10 10 8 X 10" eight bit numbers or characters (bytes) of main memory, 300 X 10* bytes of disk storage, and can perform about 1.8 X 10? average floating-point operations ithmetie operations on numbers with a decimal place) per second. Large main- frame computers now have capabilites such as 500 10° bytes of main mem- cory and 2X 10" bytes of secondary storage. They can process. 800 X 10" floating-point operations per second. There are also specialized computers con- sisting of many processors tied together to work on a single problem that can achieve even faster combined processing speeds. OF course, with the rapid de- velopmicnts in computing technology over the past few years, we expect these ‘numbers to be out of date by the time this book isin print and to seem very small by the time this book is next revised. These figures are given just for comparison. Hecause of aggressive marketing, most users are quite impressed with recent developments in microcomputers and workstations. The equally impressive Ue- velopment in supercomputers needed for the solution of many large-scale prob- are not us visible creasing speed of computation is important for many applications Computers perform no operations that could not, in principle, be done by had. However, tasks that large-scale computers can perform in a matter of seconds might take thousands of years for humans to complete. Computer hardware is also very reliable. IL is unlikely that a person working on a very long calculation. \woukl not make an error somewhere along the way unless time were allowed for independent checking ofall results, Itis now possible to solve problems that one ‘would searcely even have contemplated before computers were developed, Generally speaking, the largest and fastest computers are the most efficient and economical for carrying out a given computation, but other factors often influence the decision of what computing hardware is used. Availability is. of course, very important. Because of their cost, very large-scale computers. are located only at major universities or laboratories, and their use is shared among. hundreds or thousands of authorized users. These computers can, however. be accessed easily from anywhere in the world via terminals or microcomputers attached to high-speed networks. Most statistical analyses require relatively Hite ‘computation, so the capabilities of a large-scale computer are not needed. (There ‘may be move overhead associated with making the connection to such a com: puter than with the actual arithmetic.) Often the most important factor is the availability of appropriate software, which is discussed in the next section. 32 SorTwaRE ‘The instructions to computer are known as the program, and the persons who specialize in the writing of computer programs are known as programmers 36 CHAPTER 3 THE HANDLING OF DATA ‘Waiting instructions inthe generally waque language fr each model of com. peri avery tedious task and is now seldom done by programmers consracd With software writen for siasical apleaons. Fortuna, programs called Comps ve hes eel oan tons wen mle 5, problenorented languages into the machine langusg fra par falar compte Somect te bese known compilers are BASIC. ©: FORTRAN, tnd Pascal, FORTRAN is ne ofthe oldest compiter languages ands sil very popular for numerical calculations, including Targe-scale vector and pall Computations on many supercomputers Tete ar also programs called inter Dreters iat compile and exec each insruton as is eneed, which can Be Nery convenient for developing software hat is changed often and adapted 10 dierent applications The best known inerpretr is forthe BASIC langage. Many new computr languages which simplify the developmen of software for partlar epee of application ate tow aval, Thee ae, or example, several powerful syseme wth bln commands fr staal at and tape operation, Recellyeysems have been developed that allow the wet {o wre a program by pointing to objet (eons represning input devices or files, particular mathematic! operon, satacal analytes, various (pes of frapit, et) on the computer screen and then drawing connections baween thin {oindicat tbe low of infomation between objet eq rom an ip le, to farticlr siainicl analysis 10» graph ofthe reals, Menus and Galogue Exes are ved to spcity various options fr the properties ofthe objects Tis chjec-orened programming approach means tat user no longer fas obea programmer inorder \o create sofware appropriate for acenain peilized ask, Tr does, however make i esental for aesearcher tobe computer Iter. "The maeril presented in his book consists of eaively tana stastcl cornutations, most of which have been programmed as pat of many diferent snl cigs Te BION pe sofware or ID FC compact was developed by one of us (Ro cry out he computation or mos of he Topics covered in hs book An ode “Gon wie In FORTRAN, a be taped to run on oer computers, including mini and mainframe computes (These programs are availabe rom Exeter Software, 100 Nor County Road Seta, New York 11733) "The specific steps used during computations on the computer depend onthe particular computer and software ised. At onetime, when using centralized Computer centers, one ado employs batch made of emputations In sich an environment on lf cat dks of progr and data 0 Deum when computer time was avallable; printed oupt was returied our ltr or vente ex sy Now large computer center ean be acesed vn compute terial (ple Aovices tat const of just a keyboard and x CRT scren and operate only under the contol ofthe remote compute) or microcomputers tha se sotvar tha lates te properties os compe inal ‘Ths new cnvonten pean inerntve mode of computation in which the computer responds rapidly to each veer inp, The connections canbe via 3.3 EFFICIENCY AND ECONOMY IN DATA PROCESSING 31 directly connected lines, dial-up phone lines, or various types of network con- nections allowing convenient access to the resources of a large computer no ‘matter where the computer is physically located. An important limitation ofthis ‘mode of computing is the speed of communication between the user and the remote computer. Speed limits applications such as interactive graphics (real {ime rotation of three-dimensional objects), as well as the practicality of applica- tions such as word processing, in which the screen is reformatted as each charac. {er is typed and in which very little actual computation is performed, Although communication speeds are increasing rapidly, these operations are ‘more efficient if performed locally. This is one ofthe reasons forthe popularity of powerful microcomputers and personal workstations where itis practical to hhave convenient, easy-to-use, and elegant graphical user interfaces, Other rea, sons for the popularity of small, powerful computers are more psychological ‘THe user generally feels more free to experiment because there are no Usage charges, nor the formality of passwords, computer allocations, and logging on to 4 computer at a remote center. The most important factor in choosing system should be simply the availability of software to perform the desired tasks An enormous quantity of software for personal computers has been developed in recent years, 33 EFFICIENCY AND ECONOMY IN DATA PROCESSING From the foregoing description it may seem tothe reader that a computer should always be preferred over a calculator. This is usually the case, but simpler calcu lations can often be done more quickly on a calculator. (One may be done before ‘most computers finish booting.) On the other hand, even simple computations often need to be checked or repeated in a slightly diferent way. It is very convenient not to have to reenter the input more than once and to have a printed ‘ecord of all input and operations performed leading to a given resul. ‘The difficult decision is not whether to use a computer but which computer ‘and software to use. Software for most of the computations described int this book is available on most computers. Thus, the choice of hardware usually is setermined by factors such as the type of computer already present in one's lab, ‘The computations described in this book can be performed on but do not require large-scale computing systems (except, possibly, for very large randomization {ests, Mantel tests, and other complex computations) It is uo longer necessary or even efficent for most users to write their own softwace, since there is usually an overwhelming array of choices of software that can be used for statistical com, Dutations. Not all software, however, is of the same quality. There can be differ. tent degrees of rounding error othe software may sometimes produce erroneous results because of bugs in the program. It is a good idea to check a software package by using standard data sets (such as the examples in the boxes in this 38 CHAPTER 3 THE HANDLING OF DATA text) before running one’s own data, One should also read published reviews of, the software before purchasing it ‘Most statistical software can be classified into one of the following major groups (some major packages combine the features of two or more groups): specialized single-operation programs, command language-driven programs, menu-driven programs, and spreadsheets. The earliest statistical software was. developed for the batch-computing environment. The program would read a previously prepared file of data, perform a single type of analysis, and then. ‘output the results (which usually would include many possible analyses in case they might interest the user). ‘Alternatively, large statistical packages were developed that combined many types of analysis, With such a package, the user selects the particular operations desired using statements in an artificial command language unique to that soft- ware, This type of software requires time to set up the data and to learn the ‘commands and various options needed fora particular analysis, but it is usually a very efficient mode for processing large datasets, These programs also usually offer the widest selection of options. User-friendly, menu-driven programs per- mit rapid data input and feature easy options for performing standard statistical analyses. Options can be conveniently selected from lists in menus or in dialogue boxes. Some packages combine the advantages ofthese last two modes by letting the user employ a menu system to prepare the command file needed to carry out the desired analysis. ‘Although spreadsheet programs are not often used for statistical calculations in biology, their very different mode for computation is often useful. They might bre viewed as the computer generalization ofthe calculator. They simulate a large sheet of paper that has been divided into cells. Each cell can contain a value centered by the user or a result of a computation based on other cells in the spreadsheet. The operations used 10 produce a particular numerical result or raph are remembered by the program. Ifan input cel is changed by the user, the consequent changes in the resulls arc immediately recomputed and displayed, ‘This feature makes it very casy to experiment with data and sce the results of alternative analyses. ‘A danger inherent in computer processing of data is thatthe user may simply obtain the final statistical results without observing the distribution of the var- iates (in fact without even seeing the data if they are collected automatically by various data acquisition devices). One should take advantage of any options available to display the data that could Iead to interesting new insights into their nature, of to the rejection of some outlying observations, or to suggestions that the data do not conform to the assumptions of a particular statistical test. Most statistical packages are capable of providing such graphics. We strongly urge research workers (0 make use of such operations. ‘Another danger is that it is too easy to blindly use whatever tests are provided ‘with a particular program without understanding their meanings and assump- tions (Searle, 1989). The availability of computers relieves the tediuin of com= pation, but not the necessity to understand the methods being employed. DESCRIPTIVE STATISTICS ‘An early and fundamental stage in any science is the descriptive stage. Until the facts can be described accurately, analysis of their causes is premature. The question what must come before how. The application of statis tics to biology has followed these general trends. Before Francis Galton could begin to think about the relations between the heights of fathers and those of their sons, he had to have adequate tools for measuring and describing heights in a population, Similarly, unless we know something about the usual distribution of the sugar content of blood in a population of guinca pigs, as well as its luctw- tions from day to day and within days, we cannot ascertain the effect of a given dose of a drug upon this variable. Ina sizable sample, obtaining knowledge of the material by contemplating all the individual observations would be tedious, We need some form of summary to deal with the data in manageable form, as well as to share our findings with ‘others in scientific talks and publications. A histogram or bar diagram of the frequency distribution is one type of summary. For most purposes, however, numerical summary is needed to describe the properties of the observed fre- {quency distribution concisely and accurately. Quantities providing such a sun mary are called descriptive statisties, This chapter will introduce you t $0 them and show how they are computed. “Two kinds of descriptive statistics will be discussed in this chapter: statistics ‘of location and statistics of dispersion. Statisties of location describe the posi- tion of a sample along a given dimension representing a variable. For example. ‘we might like to know whether the sample variates measuring the | certain animals lie inthe vicinity of 2 cm or 20 em. A statistic of loc yield a representative value for the sample of observations. However, such a Satistic (sometimes also known as a measure of central tendency) does not describe the shape of a frequency distribution. This distribution may be long or very narrow; it may be humped or U-shaped; it may contain two humps, oF it may be markedly asymmetrical. Quantitative measures of such aspects o fre- quency distributions are required. To this end we need to define and study statisties of dispersion. 39 40 CHAPTER 4 DESCRIPTIVE STATISTICS ‘The arithmetic mean described in Section 4.1 is undoubtedly the most impor- tant single statistic of locaton, but others (the geometric mean, the harmonig ‘att, the median, and the mode) are mentioned briefly in Sections 4.2, 4.3, an FX simple statistic of dispersion, the range, is briefly noted in Section 45, and the standard deviation, the most common statistic for describing dispersion, 's Explained in Section 4.6. Our fist encounter with contrasts Between samPi= exrretice and population parameters occurs in Section 4.7, in connection with statics of lovation and dispersion. Section 4.8 contains a description of meth- aarvof coding data to simplify the computation ofthe mean and standard devia- Gon, which is discussed in Section 4.9. The coefficient of variation (a statistic that permits us to compare the relative amount of dispersion indifferent samples) js explained in the last section (4.10). 30, the ap- proximation C3~ 1 + [4(n — 1)]”' is sufficiently accurate. In the case of the smi i eens G 4.8 CODING DATA BEFORE COMPUTATION Coding the origina datas far ss important subject atthe present time, when sen ompaaone visa ead val an wn a ys, hen twas eoeatia for carying ott most computations, By coding we mean the adlon o subtraction ofa constant number fo the eiginal daa andor the tlpiation or division ofthese data by constant. Data may needa be coJod boca they were originally expressed in too many digits orate very lange numbers that may cause ificales and eros during data handing. Cong can 5h CHAPTER 4 DESCRIPTIVE statistics therefore simplify computation appreciably, and for certain techniques, such as, polynomial regression (see Section 16.6), it can be very useful to reduce round. ing error (Bradley and Srivastava, 1979). The types of coding shown here are linear transformations of the variables. Persons using statistics should know the effects of such transformations on means, standard deviations, and any other Statistics they intend to employ. Additive coding isthe addition or subtraction of a constant (since subtraction is addition of a negative number). Similarly, multiplicative coding isthe mult plication or division by a constant (since division is multiplication by the recip- rocal ofthe divisor). Combination coding is the application of both additive and ‘multiplicative coding to the same set of data. In Section A.2 ofthe appendix we examine the consequences of the three types of coding for computing means, variances, and standard deviations. For the case of means, the formula for combination coding and decoding isthe ‘most generally applicable one. Ifthe coded variable is ¥, = DO’ + C), then y, F=a-c D where Cis an additive code and D is @ multiplicative code. Additive codes have ho effect, however, on the sums of squares, variances, or standard deviations. ‘The mathematical proof is given in Section A.2, but this can be scen intuitively because an additive code has no effect on the distance of an item from its mean. For example, the distance from an item of 15 to its mean of 10 would be 5. If we ‘wore to code the variates by subtracting a constant of 10, the item would now be 5 and the mean zero, but the difference between them would still be 5, Thus if only additive coding is employed, the only statistic in need of decoding is the ‘mean. Multiplicative coding, on the other hand, does have an effect on sums of squares, variances, and standard deviations. The standard deviations have to be divided by the multiplicative code, just as had to be done forthe mean; the sums of squares of variances have to be divided by the multiplicative codes squared because they are squared terms, and the multiplicative factor became squared ‘during the operations. In combination coding the additive code can be ignored, ‘An example of coding and decoding data is shown in Box 4.3 4,9 CompuTING MEANS AND STANDARD DEVIATIONS ‘Three steps are necessary for computing the standard deviation: (1) finding Xy?, the sum of squares, (2) dividing by n ~ 1 to give the variance, and (3) taking the ‘square root of the variance to obtain the standard deviation, The procedure used {to compute the sum of squares in Section 4,6 can be expressed by the following formula: DrsSa-7y an | | t 4.9 COMPUTING MEANS AND STANDARD DEVIATIONS. ” When the data are unordered, the computation proceeds as in Box 4.2, which is based on the unordered aphid femur length data shown at the head of Box 2.1 Occasionally original data are already in the form ofa frequency distribution, fr the person computing the statistics may want to avoid manual entry of large ‘numbers of individual variates, in which case setting up a frequency distribution is also advisable, Data already arrayed in a frequency distribution speed up the ‘computations considerably. An example is shown in Box 4,3, Hand calculations are simplified and data entry into computers is less tedious (and hence there will be less chance for input errors) by coding to remove the awkward clas marks. ‘We coded each class mark in Box 4.3 by subtracting 59.5, the lowest class mark of the array. The resulting class marks are the values 0, 8, 16, 24, 32, and so on, Dividing these values by 8 changes them to 0, 1,2, 3, 4, and $0 on, which is the desired format, shown in column (3). Details ofthe computation are given in Box 43, ‘Computer programs that compute the basic statistics ¥, 3, s, and others that ‘we have not yet discussed are furnished in many commercially available pro ‘grams. The BIOM-pe program version 3 accepts raw, unordered observations as input, as well as data in the form of a frequency distribution. ‘An approximate method for estimating statistis is useful when cheeking the results of ealculations because it enables the detection of gross errors in compu tation. A simple method for estimating the mean is to average the largest and simallest observations to obtain the midrange. For the aphid stem mother data of Box 2.1, this value is (4.7 + 3.3)/2 = 4.0, which happens to fall almost exactly ‘on the computed sample mean (but, of course, this will not be true of other « sels), Standard deviations ean be estimated from ranges by appropriate division of the range: For samples of divide the range by w 3 30 4 100 5 500 6 1000 o ‘The range of the aphid data is 1.4. When this value is divided by 4 we get an estimate ofthe standard deviation of 0.35, which compares not too badly with the calculated value of 0.3657 in Box 4.2. ‘A more accurate procedure for estimating statistics is to use Statistical Table 1, which furnishes the mean range for different sample sizes of a normal distr- bution (see Chapter 6) with a variance of one. When we divide the range of a sample by a mean range from Table I, we obtain an estimate of the stan- dard deviation of the population from which the sample was taken. Thus. for 56 CHAPTER 4 DESCRIPTIVE STATISTICS [REE carcuration oF ¥ AND s FROM UNORDERED DATA. ‘Based on aphid femur length data, unordered, as shown at the head of Box 2.1, Computation n= 25D Y= 100: P= 4008 LS yt=Dor— PR = 3.2056 EP 32096 giao mie = OTST = 03651 oe CALCULATION OF ¥, s, AND V FROM A FREQUENCY DISTRIBUTION. Les Bint weights of male Chinese in ounces (from Box 4.1). @ cna — Coeds ce L % ——— rr! ° as 6 1 3} 2 bs as 3 a3 ‘ bp arm 5 wis 20 é 13s dom $ : ins ia33 : i bis ‘etl : i tes OD) ® F Mis, at i : (ss 2 toss 8 ts " Ha =n Sounce Mi an Seg (95. 4.10 THE COBPFICIENT OF VARIATION 51 Box 4.3 Conrinuso pore ae fon resets comma tg DJ = 1,040,199.5, oie Dh = Ds ~ PF v= 3 x 100 = 35% x 100 = 123708 109.9 the aphid data we look up n= 25 in Table I and obtain 3931. We estimate | $= 143.931 ~ 0356, a value closer to the sample standard deviation than that obtained by the Tougher method dicusod above (which, however, is based on the same principle and assumption) ‘4JO THE COEFFICIENT OF VARIATION Having obtained the standard deviation as a measure of the amount of variation in the data, you may legitimately ask, What can I do with it? At this stage in our ‘comprehension of statistical theory, nothing realy useful comes of the computa- tions we have carried out, although the skills earned are basic (0 all statistical work. So far, the only use that we might have for the standard deviation is as an estimate of the amount of variation in a population. Thus we may wish to 58 CHAPTER 4 DESCRIPTIVE STATISTICS ‘compare the magnitudes ofthe standard deviations of similar populations to see ‘whether population A is more or less variable than population B. When popul tions differ appreciably in their means, however, the direct comparison of their variances or standard deviations is less useful, since larger organisms usually ‘vary more than smaller ones. For instance, the standard deviation of the tail lengths of elephants is obviously much greater than the entire tail length of a ‘mouse, To compare the relative amounts of variation in populations having different means, the coefficient of variation, symbolized by V (or occasionally CV), has been developed. This coefficient is simply the standard deviation ex- pressed as a percentage of the mean Its formula is = 100 Y For example, the coefficient of variation of the birth weights in Box 4.3 is 12.37%, as shown at the bottom of that box. The coeflicient of variation is independent of the unit of measurement and is expressed as a percentage. The coefficient of variation as computed above i a biased estimator of the population V. The following estimate V* is corrected for bias L vee(t ( +t)v a9) {In small samples this correction can make an appreciable difference. Note that when using Expression (4.9), the standard deviation used to compute V should not be corrected using C, because this would result in an overcorrection. Note also thatthe conection factor approximates C,, he comet faror wed in ection 4: Coefficients of variation are used extensively when comparing the variation ‘of two populations independent of the magnitude of their means. Whether the birth weights ofthe Chinese children (see Box 4.1) are more or less variable than the femur lengths of the aphid stem mothers (see Box 2.1) is probably of litle interest, but we can calculate the latter as 0.3656 100/4,004 = 9.13%, which ‘would suggest that the birth weights are more variable. More commonly, we ‘might wish to test whether a given biological sample is more variable for one character than for another, For example, for a sample of rats, is body weight ‘more variable than blood sugar content? Another frequent type of comparison, especially in systematics, is among different populations for the same character. If, for instance, we had measured wing length in samples of birds from several localities, we might wish to know whether any one of these populations is more variable than the others. An answer to this question can be obtained by examin- ing the coefficients of variation of wing length in these samples. ‘Employing the coefficient of variation in a comparison between two variables ‘or two populations assumes thatthe variable in the second sample is proportional to that in the first. Thus we could write ¥, = KY, where k is a constant of proportionality. If two variables are related in this manner, their coefficients of as) EXERCISES 4 39 variation should be identical. This should be obvious if we remember that k is a ‘multiplicative code, Thus ¥, = k¥, and s, = ks, and consequently y= 1005, 100K, _ 0s (ye If the variables are transformed to logarithms (see Section 13.7), the relationship between them can be writen as In ¥; = Ink + In ¥, and since In kis a constan the variances of In ¥, and In ¥, are identical, This relation can lead to a test of ccquality of coefficients of variation (sec Lewontin, 1966; Sokal and Braumann, 1930). ‘At one time systematists put great stock in coefficients of variation and even, based some classification decisions on the magnitude of these coefficients, However, there is litle, if any, foundation for such actions. More extensive discussion ofthe coefficient of variation, especially as it relates to systematics tean be found in Simpson et al. (1960), Lewontin (1966), Lande (1977), and ‘Sokal and Braumann (1980), EXERCISES 4 4.1 Find the mean, standard deviation, and coefficient of variation for the pigeon data isiven in Exercise 24. Group the dala info ten classes, compute F and s, and ‘Compare them with the results obined from the ungrouped data. Compute the me- tian for the grouped dala, Answer: For ungrouped data, Y= 1148 and x = 0.69178. 42 Find Y, 5, ¥, and the median for the following data (milligrams of glycine per hilligram of creatinine inthe wine of 37 chimpanzees; from Garter ta, 1956) OS 0180550535052 077026. MD 025 036 043.100 120,110 «100350100300 OL 060 070030, 080110110 120,138 100 “oo 155.370 019100100116 143 The following ae percentages of butlerfat from 120 registered three-year-old Ayr- shire cows selected at random from a Canadian stock record book. 432 424 429 400 396 448 389 402 378 442 420 387 410 400 433-381 433 416 38K 4A 423 467 374 425 «428 403 442 409 GIS 429 427 438 449 403 397 432 467 4.11 424 5:00 400 438 372 399 400 446 482391 «4713.96 366 410 438 416 377 440 406 408 3.66 4.70 397 397 420 441 431 370 383 428 4300 IT 397 420 451 386 436 418 424 405 4053.56 a4 389 458 399 4.17 382 370 433 406 3D 407358 393 420 389 460 438 4.14 40 397 422 347 392 491 395° 438 412 452 43531 410 409 409 434 409 488428 3.98 3K SK 60 CHAPTER 4 DESCRIPTIVE STATISTICS ina foqueney fo CaeateF, sand Vary rom the aa (0) Group the at Co Gant Og clei Pa an V- Compare te ests wih ae (0) Sisto a i aed by erouping? Alo calelate te mein. An- How ie pooped aia P= 4.16008, # = O0238,V = 725815. 44 ales woul ing cosa 2 allen vt et at cw tse PV average devin, eis cao te let of etg 2 ad then mpyng he sus by £07 ‘ange? Wal wo cic in he above sats if we molpled by BO frst then aed 5.27 ange se Section 49) forthe atin 45 tmate and owing the mirage and the ane nn Se ae 5 How ell ese ema age Bae avers Ena a fr Bnrse 42 Me 0224 ad Ste respect 4 Stow at he aon ovarian cna fe wens @) eel Drea = 7m wad 7 ion wo te estimated standard deviation ofthe data in Exrcie 42 Apply the C conection oh nee sepia ofthe oun of varia, 24 Alo compe tee comes Answer IC = 009621, V* = ones. INTRODUCTION TO PROBABILITY DISTRIBUTIONS: BINOMIAL AND POISSON sonar) Section 2.5 was our first discussion of frequency distributions. For ‘example, Table 2.3 shows a distibution for a merstic, or discrete (discon tinuous) variable, the number of sedge plants per quadrat. Examples of dstibu- tions for continuous variables are the femur lengths of aphid stem mothers in Box 2.1 of the human birth weights in Box 4.3. Each of these distributions informs us about the absolute frequency of any given class and permits eompu- tation of the relative frequencies of any class of variable. Thus, most of the ‘quadrats contained either no sedges or just one or two plants. Inthe 139.5-0z class of birth weights, we find only 201 out of the 9465 babies recorded; that approximately only 2.1% of the infants are in that birth-weight class. (Ofcourse, these frequency distributions are only samples from given popula- tions. The birth weights representa population of male Chinese infants from a given geographical area. If we knew our sample to be representative of that population, however, we could make al sorts of predictions based on the fre {quency distribution ofthe sample. For instance, we could say that approximately 2.1% of male Chinese babies bom in ths population weigh between 135.5 and 1435 02 at birth, Similarly, we might say thatthe probability ofthe weight at birth of any one baby in this population being in the 139.502 birth class is quite low. If each ofthe 9465 weights were mixed up ina hat, and we pulled one out the probability that we would pull out one ofthe 201 in the 139.5-0z class would be very low indeed —oaly 2.15. t would be much more probable that we would sample an infant of 107-5 or 115.5 oz, since the infants in these clases are represented by the frequencies 2240 and 2007, respective Finally, if we were to sample from an unknown population of babies and find that the very first individual sampled had a bin weight of 170 oz, we would probably eject any hypothesis that the unknown population was the same as that sampled in Box 4.3. We would arrive at this eonclusion because in the distibu- tion in Box 43, only one out of almost 10,000 infants had a birth weight tha high. Although itis possible to have sampled from the population of male Ci nese babies and obtained a birth weight of 170 oz the probability that the first individual sampled would have such a value is very low indeed. Its much more 61 reasonable to suppose that the unknown population from which we are sampling is different in mean and possibly in variance from the one in Box 4.3, We have used this empirical frequency distribution to make certain predic tions (with what frequency a given event will occur) or make judgments and decisions (whether itis likely that an infant of a given birth weight belongs to this population). In many cases in biology, however, we will make such predictions not from empirical distributions, but on the basis of theoretical considerations that in our judgment are pertinent. We may feel thatthe data should be distrib ‘uted in a certain way because of basic assumptions about the nature ofthe forces acting on the example at hand. If our observed data do not sufficiently conform to the values expected on the basis of these assumptions, we will have serious «doubts about our assumptions. This is a common use of frequency distributions in biology. The assumptions being tested generally lead to a theoretical fre ‘quency distribution, known also as a probability distribution. ‘A probability distribution may be a simple two-valued distribution such as the 3: [ratio in a Mendelian cross, or it may be a more complicated function that is. intended to predict the number of plants in a quadrat. If we find that the observed. data do not fit the expectations on the basis of theory, we are often led to the discovery of some biological mechanism causing this deviation from expecta- tion. The phenomena of linkage in genetics, of preferential mating between different phenotypes in animal behavior, of congregation of animals at certain favored places or, conversely, their teritorial dispersion are cases in point. We will thus make use of probability theory o test our assumptions abou the laws of occurrence of certain biological phenomena, Probability theory underlies the entire structure of statistics, a fact which, because of the nonmathematical orien- tation of this book, may not be entirely obvious. In Seotion 5.1 we present an elementary discussion of probability, limited to what is necessary for comprehension of the sections that follow. The binomial frequency distribution, which not only is important in certain types of studies, such as genetics, but is fundamental to an understanding of the kinds of pro bility distributions discussed in dhs book, is covered in Section 5.2. The Poisson distribution, which follows in Section 5.3, is widely applicable in biology, espe- cially for tests of randomness of the occurrence of certain events. Both the binomial and Poisson distributions are discrete probability distributions. Some other discrete distributions are mentioned briefly in Section 5.4, The entire chapter therefore deals with discrete probability distributions, The most common continuous probability distribution is the normal frequency distribution, dis ccussed in Chapter 6, 5.1 PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING We will start this discussion with an example that is not biometrical or biological in the strict sense. We have often found it pedagogically effective to introduce 5.1 PROBABILITY, SAMPLING, AND HYVOTHESIS TESTING 63 new concepts through situations thoroughly familiar to the student, even if the ‘example is not relevant to the general subject matter of biometry. Let us imagine ourselves at Matchless University, a state insttutio where between the Appalachians and the Rockies. Its enrollment figures yield the following breakdown of the student body: 70% of the students are American undergraduates (AU), 26% are American graduate students (AG), and the re maining 4% are from abroad, Of these, 1% are foreign undergraduates (FU) and 3% are foreign graduate students (FG). In much of our work we use proportions. rather than percentages as a useful convention. Thus the enrollment consists of (0.70 AUs, 0.26 AGs, 0.01 FUs, and 0.03 FGs. The total student body, corre- sponding to 100%, is represented by the figure 1.0. If we sample 100 students at random, we expect that, on the average, 3 will be foreign graduate students, The actual outcome might vary. There might not be a single FG student among the 100 sampled or there may be quite a few more tha 3, and the estimate of the proportion of FGs may therefore range from 0 10 iveater than 0.03, If we increase our sample size to S00 or 1000, itis less likely thatthe ratio will Nuctuate widely around 0.03. The larger the sample. the closer 100.03 the ratio of FG students sampled to total students sampled will be. In Fact, the probability of sampling a foreign student can be defined as the limit reached by the ratio of foreign students to the tolal number of students sampled, as uple size increases, Thus, we may formally summarize by stating that the probability of a student at Matchless University being a foreign graduate student is PLFG] = 0.03, Similarly, the probability of sampling a foreign undergrad is PLFU} = 0.01, that of sampling an American undergraduate is PLAU] and that for American graduate students, PLAG] = 0.26. "Now imagine the following experiment: We try to sample a student at random from the student body at Matchless University. This task is not as easy as might be imagined. If we wanted to do the operation physically, we would have to set up a collection or trapping station somewhere on campus. And to make ertain thatthe sample is truly eandom with respect to the entire student population, we ‘would have to know the ecology of students on campus very thoroughly so th ‘we could locate our trap ata station where each student had an equal probability of passing. Few, if any, such places can be found in a university. The student tunion facilities are likely to be frequented more by independent and foreign students, less by those living in organized houses and dormitories. Fewer foreign and graduate students might be found along fraternity row. Clearly we would not ‘wish (0 place our trap near the International Club or House because our proba bility of sampling a foreign student would be greatly enhanced. In front of the bursar’s window we raight sample students paying tuition, The time of sampling is equally important, in the seasonal as well as the diurnal cycle. There seems no ceasy solution in sight ‘Those of you who are interested in sampling organisms from nature will already have perceived parallel problems in your work. If we were 0 sample only students wearing turbans or saris, their probability of being foreign students 70, 6h CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS. ‘would be almost 1. We could no longer speak ofa random sample. In the familiar ecosystem of the university these violations of proper sampling procedure are ‘obvious to all of us, but they are not nearly so obvious in real biological instances ‘that are less well known. ‘How should we proceed to obtain a random sample of leaves from a tre, of insects ftom a field, or of mutations in a culture? In sampling at random we are attempting to permit the frequencies of various events occuring in nature to be reproduced unalteredly in our records; that is, we hope that on the average the frequencies of these events in our sample will be the same as they arc in the natural situation. Another way of stating this goal is that ina random sample we want every individual in the population being sampled to have an equal proba- bility of being included in the sample. ‘We might go about obtaining a random sample by using records representing the student body, such asthe student directory, selecting a page from it at random and a name at random from the page. Or we could assign an arbitrary number to each student, write each on a chip or disk, put these in a large container, stir well, and then pull out a number. Imagine now that we sample a single student by the best random procedure we can devise. What are the possible outcomes? Clearly the student could be either an AU, AG, FU, or FG. This set of four possible outcomes exhausts the possibilities of the experiment. This set, which we can represent as (AU, AG, FU, FG) is called the sample space. Any single experiment would result in only one of the four possible outcomes (elements) in the set. Such an element in a sample space is called a simple event. Ii distinguished from an event, which is ‘any subset of the sample space. Thus in the sample space defined above {AU}, {AG}, {FU}, or (FG) are each simple events. The following sampling results are some of the possible events: {AU, AG, FU}, (AU, AG, FG), {AG, FG), (AU, FG). The meaning of these events, in order, implies being an American, an undergraduate, or both; being an American, a graduate student, or both; being a ‘graduate student; being an American undergraduate or a foreign graduate stu- ‘dent. By the definition of event, simple events, as well asthe enire sample space, are also events Given the sampling space described above, the event A = (AU, AG) encom- passes all possible outcomes in the space yielding an American Student. Simi- larly, the event B= (AG, FG) summarizes the possibilities for obtaining a ‘graduate student, The intersection of events A and B, written A 1 B, describes Only the events that are shared by A and B. Clearly only AG qualifies, as can be seen here: A= (AU, AG) B= (AG,FG) ‘Thus AM Bis the event in the sample space that gives rise vo the sampling of an ‘American graduate student. When the intersection of two events is empty, as in 51 PROBAUILITY, SAMPLING, AND HYPOTHESIS TESTING 65 3.0 C,where C= (AU,FU), the events Band Case mutually exclusive. These (wo events have no common element in the sampling space. " We may alo defn events that se unions of we oer cunt in he sample spac. Ths A'U B indicts that Aor Br boll‘ sad B sae ae de above, AU Bb describes all stuns wo are Amedeanpadte sedans Sen graduate students), a is dscustion of events makes the following relations almos self-evident Probus ae necessary bounded by O and I Three Os PIA s1 6.) ‘The probability of an entire sample space is PIS) = 1 (5.2) whet $i he sample space ofall possible events, Als, fo f ts, Alb, for any ere A, he brohabiliy of iat occuring is | PIAL This known ste copie PIAS] = 1 PIAL 63) hte AC stands forall evens ht are not A. Why ae we concemed wih defining sample spaces and evens? Becawe these concepts ead sto weil dfinons and eperatons regarding he sake Diy of various outcomes. If we can assign eumberD = pl tose ros event in a sample space such that the su of tase p's overall sinpecren the space equals unt, then th space Becomes (Rate) probabny pace te our example he following numbers were asociated with eapponis ee even inthe sample pce: {AU, AG, FU, FG) (0.70, 0.26, 0.01, 0.03) Given tis probability space, we ae now ale o make statement regarding probably of given evens For example, wha the pot ee ating sampled at random. being an Americ, pradvte Suden ‘Cle en PUAG)) = 0.26. Whats the probably thas stant eee Arcee fade sade? The ener oma en of be padre rs PIA UB] = PIA) + PIB] ~ PLAN By 5.4) mn terms of the events defined earlier, the answer is. PIAU BY = PULAU, AG}] + P{LAG, FG)} — P{LAG)] 196 + 0.29 ~ 0.26 099 a a a ee We can sec that we must subtract P[A 1B] = P{[AG)], because if we did not do s0 it would be included twice, once in P(A] and once in PIB}, and would lead. to the absurd result of a probability greater than 1 Now let us assume that we have sampled our single student from the student body of Matchless University. He rms out to be a foreign graduate student. What can we conclude? By chance alone, this result has a probability of 0.03, or ‘would happen 3% of the time, that is, not very frequently. The assumption that ‘we have sampled at random should probably be rejected, since if we accept the hypothesis of random sampling, the outcome of the experiment is improbable. Please note that we said improbable, not impossible. We could have chanced ‘upon an FG as the first one to be sampled, but i is not very likely. The probabi ity is 0.97 that a single student sampled would be a non-FG. If we could be certain that our sampling method was random (as when drawing student numbers out of a container), we would of course have to reach the conclusion that an Improbable event has occurred. The decisions ofthis paragraph are all based on ‘our definite knowledge that the proportion of students at Matchless University is as specified by the probability space. If we were uncertain about this, we would be led to assume a higher proportion of foreign graduate students as a conse- quence of the outcome of our sampling experiment. [Now we will sample two students rather than just one. What are the possible outcomes of this sampling experiment? The new sample space can best be de- picted by a diagram that shows the set of the sixteen possible simple events as points in a lattice (Figure 5.1). The possible combinations, ignoring which stu- ent was sampled first, are (AU, AU), (AU, AG}, (AU, FU), (AU, FG}, (AG, AG}. (AG, FU}, (AG, FG}, (FU, FU}. (FU, FG), and (FG, FG). ‘What would be the expected probabilities of these new outcomes? Now the nature of the sampling procedure becomes quite important, We may sample with ome Joon ¥ oms.ac g oz au om OG 07 026 oot Fit tet FIGURE 5.1 Sample space for sampling two students from Matchless University 5.) PROBABILITY. SAMPLING, AND HYPOTHESIS TESTING 61 oor without replacement; that is, we may return the first student sampled to the population or may keep him out ofthe pool of individuals to be sampled. If we Jo ‘nol replace the first individual sampled, the probability of sampling a foreizn ‘graduate student will no longer be exactly 0.03. We can visualize this easily. ‘Assume that Matchless University has 10,000 students. Since 3% are foreign ‘graduate students, there must be 300 FG students atthe university. After obtain ing a foreign graduate student in the first sample, this number is reduced to 299 ‘out of 9999 students, Consequently, the probability of sampling an FG student w becomes 299/9999 = 0.0299, a slightly lower probability than the value of 0.03 for sampling the first FG student. If, on the other hand, we return the original forcign student to the student population and make certain that the population is thoroughly randomized before being sampled again (that is, give him a chance to lose himself among the campus crowd or, in drawing student ‘numbers out of a container, mix up the disks with the numbers on them), the probability of sampling a second FG student is the same as before—0.03, In fact, if we continue to replace the sampled individuals in the original population, ‘ve cam sample from it as though it were an infinitely sized population. Biological populations are, of course, finite, but they are frequently so large that for purposes of sampling experiments we can consider them effectively infinite, whether we replace sampled individuals or not. After all, even in this relatively small population of 10,000 students, the probability of sampling a second foreign graduate student (without replacement) is only minutely different from 0.03, For the test of this section, we wil assume that sampling is with replacement, so the probability of obtaining a foreign student will not change. ‘There is a second potential source of difficulty in this design. We not only have to assume that the probability of sampling a second foreign student is equal to that ofthe firs, but also that these events are independent. By independence of events we mean thatthe probability of one event isnot affected by whether or rot another event has occurred. Inthe case ofthe students, having sampled one foreign student, is it more or less likely that the second student we sampled is also a forcign student? Independence of the events may depend on where we sample the students of on the method of sampling. If we sampled students on ‘pus, it is quite likely that the events are not independent: that is, having. sampled one foreign student, the probability thatthe second student we sample is foreign increases, since foreign students tend to congregate. Thus, al Matchless University the probability that a student walking with a foreign graduate student is also an FG would be greater than 0.03. [Events D and E in a sample space are defined as independent whenever PID 1 Ej = PDI) 6.5) ‘The probability values assigned to the sixteen points in the sample space of Figure 5.1 have been computed to satisfy this condition. Thus letting P{D| equal the probability that the fest student sampled is an AU—that is, P[(AU,AU;. AU,AG,, AU,EU;, AU,FG,}]—and P{E] equal the probability that the second 68 CHAPTER § BINOMIAL AND POISSON DISTRIBUTIONS. student is an FG—thatis, P{(AUFG2, AG,PG,, FU,FG,, FGFG:)1— we note that the intersection D 9 Eis [AU,EG, }. Ths event has a value of 0.0210 inthe probability space of Figure 5.1. We find that this value is the product of PUAUIPLLEG}} = 0.70 x 0.03 = 0.0210. These mutually independent rela- cee ave been deliberately imposed upon all points in the probability space. “Thetefors ifthe sampling probabilities for the second student are independent of the type of student sampled fist, we can compute the probabilities of the out ‘omnes asthe product of the independent probabilities. Thus the probabil fty of bushing two FG students is P((FG)}P{(FG)] = 003 X 0.09 = 0.000% “The probability of obtaining one AU and one FG student inthe sample might soem to be the prodoct 0.70 X 0.03. However, itis twice that probability, I is SSey to see why, There is only one way of obtaining two FG students, namely by SSRpling frat one FG and then again another FG. Similarly, there is only one Mayto sample wo AU students. However, sampling one of each type of student were done by first sampling an AU followed by an FG or by ist sampling an FO followed by an AU. Thus the probability is 2P{(AU)IPLFGI] = 2 * 0.70 x 0.03 = 0.0420. If we conducted such an experiment and obtained a sample of two FG stu denis we would be led tothe following conclusions: Since only 0.0009 of the samples (9/100th of 18, 9 out of 10,000 cases) would be expected to consist Siro Toreign graduate students, obtaining such a result by chance alone i quite SInprobable. Given tat P{{FG}} = 0.03 tre, we would suspect that sampling sere random of that the events were not independent (or that both wSSruptions-—-random sampling and independence of events—were incorrect) Random sampling is sometimes confused with randomness in nature. The Tomar isthe faithful representation inthe sample ofthe distribution ofthe events perme’ the later is the independence of the events in nature, Random sam- pling generally i or should be under the control of dhe experimenter and earned to the strategy of good sampling. Randomness in nature generally de~ reltbos an innate property of the objects being sampled and thus is of greater biological interest. The confusion betweca random sampling and independence Or events aries because lack of either can yield observed frequencies of events Gifesng from expectation. We have already seen how lack of independence in Samples of foreign students can be interpreted from both pints of view in our example from Matchless University. Enxpression (5.5) ean be used to test whether two events ae independent. I is ‘well known that foreign students at American universities are predominantly Taduate students, quite unlike the situation for American students Can we Srronstrate this for Matchless University? The proportion of foreign students at de niversity PLE] = PLFU] + P{FG] ~ 0.04 and of graduate students is PIG] = PIAG] + PIFG) = 0.29. Ifthe properties of being a foreign student and et being a graduate student were independent, then P(FG) would equal PEFLPIGI. In fact, it does nt, since 0.08 X 0.29 = 00116, not 0.03, the actual proportion of foreign graduate students. Thus nationality . Thus nationality and academic rank are bot independent The probability of being» graduate sent fers depending on the nationality: Among American students the proportion of graduate students 1s 0.26/0.96 = 0.27, bt among foreign students itis 0030.04 = 0.75 aps epenns ews diy he ie of endian pein speak othe conditional probability of event A, given event B, symbolized as PIA) This quantity can be computed ss follows: aa _ PAN B) PIB) Let ws apply tis formato our Mathes U mula to our Matchless Universi example, What she Sine pty eine dn en gen a eS incr We alae ACTF = AG OFF] ~ 000M ~ O75, Tn Sm ofthe foreign suds ate paduate students, Simi, te condional foi of ee + ean wn, te ce Arcam, uch ower valu Only 27.08% ofthe Amin are graduate students. Sf laa aaa! Here another example that requires Exe tures Expression (56) fr onion prob viliy Let Cte cern tata eon fa cancer an UC} ent te post ty that a panicular population has cancer (and P[{CS] = | — P[C], the probabil- ie ving a) Inepnlogy PC] Lown at prev i ita Ft tse grotto ees ha pons camer Un ese ie et et fr 3 Pa rh ain woo nC pty tt Pat co Tels wee pose. Apping Expresion PIAL! 6.6) Ac aT) CIT) = Sa 7) colt frm ton individ known to hae cane non hose known to possible. It follows that P{TICS}, the probability of a positive result in cancer- free patients equals | — P{TCICS), the one-complement of the specificity. PICT = PEFICIPICL Te probabiiy ofan event se aT canbe writen PIT} PITICIPIC] + PITIC PICS 10 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS whic is the sum of the probabilities of having a positive test among those who have cancer and among those who do not have cancer —each weighted bythe frequencies of the wo populations. Substituting these two results into Expres. sion (57) yields ; prTiCVtc) MON = SEE TAC] + PCTS oe ‘This expression is known as Bayes’ theorem and canbe generalized to allow foran event C having more than just two states (the denominators summed over all events C, rather than just C and its complemend. This famous formula, published posthumously by the eighteenth-century English clergyman Thomas Bayes, ib led to much controversy over the interpretation of the quantity PciTy Earlier we defined “probability” asthe proportion tha an event occurs out of 4 large numberof tials. Inthe current example we have only single alien ‘who either does or doesnot have cancer. The patent doesnot have cancer sone Proportion ofthe tne. Thus the meaning of P{CIT] in ths case is the degree of One's belief or the likelihood that this patient has cancer. Its this allemaive interpretation of probability and the question of how it should be applied 0 States that i controversial. Kotz and Stroup (1983) give a good introduction tothe idea that probably refers to uncertainty of knowledge rather than of ovens Consider te following example in which Bayes" theorem was aplied to a diagnose test The figures are based on Walson and Tang (1980). The sensiv- iy ofthe radioimmunoassay for prostatic acid phosphatase (RIA-PAP) a atest for prostatic eancer is 0.70. Is specificity is 0.94. The prevalence of prostatic cancer in the white male population is 35 per 100,000, or 0.00035, Applying these values to Expression (33), we ind PUTICIPIC) PUTIC}PIC] + PUTIC PICS] = 0:70 0.00035 © 070 X-0,00035) + [1 — 0.941 = 0.00035)] The rather surprising result is that the likelihood that a white male who tests positive for the RIA-PAP test actually has prostate cancer is only 0.41%. This probability is known in epidemiology as the positive predictive value. Even if, the test had been much more sensitive, say, 0.95 rather than 0.70, the positive predictive value would have been low—0°55 percent. Only for a perfect test (ic. sensitivity and specificity both = 1) would a positive test imply with cer- tainty that the patient had prostate cancer. ‘Tho paradoxically low positive predictive value is a consequence of its de- pendence on the prevalence of the disease. Only if the prevalence of prostatic cir: 0.0041 5.2 THE BINOMIAL DISTRIBUTION N ‘cancer were 7895 per 100,000 would there be a 50:50 chance that a patient with a positive test result has cancer, This is more than 127 times the highest preva- lence ever reported from a population in the United States. Watson and Tang. ‘(.980) use these findings (erroneously reported as 1440 per 100,000) and further analyses to make the point that using the RIA-PAP test as a routine serecning procedure for prostate cancer is not worthwhile. Readers inicrested in extending their knowledge of probability should refer ‘general texts such as Galambo (1984) or Kolz and Stroup (1983) fora simple introduction 5.2 THE BINOMIAL DISTRIBUTION For the discussion to follow, we will simplify our sample space to consist of only two clements, foreign and American students, represented by {F, A}, and ignore whether they are undergraduates or graduates. Let us symbolize the probability space by [p,q], where p = PIF}, the probability of being a foreign student, and PIA|, the probability of being an American student. As before, we can compute the probability space of samples of two students as follows: (FE, FA, AA) {p?.2pq. 4?) IC we were to sample three students independently, the probability space of the sample would be (FFF, FFA, FAA, AAA) (e394. 3p. g? Samples of thre foreign or three American students can be obtained in only one way, and their probabilities are p” and q° respectively. In samples of thee, however there are three ways of obtaining two students of one kind and one student ofthe other. As before, fA stands for American and F stands or foreign, then the sampling sequence could be AFF, FAR, and FFA for two foreign std dents and one American Thus the probability ofthis outcome will be 3"y. Similarly, the probability for two Americans ad one foreign student is 34. 'A convenient way to summarize these results i bythe binomial expansion, whic is applicable to samples of any size from populations in which objects occur independently in only two classes—students who may be foreign oF “American, individuals who may be dead or alive, male o female, black or whit rough or sinooth, and so forth, Tis summary is accomplished by expanding the binomial term (p + 9), where k equals sample size, p equals the probability of occurrence of the fist class, and y equals the probability of occurrence of the n CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS = I; hence q is a function of p: 4 = second class. By definition, p + q = 1; hence q is a functc ‘We will expand the expression for samples of k from 1 to 3: For samples of 1 (p+ g)'= p+ For samples of 2. (p +g) = p? + 2pa + q? For samples of 3 (p + 4)? =p? + 3p%4 + 3pq + @? 1s discussed previously. The coeffi- ‘These expressions yield the same outcomes di i ‘cients (the numbers before the powers of p and q) express the number of way articular outcome is obtained. Park general formula that gives both the powers of p and q, as well as the binomial coefficients, is BY ore pL v= pyr (5.9) e st us and whose probability ‘number or count of successes,” the items that interest us and BF oveurrence is symbolized by p. In our example, ¥ designates the number of secapomaes Te nn 1) nb utr of iso that can be formed from k items taken Y at a time. This expression can be Gualuated as k'/(YI(k — Y)HJ, where ! means factorial. In mathematics, k Facto lis the product of all the integers from 1 up to and including &. Thus: 5 2x3 X 4% 5 = 120, By convention, 0! = 1. In working out fractions Containing factorals, note that a factorial always cancels against a higher facto- Sal Thus SUB! = (6 X 4 X 3D! = 5X 4. For example, the binomial coeth- rent for the expected frequency of samples of 5 students containing 2 foreign (Sx 42 = 10. users is (5) = 5120! let um a ilgial example. Suppose we aves populaon of (ae a a cheer ha gen vs $I we Ue iY it cach an exis ach inset sear ee rs samo of fat ato frp col we exp ithe bof a tina sample were inept om ht of te ices ifn a cv p= Othe poporon infec, and ¢ = 06. the i a ne The opulton tue evo ge ht he gue Fo ings wor wibstepaceentsilevat racial a a gegen wou be he expansion fe ov (p+ gt = 04 + 06" “With the aid of Expression (5.9) this expansion is po + Spq + lOp'g? + 10p%¢' + Spat + 4° 5.2 THE BINOMIAL DISTRIBUTION B (04)! + 5(0.4)40.6) + 10(0.4)°0.6) + 1000.40.59 + 5(0.4)10.6) + (0.6) representing the expected proportions of samples of five infected insects, four infected and one noninfected insects, three infected and two noninfected insects, and s0 on. By now you have probably realized that the terms ofthe binomial expansion yield a type of frequency distribution for these different outcomes. Associated ‘with each outcome, such as “five infected insects,” is a probability of ‘occurrence —in this case (0.4)* = 0.01024. This is a theoretical frequency dis- tribution, or probability distribution, of events that can occur in two classes. It describes the expected distribution of outcomes in random samples of five in- sects, 40% of which are infected. The probability distribution described here is known as the binomial distribution; the binomial expansion yields the expected frequencies ofthe classes of the binomial distribution. ‘A convenient layout for presentation and computation of a binomial distribu- ‘ion is shown in Table 5.1, based on Expression (5.9). Inthe first column, which lists the number of infected insects per sample, note that we have revised the order of the terms to indicate a progression from ¥ = 0 successes (infected insects) to ¥ = k successes. The second columa features the binomial coefficient as given by the combinatorial portion of Expression (5.9). Column 3 shows Stable $1, EXPECTED FREQUENCIES OF INFECTED WsEcTS IN il SAMPLES OF 5 INSECTS SAMPLED FROM AN INFINITELY LARGE POPULATION WITH AN ASSUMED INFECTION RATE OF 40%. w o © o o amber of infected Relative Absolute insets Powers Powers expected expected Observed per sample fof fequencies fequencies frequencies Y a= 06 has i f 0 07776 0077761884 «202 1 012960 025920 6280683 2 021600 034560 837.4 SIT 3 036000 023040 5583535 4 0.50000 0.07680 186.1 197 5 1 001024 1.00000 go1024 248 2 Ler Zs=m) 1.00000 34230 Baas ‘Mean 2.00000 2.00008. 1.98721, Standard deviation 1.09545 1.09543, L.11934 Th CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS increasing powers ofp from p® to p*, and column (4) shows decreasing powers of 4 from q° to q°. The relative expected frequencies, which are the probabilities Of the various outcomes, are shown in column (5). We label such expected frequencies f... They are the product of columns (2), (3), and (4), and their sum is, ‘equal to 1.0, since the events in column (1) exhaust the possible outcomes. We see from column (5) that only about 1% of the samples are expected to consist of 5 infected insects, and 25.9% are expected to contain I infected and 4 nonin- fected insects. We will now test whether these predictions hold in an actual experiment EXPERIMENT 5.1. Simulate the case of the infected insects by using a table of random numbers such as Statistical Table FF. These are randomly chosen one-digit ‘numbers in which each digit 0 through 9 has an equal probability of appearing. The ‘numbers are grouped in blocks of 25 fr convenience. Such numbers can also be obtained from random number keys on some pocket calculators and by pseudorandom nun generating algorithms in computer programs. Since there is an equal probability for any ‘ne dit to appear you ean et any four digits (say 0, 1, 2,3) stand fo the infected insects and the remaining digits (4, 5,6, 7, 8, 9 stand forthe noninfected insects. The probability that any one digit selected from the table represents an infected insoct (thai, 40, 1,2, oF 3)is therefore 40% or 0.4, since these are four of the ten possible digits. Also, successive Uigts are assumed to be independent ofthe values of previous digits, Thus the assump: ‘ions of the binomial distribution should be met in tis experiment. Ener the table of random aumbers at an arbitrary point (not always atthe beginning!) and look at succes- sive groups of five digits, noting in each group how many ofthe digits are 0, 1,2. oF 3. ‘Take as many groups of five as you can find time to-do, but no fewer than 100 groups (Persons with computer experience can easily generate the Jaa required by this exereise Without using Table FF. There are also some programs that specialize in simulating sampling experiments.) Column (7) in Table 5.1 shows the results of such an experiment by a bi ety clas. A total of 2423 samples of five numbers were obtained from Stats- tical Table FF and the distribution ofthe four digits simulating the percentage of infection is shown in this column. The observed frequencies are labeled f To calculate the expected frequencies for this example, we multiplied the relative expected frequencies, of column (5) by n = 2423, the number of samples taken, These calculations resulted in absolute expected frequencies, f, shown in column (6). When we compare the observed frequencies in column (7) with the expected frequencies in column (6), we note general agreement between the two columns of figures. The two distributions ae illustrated in Figure 52.1 the observed frequencies didnot ft expected frequencies, we might believe that the lack offi was due to chance alone. Or we might be Ie to reject one or more of the following hypotheses: (1) thatthe tue proportion of digits 1,2, and 3s 0.4 (rejection of this hypothesis would normally not be reasonable, for we may rely fn the fact thatthe proportion of digits 0, 1 2, and 3 ina tale of random 15 Bh Observed frequencies 1D Expected feuencies Frequency cERESESRZES o 1 2 Number of infeted insects per sample FIGURE 5.2 Bar diagram of observed amd expected frequencies given n Table 5.1, pee numbers is 0.4 oF very elose to it); (2) that sampling was random; and (3) that ‘events are independent. “These statements ean be reinterpreted in terms of the original infection mode! with which we started this discussion. If, instead of a sampling experiment of igits by a biometry class, this had been i real sampling experiment of insects, ‘we would conclude thatthe inscets had indeed been randomly sampled and that ‘we had no evidence to reject the hypothesis that the proportion of infected inscets ‘was 40%, IF the observed frequencies had not fit the expected frequencies, the Jack of fit might be attributed to chance or to the conclusion that the true propor- tion of infection is not 0.4, or we would have to reject one or both the following assumptions: (1) that sampling was at random, and (2) that the occurrence of infected inscets in these samples was independent. Experiment 5.1 was designed to yield random samples and independent events. How could we simulate a sampling procedure in which the occurrences of the digits 0, 1,2, and 3 were not independent? We could, for example. instruct the sampler to sample as indicated previously, but every time he found a 3, to search through the succeeding digits until he found another one of the four digits standing for infected individuals and to incorporate this in the sample. Thus. ‘once a3 was found, the probability would be I.0 that another one ofthe indicated digits would be included in the sample. After repeated samples, this procedure ‘would result in higher frequencies of classes of two or mote indicated digits and in lower frequencies than expected (on the basis of the binomial distribution) of classes of one event. Many such sampling schemes could be devised. It should be lear thatthe probability ofthe second event occurring would be different from ‘and dependent on that ofthe frst. 16 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS, How would we interpret large departure ofthe observed frequencies from extete fronvncin in anther example? We have ot yet ead techies coerce whathercbxerved frequencies difer from those expected by more see eae bated to chance alone This topic wil be taken up in Chapter 17. rao at sucha test has been eared ot and tht it has shown thal our ase roquenies ae significantly different from the expected frequencies Tree ges of deparare fom expetaion are ikely (1 clumping and 2) ein, own i fettios examples in Table 5.2. In el examples we would reeset rt ntions about the magaitade of p, the probability of one of the mat te athe outcomes. In such eases x customary to obtain p fom the ob weet sample and to caleulat the expected Frequencies sing te samplep. This a a ia ibe hypsess hat pis a given vue cannot be std. since by dame i pnpcctediequencies all Rave the ame p value asthe Observed desig th. 1 in clumped samples, and <1 in cases of repulsion. In the yeast cell example, CD = 1.092. Computer program BIOM-pe includes an option for expected Poisson frequencies. Figure 5.3 will give you an idea of the shapes of the Poisson distributions of different means. It shows frequency polygons (the lines connecting the mid 10 Relative expected frequeney 2 4 6 8 0 2 4 16 [Number of rare evens per sample FIGURE 5.3 Frequency polygons of the Poisson distributions for various values of the mean points of a bar diagram) for five Poisson distributions. For the low value of r= OL, the frequency polygon is extremely U-shaped, but with an increase in the value of , the distributions become humped and eventually nearly syrmmet- rical ‘We conclude our study of the Poisson distribution by considering several examples from diverse fields. The fist example, whichis analogous to that ofthe Yeast cells, is from an ecological study of mosses ofthe species Hypnum schre- Jeri invading mica residues of china clay (Table 5.5). These residues occur on ‘deposited *'dams” (often 5000 yd? in area), on which the ecologists laid out 126 quadrats, In each quadrat they counted the number of moss shoots. Expected Frequencies are again computed, using the mean number of moss shoots, Y= 0.4841, as an estimate of ps. There are many more quadrats than expected at the two tals of the distribution than atts center. Thus, although we would expect only approximately 78 quadrats without a moss plant, we find 100. Similarly, while there are 11 quadrats containing 3 or more moss shoots, the Poisson Uistribution predicted only 1.7 quadrats. By way of compensation the central ‘lasses are shy of expectation. Instead of the nearly 38 expected quadrats ‘one moss plant each, only 9 were found, and there is a slight deficiency also the 2-mosses-per-quadrat class. This case is another illustration of clumping, ‘which was encountered first in the binomial distribution. The sample variance y= '1308, much larger than P= 0.4841, yields a coeflicient of dispersion cD = 2.702. ‘Searching for a biological explanation of the clumping, the investigators found that the protonemata, or spores, of the moss were carried in by water and Table 557 NUMBER OF MOSS SHOOTS (HYPHUM SCHREBER? PER ble 5.5. QuaDRAT ON CHINA CLAY RESIDUES (MICA). wo @ @ © unter of tte Detation EES onenet don et egenees Gegpences expectation nant 7 wt : o 100 mI + 1 2 M6 > ° : 3 sl 13 4 ‘ thr ation he a 3 2 ; Source: Ba fom Bans sd Sty (950, MITES (ARRENURUS sp.) INFESTING 589 CHIRONOMID Table 5.6 ruics (cALoPsEcTRA AKRINA) > a sinter pao Deaton Chmics—Oteeved pen tem Pe cee | eee eee 7 7 Tt oo + 1 1 wes : 2 > 362 E 3 i 33 * 3 ‘ os 1 $ ‘ a1 + 2 Shar gy thy ? Q a0 ° h 1 29 4 Total 389 589.0 deposited at anon bul hat cch proton gave ie ta number of up show, com be ate eda ped atrbuon Inf when he clon instead of indvidal stots were used vara che vestigins found tat the chm followed a Poisson dtibtion that were randomly db ted. Thete ae prblete nr eppyng is aprsch sine tee a th tanling units Can vnc heel Ths woul app, for nao the plans produced substances tha prevented other las fom growing very clone (other pulsed daibun)tur ona age scale the pans were lured in favorable regions (clinpd dabuton), Grganss ving unde thelr ovn font wh td o cmp on ay Beep 1 rn Socialists or may have accumulated in clump, aa eoul fo in rxponso tecenvironmena foes Temps asarsalichor nee "The seond example features the dsbtion of water mite on as of chromomi fy Tale 5. Ths exams inl toe Geof the moses ‘except that here the sampling unit is a fly instead of a quadrat, The rare event is a Ines nesting they. The coffe of person, 2.22, eet the paern ofbecredfrojuencc, whic ar reser han aperiodic al ade han expesed nthe centr Thi slaonsip a made lear in the lt clu of the feuney distibuen, which shows th signe of the deviations (observed = quences minus eaeced reqenis) and shows characte clumped pt tk A ponible eplanation I tha he dense of mites Inthe several poms 90 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS. from which the chironomid flies emerged differed considerably. Chironomids that emerged from ponds with many mites would be parasitized by more than ‘one mite, but those from ponds where mites were scarce would show little or no infestation, ‘The third example tests the randomness of the distribution of weed seeds in samples of grass seed (Table 5.7). Here the total number of seeds in a quarter- ‘ounce sample could be counted, so we could estimate k (which is several thou- sand), and , which represents the large proportion of grass seed, as contrasted with p, the small proportion of weed seed. The data are therefore structured as in a binomial distribution with alternative states, weed seed and grass seed. For purposes of computation, however, only the number of weed seeds must be ‘considered. This is an example of a case mentioned at the beginning of this section, a binomial in which the frequency of one outcome is very much smaller than that of the other and the sample size is large, We can use the Poisson distribution as a useful approximation of the expected binomial frequencies for the tal ofthe distribution. We use the average number of weed seeds per sample of seeds as our estimate of the mean and calculate expected Poisson frequencies for this mean. Although the pattern of the deviations and the coefficient of dispersion again indicate clumping, this tendency is not pronounced, and the methods of Chapter 17 would indicate insufficient evidence for the conclusion that the distribution is not a Poisson. We conclude that the weed seeds are randomly distributed through the samples. If clumping had been significant it Table 5:1: POTENTILLA (WEED) SEEDS IN 98 QUARTER-OUNCE fable 3.1 samples OF GRASS SEEDS (PHLEUM PRATENSE). , @ ° @ Number of weed Poisson Deviation seeds per Observed expected from sample of seds frequencies frequencies expectation Y f f ia 7 33 + 2 387 7 16 208 S 9 18 + 2 22 5 Oh oshi06 = = be 1 on + 1 00) + 580 Yau f= 17 CD socr: Motil rom Lega (539. 5.3 THE POISSON DISTRIBUTION Bi} EXPECTED FREQUENCIES COMPARED FOR BINOMIAL Table 5.8. ano poisson oisTRIBUTIONS. o @ @ Expected frequencies frequencies approximated = 0001 “by Poisson = 100 wel y Jes Jr 00367695 (0367879 10368063 0367879 2 oeos2 183940 30061283 a0si313 4 0013290 o.1s328 50003049 0.003066 6 0000508 000511 7 gwar a.gov073 Koo? 0.000009 9 poo! 0.000001 Total 1.000000 099999 might have been because the weed seeds stuck together, possibly because of interlocking hairs, sticky envelopes, or the like. Perhaps you need to be convinced that the binomial under these conditions| {does approach the Poisson distribution. The mathematical proof is too involved for this text, ut we can give an empirical example (Table 5.8). Here the expected binomial frequencies for the expression (0.001 + 0.999)!" are given in column (2), and in column (3) these frequencies are approximated by a Poisson distribu- tion with mean equal to I, since the expected value for p is 0.001, which is one event for a sample size of k = 1000. The two columns of expected frequencies are extremely similar. “The next distribution is extracted from an experimental study of the effects of different densities of parents of the azuki bean weevil (Table 5.9). Larvae of se weevils enter the beans, feed and pupate inside them, and then emerge through a hole, Thus the number of emergence holes per bean is a good measure of the number of adults that have emerged. The rare event in this case is the ‘weevil present in the bean, We note that the distribution is strongly repulsed, a far rarer occurrence than a clumped distribution. There are many more beans containing one weevil than the Poisson distribution would predict. A statistical finding of this sort leads us to investigate the biology of the phenomenon. In this. case it was found that the adult female weevils tended to deposit their eggs evenly rather than randomly over the available beans. This prevented too many 9 CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS. Table 5,9 2ZUKI_BEAN WEEVILS (CALLOSOBRUCHUS CHINENSIS) fable 5.9 | emencinG FROM 112 AZUKI BEANS (PHASEOLUS RADIATUS). 3.4 OTHER DISCRETE PROBAUILITY DISTRIBUTIONS 93 Table 5,10 MEN KILLED BY BEING KICKED BY A HORSE IN 10 PRUSSIAN ARMY CORPS IN THE COURSE OF 20 YEARS. © @ @ o o @ 5 Number of Poston Delton ‘evi Poison Deviation Namberofmen Obered expected om merging Observed capeted tom kiledperyear” —feqwencesfeaeanees eapeson Terbean —fequenclesGoquenciss_eapeaton reraimycome ’ f 7 = oo SF vost 3 a 7a = t «3 1 50 m7 a 2 mn 2 1 6 3 a 3 ob, tabey = 4 os} an ae oon . 5+ au Tout Tia Tot x00 0.269 CD = 0.579 ¥ = 0610 ost CD = 7 SoUNcE Du or Ue 60) eggs being placed on any one bean and precluded heavy competition among the developing larvae. A contributing factor was competition between larvac feeding fn the same bean, generally resulting in all but one being killed or driven out. ‘Thus its easy to understand how these biological phenomena would give rise 10 a repulsed distribution, ‘We will end this discussion with a classic, ragicomic application of the Pois- son distribution. Table 5.10 is a frequency distribution of men killed by being kicked by a horse in 10 Prussian army corps over 20 years. The basic sampling unit is temporal inthis case, one army comps per year. We are not certain exactly how many men are involved, but most likely a considerable number. The 0.610 ‘men killed per army corps per year isthe rare event. If we knew the number of ‘men in each army corps, we could calculate the probability of not being killed by 1 horse in one year. This calculation would give us a binomial that could be approximated by the Poisson distribution. Knowing that the sample size (the ‘number of men in an army corps) is large, however, we need not concern our- selves with values of p and k and can consider the example simply from the Poisson model, using the observed mean number of men killed per army corps per year as an estimate of ‘This example is an almost perfect fit of expected frequencies to observed cones. What would clumping have meant in such an example? An unusual num- ber of deaths in a certain army corps ina given year could have been due 10 poor discipline in the particular corps or to @ particularly vicious horse that killed SUH: Daa ome Bonowcr (98) Several men before the corps got rd of it. Repulsion might mean that one death per year per corps occurred more frequently than expected. This might have been 50 if men in each corps were careless each year until someone had been killed, after which time they all became more cazeful for a while. { Occasionally samples of tems that could be distributed in Poisson fashion are ‘taken in such a way that the zero class is missing. For example, an entomologist studying the number of mites per leaf of a tree may not sample leaves at random {From the tree, but collect only leaves containing mites, excluding those without | nites. For these so-called truncated Poisson distributions, special methods can De used to estimate parameters and to calculate expected frequencies. Cohen © (1960) gives tables and graphs for rapid estimation. 5.4 OTHER DiscRETE PROBABILITY DisTRIBUTIONS ‘The binoril and Poisson dstittions ate both examples of sree probabil istibutions Forte moss example of te previous secon (Table 99) th means that tere is ether no mons pan pot quadrat er one ors plan td plants thee plans, aso forth bu ho vales in between. Other dees dines butions are known in probability theory, and in reen yrs any have taeg Suggested as slab for one application canoe. A abepinnor, you reed sex 9h CHAPTER 5 BINOMIAL AND POISSON DISTRIBUTIONS. 50) samples (see Box 63, part Il), In smaller samples a difference of one item per class would make a substantial difference inthe cumulative percentage in the tails. For small samples (<50) the method of ranked normal deviates or ranks, is preferable. With this, ‘method. instead of quantiles we use the ranks ofeach observation in the sample, and instead of NEDs we plot values from a table of rankits, the average positions PYRE] GRAPHIC TEST FOR NORMALITY OF SAMPLES BY ° MEANS OF NORMAL QUANTILE PLOTS. 1. Routine computer processed samples Computation 1. Array samples in order of increasing magritude of variates. 2, For each variate compute the quantity p = (i~ 49m, where is the rank onder of the ‘th variate in the aay, and is the sample site, These Yalues will be used for computing NEDs. Th correction of} prevents the last variate from yielding p = 1.0, for which the NED would be postive infinity. For ied values of calculate an average p. 13. For each value ofp evaluate the coresponding NED. If ne computer program is Available, the NEDs can be looked up ina table ofthe inverse cumulative normal distribution (eg. Table 4 in Pearson and Harley, 1958, or Table 1.2 in Owen, 1962), or by inverse interpolation in Statistical Table A. 4, Plot the NEDs against the original variates and examine the scatterpot visually for Tineaiy. Some programs also plot the straight line expected when the sample i perfectly normally distribute, to serve as a benchmark agaist which the observed Scatterpot can be judged. Alternatively straight line is fte to the points by eye, Preferably using a ransparent plastic ler, which permits ll he points to be seen 8 the line is drawn, In drawing the line, most weight should be given to the poins between cumulative frequencies of25% to 75% because a difference ofa single item may make appreciable changes in the percentages atthe tis. Some programs plo INEDs along the abscissa and the variable along the ordinate. We prefer the more ‘common arrangement shown in Figure 6.6 andthe figures inthis box Figure A shows 1400 housely wig lengths randomly sampled from Table 61 a8 3 normal quantile plot Since these are approximately normal dat, itis wo surpeise that the seaterplot forms «nearly seaight line. ng Box 6.3 (CONTINUED) 23 20 15 10 os! 00) 05 -10] [Normal quantiles -15) 20) ee ae eee eee So WT 40 a2 44 a6 a8 50 52 34 56 Housel wing lengths FIGURE A Normal quantile plot of 1400 random samples of 5 housefly wing Teng from Table 6.1 Large frequency distributions ‘When the sample is large, ploting every observation may nt be wore, athe rormal quantile plot can be applied tothe frequency distribution as ftlows, We employ the by now thoroughly fail Chinese bith weights. Birth weights of male Chinese in ounces, rom Box 43, o ® ® @ o Class Upper Cumulative ‘mark class frequencies Y tim Sf F poe Hin ws Ss 2 ‘0.0002 os ms 6 8 0.0008 sa) = se 2 0) a 0.0049 BS 87S ORS 432 0.0856, 31S 983 RR 1320, 0.1394 9S 103s 129309 o3z21 wors11n3 2240589 03587 ss 195 20077296 07108 ass 12338829 sort is 1385 Git 9170 0.9688 nes 3s 0 TL 0.9900 175 StS 749445 9078 Iss 195148459. 0.9993 1633 9464 0.9998 mms 93685 09999) 10 2, Form a.com Box 6.3 (conTiNvED) Compuation 1. Prepare a frequency distribution as shown in columns (1), (2), and (3) sive frequeney distribution as shown in column (). Its obtained by successive summation ofthe Frequency values Incolumn (5) express the cumoltive frequencies asp-alucs using the formula instep 2of part. 53. Graph the NEDs, quantiles, comesponding these p-values against the upper cass Tmt ofeach claws (Figure B) Notice tht the upper frequencies deviate othe right ofthe straight ine. This is (ical of data that ae skewed tothe right (se Figure 660). 4. The graph from step 3 (and those described inthe other pars ofthis box) can ao BE Teter pid estimation ofthe ean and saan devin of sample The nea sSSpmonitat bya papi estimation ofthe mein, The more era he dst Sot te closer the ean wil be fo the mean, Te moan etme by Normal equivalent deviates BSS 713 B15 HONS 1195 1353 1313 1675 ‘Binh weights of Chinese males (oz) FIGURE B Graphic analysis of data 7 GRarmie METHODS ni Box 6.3 (continueD) dropping a perpendicular from the intersection ofthe 0 pont onthe ordinate and the cumulative Frequency curve othe abscissa (see Figure B). The estimate ofthe mean of 10.7 02 is quite close wo the computed mean of 1099 0, ‘5. The standard deviation is estimated by dropping similar perpendiculars from the intersections ofthe = and the +1 points with the cumulative curve, espotivel. These points enclose the portion ofa normal curve represented by ys 2 0. By mea. suring the difference between these perpeniculas and dividing this value by 2, we ‘obtain an estimate of one standant deviation. In this instance the estimate i + = 136since the difference is27.2 oz divided by 2. This close approximation tothe computed value of 13.59 02 ML, Smal samples (n= 50) Femur lengths of aphid stem mothers fom Ix 2.1 # = 25, Ta (oe enor Ranks Rankie Ranks Rankis ‘rom allowing. fiom allowing ¥ Table “fortes | Y Table ‘forties tor 97e eo emer ars 35-12-12 | 1020s 36-136 42° 03000 36-107 43 oat O58 36-091 le 0s ose 36-076 430688 Be 064 43 076 sk 38-082 44 oot 1.08 3K “0i wy rt 38-030 44 126 Low 39 “020 450 1s 32 39-010 47st a7 39 000 Computation 1. Enter the sample arayed in increasing order of magni in column (1) ncolunin (@) put coresponding rankits from Slalistical Table MH, The table gives only the anki forthe al of each distribution greater than the median for any sample size. ‘The ther half s the same bot is negative in sign All samples conaining an od ‘umber of variates (suchas this one) hae Oa the median vale. The ranks fr this example are looked up under sample size n= 25, ‘A special problem illustrated inthis example i the ease of tes, oF variates of ‘dentical magnitude, In sich a case we sum the rant vals forthe corresponding ranks and ind their mean, Thus the ~ 1.00 occupying lines 3 06 in column (3) isthe

Das könnte Ihnen auch gefallen