Experimentos
Experimentos
d
0
s/
n
,
where
d is the mean of the data (here the differences d
1
, d
2
, . . ., d
n
), n is the The paired t-test
sample size, and s is the sample standard deviation (of the differences)
s =
_
1
n 1
n
i=1
(d
i
d )
2
.
If our null hypothesis is correct and our assumptions are true, then the t-
statistic follows a t-distribution with n 1 degrees of freedom.
The p-value for a test is the probability, assuming that the null hypothesis
is true, of observing a test statistic as extreme or more extreme than the one The p-value
we did observe. Extreme means away from the the null hypothesis towards
the alternative hypothesis. Our alternative here is that the true average is
larger than the null hypothesis value, so larger values of the test statistic are
extreme. Thus the p-value is the area under the t-curve with n1 degrees of
freedom from the observed t-value to the right. (If the alternative had been
<
0
, then the p-value is the area under the curve to the left of our test
22 Randomization and Design
Table 2.3: Paired t-tests results for runstitching times (standard
ergonomic) for the last 10 and all 30 workers
n df
d s t p
Last 10 10 9 .023 .695 .10 .459
All 30 30 29 .175 .645 1.49 .074
statistic. For a two sided alternative, the p-value is the area under the curve
at a distance from 0 as great or greater than our test statistic.)
To illustrate the t-test, lets use the data for the last 10 workers and all
30 workers. Table 2.3 shows the results. Looking at the last ten workers,
the p-value is .46, meaning that we would observe a t-statistic this larger or
larger in 46% of all tests when the null hypothesis is true. Thus there is little
evidence against the null here. When all 30 workers are considered, the p-
value is .074; this is mild evidence against the null hypothesis. The fact that
these two differ probably indicates that the workers are not listed in random
order. In fact, Figure 2.1 shows box-plots for the differences by groups of ten
workers; the lower numbered differences tend to be greater.
Now consider a randomization-based analysis. The randomization null
hypothesis is that the two workplaces are completely equivalent and merely
act to label the responses that we observed. For example, the rst worker Randomization
null hypothesis
had responses of 4.90 and 3.87, which we have labeled as standard and er-
gonomic. Under the randomization null, the responses would be 4.90 and
3.87 no matter how the random assignment of treatments turned out. The
only thing that could change is which of the two is labeled as standard, and
which as ergonomic. Thus, under the randomization null hypothesis, we
could, with equal probability, have observed 3.87 for standard and 4.90 for
ergonomic.
What does this mean in terms of the differences? We observed a differ-
ence of 1.03 for worker 1. Under the randomization null, we could just as Differences have
random signs
under
randomization
null
easily have observed the difference -1.03, and similarly for all the other dif-
ferences. Thus in the randomization analogue to a paired t-test, the absolute
values of the differences are taken to be xed, and the signs of the differ-
ences are random, with each sign independent of the others and having equal
probability of positive and negative.
To construct a randomization test, we choose a descriptive statistic for
the data and then get the distribution of that statistic under the randomization
null hypothesis. The randomization p-value is the probability (under this
randomization distribution) of getting a descriptive statistic as extreme or
more extreme than the one we observed.
2.4 Randomization for Inference 23
-1
-0.5
0
0.5
1
1.5
1 2 3
Group of 10
T
i
m
e
d
i
f
f
e
r
e
n
c
e
Figure 2.1: Box-plots of differences in runstitching times by
groups of 10 workers, using MacAnova. Stars and diamonds
indicate potential outlier points.
For this problem, we take the sum of the differences as our descriptive
statistic. (The average would lead to exactly the same p-values, and we could
also form tests using the median or other measures of center.) Start with Randomization
statistic and
distribution
the last 10 workers. The sum of the last 10 observed differences is .23. To
get the randomization distribution, we have to get the sum for all possible
combinations of signs for the differences. There are two possibilities for
each difference, and 10 differences, so there are 2
10
= 1024 different equally
likely values for the sum in the randomization distribution. We must look at
all of them to get the randomization p-value.
Figure 2.2 shows a histogram of the randomization distribution for the
last 10 workers. The observed value of .23 is clearly in the center of this
distribution, so we expect a large p-value. In fact, 465 of the 1024 values are Randomization
p-value
.23 or larger, so the randomization p-value is 465/1024 = .454, very close to
the t-test p-value.
We only wanted to do a test on a mean of 10 numbers, and we had to
compute 1024 different sums of 10 numbers; you can see one reason why
randomization tests have not had a major following. For some data sets, you
can compute the randomization p-value by hand fairly simply. Consider the
last 10 differences in Table 2.2 (reading across rows, rather than columns).
24 Randomization and Design
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
-6 -4 -2 0 2 4 6
Sum of differences
D
e
n
s
i
t
y
Figure 2.2: Histogram of randomization distribution of the sum
of the last 10 worker differences for runstitching, with vertical
line added at the observed sum.
These differences are
.62 1.75 .71 .21 .01 .42 -.21 .42 .43 .82
Only one of these values is negative (-.21), and seven of the positive differ-
ences have absolute value greater than .21. Any change of these seven values
can only make the sum less, so we dont have to consider changing their
signs, only the signs of .21, .01, and -.21. This is a much smaller problem,
and it is fairly easy to work out that four of the 8 possible sign arrangements
for testing three differences lead to sums as large or larger than the observed
sum. Thus the randomization p-value is 4/1024 = .004, similar to the .007
p-value we would get if we used the t-test.
Looking at the entire data set, we have 2
30
= 1, 073, 741, 824 different
sets of signs. That is too many to do comfortably, even on a computer. What Subsample the
randomization
distribution
is done instead is to have the computer choose a random sample from this
complete distribution by choosing random sets of signs, and then use this
sample for computing randomization p-values as if it were the complete dis-
tribution. For a reasonably large sample, say 10,000, the approximation is
usually good enough. I took a random sample of size 10,000 and got a p-
value .069, reasonably close to the t-test p-value. Two additional samples
of 10,000 gave p-values of .073 and .068; the binomial distribution suggests
2.4 Randomization for Inference 25
Table 2.4: Log whole plant phosphorus
(ln g/plant) 15 and 28 days after rst harvest.
15 Days 28 Days
4.3 4.6 4.8 5.4 5.3 5.7 6.0 6.3
that these approximate p-values have a standard deviation of about
_
p (1 p)/10000
_
.07 .93/10000 = .0026 .
2.4.2 Two-sample t-test
Figure 2 of Hunt (1973) provides data from an experiment looking at the
absorption of phosphorus by Rumex acetosa. Table 2.4 is taken from Figure
2 of Hunt and gives the log phosphorus content of 8 whole plants, 4 each at
15 and 28 days after rst harvest. These are 8 plants randomly divided into
two groups of 4, with each group getting a different treatment. One natural
question is whether the average phosphorus content is the same at the two
sampling times. Formally, we test the null hypothesis that the two sampling
times have the same average.
A two-sample t-test is the standard method for addressing this question.
Let y
11
, . . ., y
14
be the responses from the rst sample, and let y
21
, . . ., y
24
Two-sample t-test
be the response from the second sample. The usual assumptions for a two-
sample t-test are that the data y
11
, . . ., y
14
are a sample from a normal dis-
tribution with mean
1
and variance
2
, the data y
21
, . . ., y
24
are a sample
from a normal distribution with mean
2
and variance
2
, and the two sam-
ples are independent. Note that while the means may differ, the variances
are assumed to be the same. The null hypothesis is H
0
:
1
=
2
and our
alternative is H
1
:
1
<
2
(presumably growing plants will accumulate
phosphorus).
The two-sample t-statistic is
t =
y
2
y
1
s
p
_
1/n
1
+ 1/n
2
,
where y
1
and y
2
are the means of the rst and second samples, n
1
and n
2
are the sample sizes, and s
2
p
is the pooled estimate of variance dened by
s
p
=
n
1
i=1
(y
1i
y
1
)
2
+
n
2
i=1
(y
2i
y
2
)
2
n
1
+n
2
2
.
26 Randomization and Design
If our null hypothesis is correct and our assumptions are true, then the t-
statistic follows a t-distribution with n
1
+ n
2
2 degrees of freedom. The
p-value for our one-sided alternative is the area under the t-distribution curve
with n
1
+ n
2
2 degrees of freedom that is to the right of our observed
t-statistic.
For these data y
1
= 4.775, y
2
= 5.825, s
p
= .446, and n
1
= n
2
= 4.
The t-statistic is then
t =
5.825 4.775
.446
_
1/4 + 1/4
= 3.33,
and the p-value is .008, the area under a t-curve with 6 degrees of freedom to
the right of 3.33. This is strong evidence against the null hypothesis, and we
would probably conclude that the null is false.
Now consider a randomization analysis. The randomization null hypoth-
esis is that growing time treatments are completely equivalent and serve only Randomization
null hypothesis
as labels. In particular, the responses we observed for the 8 units would be
the same no matter which treatments had been applied, and any subset of four
units is equally likely to be the 15-day treatment group. For example, under
the randomization null wth the 15-day treatment, the responses (4.3, 4.6, 4.8,
5.4), (4.3, 4.6, 5.3, 5.7), and (5.4, 5.7, 6.0, 6.3) are all equally likely.
To construct a randomization test, we choose a descriptive statistic for
the data and then get the distribution of that statistic under the randomization
null hypothesis. The randomization p-value is the probability (under this
randomization distribution) of getting a descriptive statistic as extreme or
more extreme than the one we observed.
For this problem, we take the average response at 28 days minus the aver-
age response at 15 days as our statistic. The observed value of this statistic is Randomization
statistic and
distribution
1.05. There are
8
C
4
= 70 different ways that the 8 plants can be split between
the two treatments. Only two of those 70 ways give a difference of averages
as large as or larger than the one we observed. Thus the randomization p-
value is 2/70 = .029. This p-value is a bit bigger than that computed from Randomization
p-value
the t-test, but both give evidence against the null hypothesis. Note that the
smallest possible randomization p-value for this experiment is 1/70 = .014.
2.4.3 Randomization inference and standard inference
We have seen a couple of examples where the p-values for randomization
tests were very close to those of t-tests, and a couple where the p-values
differed somewhat. Generally speaking, randomization p-values are close to
standard p-values. The two tend to be very close when the sample size is
2.5 Further Reading and Extensions 27
large and the assumptions of the standard methods are met. For small sample
sizes, randomization inference is coarser, in the sense that there are relatively
few obtainable p-values.
Randomization p-values are usually close to normal theory p-values.
We will only mention randomization testing in passing in the remainder
of this book. Normal theory methods such as ANOVA and t-tests are much
easier to implement and generalize; furthermore, we get essentially the same
inference as the randomization tests, provided we take some care to ensure
that the assumptions made by the standard procedures are met. We should
consider randomization methods when the assumptions of normal theory can-
not be met.
2.5 Further Reading and Extensions
Randomization tests, sometimes called permutation tests, were introduced
by Fisher (1935) and further developed by Pitman (1937, 1938) and others.
Some of the theory behind these tests can be found in Kempthorne (1955) and
Lehmann (1959). Fishers book is undoubtedly a classic and the granddaddy
of all modern books on the design of experiments. It is, however, difcult
for mere mortals to comprehend and has been debated and discussed since
it appeared (see, for example, Kempthorne 1966). Welch (1990) presents a
fairly general method for constructing randomization tests.
The randomization distribution for our test statistic is discrete, so there
is a nonzero lump of probability on the observed value. We have computed
the p-value by including all of this probability at the observed value as being
in the tail area (as extreme or more extreme than that we observed). One
potential variation on the p-value is to split the probability at the observed
value in half, putting only half in the tail. This can sometimes improve the
agreement between randomization and standard methods.
While randomization is traditional in experimental design and its use is
generally prescribed, it is only fair to point out that there is an alternative
model for statistical inference in which randomization is not necessary for
valid experimental design, and under which randomization does not form
the basis for inference. This is the Bayesian model of statistical inference.
The drawback is that the Bayesian analysis must model all the miscellaneous
factors which randomization is used to avoid.
28 Randomization and Design
The key assumption in many Bayesian analyses is the assumption of ex-
changeability, which is like the assumption of independence in a classical
analysis. Many Bayesians will concede that randomization can assist in mak-
ing exchangeability a reasonable approximation to reality. Thus, some would
do randomization to try to get exchangeability. However, Bayesians do not
need to randomize and so are free to consider other criteria, such as ethical
criteria, much more strongly. Berry (1989) has expounded this view rather
forcefully.
Bayesians believe in the likelihood principle, which here implies basing
your inference on the data you have instead of the data you might have had.
Randomization inference compares the observed results to results that would
have been obtained under other randomizations. This is a clear violation
of the likelihood principle. Of course, Bayesians dont generally believe in
testing or p-values to begin with.
A fairly recent cousin of randomization inference is bootstrapping (see
Efron 1979; Efron and Tibshirani 1993; and many others). Bootstrap infer-
ence in the present context does not rerandomize the assignment of treat-
ments to units, rather it randomly reweights the observations in each treat-
ment group in an effort to determine the distribution of statistics of interest.
2.6 Problems
We wish to evaluate a new textbook for a statistics class. There are seven Exercise 2.1
sections; four are chosen at randomto receive the new book, three receive the
old book. At the end of the semester, student evaluations show the following
percentages of students rate the textbook as very good or excellent:
Section 1 2 3 4 5 6 7
Book N O O N N O N
Rating 46 37 47 45 32 62 56
Find the one-sided randomization p-value for testing the null hypothesis that
the two books are equivalent versus the alternative that the new book is better
(receives higher scores).
Dairy cows are bred by selected bulls, but not all cows become pregnant Exercise 2.2
at the rst service. A drug is proposed that is hoped to increase the bulls
fertility. Each of seven bulls will be bred to 2 herds of 100 cows each (a
total of 14 herds). For one herd (selected randomly) the bulls will be given
the drug, while no drug will be given for the second herd. Assume the drug
has no residual effect. The response we observe for each bull is the number
2.6 Problems 29
of impregnated cows under drug therapy minus the number of impregnated
cows without the drug. The observed differences are -1, 6, 4, 6, 2, -3, 5. Find
the p-value for the randomization test of the null hypothesis that the drug has
no effect versus a one-sided alternative (the drug improves fertility).
Suppose we are studying the effect of diet on height of children, and we Exercise 2.3
have two diets to compare: diet A (a well balanced diet with lots of broccoli)
and diet B (a diet rich in potato chips and candy bars). We wish to nd the
diet that helps children grow (in height) fastest. We have decided to use 20
children in the experiment, and we are contemplating the following methods
for matching children with diets:
1. Let them choose.
2. Take the rst 10 for A, the second 10 for B.
3. Alternate A, B, A, B.
4. Toss a coin for each child in the study: heads A, tails B.
5. Get 20 children; choose 10 at random for A, the rest for B.
Describe the benets and risks of using these ve methods.
As part of a larger experiment, Dale (1992) looked at six samples of Exercise 2.4
a wetland soil undergoing a simulated snowmelt. Three were randomly se-
lected for treatment with a neutral pHsnowmelt; the other three got a reduced
pH snowmelt. The observed response was the number of Copepoda removed
from each microcosm during the rst 14 days of snowmelt.
Reduced pH Neutral pH
256 159 149 54 123 248
Using randomization methods, test the null hypothesis that the two treatments
have equal average numbers of Copepoda versus a two-sided alternative.
Chu (1970) studied the effect of the insecticide chlordane on the ner- Exercise 2.5
vous systems of American cockroaches. The coxal muscles from one meso-
and one metathoracic leg on opposite sides were surgically extracted from
each of six roaches. The roaches were then treated with 50 micrograms of
-chlordane, and coxal muscles from the two remaining meso- and metatho-
racic legs were removed about two hours after treatment. The Na
+
-K
+
ATPase
activity was measured in each muscle, and the percentage changes for the six
roaches are given here:
15.3 -31.8 -35.6 -14.5 3.1 -24.5
Test the null hypothesis that the chlordane treatment has not affected the
30 Randomization and Design
Na
+
-K
+
ATPas activity. What experimental technique (not mentioned in the
description above) must have been used to justify a randomization test?
McElhoe and Conner (1986) use an instrument called a Visiplume to Problem 2.1
measure ultraviolet light. By comparing absorption in clear air and absorp-
tion in polluted air, the concentration of SO
2
in the polluted air can be es-
timated. The EPA has a standard method for measuring SO
2
, and we wish
to compare the two methods across a range of air samples. The recorded
response is the ratio of the Visiplume reading to the EPA standard reading.
There were six observations on coal plant number 2: .950, .978, .762, .733,
.823, and 1.011. If we make the null hypothesis be that the Visiplume and
standard measurements are equivalent (and the Visiplume and standard labels
are just labels and nothing more), then the ratios could (with equal probabil-
ity) have been observed as their reciprocals. That is, the ratio of .950 could
with equal probability have been 1/.950 = 1.053, since the labels are equiva-
lent and assigned at random. Suppose we take as our summary of the data the
sum of the ratios. We observe .95 + ... + 1.011 = 5.257. Test (using random-
ization methods) the null hypothesis of equivalent measurement procedures
against the alternative that Visiplume reads higher than the standard. Report
a p-value.
In this problem, a data set of size 5 consists of the numbers 1 through 5; Problem 2.2
a data set of size 6 consists of the numbers 1 through 6; and so on.
(a) For data sets of size 5 and 6, compute the complete randomization distri-
bution for the mean of samples of size 3. (There will be 10 and 20 members
respectively in the two distributions.) How normal do these distributions
look?
(b) For data sets of size 4 and 5, compute the complete randomization distri-
bution for the mean of samples of any size (size 1, size 2, . . ., up to all the
data in the sample). Again, compare these to normal.
(c) Compare the size 5 distributions from parts a) and b). How do they com-
pare for mean, median, variance, and so on.
Let X
1
, X
2
, . . ., X
N
be independent, uniformly distributed, random k- Question 2.1
digit integers (that is, less than 10
k
). Find the probability of having no dupli-
cates in N draws.
Chapter 3
Completely Randomized
Designs
The simplest randomized experiment for comparing several treatments is the
Completely Randomized Design, or CRD. We will study CRDs and their
analysis in some detail, before considering any other designs, because many
of the concepts and methods learned in the CRD context can be transferred
with little or no modication to more complicated designs. Here, we dene
completely randomized designs and describe the initial analysis of results.
3.1 Structure of a CRD
We have g treatments to compare and N units to use in our experiment. For
a completely randomized design: All partitions of
units with sizes
n
1
through n
g
equally likely in
CRD
1. Select sample sizes n
1
, n
2
, . . . , n
g
with n
1
+n
2
+ +n
g
= N.
2. Choose n
1
units at random to receive treatment 1, n
2
units at random
from the N n
1
remaining to receive treatment 2, and so on.
This randomization produces a CRD; all possible arrangements of the N
units into g groups with sizes n
1
though n
g
are equally likely. Note that
complete randomization only addresses the assignment of treatments to units;
selection of treatments, experimental units, and responses is also required.
Completely randomized designs are the simplest, most easily understood, First consider a
CRD
most easily analyzed designs. For these reasons, we consider the CRD rst
when designing an experiment. The CRD may prove to be inadequate for
32 Completely Randomized Designs
some reason, but I always consider the CRD when developing an experimen-
tal design before possibly moving on to a more sophisticated design.
Example 3.1 Acid rain and birch seedlings
Wood and Bormann (1974) studied the effect of acid rain on trees. Clean
precipitation has a pH in the 5.0 to 5.5 range, but observed precipitation pH
in northern New Hampshire is often in the 3.0 to 4.0 range. Is this acid rain
harming trees, and if so, does the amount of harm depend on the pH of the
rain?
One of their experiments used 240 six-week-old yellow birch seedlings.
These seedlings were divided into ve groups of 48 at random, and the
seedlings within each group received an acid mist treatment 6 hours a week
for 17 weeks. The ve treatments differed by mist pH: 4.7, 4.0, 3.3, 3.0, and
2.3; otherwise, the seedlings were treated identically. After the 17 weeks, the
seedlings were weighed, and total plant (dry) weight was taken as response.
Thus we have a completely randomized design, with ve treatment groups
and each n
i
xed at 48. The seedlings were the experimental units, and plant
dry weight was the response.
This is a nice, straightforward experiment, but lets look over the steps
in planning the experiment and see where some of the choices and compro-
mises were made. It was suspected that damage might vary by pH level, plant
developmental stage, and plant species, among other things. This particu-
lar experiment only addresses pH level (other experiments were conducted
separately). Many factors affect tree growth. The experiment specically
controlled for soil type, seed source, and amounts of light, water, and fer-
tilizer. The desired treatment was real acid rain, but the available treatment
was a synthetic acid rain consisting of distilled water and sulfuric acid (rain
in northern New Hampshire is basically a weak mixture of sulfuric and ni-
tric acids). There was no placebo per se. The experiment used yellow birch
seedlings; what about other species or more mature trees? Total plant weight
is an important response, but other responses (possibly equally important) are
also available. Thus we see that the investigators have narrowed an enormous
question down to a workable experiment using articial acid rain on seedlings
of a single species under controlled conditions. A considerable amount of
nonstatistical background work and compromise goes into the planning of
even the simplest (from a statistical point of view) experiment.
Example 3.2 Resin lifetimes
Mechanical parts such as computer disk drives, light bulbs, and glue bonds
eventually fail. Buyers of these parts want to know how long they are likely
3.2 Preliminary Exploratory Analysis 33
Table 3.1: log
10
times till failure of a resin under stress.
Temperature (
o
C)
175 194 213 231 250
2.04 1.85 1.66 1.66 1.53 1.35 1.15 1.21 1.26 1.02
1.91 1.96 1.71 1.61 1.54 1.27 1.22 1.28 .83 1.09
2.00 1.88 1.42 1.55 1.38 1.26 1.17 1.17 1.08 1.06
1.92 1.90 1.76 1.66 1.31 1.38 1.16
to last, so manufacturers perform tests to determine average lifetime, some-
times expressed as mean time to failure, or mean time between failures for
repairable items. The last computer disk drive I bought had a mean time to
failure of 800,000 hours (over 90 years). Clearly the manufacturer did not
have disks on test for over 90 years; how do they make such claims?
One experimental method for reliability is called an accelerated life test.
Parts under stress will usually fail sooner than parts that are unstressed. By
modeling the lifetimes of parts under various stresses, we can estimate (ex-
trapolate to) the lifetime of parts that are unstressed. That way we get an
estimate of the unstressed lifetime without having to wait the complete un-
stressed lifetime.
Nelson (1990) gave an example where the goal was to estimate the life-
time (in hours) of an encapsulating resin for gold-aluminum bonds in inte-
grated circuits operating at 120
o
C. Since the lifetimes were expected to be
rather long, an accelerated test was used. Thirty-seven units were assigned
at random to one of ve different temperature stresses, ranging from 175
o
to
250
o
. Table 3.1 gives the log
10
lifetimes in hours for the test units.
For this experiment, the choice of units was rather clear: integrated cir-
cuits with the resin bond of interest. Choice of treatments, however, de-
pended on knowing that temperature stress reduced resin bond lifetime. The
actual choice of temperatures probably beneted from knowledge of the re-
sults of previous similar experiments. Once again, experimental design is a
combination of subject matter knowledge and statistical methods.
3.2 Preliminary Exploratory Analysis
It is generally advisable to conduct a preliminary exploratory or graphical
analysis of the data prior to any formal modeling, testing, or estimation. Pre-
liminary analysis could include:
Simple descriptive statistics such as means, medians, standard errors,
34 Completely Randomized Designs
interquartile ranges;
Plots, such as stem and leaf diagrams, box-plots, and scatter-plots; and
The above procedures applied separately to each treatment group.
See, for example, Moore and McCabe (1999) for a description of these ex-
ploratory techniques.
This preliminary analysis presents several possibilities. For example, a
set of box-plots with one box for each treatment group can show us the rel- Graphical
analysis reveals
patterns in data
ative sizes of treatment mean differences and experimental error. This often
gives us as much understanding of the data as any formal analysis proce-
dure. Preliminary analysis can also be a great help in discovering unusual
responses or problems in the data. For example, we might discover an outly-
ing value, perhaps due to data entry error, that was difcult to spot in a table
of numbers.
Example 3.3 Resin lifetimes, continued
We illustrate preliminary analysis by using Minitab to make box-plots of
the resin lifetime data of Example 3.2, with a separate box-plot for each
treatment; see Figure 3.1. The data in neighboring treatments overlap, but
there is a consistent change in the response from treatments one through ve,
and the change is fairly large relative to the variation within each treatment
group. Furthermore, the variation is roughly the same in the different treat-
ment groups (achieving this was a major reason for using log lifetimes).
A second plot shows us something of the challenge we are facing. Fig-
ure 3.2 shows the average log lifetimes per treatment group plotted against
the stress temperature, with a regression line superimposed. We are trying to
extrapolate over to a temperature of 120
o
, well beyond the range of the data.
If the relationship is nonlinear (and it looks curved), the linear t will give
a poor prediction and the average log lifetime at 120
o
could be considerably
higher than that predicted by the line.
3.3 Models and Parameters
A model for data is a specication of the statistical distribution for the data.
For example, the number of heads in ten tosses of a fair coin would have a
Binomial(10,.5) distribution, where .5 gives the probability of a success and
10 is the number of trials. In this instance, the distribution depends on two
numbers, called parameters: the success probability and the number of trials.
For ten tosses of a fair coin, we know both parameters. In the analysis of
3.3 Models and Parameters 35
5 4 3 2 1
2.0
1.5
1.0
Treatments
L
o
g
1
0
l
i
f
e
t
i
m
e
Figure 3.1: Box-plots of log
10
times till failure of a resin under
ve different temperature stresses, using Minitab.
1.5
2
2.5
3
120 140 160 180 200 220 240
Temperature
M
e
a
n
l
o
g
l
i
f
e
t
i
m
e
Figure 3.2: Average log
10
time till failure versus temperature,
with linear regression line added, using MacAnova.
experimental data, we may posit several different models for the data, all
with unknown parameters. The objectives of the experiment can often be
described as deciding which model is the best description of the data, and
making inferences about the parameters in the models.
36 Completely Randomized Designs
Our models for experimental data have two basic parts. The rst part
describes the average or expected values for the data. This is sometimes
called a model for the means or structure for the means. For example, Model for the
means
consider the birch tree weights from Example 3.1. We might assume that
all the treatments have the same mean response, or that each treatment has
its own mean, or that the means in the treatments are a straight line function
of the treatment pH. Each one of these models for the means has its own
parameters, namely the common mean, the ve separate treatment means,
and the slope and intercept of the linear relationship, respectively.
The second basic part of our data models is a description of how the
data vary around the treatment means. This is the model for the errors Model for the
errors
or structure for the errors. We assume that deviations from the treatment
means are independent for different data values, have mean zero, and all the
deviations have the same variance, denoted by
2
.
2
This description of the model for the errors is incomplete, because we
have not described the distribution of the errors. We can actually go a fair
way with descriptive statistics using our mean and error models without ever Normal
distribution of
errors needed for
inference
assuming a distribution for the deviations, but we will need to assume a dis-
tribution for the deviations in order to do tests, condence intervals, and other
forms of inference. We assume, in addition to independence, zero mean, and
constant variance, that the deviations follow a Normal distribution.
The standard analysis for completely randomized designs is concerned
with the structure of the means. We are trying to learn whether the means Standard analysis
explores means
are all the same, or if some differ from the others, and the nature of any
differences that might be present. The error structure is assumed to be known,
except for the variance
2
, which must be estimated and dealt with but is
otherwise of lesser interest.
Let me emphasize that these models in the standard analysis may not
be the only models of interest; for example, we may have data that do not Standard analysis
is not always
appropriate
follow a normal distribution, or we may be interested in variance differences
rather than mean differences (see Example 3.4). However, the usual analysis
looking at means is a reasonable place to start.
Example 3.4 Luria, Delbr uck, and variances
In the 1940s it was known that some strains of bacteria were sensitive to a
particular virus and would be killed if exposed. Nonetheless, some members
of those strains did not die when exposed to the virus and happily proceeded
to reproduce. What caused this phenomenon? Was it spontaneous mutation,
or was it an adaptation that occurred after exposure to the virus? These two
competing theories for the phenomenon led to the same average numbers
3.3 Models and Parameters 37
of resistant bacteria, but to different variances in the numbers of resistant
bacteriawith the mutation theory leading to a much higher variance. Ex-
periments showed that the variances were high, as predicted by the mutation
theory. This was an experiment where all the important information was in
the variance, not in the mean. It was also the beginning of a research collab-
oration that eventually led to the 1969 Nobel Prize for Luria and Delbr uck.
There are many models for the means; we start with two basic models.
We have g treatments and N units. Let y
ij
be the jth response in the ith
treatment group. Thus i runs between 1 and g, and j runs between 1 and n
i
,
in treatment group i. The model of separate group means (the full model) as- Separate means
model
sumes that every treatment has its own mean response
i
. Combined with the
error structure, the separate means model implies that all the data are inde-
pendent and normally distributed with constant variance, but each treatment
group may have its own mean:
y
ij
N(
i
,
2
) .
Alternatively, we may write this model as
y
ij
=
i
+
ij
,
where the
ij
s are errors or deviations that are independent, normally
distributed with mean zero and variance
2
.
The second basic model for the means is the single mean model (the
reduced model). The single mean model assumes that all the treatments have Single mean
model
the same mean . Combined with the error structure, the single mean model
implies that the data are independent and normally distributed with mean
and constant variance,
y
ij
N(,
2
) .
Alternatively, we may write this model as
y
ij
= +
ij
,
where the
ij
s are independent, normally distributed errors with mean zero
and variance
2
.
Note that the single mean model is a special case or restriction of the Compare reduced
model to full
model
group means model, namely the case when all of the
i
s equal each other.
Model comparison is easiest when one of the models is a restricted or reduced
form of the other.
38 Completely Randomized Designs
We sometimes express the group means
i
as
i
=
+
i
. The constant
and treatment
effects
i
formulation, the single mean model is the situation where all the
i
values
are equal to each other: for example, all zero. This introduction of
and
i
seems like a needless complication, and at this stage of the game it really
is. However, the treatment effect formulation will be extremely useful later
when we look at factorial treatment structures.
Note that there is something a bit shy here. There are g means
i
,
one for each of the g treatments, but we are using g + 1 parameters (
and the
i
s) to describe the g means. This implies that
and the
i
s are Too many
parameters
not uniquely determined. For example, if we add 15 to
and subtract 15
from all the
i
s, we get the same treatment means
i
: the 15s just cancel.
However,
i
j
will always equal
i
j
, so the differences between
treatment effects will be the same no matter how we dene
.
We got into this embarrassment by imposing an additional mathematical
structure (the overall mean
; once we know
, then
we can determine the treatment effects
i
by
i
=
i
by
=
i
i
.
These decisions typically take the form of some mathematical restriction on
the values for
or
i
. Restricting
or
i
is really two sides of the same
coin.
Mathematically, all choices for dening
=
g
i=1
i
/g .
For this choice, the sum of the treatment effects is zero: Sum of treatment
effects is zero
g
i=1
i
= 0 .
3.4 Estimating Parameters 39
An alternative that makes some hand work simpler assumes that
is the
weighted average of the treatment means, with the sample sizes n
i
used as
weights:
=
g
i=1
n
i
i
/N .
For this choice, the weighted sum of the treatment effects is zero: Or weighted sum
of treatment
effects is zero
g
i=1
n
i
i
= 0 .
When the sample sizes are equal, these two choices coincide. The computa-
tional formulae we give in this book will use the restriction that the weighted
sum of the
i
s is zero, because it leads to somewhat simpler hand computa-
tions. Some of the formulae in later chapters are only valid when the sample
sizes are equal.
Our restriction that the treatment effects
i
add to zero (either weighted
or not) implies that the treatment effects are not completely free to vary. We Degrees of
freedom for
treatment effects
can set g 1 of them however we wish, but the remaining treatment effect is
then determined because it must be whatever value makes the zero sum true.
We express this by saying that the treatment effects have g 1 degrees of
freedom.
3.4 Estimating Parameters
Most data analysis these days is done using a computer. Few of us sit down
and crunch through the necessary calculations by hand. Nonetheless, know-
ing the basic formulae and ideas behind our analysis helps us understand and
interpret the quantities that come out of the software black box. If we dont
understand the quantities printed by the software, we cannot possibly use
them to understand the data and answer our questions.
The parameters of our group means model are the treatment means
i
and the variance
2
, plus the derived parameters
and the
i
s. We will Unbiased
estimators correct
on average
be computing unbiased estimates of these parameters. Unbiased means
that when you average the values of the estimates across all potential random
errors
ij
, you get the true parameter values.
It is convenient to introduce a notation to indicate the estimator of a pa-
rameter. The usual notation in statistics is to put a hat over the parameter to
indicate the estimator; thus is an estimator of . Because we have parame-
ters that satisfy
i
=
+
i
, we will use estimators that satisfy
i
=
+
i
.
40 Completely Randomized Designs
Lets establish some notation for sample averages and the like. The sum
of the observations in the ith treatment group is
y
i
=
n
i
j=1
y
ij
.
The mean of the observations in the ith treatment group is Treatment means
y
i
=
1
n
i
n
i
j=1
y
ij
= y
i
/n
i
.
The overbar indicates averaging, and the dot () indicates that we have aver-
aged (or summed) over the indicated subscript. The sum of all observations
is
y
=
g
i=1
n
i
j=1
y
ij
=
g
i=1
y
i
,
and the grand mean of all observations is Grand mean
y
=
1
N
g
i=1
n
i
j=1
y
ij
= y
/N .
The sum of squared deviations of the data from the group means is
SS
E
=
g
i=1
n
i
j=1
(y
ij
y
i
)
2
.
The SS
E
measures total variability in the data around the group means.
Consider rst the separate means model, with each treatment group hav-
ing its own mean
i
. The natural estimator of
i
is y
i
, the average of the
i
= y
i
observations in that treatment group. We estimate the expected (or average)
response in the ith treatment group by the observed average in the ith treat-
ment group responses. Thus we have
i
= y
i
.
The sample average is an unbiased estimator of the population average, so
i
is an unbiased estimator of
i
.
In the single mean model, the only parameter in the model for the means
is . The natural estimator of is y
.
The grand mean is an unbiased estimate of when the data all come from a
single population.
We use the restriction that
i
n
i
i
/N; an unbiased estimate of
is
g
i=1
n
i
i
N
=
g
i=1
n
i
y
i
N
=
y
N
= y
.
This is the same as the estimator we use for in the single mean model. =
for
weighted sum
restriction
Because and
i
=
i
;
these can be estimated by
i
= y
i
y
i
=
i
= y
i
y
.
These treatment effects and estimates satisfy the restriction
g
i=1
n
i
i
=
g
i=1
n
i
i
= 0 .
The only parameter remaining to estimate is
2
. Our estimator of
2
is
2
= MS
E
=
SS
E
N g
=
g
i=1
n
i
j=1
(y
ij
y
i
)
2
N g
.
We sometimes use the notation s in place of in analogy with the sample
2
is unbiased for
2
standard deviation s. This estimator
2
is unbiased for
2
in both the separate
means and single means models. (Note that is not unbiased for .)
The deviations from the group mean y
ij
y
i
add to zero in any treatment
group, so that any n
i
1 of them determine the remaining one. Put another
way, there are n
i
1 degrees of freedom for error in each group, or N g = Error degrees of
freedom
i
(n
i
1) degrees of freedom for error for the experiment. There are thus
N g degrees of freedom for our estimate
2
. This is analogous to the
42 Completely Randomized Designs
Model Parameter Estimator
Single mean y
g
i=1
n
i
j=1
(y
ij
y
i
)
2
Ng
Separate means y
i
y
i
i
y
i
y
g
i=1
n
i
j=1
(y
ij
y
i
)
2
Ng
Display 3.1: Point estimators in the CRD.
formula n
1
+n
2
2 for the degrees of freedomin a two-sample t-test. Another
way to think of Ng is the number of data values minus the number of mean
parameters estimated.
The formulae for these estimators are collected in Display 3.1. The next
example illustrates their use.
Example 3.5 Resin lifetimes, continued
Most of the work for computing point estimates is done once we get the av-
erage responses overall and in each treatment group. Using the resin lifetime
data from Table 3.1, we get the following means and counts:
Treatment (
o
C) 175 194 213 231 250 All data
Average 1.933 1.629 1.378 1.194 1.057 1.465
Count 8 8 8 7 6 37
The estimates
i
and can be read from the table:
1
= 1.933
2
= 1.629
3
= 1.378
4
= 1.194
5
= 1.057 = 1.465
Get the
i
values by subtracting the grand mean from the group means:
1
= 1.932 1.465 = .467
2
= 1.629 1.465 = .164
3
= 1.378 1.465 = .088
4
= 1.194 1.465 = .271
5
= 1.057 1.465 = .408
3.4 Estimating Parameters 43
Notice that
g
i=1
n
i
i
= 0 (except for roundoff error).
The computation for
2
is a bit more work, because we need to compute
the SS
E
. For the resin data, SS
E
is
SS
E
= (2.04 1.933)
2
+ (1.91 1.933)
2
+ + (1.90 1.933)
2
+
(1.66 1.629)
2
+ (1.71 1.629)
2
+ + (1.66 1.629)
2
+
(1.53 1.378)
2
+ (1.54 1.378)
2
+ + (1.38 1.378)
2
+
(1.15 1.194)
2
+ (1.22 1.194)
2
+ + (1.17 1.194)
2
+
(1.26 1.057)
2
+ (.83 1.057)
2
+ + (1.06 1.057)
2
= .29369
Thus we have
2
= SS
E
/(N g) = .29369/(37 5) = .009178 .
A point estimate gives our best guess as to the value of a parameter. A
condence interval gives a plausible range for the parameter, that is, a set of Condence
intervals for
means and
effects
parameter values that are consistent with the data. Condence intervals for
and the
i
s are useful and straightforward to compute. Condence intervals
for the
i
s are only slightly more trouble to compute, but are perhaps less
useful because there are several potential ways to dene the s. Differences
between
i
s, or equivalently, differences between
i
s, are extremely useful;
these will be considered in depth in Chapter 4. Condence intervals for the
error variance
2
will be considered in Chapter 11.
Condence intervals for parameters in the mean structure have the gen-
eral form: Generic
condence
interval for mean
parameter
unbiased estimate multiplier (estimated) standard error of estimate.
The standard errors for the averages y
and y
i
are /
N and /
n
i
re-
spectively. We do not know , so we use = s =
MS
E
as an estimate
and obtain s/
N and s/
n
i
as estimated standard errors for y
and y
i
.
For an interval with coverage 1 E, we use the upper E/2 percent point
of the t-distribution with N g degrees of freedom as the multipler. This is
denoted t
E/2,Ng
. We use the E/2 percent point because we are constructing Use t multiplier
when error is
estimated
a two-sided condence interval, and we are allowing error rates of E/2 on
both the low and high ends. For example, we use the upper 2.5% point (or
97.5% cumulative point) of t for 95% coverage. The degrees of freedom for
the t-distribution come from
2
, our estimate of the error variance. For the
CRD, the degrees of freedom are N g, the number of data points minus the
number of treatment groups.
44 Completely Randomized Designs
Parameter Estimator Standard Error
y
s/
i
y
i
s/
n
i
i
y
i
y
s
_
1/n
i
1/N
Display 3.2: Standard errors of point estimators in the CRD.
The standard error of an estimated treatment effect
i
is
_
1/n
i
1/N.
Again, we must use an estimate of , yielding s
_
1/n
i
1/N for the esti-
mated standard error. Keep in mind that the treatment effects
i
are nega-
tively correlated, because they must add to zero.
3.5 Comparing Models: The Analysis of Variance
In the standard analysis of a CRD, we are interested in the mean responses
of the treatment groups. One obvious place to begin is to decide whether the
means are all the same, or if some of them differ. Restating this question in
terms of models, we ask whether the data can be adequately described by the
model of a single mean, or if we need the model of separate treatment group
means. Recall that the single mean model is a special case of the group means
model. That is, we can choose the parameters in the group means model so ANOVA
compares models
that we actually get the same mean for all groups. The single mean model is
said to be a reduced or restricted version of the group means model. Analysis
of Variance, usually abbreviated ANOVA, is a method for comparing the t
of two models, one a reduced version of the other.
Strictly speaking, ANOVA is an arithmetic procedure for partitioning the
variability in a data set into bits associated with different mean structures
plus a leftover bit. (Its really just the Pythagorean Theorem, though weve
chosen our right triangles pretty carefully in N-dimensional space.) When in
addition the error structure for the data is independent normal with constant ANOVA partitions
variability
variance, we can use the information provided by an ANOVA to construct
statistical tests comparing the different mean structures or models for means
that are represented in the ANOVA. The link between the ANOVA decom-
position for the variability and tests for models is so tight, however, that we
sometimes speak of testing via ANOVA even though the test is not really part
of the ANOVA.
3.6 Mechanics of ANOVA 45
Our approach to model comparison is Occams Razor we use the sim-
plest model that is consistent with the data. We only move to the more com- Use simplest
acceptable model
plicated model if the data indicate that the more complicated model is needed.
How is this need indicated? The residuals r
ij
are the differences between
the data y
ij
and the tted mean model. For the single mean model, the Residuals and
SSR
tted values are all y
n
i
i
= 0, one nonzero
i
implies that the
i
s are not all equal to each other. The alternative hypothesis does not mean
that all the
i
s are different, just that they are not all the same.
The model comparison point of viewopts for the separate means model if
that model has sufciently less residual variation, while the parameter testing
view opts for the separate means model if there is sufciently great variation
between the observed group means. These seem like different ideas, but we
will see in the ANOVA decomposition that they are really saying the same
thing, because less residual variation implies more variation between group
means when the total variation is xed.
3.6 Mechanics of ANOVA
ANOVA works by partitioning the total variability in the data into parts that
mimic the model. The separate means model says that the data are not all
46 Completely Randomized Designs
equal to the grand mean because of treatment effects and random error: ANOVA
decomposition
parallels model
y
ij
=
i
+
ij
.
ANOVA decomposes the data similarly into a part that deals with group
means, and a part that deals with deviations from group means:
y
ij
y
= (y
i
y
) + (y
ij
y
i
)
=
i
+r
ij
.
The difference on the left is the deviation of a response from the grand mean. SS
T
If you square all such differences and add them up you get SS
T
, the total
sum of squares.
1
The rst difference on the right is the estimated treatment effect
i
. If
you squared all these (one for each of the N data values) and added them up,
you would get SS
Trt
, the treatment sum of squares: SS
Trt
SS
Trt
=
g
i=1
n
i
j=1
(y
i
y
)
2
=
g
i=1
n
i
(y
i
y
)
2
=
g
i=1
n
i
i
2
.
I think of this as
1. Square the treatment effect,
2. Multiply by the number of units receiving that effect, and
3. Add over the levels of the effect.
This three-step pattern will appear again frequently.
The second difference on the right is the ijth residual from the model,
which gives us some information about
ij
. If you squared and added the SS
E
r
ij
s you would get SS
E
, the error sum of squares:
SS
E
=
g
i=1
n
i
j=1
(y
ij
y
i
)
2
.
This is the same SS
E
that we use in estimating
2
.
1
For pedants in the readership, this quantity is the corrected total sum of squares. There
is also an uncorrected total sum of squares. The uncorrected total is the sum of the squared
observations; the uncorrected total sum of squares equals SS
T
plus Ny
2
. In this book, total
sum of squares will mean corrected total sum of squares.
3.6 Mechanics of ANOVA 47
SS
Trt
=
g
i=1
n
i
i
2
SS
E
=
g
i=1
n
i
j=1
(y
ij
y
i
)
2
SS
T
= SS
Trt
+ SS
E
Display 3.3: Sums of squares in the CRD
Recall that
y
ij
y
=
i
+r
ij
so that
(y
ij
y
)
2
=
i
2
+r
2
ij
+ 2
i
r
ij
.
Adding over i and j we get
SS
T
= SS
Trt
+SS
E
+ 2
g
i=1
n
i
j=1
i
r
ij
.
We can show (see Question 3.2) that the sum of the cross-products is zero, so Total SS
that
SS
T
= SS
Trt
+SS
E
.
Now we can see the link between testing equality of group means and com-
paring models via SSR. For a given data set (and thus a xed SS
T
), more Larger SS
Trt
implies smaller
SS
E
variation between the group means implies a larger SS
Trt
, which in turn im-
plies that the SS
E
must be smaller, which is the SSR for the separate means
model.
Display 3.3 summarizes the sums of squares formulae for the CRD. I
should mention that there are numerous calculator or shortcut formulae
for computing sums of squares quantities. In my experience, these formulae
are more difcult to remember than the ones given here, provide little insight
into what the ANOVA is doing, and are in some circumstances more prone
to roundoff errors. I do not recommend them.
ANOVAcomputations are summarized in a table with columns for source
of variation, degrees of freedom, sum of squares, mean squares, and F-
statistics. There is a row in the table for every source of variation in the full ANOVA table
model. In the CRD, the sources of variation are treatments and errors, some-
times called between- and within-groups variation. Some tables are written
48 Completely Randomized Designs
with rows for either or both of the grand mean and the total variation, though
these rows do not affect the usual model comparisons.
The following is a generic ANOVA table for a CRD. Generic ANOVA
table
Source DF SS MS F
Treatments g 1 SS
Trt
SS
Trt
/(g 1) MS
Trt
/MS
E
Error N g SS
E
SS
E
/(N g)
The degrees of freedom are g 1 for treatments and N g for error. We
saw the rationale for these in Section 3.4. The formulae for sums of squares
were given above, and mean squares are always sums of squares divided by
their degrees of freedom. The F-statistic is the ratio of two mean squares, the
numerator mean square for a source of variation that we wish to assess, and
a denominator (or error) mean square that estimates error variance.
We use the F-statistic (or F-ratio) in the ANOVA table to make a test of
the null hypothesis that all the treatment means are the same (all the
i
values
are zero) versus the alternative that some of the treatment means differ (some
of the
i
values are nonzero). When the null hypothesis is true, the F-statistic
is about 1, give or take some random variation; when the alternative is true,
the F-statistic tends to be bigger than 1. To complete the test, we need to be F-test to compare
models
able to tell how big is too big for the F-statistic. If the null hypothesis is true
and our model and distributional assumptions are correct, then the F-statistic
follows the F-distribution with g 1 and N g degrees of freedom. Note
that the F-distribution has two degrees of freedom, one from the numerator
mean square and one from the denominator mean square.
To do the test, we compute the F-statistic and the degrees of freedom, and
then we compute the probability of observing an F-statistic as large or larger
than the one we observed, assuming all the
i
s were zero. This probability is
called the p-value or observed signicance level of the test, and is computed p-value to assess
evidence
as the area under an F-distribution from the observed F-statistic on to the
right, when the F-distribution has degrees of freedom equal to the degrees of
freedom for the numerator and denominator mean squares. This p-value is
usually obtained from a table of the F-distribution (for example, Appendix
Table D.5) or via the use of statistical software.
Small values of the p-value are evidence that the null may be incorrect:
either we have seen a rare event (big F-statistics when the null is actually
true, leading to a small p-value), or an assumption we used to compute the
p-value is wrong, namely the assumption that all the
i
s are zero. Given
the choice of unlucky or incorrect assumption, most people choose incorrect
assumption.
3.6 Mechanics of ANOVA 49
Table 3.2: Approximate Type I error probabilities for
different p-values using the Sellke et al. lower bound.
p .05 .01 .001 .0001
P(p) .29 .11 .018 .0025
We have now changed the question from How big is too big an F? to
How small is too small a p-value? By tradition, p-values less than .05
are termed statistically signicant, and those less than .01 are termed highly
statistically signicant. These values are reasonable (one chance in 20, one .05 and .01
signicance levels
chance in 100), but there is really no reason other than tradition to prefer
them over other similar values, say one chance in 30 and one chance in 200.
It should also be noted that a person using the traditional values would declare
one test with p-value of .049 to be signicant and another test with a p-
value of .051 not to be signicant, but the two tests are really giving virtually
identical results. Thus I prefer to report the p-value itself rather than simply
report signicance or lack thereof.
As with any test, remember that statistical signicance is not the same
as real world importance. A tiny p-value may be obtained with relatively Practical
signicance
small
i
s if the sample size is large enough or
2
is small enough. Likewise,
large important differences between means may not appear signicant if the
sample size is small or the error variance large.
It is also important not to overinterpret the p-value. Reported p-values of
.05 or .01 carry the magnicent labels of statistically signicant or highly sta-
tistically signicant, but they actually are not terribly strong evidence against
the null. What we would really like to know is the probability that rejecting
the null is an error; the p-value does not give us that information. Sellke,
Bayarri, and Berger (1999) dene an approximate lower bound on this prob-
ability. They call their bound a calibrated p-value, but I do not like the name Approximate error
probability
because their quantity is not really a p-value. Suppose that before seeing any
data you thought that the null and alternative each had probability .5 of being
true. Then for p-values less than e
1
.37, the Sellke et al. approximate
error probability is
P(p) =
ep log(p)
1 ep log(p)
.
The interpretation of the approximate error probability P(p) is that having
seen a p-value of p, the probability that rejecting the null hypothesis is an
error is at least P(p). Sellke et al. show that this lower bound is pretty
good in a wide variety of problems. Table 3.2 shows that the probability that
rejection is a Type I error is more than .1, even for a p-value of .01.
50 Completely Randomized Designs
Listing 3.1: Minitab output for resin lifetimes.
One-way Analysis of Variance
Analysis of Variance for Lifetime
Source DF SS MS F P x
Temp 4 3.53763 0.88441 96.36 0.000
Error 32 0.29369 0.00918
Total 36 3.83132
Individual 95% CIs For Mean y
Based on Pooled StDev
Level N Mean StDev --------+---------+---------+--------
1 8 1.9325 0.0634 (-*--)
2 8 1.6288 0.1048 (-*--)
3 8 1.3775 0.1071 (-*-)
4 7 1.1943 0.0458 (--*-)
5 6 1.0567 0.1384 (-*--)
--------+---------+---------+--------
Pooled StDev = 0.0958 1.20 1.50 1.80
Example 3.6 Resin lifetimes, continued
For our resin data, the treatment sum of squares is
SS
Trt
=
g
i=1
n
i
i
2
= 8 .467
2
+ 8 .164
2
+ 8 (.088)
2
+
7 (.271)
2
+ 6 (.408)
2
= 3.5376 .
We have g = 5 treatments so there are g1 = 4 degrees of freedom between
treatments. We computed the SS
E
in Example 3.5; it was .29369 with 32
degrees of freedom. The ANOVA table is
ANOVA
Source DF SS MS F
treatments 4 3.5376 .88441 96.4
error 32 .29369 .0091779
total 36 3.8313
The F-statistic is about 96 with 4 and 32 degrees of freedom. There is
essentially no probability under the F-curve with 4 and 32 degrees of freedom
3.6 Mechanics of ANOVA 51
Listing 3.2: SAS output for resin lifetimes
Analysis of Variance Procedure
Dependent Variable: LIFETIME
Sum of Mean
Source DF Squares Square F Value Pr > F x
Model 4 3.53763206 0.88440802 96.36 0.0001
Error 32 0.29369226 0.00917788
Corrected Total 36 3.83132432
R-Square C.V. Root MSE LIFETIME Mean
0.923344 6.538733 0.09580 1.46514
Level of -----------LIFETIME---------- y
TEMPER N Mean SD
1 8 1.93250000 0.06341473
2 8 1.62875000 0.10480424
3 8 1.37750000 0.10713810
4 7 1.19428571 0.04577377
5 6 1.05666667 0.13837148
to the right of 96. (There is only .00001 probability to the right of 11.) Thus
the p-value for this test is essentially zero, and we would conclude that not all
the treatments yield the same mean lifetime. From a practical point of view,
the experimenters already knew this; the experiment was run to determine
the nature of the dependence of lifetime on temperature, not whether there
was any dependence.
Different statistics software packages give slightly different output for the
ANOVA of the resin lifetime data. For example, Listing 3.1 gives Minitab
ANOVA output. In addition to the ANOVA table x, the standard Minitab
output includes a table of treatment means and a plot of 95% condence
intervals for those means y. Listing 3.2 gives SAS output (edited to save
space) for these data x. SAS does not automatically print group means, but
you can request them as shown here y.
There is a heuristic for the degrees-of-freedom formulae. Degrees of
freedom for a model count the number of additional parameters used for the
mean structure when moving from the next simpler model to this model. For
example, the degrees of freedom for treatment are g 1. The next simpler
52 Completely Randomized Designs
model is the model of a single mean for all treatments; the full model has a
different mean for each of the g treatments. That is g 1 more parameters. Model df count
parameters
Alternatively, look at the
i
s. Under the null, they are all zero. Under the
alternative, they may be nonzero, but only g 1 of them can be set freely,
because the last one is then set by the restriction that their weighted summust
be zero. Degrees of freedom for error are the number of data less the number
of (mean) parameters estimated.
3.7 Why ANOVA Works
The mean square for error is a random variable; it depends on the random
errors in the data. If we repeated the experiment, we would get different ran-
dom errors and thus a different mean square for error. However, the expected E(MS
E
) =
2
value of the mean square for error, averaged over all the possible outcomes
of the random errors, is the variance of the random errors
2
. Thus, the mean
square for error estimates the error variance, no matter what the values of the
i
s.
The mean square for treatments is also a random variable, but the MS
Trt
has expectation: Expected mean
square for
treatments
E(MS
Trt
) = EMS
Trt
=
2
+
g
i=1
n
i
2
i
/(g 1) .
The important things to get from this expression are
1. When all of the
i
s are zero, the mean square for treatments also esti-
mates
2
.
2. When some of the
i
s are nonzero, the mean square for treatments
tends to be bigger than
2
.
When the null hypothesis is true, both MS
Trt
and MS
E
vary around
2
, so their ratio (the F-statistic) is about one, give or take some random
variation. When the null hypothesis is false, MS
Trt
tends to be bigger than
2
, and the F-statistic tends to be bigger than one. We thus reject the null
hypothesis for sufciently large values of the F-statistic.
3.8 Back to Model Comparison
The preceding section described Analysis of Variance as a test of the null
hypothesis that all the
i
values are zero. Another way to look at ANOVA is
3.8 Back to Model Comparison 53
as a comparison of two models for the data. The reduced model is the model
that all treatments have the same expected value (that is, the
i
values are all
zero); the full model allows the treatments to have different expected values. ANOVA
compares models
From this point of view, we are not testing whether a set of parameters is all
zero; we are comparing the adequacy of two different models for the mean
structure.
Analysis of Variance uses sums of squared deviations from a model, just
as sample standard deviations use squared deviations from a sample mean.
For the reduced model (null hypothesis), the estimated model is = y
.
For the data value y
ij
, the residual is
r
ij
= y
ij
= y
ij
y
.
The residual sum of squares for the reduced model is then
SSR
0
=
ij
r
2
ij
=
ij
(y
ij
y
)
2
.
For the full model (alternative hypothesis), the estimated model is
i
= y
i
,
and the residuals are
r
ij
= y
ij
i
= y
ij
y
i
.
The residual sum of squares for the full model is then Model SSR
SSR
A
=
ij
r
2
ij
=
ij
(y
ij
y
i
)
2
.
SSR
A
can never be bigger than SSR
0
and will almost always be smaller.
We would prefer the full model if SSR
A
is sufciently smaller than SSR
0
.
How does this terminology for ANOVA mesh with what we have already
seen? The residual sum of squares from the full model, SSR
A
, is the error
sum of squares SS
E
in the usual formulation. The residual sum of squares
from the reduced model, SSR
0
, is the total sum of squares SS
T
in the usual
formulation. The difference SSR
0
SSR
A
is equal to the treatment sum of
squares SS
Trt
. Thus the treatment sum of squares is the additional amount of Change in SSR
variation in the data that can be explained by using the more complicated full
model instead of the simpler reduced model.
This idea of comparing models instead of testing hypotheses about pa-
rameters is a fairly subtle distinction, and here is why the distinction is im-
portant: in our heart of hearts, we almost never believe that the null hypoth-
esis could be true. We usually believe that at some level of precision, there
54 Completely Randomized Designs
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
E
f
f
e
c
t
o
r
R
e
s
i
d
u
a
l
1
2
3
4
5
Temp Residuals
Figure 3.3: Side-by-side plot for resin lifetime data, using
MacAnova.
is a difference between the mean responses of the treatments. So why the
charade of testing the null hypothesis?
The answer is that we are choosing a model for the data from a set of
potential models. We want a model that is as simple as possible yet still con-
sistent with the data. A more realistic null hypothesis is that the means are so
close to being equal that the differences are negligible. When we reject the
null hypothesis we are making the decision that the data are demonstrably Choose simplest
acceptable model
inconsistent with the simpler model, the differences between the means are
not negligible, and the more complicated model is required. Thus we use the
F-test to guide us in our choice of model. This distinction between testing
hypotheses on parameters and selecting models will become more important
later.
3.9 Side-by-Side Plots
Hoaglin, Mosteller, and Tukey (1991) introduce the side-by-side plot as a
method for visualizing treatment effects and residuals. Figure 3.3 shows a
side-by-side plot for the resin lifetime data of Example 3.2. We plot the es-
timated treatment effects
i
in one column and the residuals r
ij
in a second Side-by-side plots
show effects and
residuals
column. (There will be more columns in more complicated models we will
see later.) The vertical scale is in the same units as the response. In this
3.10 Dose-Response Modeling 55
plot, we have used a box-plot for the residuals rather than plot them indi-
vidually; this will usually be more understandable when there are relatively
many points to be put in a single column.
What we see from the side-by-side plot is that the treatment effects are
large compared to the size of the residuals. We were also able to see this in
the parallel box-plots in the exploratory analysis, but the side-by-side plots
will generalize better to more complicated models.
3.10 Dose-Response Modeling
In some experiments, the treatments are associated with numerical levels
such as drug dose, baking time, or reaction temperature. We will refer to Numerical levels
or doses
such levels as doses, no matter what they actually are, and the numerical
value of the dose for treatment i will be denoted z
i
. When we have numer-
ical doses, we may reexpress the treatment means as a function of the dose
z
i
:
+
i
= f(z
i
; ) ,
where is some unknown parameter of the function. For example, we could
express the mean weight of yellow birch seedlings as a function of the pH of
acid rain.
The most commonly used functions f are polynomials in the dose z
i
: Polynomial
models
+
i
=
0
+
1
z
i
+
2
z
2
i
+ +
g1
z
g1
i
.
We use the power g 1 because the means at g different doses determine
a polynomial of order g 1. Polynomials are used so often because they
are simple and easy to understand; they are not always the most appropriate
choice.
If we know the polynomial coefcients
0
,
1
, . . .,
g1
, then we can de-
termine the treatment means +
i
, and vice versa. If we know the poly-
nomial coefcients except for the constant
0
, then we can determine the Polynomials are
an alternative to
treatment effects
treatment effects
i
, and vice versa. The g 1 parameters
1
through
g1
in this full polynomial model correspond to the g 1 degrees of freedom
between the treatment groups. Thus polynomials in dose are not inherently
better or worse than the treatment effects model, just another way to describe
the differences between means.
Polynomial modeling is useful in two contexts. First, if only a few of
the polynomial coefcients are needed (that is, the others can be set to zero
without signicantly decreasing the quality of the t), then this reduced poly-
nomial model represents a reduction in the complexity of our model. For
56 Completely Randomized Designs
example, learning that the response is linear or quadratic in dose is useful,
whereas a polynomial of degree six or seven will be difcult to comprehend Polynomial
models can
reduce number of
parameters
needed and
provide
interpolation
(or sell to anyone else). Second, if we wish to estimate the response at some
dose other than one used in the experiment, the polynomial model provides
a mechanism for generating the estimates. Note that these estimates may be
poor if we are extrapolating beyond the range of the doses in our experiment
or if the degree of the polynomial is high. High-order polynomials will t
our observed treatment means exactly, but these high-order polynomials can
have bizarre behavior away from our data points.
Consider a sequence of regression models for our data, regressing the
responses on dose, dose squared, and so on. The rst model just includes
the constant
0
; that is, it ts a single value for all responses. The second
model includes the constant
0
and a linear term
1
z
i
; this model ts the
responses as a simple linear regression in dose. The third model includes the
constant
0
, a linear term
1
z
i
, and the quadratic term
2
z
2
i
; this model ts
the responses as a quadratic function (parabola) of dose. Additional models
include additional powers of dose up to g 1.
Let SSR
k
be the residual sumof squares for the model that includes pow-
ers up to k, for k = 0, . . ., g 1. Each successive model will explain a little
more of the variability between treatments, so that SSR
k
> SSR
k+1
. When
we arrive at the full polynomial model, we will have explained all of the
between-treatment variability using polynomial terms; that is, SSR
g1
= Polynomial
improvement SS
for including an
additional term
SS
E
. The linear sum of squares is the reduction in residual variability
going from the constant model to the model with the linear term:
SS
linear
= SS
1
= SSR
0
SSR
1
.
Similarly, the quadratic sumof squares is the reduction in residual variabil-
ity going from the linear model to the quadratic model,
SS
quadratic
= SS
2
= SSR
1
SSR
2
,
and so on through the remaining orders.
Each of these polynomial sums of squares has 1 degree of freedom, be-
cause each is the result of adding one more parameter
k
to the model for
the means. Thus their mean squares are equal to their sums of squares. In Testing
parameters
a model with terms up through order k, we can test the null hypothesis that
k
= 0 by forming the F-statistic SS
k
/MS
E
, and comparing it to an F-
distribution with 1 and N g degrees of freedom.
One method for choosing a polynomial model is to choose the small-
est order such that no signicant terms are excluded. (More sophisticated Model selection
model selection methods exist.) It is important to know that the estimated
3.10 Dose-Response Modeling 57
Listing 3.3: MacAnova output for resin lifetimes polynomial model.
DF SS MS F P-value x
CONSTANT 1 79.425 79.425 8653.95365 0
{temperature} 1 3.4593 3.4593 376.91283 0
{(temperature)^2} 1 0.078343 0.078343 8.53610 0.0063378
{(temperature)^3} 1 1.8572e-05 1.8572e-05 0.00202 0.9644
{(temperature)^4} 1 8.2568e-06 8.2568e-06 0.00090 0.97626
ERROR1 32 0.29369 0.0091779
CONSTANT y
(1) 0.96995
{temperature}
(1) 0.075733
{(temperature)^2}
(1) -0.00076488
{(temperature)^3}
(1) 2.6003e-06
{(temperature)^4}
(1) -2.9879e-09
DF SS MS F P-value z
CONSTANT 1 79.425 79.425 9193.98587 0
{temperature} 1 3.4593 3.4593 400.43330 0
{(temperature)^2} 1 0.078343 0.078343 9.06878 0.0048787
ERROR1 34 0.29372 0.0086388
CONSTANT {
(1) 7.418
{temperature}
(1) -0.045098
{(temperature)^2}
(1) 7.8604e-05
coefcients
i
depend on which terms are in the model when the model is es-
timated. Thus if we decide we only need
0
,
1
, and
2
when g is 4 or more,
we should ret using just those terms to get appropriate parameter estimates.
Resin lifetimes, continued Example 3.7
The treatments in the resin lifetime data are different temperatures (175, 194,
213, 231, and 250 degrees C), so we can use these temperatures as doses z
i
in
a dose-response relationship. With g = 5 treatments, we can use polynomials
up to power 4.
Listing 3.3 shows output for a polynomial dose-response modeling of the
resin lifetime data. The rst model ts up to temperature to the fourth power.
From the ANOVA x we can see that neither the third nor fourth powers are
58 Completely Randomized Designs
signicant, but the second power is, so a quadratic model seems appropriate.
The ANOVA for the reduced model is at z. The linear and quadratic sums
of squares are the same as in x, but the SS
E
in z is increased by the cubic
and quartic sums of squares in x. We can also see that the intercept, linear,
and quadratic coefcients change dramatically from the full model y to the
reduced model using just those terms {. We cannot simply take the intercept,
linear, and quadratic coefcients from the fourth power model and use them
as if they were coefcients in a quadratic model.
One additional trick to remember when building a dose-response model
is that we can transform or reexpress the dose z
i
. That is, we can build Try transforming
dose
models using log of dose or square root of dose as simply as we can using
dose. For some data it is much simpler to build a model as a function of a
transformation of the dose.
3.11 Further Reading and Extensions
There is a second randomization that is used occasionally, and unfortunately
it also is sometimes called completely randomized.
1. Choose probabilities p
1
though p
g
with p
1
+p
2
+ +p
g
= 1.
2. Choose a treatment independently for each unit, choosing treatment i
with probability p
i
.
Now we wind up with n
i
units getting treatment i, with n
1
+n
2
+ +n
g
=
N, but the sample sizes n
i
are random. This randomization is different than
the standard CRD randomization. ANOVA procedures do not distinguish be-
tween the xed and random sample size randomizations, but if we were to do
randomization testing, we would use different procedures for the two differ-
ent randomizations. As a practical matter, we should note that even though
we may design for certain xed sample sizes, we do not always achieve those
sample sizes when test tubes get dropped, subjects withdraw from studies, or
drunken statistics graduate students drive through experimental elds (you
know who you are!).
The estimates we have used for mean parameters are least squares es-
timates, meaning that they minimize the sum of squared residuals. Least
squares estimation goes back to Legendre (1806) and Gauss (1809), who
developed the procedure for working with astronomical data. Formal tests
based on the t-distribution were introduced by Gosset, who wrote under the
3.11 Further Reading and Extensions 59
pseudonym Student (Student 1908). Gosset worked at the Guiness Brew-
ery, and he was allowed to publish only under a pseudonym so that the com-
petition would not be alerted to the usefulness of the procedure. What Gosset
actually did was posit the t-distribution; proof was supplied later by Fisher
(1925a).
The Analysis of Variance was introduced by Fisher in the context of pop-
ulation genetics (Fisher 1918); he quickly extended the scope (Fisher 1925b).
The 1918 paper actually introduces the terms variance and analysis of
variance. Scheff e (1956) describes howmodels for data essentially the same
as those used for ANOVA were in use decades earlier, though analysis meth-
ods were different.
From a more theoretical perspective, the SS
E
is distributed as
2
times
a chi-square random variable with N g degrees of freedom; SS
Trt
is dis-
tributed as
2
times a possibly noncentral chi-square random variable with
g 1 degrees of freedom; and these two sums of squares are independent.
When the null hypothesis is true, SS
Trt
is a multiple of an ordinary (central)
chi-square; noncentrality arises under the alternative when the expected value
of MS
Trt
is greater than
2
. The ratio of two independent central chi-squares,
each divided by their degrees of freedom, is dened to have an F-distribution.
Thus the null-hypothesis distribution of the F-statistic is F. Chapter 7 and
Appendix A discuss this distribution theory in more detail. Scheff e (1959),
Hocking (1985), and others provide book-length expositions of linear models
and their related theory.
We have described model selection via testing a null hypothesis. An
alternative approach is prediction; for example, we can choose the model
that we believe will give us the lowest average squared error of prediction.
Mallows (1973) dened a quantity called C
p
C
p
=
SSR
p
MS
E
+ 2p N ,
where SSR
p
is the residual sum of squares for a means model with p pa-
rameters (degrees of freedom including any overall constant), MS
E
is the
error mean square from the separate means model, and N is the number of
observations. We prefer models with small C
p
.
The separate means model (with p = g parameters) has C
p
= g. The
single mean model, dose-response models, and other models can have C
p
values greater or less than g. The criterion rewards models with smaller
SSR and penalizes models with larger p. When comparing two models, one
a reduced form of the other, C
p
will prefer the larger model if the F-statistic
comparing the models is 2 or greater. Thus we see that it generally takes less
60 Completely Randomized Designs
evidence to choose a larger model when using a predictive criterion than
when doing testing at the traditional levels.
Quantitative dose-response models as described here are an instance of
polynomial regression. Weisberg (1985) is a good general source on regres-
sion, including polynomial regression. We have used polynomials because
they are simple and traditional, but there are many other sets of functions we
could use instead. Some interesting alternatives include sines and cosines,
B-splines, and wavelets.
3.12 Problems
Rats were given one of four different diets at random, and the response Exercise 3.1
measure was liver weight as a percentage of body weight. The responses
were
Treatment
1 2 3 4
3.52 3.47 3.54 3.74
3.36 3.73 3.52 3.83
3.57 3.38 3.61 3.87
4.19 3.87 3.76 4.08
3.88 3.69 3.65 4.31
3.76 3.51 3.51 3.98
3.94 3.35 3.86
3.64 3.71
(a) Compute the overall mean and treatment effects.
(b) Compute the Analysis of Variance table for these data. What would
you conclude about the four diets?
An experimenter randomly allocated 125 male turkeys to ve treatment Exercise 3.2
groups: control and treatments A, B, C, and D. There were 25 birds in each
group, and the mean results were 2.16, 2.45, 2.91, 3.00, and 2.71, respec-
tively. The sum of squares for experimental error was 153.4. Test the null
hypothesis that the ve group means are the same against the alternative that
one or more of the treatments differs from the control.
Twelve orange pulp silage samples were divided at random into four Exercise 3.3
groups of three. One of the groups was left as an untreated control, while
the other three groups were treated with formic acid, beet pulp, and sodium
chloride, respectively. One of the responses was the moisture content of the
3.12 Problems 61
silage. The observed moisture contents of the silage are shown below (data
from Caro et al. 1990):
NaCl Formic acid Beet pulp Control
80.5 89.1 77.8 76.7
79.3 75.7 79.5 77.2
79.0 81.2 77.0 78.6
Means 79.6 82.0 78.1 77.5
Grand mean 79.3
Compute an analysis of variance table for these data and test the null hypoth-
esis that all four treatments yield the same average moisture contents.
We have ve groups and three observations per group. The group means Exercise 3.4
are 6.5, 4.5, 5.7, 5.7, and 5.1, and the mean square for error is .75. Compute
an ANOVA table for these data.
The leaves of certain plants in the genus Albizzia will fold and unfold in Exercise 3.5
various light conditions. We have taken fteen different leaves and subjected
them to red light for 3 minutes. The leaves were divided into three groups of
ve at random. The leaet angles were then measured 30, 45, and 60 minutes
after light exposure in the three groups. Data from W. Hughes.
Delay (minutes) Angle (degrees)
30 140 138 140 138 142
45 140 150 120 128 130
60 118 130 128 118 118
Analyze these data to test the null hypothesis that delay after exposure does
not affect leaet angle.
Cardiac pacemakers contain electrical connections that are platinum pins Problem 3.1
soldered onto a substrate. The question of interest is whether different op-
erators produce solder joints with the same strength. Twelve substrates are
randomly assigned to four operators. Each operator solders four pins on each
substrate, and then these solder joints are assessed by measuring the shear
strength of the pins. Data from T. Kerkow.
Strength (lb)
Operator Substrate 1 Substrate 2 Substrate 3
1 5.60 6.80 8.32 8.70 7.64 7.44 7.48 7.80 7.72 8.40 6.98 8.00
2 5.04 7.38 5.56 6.96 8.30 6.86 5.62 7.22 5.72 6.40 7.54 7.50
3 8.36 7.04 6.92 8.18 6.20 6.10 2.75 8.14 9.00 8.64 6.60 8.18
4 8.30 8.54 7.68 8.92 8.46 7.38 8.08 8.12 8.68 8.24 8.09 8.06
62 Completely Randomized Designs
Analyze these data to determine if there is any evidence that the operators
produce different mean shear strengths. (Hint: what are the experimental
units?)
Scientists are interested in whether the energy costs involved in reproduc- Problem 3.2
tion affect longevity. In this experiment, 125 male fruit ies were divided at
random into ve sets of 25. In one group, the males were kept by themselves.
In two groups, the males were supplied with one or eight receptive virgin fe-
male fruit ies per day. In the nal two groups, the males were supplied with
one or eight unreceptive (pregnant) female fruit ies per day. Other than
the number and type of companions, the males were treated identically. The
longevity of the ies was observed. Data from Hanley and Shapiro (1994).
Companions Longevity (days)
None 35 37 49 46 63 39 46 56 63 65 56 65 70
63 65 70 77 81 86 70 70 77 77 81 77
1 pregnant 40 37 44 47 47 47 68 47 54 61 71 75 89
58 59 62 79 96 58 62 70 72 75 96 75
1 virgin 46 42 65 46 58 42 48 58 50 80 63 65 70
70 72 97 46 56 70 70 72 76 90 76 92
8 pregnant 21 40 44 54 36 40 56 60 48 53 60 60 65
68 60 81 81 48 48 56 68 75 81 48 68
8 virgin 16 19 19 32 33 33 30 42 42 33 26 30 40
54 34 34 47 47 42 47 54 54 56 60 44
Analyze these data to test the null hypothesis that reproductive activity does
not affect longevity. Write a report on your analysis. Be sure to describe the
experiment as well as your results.
Park managers need to know how resistant different vegetative types are Problem 3.3
to trampling so that the number of visitors can be controlled in sensitive areas.
The experiment deals with alpine meadows in the White Mountains of New
Hampshire. Twenty lanes were established, each .5 m wide and 1.5 m long.
These twenty lanes were randomly assigned to ve treatments: 0, 25, 75, 200,
or 500 walking passes. Each pass consists of a 70-kg individual wearing lug-
soled boots walking in a natural gait down the lane. The response measured
is the average height of the vegetation along the lane one year after trampling.
Data based on Table 16 of Cole (1993).
3.12 Problems 63
Number
of passes Height (cm)
0 20.7 15.9 17.8 17.6
25 12.9 13.4 12.7 9.0
75 11.8 12.6 11.4 12.1
200 7.6 9.5 9.9 9.0
500 7.8 9.0 8.5 6.7
Analyze these data to determine if trampling has an effect after one year, and
if so, describe that effect.
Caffeine is a common drug that affects the central nervous system. Among Problem 3.4
the issues involved with caffeine are how does it get from the blood to the
brain, and does the presence of caffeine alter the ability of similar compounds
to move across the blood-brain barrier? In this experiment, 43 lab rats were
randomly assigned to one of eight treatments. Each treatment consisted of
an arterial injection of C
14
-labeled adenine together with a concentration of
caffeine (0 to 50 mM). Shortly after injection, the concentration of labeled
adenine in the rat brains is measured as the response (data from McCall,
Millington, and Wurtman 1982).
Caffeine (mM) Adenine
0 5.74 6.90 3.86 6.94 6.49 1.87
0.1 2.91 4.14 6.29 4.40 3.77
0.5 5.80 5.84 3.18 3.18
1 3.49 2.16 7.36 1.98 5.51
5 5.92 3.66 4.62 3.47 1.33
10 3.05 1.94 1.23 3.45 1.61 4.32
25 1.27 .69 .85 .71 1.04 .84
50 .93 1.47 1.27 1.13 1.25 .55
The main issues in this experiment are whether the amount of caffeine present
affects the amount of adenine that can move from the blood to the brain, and
if so, what is the dose response relationship. Analyze these data.
Engineers wish to know the effect of polypropylene bers on the com- Problem 3.5
pressive strength of concrete. Fifteen concrete cubes are produced and ran-
domly assigned to ve levels of ber content (0, .25, .50, .75, and 1%). Data
from Figure 2 of Paskova and Meyer (1997).
64 Completely Randomized Designs
Fiber
content (%) Strength (ksi)
0 7.8 7.4 7.2
.25 7.9 7.5 7.3
.50 7.4 6.9 6.3
.75 7.0 6.7 6.4
1 5.9 5.8 5.6
Analyze these data to determine if ber content has an effect on concrete
strength, and if so, describe that effect.
Prove that
g
i=1
i
/g is equivalent to
g
i=1
i
= 0. Question 3.1
Prove that Question 3.2
0 =
g
i=1
n
i
j=1
i
r
ij
.
Chapter 4
Looking for Specic
DifferencesContrasts
An Analysis of Variance can give us an indication that not all the treatment
groups have the same mean response, but an ANOVA does not, by itself, tell
us which treatments are different or in what ways they differ. To do this, we
need to look at the treatment means, or equivalently, at the treatment effects.
One method to examine treatment effects is called a contrast.
ANOVAis like background lighting that dimly illuminates all of our data,
but not giving enough light to see details. Using a contrast is like using a Contrasts
examine specic
differences
spotlight; it enables us to focus in on a specic, narrow feature of the data.
But the contrast has such a narrow focus that it does not give the overall
picture. By using several contrasts, we can move our focus around and see
more features. Intelligent use of contrasts involves choosing our contrasts so
that they highlight interesting features in our data.
4.1 Contrast Basics
Contrasts take the form of a difference between means or averages of means.
For example, here are two contrasts:
( +
6
) ( +
3
)
and
+
2
+ +
4
2
+
1
+ +
3
+ +
5
3
.
66 Looking for Specic DifferencesContrasts
The rst compares the means of treatments 6 and 3, while the second com-
pares the mean response in groups 2 and 4 with the mean response in groups
1, 3, and 5.
Formally, a contrast is a linear combination of treatment means or effects Contrasts
compare
averages of
means
g
i=1
w
i
i
= w({
i
}) or
g
i=1
w
i
i
= w({
i
}), where the coefcients w
i
satisfy
g
i=1
w
i
= 0.
Contrast coefcients add to zero.
Less formally, we sometimes speak of the set of contrast coefcients {w
i
} as
being a contrast; we will try to avoid ambiguity. Notice that because the sum
of the coefcients is zero, we have that
w({
i
}) =
g
i=1
w
i
i
= x
g
i=1
w
i
+
g
i=1
w
i
i
=
g
i=1
w
i
(x +
i
) =
g
i=1
w
i
( +
i
) = w({
i
})
for any xed constant x (say or ). We may also make contrasts in the
observed data:
w({y
i
}) =
g
i=1
w
i
y
i
=
g
i=1
w
i
(y
i
y
) =
g
i=1
w
i
i
= w({
i
}) .
A contrast depends on the differences between the values being contrasted,
but not on the overall level of the values. In particular, a contrast in treatment
means depends on the
i
s but not on . A contrast in the treatment means Contrasts do not
depend on
-restrictions
or effects will be the same regardless of whether we assume that
1
= 0,
or
i
= 0, or
n
i
i
= 0. Recall that with respect to restrictions on
the treatment effects, we said that the important things dont depend on
which set of restrictions we use. In particular, contrasts dont depend on the
restrictions.
We may use several different kinds of contrasts in any one analysis. The
trick is to nd or construct contrasts that focus in on interesting features of
the data.
Probably the most common contrasts are pairwise comparisons, where
we contrast the mean response in one treatment with the mean response in a
second treatment. For a pairwise comparison, one contrast coefcient is 1, Pairwise
comparisons
a second contrast coefcient is -1, and all other contrast coefcients are 0.
For example, in an experiment with g = 4 treatments, the coefcients (0, 1,
4.1 Contrast Basics 67
-1, 0) compare the means of treatments 2 and 3, and the coefcients (-1, 0,
1, 0) compare the means of treatments 1 and 3. For g treatments, there are
g(g 1)/2 different pairwise comparisons. We will consider simultaneous
inference for pairwise comparisons in Section 5.4.
A second classic example of contrasts occurs in an experiment with a
control and two or more new treatments. Suppose that treatment 1 is a con-
trol, and treatments 2 and 3 are new treatments. We might wish to compare
the average response in the new treatments to the average response in the
control; that is, on average do the new treatments have the same response as
the control? Here we could use coefcients (-1, .5, .5), which would sub-
tract the average control response from the average of treatments 2 and 3s
average responses. As discussed below, this contrast applied to the observed Control versus
other treatments
treatment means ((y
2
+ y
3
)/2 y
1
) would estimate the contrast in the
treatment effects ((
2
+
3
)/2
1
). Note that we would get the same
kind of information from contrasts with coefcients (1, -.5, -.5) or (-6, 3, 3);
weve just rescaled the result with no essential loss of information. We might
also be interested in the pairwise comparisons, including a comparison of the
new treatments to each other (0, 1, -1) and comparisons of each of the new
treatments to control (1, -1, 0) and (1, 0, -1).
Consider next an experiment with four treatments examining the growth
rate of lambs. The treatments are four different food supplements. Treat-
ment 1 is soy meal and ground corn, treatment 2 is soy meal and ground oats,
treatment 3 is sh meal and ground corn, and treatment 4 is sh meal and
ground oats. Again, there are many potential contrasts of interest. A contrast
with coefcients (.5, .5, -.5, -.5) would take the average response for sh
meal treatments and subtract it from the average response for soy meal treat- Compare related
groups of
treatments
ments. This could tell us about how the protein source affects the response.
Similarly, a contrast with coefcients (.5, -.5, .5, -.5) would take the average
response for ground oats and subtract it from the average response for ground
corn, telling us about the effect of the carbohydrate source.
Finally, consider an experiment with three treatments examining the ef-
fect of development time on the number of defects in computer chips pro-
duced using photolithography. The three treatments are 30, 45, and 60 sec-
onds of developing. If we think of the responses as lying on a straight line
function of development time, then the contrast with coefcients (-1/30, 0, Polynomial
contrasts for
quantitative
doses
1/30) will estimate the slope of the line relating response and time. If instead
we think that the responses lie on a quadratic function of development time,
then the contrast with coefcients (1/450, -2/450, 1/450) will estimate the
quadratic term in the response function. Dont worry for now about where
these coefcients come from; they will be discussed in more detail in Sec-
tion 4.4. For now, consider that the rst contrast compares the responses at
68 Looking for Specic DifferencesContrasts
the ends to get a rate of change, and the second contrast compares the ends
to the middle (which yields a 0 comparison for responses on a straight line)
to assess curvature.
4.2 Inference for Contrasts
We use contrasts in observed treatment means or effects to make inference
about the corresponding contrasts in the true treatment means or effects. The
kinds of inference we work with here are point estimates, condence inter-
vals, and tests of signicance. The procedures we use for contrasts are similar
to the procedures we use when estimating or testing means.
The observed treatment mean y
i
is an unbiased estimate of
i
= +
i
,
so a sum or other linear combination of observed treatment means is an un- w({y
i
})
estimates
w({
i
})
biased estimate of the corresponding combination of the
i
s. In particular,
a contrast in the observed treatment means is an unbiased estimate of the
corresponding contrast in the true treatment means. Thus we have:
E[w({y
i
})] = E[w({
i
})] = w({
i
}) = w({
i
}) .
The variance of y
i
is
2
/n
i
, and the treatment means are independent,
so the variance of a contrast in the observed means is
Var [w({y
i
})] =
2
g
i=1
w
2
i
n
i
.
We will usually not know
2
, so we estimate it by the mean square for error
from the ANOVA.
We compute a condence interval for a mean parameter with the general
form: unbiased estimate t-multiplier estimated standard error. Contrasts
are linear combinations of mean parameters, so we use the same basic form. Condence
interval for
w({
i
})
We have already seen how to compute an estimate and standard error, so
w({y
i
}) t
E/2,Ng
_
MS
E
_
g
i=1
w
2
i
n
i
forms a 1 E condence interval for w({
i
}). As usual, the degrees of
freedom for our t-percent point come from the degrees of freedom for our
estimate of error variance, here Ng. We use the E/2 percent point because
we are forming a two-sided condence interval, with E/2 error on each side.
4.2 Inference for Contrasts 69
The usual t-test statistic for a mean parameter takes the form
unbiased estimate null hypothesis value
estimated standard error of estimate
.
This form also works for contrasts. If we have the null hypothesis H
0
:
w({
i
}) = , then we can do a t-test of that null hypothesis by computing
the test statistic
t =
w({y
i
})
MS
E
_
g
i=1
w
2
i
n
i
.
Under H
0
, this t-statistic will have a t-distribution with N g degrees of t-test for w({
i
})
freedom. Again, the degrees of freedom come from our estimate of error
variance. The p-value for this t-test is computed by getting the area under
the t-distribution with N g degrees of freedom for the appropriate region:
either less or greater than the observed t-statistic for one-sided alternatives,
or twice the tail area for a two-sided alternative.
We may also compute a sum of squares for any contrast w({y
i
}):
SS
w
=
(
g
i=1
w
i
y
i
)
2
g
i=1
w
2
i
n
i
.
This sum of squares has 1 degree of freedom, so its mean square is MS
w
=
SS
w
/1 = SS
w
. We may use MS
w
to test the null hypothesis that w({
i
}) =
0 by forming the F-statistic MS
w
/MS
E
. If H
0
is true, this F-statistic will
have an F-distribution with 1 and N g degrees of freedom (N g from the SS and F-test for
w({
i
})
MS
E
). It is not too hard to see that this F is exactly equal to the square of
the t-statistic computed for same null hypothesis = 0. Thus the F-test and
two-sided t-tests are equivalent for the null hypothesis of zero contrast mean.
It is also not too hard to see that if you multiply the contrast coefcients by
a nonzero constant (for example, change from (-1, .5, .5) to (2, -1, -1)), then
the contrast sum of squares is unchanged. The squared constant cancels from
the numerator and denominator of the formula.
Rat liver weights Example 4.1
Exercise 3.1 provided data on the weight of rat livers as a percentage of body
weight for four different diets. Summary statistics from those data follow:
i 1 2 3 4
y
i
3.75 3.58 3.60 3.92
n
i
7 8 6 8 MS
E
= .04138
70 Looking for Specic DifferencesContrasts
If diets 1, 2, and 3 are rations made by one manufacturer, and diet 4 is a
ration made by a second manufacturer, then it may be of interest to compare
the responses from the diets of the two manufacturers to see if there is any
difference.
The contrast with coefcients (1/3, 1/3, 1/3, -1) will compare the mean
response in the rst three diets with the mean response in the last diet. Note
that we intend the mean response in the rst three diets to denote the av-
erage of the treatment averages, not the simple average of all the data from
those three treatments. The simple average will not be the same as the aver-
age of the averages because the sample sizes are different.
Our point estimate of this contrast is
w({y
i
}) =
1
3
3.75 +
1
3
3.58 +
1
3
3.60 + (1)3.92 = .277
with standard error
SE(w({y
i
})) =
.04138
(
1
3
)
2
7
+
(
1
3
)
2
8
+
(
1
3
)
2
6
+
(1)
2
8
= .0847 .
The mean square for error has 29 4 = 25 degrees of freedom. To construct
a 95% condence interval for w({
i
}), we need the upper 2.5% point of a
t-distribution with 25 degrees of freedom; this is 2.06, as can be found in
Appendix Table D.3 or using software. Thus our 95% condence interval is
.277 2.06 .0847 = .277 .174 = (.451, .103) .
Suppose that we wish to test the null hypothesis H
0
: w({
i
}) = . Here
we will use the t-test and F-test to test H
0
: w({
i
}) = = 0, but the t-test
can test other values of . Our t-test is
.277 0
.0847
= 3.27 ,
with 25 degrees of freedom. For a two-sided alternative, we compute the p-
value by nding the tail area under the t-curve and doubling it. Here we get
twice .00156 or about .003. This is rather strong evidence against the null
hypothesis.
Because our null hypothesis value is zero with a two-sided alternative, we
can also test our null hypothesis by computing a mean square for the contrast
4.3 Orthogonal Contrasts 71
Listing 4.1: SAS PROC GLM output for the rat liver contrast.
Source DF Type I SS Mean Square F Value Pr > F
DIET 3 0.57820903 0.19273634 4.66 0.0102
Contrast DF Contrast SS Mean Square F Value Pr > F
1,2,3 vs 4 1 0.45617253 0.45617253 11.03 0.0028
Listing 4.2: MacAnova output for the rat liver contrast.
component: estimate
(1) -0.28115
component: ss
(1) 0.45617
component: se
(1) 0.084674
and forming an F-statistic. The sum of squares for our contrast is
(
1
3
3.75 +
1
3
3.58 +
1
3
3.60 + (1)3.92)
2
(1/3)
2
7
+
(1/3)
2
8
+
(1/3)
2
6
+
(1)
2
8
=
(.277)
2
.1733
= .443 .
The mean square is also .443, so the F-statistic is .443/.04138 = 10.7. We
compute a p-value by nding the area to the right of 10.7 under the F-
distribution with 1 and 25 degrees of freedom, getting .003 as for the t-test.
Listing 4.1 shows output from SAS for computing the sum of squares for
this contrast; Listing 4.2 shows corresponding MacAnova output. The sum
of squares in these two listings differs from what we obtained above due to
rounding at several steps.
4.3 Orthogonal Contrasts
Two contrasts {w} and {w
i=1
w
i
w
i
/n
i
= 0 .
72 Looking for Specic DifferencesContrasts
If there are g treatments, you can nd a set of g1 contrasts that are mutually
orthogonal, that is, each one is orthogonal to all of the others. However, there
are innitely many sets of g 1 mutually orthogonal contrasts, and there are g 1 orthogonal
contrasts
no mutually orthogonal sets with more than g 1 contrasts. There is an anal-
ogy from geometry. In a plane, you can have two lines that are perpendicular
(orthogonal), but you cant nd a third line that is perpendicular to both of
the others. On the other hand, there are innitely many pairs of perpendicular
lines.
The important feature of orthogonal contrasts applied to observed means
is that they are independent (as random variables). Thus, the random error of Orthogonal
contrasts are
independent and
partition variation
one contrast is not correlated with the randomerror of an orthogonal contrast.
An additional useful fact about orthogonal contrasts is that they partition the
between groups sum of squares. That is, if you compute the sums of squares
for a full set of orthogonal contrasts (g1 contrasts for g groups), then adding
up those g 1 sums of squares will give you exactly the between groups sum
of squares (which also has g 1 degrees of freedom).
Example 4.2 Orthogonal contrast inference
Suppose that we have an experiment with three treatmentsa control and
two new treatmentswith group sizes 10, 5, and 5, and treatment means 6.3,
6.4, and 6.5. The MS
E
is .0225 with 17 degrees of freedom. The contrast
w with coefcients (1, -.5, -.5) compares the mean response in the control
treatment with the average of the mean responses in the new treatments. The
contrast with coefcients (0, 1, -1) compares the two new treatments. In our
example above, we had a control with 10 units, and two new treatments with
5 units each. These contrasts are orthogonal, because
0 1
10
+
1 .5
5
+
1 .5
5
= 0 .
We have three groups so there are 2 degrees of freedom between groups,
and we have described above a set of orthogonal contrasts. The sum of
squares for the rst contrast is
(6.3 .5 6.4 .5 6.5)
2
1
10
+
(.5)
2
5
+
(.5)
2
5
= .1125 ,
and the sum of squares for the second contrast is
4.4 Polynomial Contrasts 73
(0 + 6.4 6.5)
2
0
10
+
1
2
5
+
(1)
2
5
=
.01
.4
= .025 .
The between groups sum of squares is
10(6.3 6.375)
2
+ 5(6.4 6.375)
2
+ 5(6.5 6.375)
2
= .1375
which equals .1125 + .025.
We can see from Example 4.2 one of the advantages of contrasts over
the full between groups sum of squares. The control-versus-newcontrast has Contrasts isolate
differences
a sum of squares which is 4.5 times larger than the sum of squares for the
difference of the new treatments. This indicates that the responses from the
new treatments are substantially farther from the control responses than they
are from each other. Such indications are not possible using the between
groups sum of squares.
The actual contrasts one uses in an analysis arise from the context of
the problem. Here we had new versus old and the difference between the
two new treatments. In a study on the composition of ice cream, we might
compare articial avorings with natural avorings, or expensive avorings
with inexpensive avorings. It is often difcult to construct a complete set
of meaningful orthogonal contrasts, but that should not deter you from using
an incomplete set of orthogonal contrasts, or from using contrasts that are
nonorthogonal.
Use contrasts that address the questions you are trying to answer.
4.4 Polynomial Contrasts
Section 3.10 introduced the idea of polynomial modeling of a response when
the treatments had a quantitative dose structure. We selected a polynomial Contrasts yield
improvement SS
in polynomial
dose-response
models
model by looking at the improvement sums of squares obtained by adding
each polynomial term to the model in sequence. Each of these additional
terms in the polynomial has a single degree of freedom, just like a contrast. In
fact, each of these improvement sums of squares can be obtained as a contrast
sum of squares. We call the contrast that gives us the sum of squares for the
linear term the linear contrast, the contrast that gives us the improvement sum
of squares for the quadratic term the quadratic contrast, and so on.
74 Looking for Specic DifferencesContrasts
When the doses are equally spaced and the sample sizes are equal, then
the contrast coefcients for polynomial terms are fairly simple and can be Simple contrasts
for equally
spaced doses
with equal n
i
found, for example, in Appendix Table D.6; these contrasts are orthogonal
and have been scaled to be simple integer values. Equally spaced doses
means that the gaps between successive doses are the same, as in 1, 4, 7,
10. Using these tabulated contrast coefcients, we may compute the linear,
quadratic, and higher order sums of squares as contrasts without tting a sep-
arate polynomial model. Doses such as 1, 10, 100, 1000 are equally spaced
on a logarithmic scale, so we can again use the simple polynomial contrast
coefcients, provided we interpret the polynomial as a polynomial in the log-
arithm of dose.
When the doses are not equally spaced or the sample sizes are not equal,
then contrasts for polynomial terms exist, but are rather complicated to de-
rive. In this situation, it is more trouble to derive the coefcients for the
polynomial contrasts than it is to t a polynomial model.
Example 4.3 Leaet angles
Exercise 3.5 introduced the leaet angles of plants at 30, 45, and 60 minutes
after exposure to red light. Summary information for this experiment is given
here:
Delay time (min)
30 45 60
y
i
139.6 133.6 122.4
n
i
5 5 5
MS
E
= 58.13
With three equally spaced groups, the linear and quadratic contrasts are (-1,
0, 1) and (1, -2, 1).
The sum of squares for linear is
((1)139.6 + (0)133.6 + (1)122.4)
2
(1)
2
5
+
0
5
+
1
2
5
= 739.6 ,
and that for quadratic is
((1)139.6 + (2)133.6 + (1)122.4)
2
1
2
5
+
(2)
2
5
+
1
2
5
= 22.53 .
Thus the F-tests for linear and quadratic are 739.6/58.13 = 12.7 and
22.53/58.13 = .39, both with 1 and 12 degrees of freedom; there is a strong
linear trend in the means and almost no nonlinear trend.
4.5 Further Reading and Extensions 75
4.5 Further Reading and Extensions
Contrasts are a special case of estimable functions, which are described in
some detail in Appendix Section A.6. Treatment means and averages of
treatment means are other estimable functions. Estimable functions are those
features of the data that do not depend on how we choose to restrict the treat-
ment effects.
4.6 Problems
An experimenter randomly allocated 125 male turkeys to ve treatment Exercise 4.1
groups: 0 mg, 20 mg, 40 mg, 60 mg, and 80 mg of estradiol. There were
25 birds in each group, and the mean results were 2.16, 2.45, 2.91, 3.00,
and 2.71 respectively. The sum of squares for experimental error was 153.4.
Test the null hypothesis that the ve group means are the same against the
alternative that they are not all the same. Find the linear, quadratic, cubic,
and quartic sums of squares (you may lump the cubic and quartic together
into a higher than quadratic if you like). Test the null hypothesis that the
quadratic effect is zero. Be sure to report a p-value.
Use the data from Exercise 3.3. Compute a 99% condence interval for Exercise 4.2
the difference in response between the average of the three treatment groups
(acid, pulp, and salt) and the control group.
Refer to the data in Problem 3.1. Workers 1 and 2 were experienced, Exercise 4.3
whereas workers 3 and 4 were novices. Find a contrast to compare the expe-
rienced and novice workers and test the null hypothesis that experienced and
novice works produce the same average shear strength.
Consider an experiment taste-testing six types of chocolate chip cookies: Exercise 4.4
1 (brand A, chewy, expensive), 2 (brand A, crispy, expensive), 3 (brand B,
chewy, inexpensive), 4 (brand B, crispy, inexpensive), 5 (brand C, chewy,
expensive), and 6 (brand D, crispy, inexpensive). We will use twenty different
raters randomly assigned to each type (120 total raters).
(a) Design contrasts to compare chewy with crispy, and expensive with inex-
pensive.
(b) Are your contrasts in part (a) orthogonal? Why or why not?
A consumer testing agency obtains four cars from each of six makes: Problem 4.1
Ford, Chevrolet, Nissan, Lincoln, Cadillac, and Mercedes. Makes 3 and 6
are imported while the others are domestic; makes 4, 5, and 6 are expensive
76 Looking for Specic DifferencesContrasts
while 1, 2, and 3 are less expensive; 1 and 4 are Ford products, while 2 and
5 are GM products. We wish to compare the six makes on their oil use per
100,000 miles driven. The mean responses by make of car were 4.6, 4.3, 4.4,
4.7, 4.8, and 6.2, and the sum of squares for error was 2.25.
(a) Compute the Analysis of Variance table for this experiment. What
would you conclude?
(b) Design a set of contrasts that seem meaningful. For each contrast,
outline its purpose and compute a 95% condence interval.
Consider the data in Problem 3.2. Design a set of contrasts that seem Problem 4.2
meaningful. For each contrast, outline its purpose and test the null hypothesis
that the contrast has expected value zero.
Consider the data in Problem 3.5. Use polynomial contrasts to choose a Problem 4.3
quantitative model to describe the effect of ber proportion on the response.
Show that orthogonal contrasts in the observed treatment means are un- Question 4.1
correlated random variables.
Chapter 5
Multiple Comparisons
When we make several related tests or interval estimates at the same time,
we need to make multiple comparisons or do simultaneous inference. The
issue of multiple comparisons is one of error rates. Each of the individual
tests or condence intervals has a Type I error rate E
i
that can be controlled Multiple
comparisons,
simultaneous
inference, families
of hypotheses
by the experimenter. If we consider the tests together as a family, then we can
also compute a combined Type I error rate for the family of tests or intervals.
When a family contains more and more true null hypotheses, the probabil-
ity that one or more of these true null hypotheses is rejected increases, and
the probability of any Type I errors in the family can become quite large.
Multiple comparisons procedures deal with Type I error rates for families of
tests.
Carcinogenic mixtures Example 5.1
We are considering a newcleaning solvent that is a mixture of 100 chemicals.
Suppose that regulations state that a mixture is safe if all of its constituents
are safe (pretending we can ignore chemical interaction). We test the 100
chemicals for causing cancer, running each test at the 5% level. This is the
individual error rate that we can control.
What happens if all 100 chemicals are harmless and safe? Because we
are testing at the 5% level, we expect 5% of the nulls to be rejected even
when all the nulls are true. Thus, on average, 5 of the 100 chemicals will be
declared to be carcinogenic, even when all are safe. Moreover, if the tests
are independent, then one or more of the chemicals will be declared unsafe
in 99.4% of all sets of experiments we run, even if all the chemicals are safe.
This 99.4% is a combined Type I error rate; clearly we have a problem.
78 Multiple Comparisons
5.1 Error Rates
When we have more than one test or interval to consider, there are several
ways to dene a combined Type I error rate for the family of tests. This vari-
ety of combined Type I error rates is the source of much confusion in the use Determine error
rate to control
of multiple comparisons, as different error rates lead to different procedures.
People sometimes ask Which procedure should I use? when the real ques-
tion is Which error rate do I want to control?. As data analyst, you need
to decide which error rate is appropriate for your situation and then choose
a method of analysis appropriate for that error rate. This choice of error rate
is not so much a statistical decision as a scientic decision in the particular
area under consideration.
Data snooping is a practice related to having many tests. Data snooping
occurs when we rst look over the data and then choose the null hypotheses Data snooping
performs many
implicit tests
to be tested based on interesting features in the data. What we tend to
do is consider many potential features of the data and discard those with
uninteresting or null behavior. When we data snoop and then perform a test,
we tend to see the smallest p-value from the ill-dened family of tests that we
considered when we were snooping; we have not really performed just one
test. Some multiple comparisons procedures can actually control for data
snooping.
Simultaneous inference is deciding which error rate we wish to control, and
then using a procedure that controls the desired error rate.
Lets set up some notation for our problem. We have a set of K null
hypotheses H
01
, H
02
, . . ., H
0K
. We also have the combined, overall, or
intersection null hypotheses H
0
which is true if all of the H
0i
are true. In Individual and
combined null
hypotheses
formula,
H
0
= H
01
H
02
H
0K
.
The collection H
01
, H
02
, . . ., H
0K
is sometimes called a family of null hy-
potheses. We reject H
0
if any of null hypotheses H
0i
is rejected. In Exam-
ple 5.1, K = 100, H
0i
is the null hypothesis that chemical i is safe, and H
0
is the null hypothesis that all chemicals are safe so that the mixture is safe.
We now dene ve combined Type I error rates. The denitions of these
error rates depend on numbers or fractions of falsely rejected null hypotheses
H
0i
, which will never be known in practice. We set up the error rates here
and later give procedures that can be shown mathematically to control the
error rates.
5.1 Error Rates 79
The per comparison error rate or comparisonwise error rate is the prob-
ability of rejecting a particular H
0i
in a single test when that H
0i
is true.
Controlling the per comparison error rate at E means that the expected frac- Comparisonwise
error rate
tion of individual tests that reject H
0i
when H
0
is true is E. This is just the
usual error rate for a t-test or F-test; it makes no correction for multiple com-
parisons. The tests in Example 5.1 controlled the per comparison error rate
at 5%.
The per experiment error rate or experimentwise error rate or familywise
error rate is the probability of rejecting one or more of the H
0i
(and thus Experimentwise
error rate
rejecting H
0
) in a series of tests when all of the H
0i
are true. Controlling
the experimentwise error rate at E means that the expected fraction of exper-
iments in which we would reject one or more of the H
0i
when H
0
is true
is E. In Example 5.1, the per experiment error rate is the fraction of times
we would declare one or more of the chemicals unsafe when in fact all were
safe. Controlling the experimentwise error rate at E necessarily controls the
comparisonwise error rate at no more than E. The experimentwise error rate
considers all individual null hypotheses that were rejected; if any one of them
was correctly rejected, then there is no penalty for any false rejections that
may have occurred.
A statistical discovery is the rejection of an H
0i
. The false discovery
fraction is 0 if there are no rejections; otherwise it is the number of false False discovery
rate
discoveries (Type I errors) divided by the total number of discoveries. The
false discovery rate (FDR) is the expected value of the false discovery frac-
tion. If H
0
is true, then all discoveries are false and the FDR is just the
experimentwise error rate. Thus controlling the FDR at E also controls the
experimentwise error at E. However, the FDR also controls at E the average
fraction of rejections that are Type I errors when some H
0i
are true and some
are false, a control that the experimentwise error rate does not provide. With
the FDR, we are allowed more incorrect rejections as the number of true re-
jections increases, but the ratio is limited. For example, with FDR at .05, we
are allowed just one incorrect rejection with 19 correct rejections.
The strong familywise error rate is the probability of making any false
discoveries, that is, the probability that the false discovery fraction is greater
than zero. Controlling the strong familywise error rate at E means that the Strong familywise
error rate
probability of making any false rejections is E or less, regardless of how
many correct rejections are made. Thus one true rejection cannot make any
false rejections more likely. Controlling the strong familywise error rate at
E controls the FDR at no more than E. In Example 5.1, a strong familywise
error rate of E would imply that in a situation where 2 of the chemicals were
carcinogenic, the probability of declaring one of the other 98 to be carcino-
genic would be no more than E.
80 Multiple Comparisons
Finally, suppose that each null hypothesis relates to some parameter (for
example, a mean), and we put condence intervals on all these parameters.
An error occurs when one of our condence intervals fails to cover the true
parameter value. If this true parameter value is also the null hypothesis value,
then an error is a false rejection. The simultaneous condence intervals cri- Simultaneous
condence
intervals
terion states that all of our condence intervals must cover their true param-
eters simultaneously with condence 1 E. Simultaneous 1 E condence
intervals also control the strong familywise error rate at no more than E. (In
effect, the strong familywise criterion only requires simultaneous intervals
for the null parameters.) In Example 5.1, we could construct simultaneous
condence intervals for the cancer rates of each of the 100 chemicals. Note
that a single condence interval in a collection of intervals with simultaneous
coverage 1 E will have coverage greater than 1 E.
There is a trade-off between Type I error and Type II error (failing to
reject a null when it is false). As we go to more and more stringent Type I More stringent
procedures are
less powerful
error rates, we become more condent in the rejections that we do make, but
it also becomes more difcult to make rejections. Thus, when using the more
stringent Type I error controls, we are more likely to fail to reject some null
hypotheses that should be rejected than when using the less stringent rates. In
simultaneous inference, controlling stronger error rates leads to less powerful
tests.
Example 5.2 Functional magnetic resonance imaging
Many functional Magnetic Resonance Imaging (fMRI) studies are interested
in determining which areas of the brain are activated when a subject is
engaged in some task. Any one image slice of the brain may contain 5000
voxels (individual locations to be studied), and one analysis method produces
a t-test for each of the 5000 voxels. Null hypothesis H
0i
is that voxel i is not
activated. Which error rate should we use?
If we are studying a small, narrowly dened brain region and are uncon-
cerned with other brain regions, then we would want to test individually the
voxels in the brain regions of interest. The fact that there are 4999 other
voxels is unimportant, so we would use a per comparison method.
Suppose instead that we are interested in determining if there are any
activations in the image. We recognize that by making many tests we are
likely to nd one that is signicant, even when all nulls are true; we want
to protect ourselves against that possibility, but otherwise need no stronger
control. Here we would use a per experiment error rate.
Suppose that we believe that there will be many activations, so that H
0
is
not true. We dont want some correct discoveries to open the ood gates for
many false discoveries, but we are willing to live with some false discoveries
5.2 Bonferroni-Based Methods 81
as long as they are a controlled fraction of the total made. This is acceptable
because we are going to investigate several subjects; the truly activated re-
jections should be rejections in most subjects, and the false rejections will be
scattered. Here we would use the FDR.
Suppose that in addition to expecting true activations, we are also only
looking at a single subject, so that we cant use multiple subjects to determine
which activations are real. Here we dont want false activations to cloud our
picture, so we use the strong familywise error rate.
Finally, we might want to be able to estimate the amount of activation in
every voxel, with simultaneous accuracy for all voxels. Here we would use
simultaneous condence intervals.
Amultiple comparisons procedure is a method for controlling a Type I error
rate other than the per comparison error rate.
The literature on multiple comparisons is vast, and despite the length of
this Chapter, we will only touch the highlights. I have seen quite a bit of
nonsense regarding these methods, so I will try to set out rather carefully
what the methods are doing. We begin with a discussion of Bonferroni-based
methods for combining generic tests. Next we consider the Scheff e proce-
dure, which is useful for contrasts suggested by data (data snooping). Then
we turn our attention to pairwise comparisons, for which there are dozens of
methods. Finally, we consider comparing treatments to a control or to the
best response.
5.2 Bonferroni-Based Methods
The Bonferroni technique is the simplest, most widely applicable multiple
comparisons procedure. The Bonferroni procedure works for a xed set of
K null hypotheses to test or parameters to estimate. Let p
i
be the p-value
for testing H
0i
. The Bonferroni procedure says to obtain simultaneous 1 Ordinary
Bonferroni
E condence intervals by constructing individual condence intervals with
coverage 1 E/K, or reject H
0i
(and thus H
0
) if
p
i
< E/K .
That is, simply run each test at level E/K. The testing version controls the
strong familywise error rate, and the condence intervals are simultaneous.
The tests and/or intervals need not be independent, of the same type, or re-
lated in any way.
82 Multiple Comparisons
Reject H
0(i)
if Method Control
p
(i)
< E/K Bonferroni Simultaneous condence
intervals
p
(j)
< E/(K j + 1)
for all j = 1, . . ., i
Holm Strong familywise error
rate
p
(j)
jE/K
for some j i
FDR False discovery rate;
needs independent tests
Display 5.1: Summary of Bonferroni-style methods for K comparisons.
The Holm procedure is a modication of Bonferroni that controls the
strong familywise error rate, but does not produce simultaneous condence
intervals (Holm 1979). Let p
(1)
, . . ., p
(K)
be the p-values for the K tests Holm
sorted into increasing order, and let H
0(i)
be the null hypotheses sorted along
with the p-values. Then reject H
0(i)
if
p
(j)
E/(K j + 1) for all j = 1, . . ., i.
Thus we start with the smallest p-value; if it is rejected we consider the next
smallest, and so on. We stop when we reach the rst nonsignicant p-value.
This is a little more complicated, but we gain some power since only the
smallest p-value is compared to E/K.
The FDR method of Benjamini and Hochberg (1995) controls the False
Discovery Rate. Once again, sort the p-values and the hypotheses. For the FDR modication
of Bonferroni
requires
independent tests
FDR, start with the largest p-value and work down. Reject H
0i
if
p
(j)
jE/K for some j i.
This procedure is correct when the tests are statistically independent. It con-
trols the FDR, but not the strong familywise error rate.
The three Bonferroni methods are summarized in Display 5.1. Exam-
ple 5.3 illustrates their use.
5.2 Bonferroni-Based Methods 83
Sensory characteristics of cottage cheeses Example 5.3
Table 5.1 shows the results of an experiment comparing the sensory charac-
teristics of nonfat, 2% fat, and 4% fat cottage cheese (Michicich 1995). The
table shows the characteristics grouped by type and p-values for testing the
null hypothesis that there was no difference between the three cheeses in the
various sensory characteristics. There are 21 characteristics in three groups
of sizes 7, 6, and 8.
How do we do multiple comparisons here? First we need to know:
1. Which error rate is of interest?
2. If we do choose an error rate other than the per comparison error rate,
what is the appropriate family of tests? Is it all 21 characteristics, or
separately within group of characteristic?
There is no automatic answer to either of these questions. The answers de-
pend on the goals of the study, the tolerance of the investigator to Type I error,
how the results of the study will be used, whether the investigator views the
three groups of characteristics as distinct, and so on.
The last two columns of Table 5.1 give the results of the Bonferroni,
Holm, and FDR procedures applied at the 5% level to all 21 comparisons
and within each group. The p-values are compared to the criteria in Dis-
play 5.1 using K = 21 for the overall family and K of 7, 6, or 8 for by group
comparisons.
Consider the characteristic cheesy avor with a .01 p-value. If we use
the overall family, this is the tenth smallest p-value out of 21 p-values. The
results are
Bonferroni The critical value is .05/21 = .0024not signicant.
Holm The critical value is .05/(2110+1) = .0042not signicant.
FDR The critical value is 10 .05/21 = .024signicant.
If we use the avor family, this is the fourth smallest p-value out of six p-
values. Now the results are
Bonferroni The critical value is .05/6 = .008not signicant.
Holm The critical value is .05/(6 4 + 1) = .017 (and all smaller
p-values meet their critical values)signicant.
FDR The critical value is 4 .05/6 = .033signicant.
84 Multiple Comparisons
Table 5.1: Sensory attributes of three cottage cheeses: p-values and 5%
signicant results overall and familywise by type of attribute using the
Bonferroni (), Holm (), and FDR methods().
Characteristic p-value Overall By group
Appearance
White .004
Yellow .002
Gray .13
Curd size .29
Size uniformity .73
Shape uniformity .08
Liquid/solid ratio .02
Flavor
Sour .40
Sweet .24
Cheesy .01
Rancid .0001
Cardboard .0001
Storage .001
Texture
Breakdown rate .001
Firm .0001
Sticky .41
Slippery .07
Heavy .15
Particle size .42
Runny .002
Rubbery .006
These results illustrate that more null hypotheses are rejected considering
each group of characteristics to be a family of tests rather than overall (the
K is smaller for the individual groups), and fewer rejections are made using
the more stringent error rates. Again, the choices of error rate and family of
tests are not purely statistical, and controlling an error rate within a group of
tests does not control that error rate for all tests.
5.3 The Scheff e Method for All Contrasts 85
5.3 The Scheff e Method for All Contrasts
The Scheff e method is a multiple comparisons technique for contrasts that
produces simultaneous condence intervals for any and all contrasts, includ-
ing contrasts suggested by the data. Thus Scheff e is the appropriate tech-
nique for assessing contrasts that result from data snooping. This sounds like Scheff e protects
against data
snooping, but has
low power
the ultimate in error rate controlarbitrarily many comparisons, even ones
suggested from the data! The downside of this amazing protection is low
power. Thus we only use the Scheff e method in those situations where we
have a contrast suggested by the data, or many, many contrasts that cannot
be handled by other techniques. In addition, pairwise comparison contrasts
y
i
y
j
, even pairwise comparisons suggested by the data, are better han-
dled by methods specically designed for pairwise comparisons.
We begin with the Scheff e test of the null hypothesis H
0
: w({
i
}) = 0
against a two-sided alternative. The Scheff e test statistic is the ratio
SS
w
/(g 1)
MS
E
;
we get a p-value as the area under an F-distribution with g 1 and degrees Scheff e F-test
of freedom to the right of the test statistic. The degrees of freedom are from
our denominator MS
E
; = N g for the completely randomized designs
we have been considering so far. Reject the null hypothesis if this p-value
is less than our Type I error rate E. In effect, the Scheff e procedure treats
the mean square for any single contrast as if it were the full g 1 degrees of
freedom between groups mean square.
There is also a Scheff e t-test for contrasts. Suppose that we are testing
the null hypothesis H
0
: w({
i
}) = against a two-sided alternative. The
Scheff e t-test controls the Type I error rate at E by rejecting the null hypoth- Scheff e t-test
esis when
|w({y
i
}) |
_
MS
E
g
i=1
w
2
i
n
i
>
_
(g 1)F
E,g1,
,
where F
E,g1,
is the upper E percent point of an F-distribution with g 1
and degrees of freedom. Again, is the degrees of freedom for MS
E
. For
the usual null hypothesis value = 0, this is equivalent to the ratio-of-mean-
squares version given above.
We may also use the Scheff e approach to form simultaneous condence Scheff e
condence
interval
intervals for any w({
i
}):
w({y
i
})
_
(g 1)F
E,g1,
_
MS
E
g
i=1
w
2
i
n
i
.
86 Multiple Comparisons
These Scheff e intervals have simultaneous 1 E coverage over any set of
contrasts, including contrasts suggested by the data.
Example 5.4 Acid rain and birch seedlings, continued
Example 3.1 introduced an experiment in which birch seedlings were ex-
posed to various levels of articial acid rain. The following table gives some
summaries for the data:
pH 4.7 4.0 3.3 3.0 2.3
weight .337 .296 .320 .298 .177
n 48 48 48 48 48
The MS
E
was .0119 with 235 degrees of freedom.
Inspection of the means shows that most of the response means are about
.3, but the response for the pH 2.3 treatment is much lower. This suggests
that a contrast comparing the pH 2.3 treatment with the mean of the other
treatments would have a large value. The coefcients for this contrast are
(.25, .25, .25, .25, -1). This contrast has value
.337 +.296 +.320 +.298
4
.177 = .1357
and standard error
.0119
_
.0625
48
+
.0625
48
+
.0625
48
+
.0625
48
+
1
48
_
= .0176 .
We must use the Scheff e procedure to construct a condence interval or
assess the signicance of this contrast, because the contrast was suggested
by the data. For a 99% condence interval, the Scheff e multiplier is
_
4 F
.01,4,235
= 3.688 .
Thus the 99% condence interval for this contrast is .13573.688.0176 up
to .1357 + 3.688 .0176, or (.0708, .2006). Alternatively, the t-statistic for
testing the null hypothesis that the mean response in the last group is equal to
the average of the mean responses in the other four groups is .1357/.0176 =
7.71. The Scheff e critical value for testing the null hypothesis at the E = .001
level is
_
(g 1)F
E,g1,Ng
=
_
4 F
.001,4,235
=
4 4.782 = 4.37 ,
so we can reject the null at the .001 level.
5.4 Pairwise Comparisons 87
Remember, it is not fair to hunt around through the data for a big contrast,
test it, and think that youve only done one comparison.
5.4 Pairwise Comparisons
A pairwise comparison is a contrast that examines the difference between
two treatment means y
i
y
j
. For g treatment groups, there are
(
g
2
) =
g(g 1)
2
different pairwise comparisons. Pairwise comparisons procedures control a
Type I error rate at E for all pairwise comparisons. If we data snoop, choose
the biggest and smallest y
i
s and take the difference, we have not made just
one comparison; rather we have made all g(g 1)/2 pairwise comparisons,
and selected the largest. Controlling a Type I error rate for this greatest dif-
ference is one way to control the error rate for all differences.
As with many other inference problems, pairwise comparisons can be
approached using condence intervals or tests. That is, we may compute Tests or
condence
intervals
condence intervals for the differences
i
j
or
i
j
or test the null
hypotheses H
0ij
:
i
=
j
or H
0ij
:
i
=
j
. Condence regions for the
differences of means are generally more informative than tests.
A pairwise comparisons procedure can generally be viewed as a critical
value (or set of values) for the t-tests of the pairwise comparison contrasts.
Thus we would reject the null hypothesis that
i
j
= 0 if
|y
i
y
j
|
MS
E
_
1/n
i
+ 1/n
j
> u ,
where u is a critical value. Various pairwise comparisons procedures differ Critical values u
for t-tests
in how they dene the critical value u, and u may depend on several things,
including E, the degrees of freedom for MS
E
, the number of treatments, the
number of treatments with means between y
i
and y
j
, and the number of
treatment comparisons with larger t-statistics.
An equivalent form of the test will reject if
|y
i
y
j
| > u
_
MS
E
_
1/n
i
+ 1/n
j
= D
ij
.
88 Multiple Comparisons
If all sample sizes are equal and the critical value u is constant, then D
ij
will be the same for all i, j pairs and we would reject the null if any pair of Signicant
differences D
ij treatments had mean responses that differed by D or more. This quantity D
is called a signicant difference; for example, using a Bonferroni adjustment
to the g(g 1)/2 pairwise comparisons tests leads to a Bonferroni signicant
difference (BSD).
Condence intervals for pairwise differences
i
j
can be formed from
the pairwise tests via
(y
i
y
j
) u
_
MS
E
_
1/n
i
+ 1/n
j
.
The remainder of this section presents methods for displaying the results
of pairwise comparisons, introduces the Studentized range, discusses sev-
eral pairwise comparisons methods, and then illustrates the methods with an
example.
5.4.1 Displaying the results
Pairwise comparisons generate a lot of tests, so we need convenient and com-
pact ways to present the results. An underline diagram is a graphical presen- Underline
diagram
summarizes
pairwise
comparisons
tation of pairwise comparison results; construct the underline diagram in the
following steps.
1. Sort the treatment means into increasing order and write out treatment
labels (numbers or names) along a horizontal axis. The y
i
values may
be added if desired.
2. Draw a line segment under a group of treatments if no pair of treat-
ments in that group is signicantly different. Do not include short lines
that are implied by long lines. That is, if treatments 4, 5, and 6 are not
signicantly different, only use one line under all of themnot a line
under 4 and 5, and a line under 5 and 6, and a line under 4, 5, and 6.
Here is a sample diagram for three treatments that we label A, B, and C:
C A B
This diagram includes treatment labels, but not treatment means. From this
summary we can see that Ccan be distinguished fromB(there is no underline
that covers both B and C), but A cannot be distinguished from either B or C
(there are underlines under A and C, and under A and B).
5.4 Pairwise Comparisons 89
Note that there can be some confusion after pairwise comparisons. You
must not confuse is not signicantly different from or cannot be distin- Insignicant
difference does
not imply equality
guished from with is equal to. Treatment mean A cannot be equal to
treatment means B and C and still have treatment means B and C not equal
each other. Such a pattern can hold for results of signicance tests.
There are also several nongraphical methods for displaying pairwise com-
parisons results. In one method, we sort the treatments into order of increas-
ing means and print the treatment labels. Each treatment label is followed by Letter or number
tags
one or more numbers (letters are sometimes used instead). Any treatments
sharing a number (or letter) are not signicantly different. Thus treatments
sharing common numbers or letters are analogous to treatments being con-
nected by an underline. The grouping letters are often put in parentheses or
set as sub- or superscripts. The results in our sample underline diagrammight
thus be presented as one of the following:
C (1) A (12) B (2) C (a) A (ab) B (b)
C
1
A
12
B
2
C
a
A
ab
B
b
There are several other variations on this theme.
A third way to present pairwise comparisons is as a table, with treatments Table of CIs or
signicant
differences
labeling both rows and columns. Table elements can ag signicant differ-
ences or contain condence intervals for the differences. Only entries above
or below the diagonal of the table are needed.
5.4.2 The Studentized range
The range of a set is the maximum value minus the minimum value, and
Studentization means dividing a statistic by an estimate of its standard error. Range,
Studentization,
and Studentized
range
Thus the Studentized range for a set of treatment means is
max
i
y
i
_
MS
E
/n
min
j
y
j
_
MS
E
/n
.
Note that we have implicitly assumed that all the sample sizes n
i
are the
same.
If all the treatments have the same mean, that is, if H
0
is true, then the
Studentized range statistic follows the Studentized range distribution. Large Studentized
range distribution
values of the Studentized range are less likely under H
0
and more likely
under the alternative when the means are not all equal, so we may use the
Studentized range as a test statistic for H
0
, rejecting H
0
when the Studentized
90 Multiple Comparisons
range statistic is sufciently large. This Studentized range test is a legitimate
alternative to the ANOVA F-test.
The Studentized range distribution is important for pairwise comparisons
because it is the distribution of the biggest (scaled) difference between treat-
ment means when the null hypothesis is true. We will use it as a building
block in several pairwise comparisons methods.
The Studentized range distribution depends only on g and , the number
of groups and the degrees of freedom for the error estimate MS
E
. The quan- Percent points
q
E
(g, )
tity q
E
(g, ) is the upper E percent point of the Studentized range distribution
for g groups and error degrees of freedom; it is tabulated in Appendix Ta-
ble D.8.
5.4.3 Simultaneous condence intervals
The Tukey honest signicant difference (HSD) is a pairwise comparisons Tukey HSD or
honest signicant
difference
technique that uses the Studentized range distribution to construct simultane-
ous condence intervals for differences of all pairs of means. If we reject the
null hypothesis H
0ij
when the (simultaneous) condence interval for
i
j
does not include 0, then the HSD also controls the strong familywise error
rate.
The HSD uses the critical value
u(E, , g) =
q
E
(g, )
2
,
leading to The HSD
HSD =
q
E
(g, )
2
_
MS
E
_
1
n
+
1
n
=
q
E
(g, )
MS
E
n
.
Form simultaneous 1 E condence intervals via
y
i
y
j
q
E
(g, )
2
_
MS
E
_
1
n
+
1
n
.
The degrees of freedom are the degrees of freedom for the error estimate
MS
E
.
Strictly speaking, the HSD is only applicable to the equal sample size
situation. For the unequal sample size case, the approximate HSD is
HSD
ij
= q
E
(g, )
_
MS
E
1
2n
i
n
j
/(n
i
+n
j
)
5.4 Pairwise Comparisons 91
Table 5.2: Total free amino acids in cheeses
after 168 days of ripening.
Strain added
None A B A&B
4.195 4.125 4.865 6.155
4.175 4.735 5.745 6.488
or, equivalently, Tukey-Kramer
form for unequal
sample sizes
HSD
ij
=
q
E
(g, )
2
_
MS
E
(
1
n
i
+
1
n
j
) .
This approximate HSD, often called the Tukey-Kramer form, tends to be
slightly conservative (that is, the true error rate is slightly less than E).
The Bonferroni signicant difference (BSD) is simply the application of Bonferroni
signicant
difference or BSD
the Bonferroni technique to the pairwise comparisons problem to obtain
u = u(E, , K) = t
E/(2K),
,
BSD
ij
= t
E/(2K),
_
MS
E
_
1/n
i
+ 1/n
j
,
where K is the number of pairwise comparisons. We have K = g(g 1)/2
for all pairwise comparisons between g groups. BSD produces simultaneous
condence intervals and controls the strong familywise error rate.
When making all pairwise comparisons, the HSD is less than the BSD. Use HSD when
making all
pairwise
comparisons
Thus we prefer the HSD to the BSD for all pairwise comparisons, because
the HSD will produce shorter condence intervals that are still simultaneous.
When only a preplanned subset of all the pairs is being considered, the BSD
may be less than and thus preferable to the HSD.
Free amino acids in cheese Example 5.5
Cheese is produced by bacterial fermentation of milk. Some bacteria in
cheese are added by the cheese producer. Other bacteria are present but were
not added deliberately; these are called nonstarter bacteria. Nonstarter bac-
teria vary from facility to facility and are believed to inuence the quality of
cheese.
Two strains (A and B) of nonstarter bacteria were isolated at a premium
cheese facility. These strains will be added experimentally to cheese to deter-
mine their effects. Eight cheeses are made. These cheeses all get a standard
92 Multiple Comparisons
starter bacteria. In addition, two cheeses will be randomly selected for each
of the following four treatments: control, add strain A, add strain B, or add
both strains A and B. Table 5.2 gives the total free amino acids in the cheeses
after 168 days of ripening. (Free amino acids are thought to contribute to
avor.)
Listing 5.1 gives Minitab output showing an Analysis of Variance for
these data x, as well as HSD comparisons (called Tukeys pairwise compar-
isons) using E = .1 y; we use the MS
E
from this ANOVA in constructing
the HSD. HSD is appropriate if we want simultaneous condence intervals
on the pairwise differences. The HSD is
q
E
(g, )
2
_
MS
E
1
n
i
+
1
n
j
=
q
.1
(4, 4)
.1572
_
1
2
+
1
2
= 4.586 .3965/1.414 = 1.286 .
We form condence intervals as the observed difference in treatment means,
plus or minus 1.286; so for A&B minus control, we have
6.322 4.185 1.286 or (.851, 3.423) .
In fact, only two condence intervals for pairwise differences do not include
zero (see Listing 5.1 y). The underline diagram is:
C A B A&B
4.19 4.43 5.31 6.32
Note in Listing 5.1 y that Minitab displays pairwise comparisons as a table
of condence intervals for differences.
5.4.4 Strong familywise error rate
A step-down method is a procedure for organizing pairwise comparisons
starting with the most extreme pair and then working in. Relabel the groups Step-down
methods work
inward from the
outside
comparisons
so that the sample means are in increasing order with y
(1)
smallest and y
(g)
largest. (The relabeled estimated effects
(i)
will also be in increasing or-
der, but the relabeled true effects
[i]
may or may not be in increasing order.)
With this ordering, y
(1)
to y
(g)
is a stretch of g means, y
(1)
to y
(g1)
is a
stretch of g 1 means, and y
(i)
to y
(j)
is a stretch of j i + 1 means. In a
step-down procedure, all comparisons for stretches of k means use the same
critical value, but we may use different critical values for different k. This
5.4 Pairwise Comparisons 93
Listing 5.1: Minitab output for free amino acids in cheese.
Source DF SS MS F P x
Trt 3 5.628 1.876 11.93 0.018
Error 4 0.629 0.157
Total 7 6.257
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ------+---------+---------+---------+
A 2 4.4300 0.4313 (------*-------)
A+B 2 6.3215 0.2355 (-------*-------)
B 2 5.3050 0.6223 (-------*-------)
control 2 4.1850 0.0141 (-------*-------)
------+---------+---------+---------+
Pooled StDev = 0.3965 4.0 5.0 6.0 7.0
Tukeys pairwise comparisons y
Family error rate = 0.100
Individual error rate = 0.0315
Critical value = 4.59
Intervals for (column level mean) - (row level mean)
A A+B B
A+B -3.1784
-0.6046
B -2.1619 -0.2704
0.4119 2.3034
control -1.0419 0.8496 -0.1669
1.5319 3.4234 2.4069
Fishers pairwise comparisons z
Family error rate = 0.283
Individual error rate = 0.100
Critical value = 2.132
Intervals for (column level mean) - (row level mean)
A A+B B
A+B -2.7369
-1.0461
B -1.7204 0.1711
-0.0296 1.8619
control -0.6004 1.2911 0.2746
1.0904 2.9819 1.9654
94 Multiple Comparisons
has the advantage that we can use larger critical values for long stretches and
smaller critical values for short stretches.
Begin with the most extreme pair (1) and (g). Test the null hypothesis
that all the means for (1) up through (g) are equal. If you fail to reject,
declare all means equal and stop. If you reject, declare (1) different from (g) (i) and (j) are
different if their
stretch and all
containing
stretches reject
and go on to the next step. At the next step, we consider the stretches (1)
through (g 1) and (2) through (g). If one of these rejects, we declare its
ends to be different and then look at shorter stretches within it. If we fail to
reject for a stretch, we do not consider any substretches within the stretch.
We repeat this subdivision till there are no more rejections. In other words,
we declare that means (i) and (j) are different if the stretch from (i) to (j)
rejects its null hypothesis and all stretches containing (i) to (j) also reject
their null hypotheses.
The REGWR procedure is a step-down range method that controls the
strong familywise error rate without producing simultaneous condence in- REGWR is
step-down with
Studentized
range based
critical values
tervals. The awkward name REGWR abbreviates the Ryan-Einot-Gabriel-
Welsch range test, named for the authors who worked on it. The REGWR
critical value for testing a stretch of length k depends on E, , k, and g.
Specically, we use
u = u(E, , k, g) = q
E
(k, )/
2 k = g, g 1,
and
u = u(E, , k, g) = q
kE/g
(k, )/
2 k = g 2, g 3, . . ., 2.
This critical value derives from a Studentized range with k groups, and we
use percent points with smaller tail areas as we move in to smaller stretches.
As with the HSD, REGWR error rate control is approximate when the
sample sizes are not equal.
Example 5.6 Free amino acids in cheese, continued
Suppose that we only wished to control the strong familywise error rate in-
stead of producing simultaneous condence intervals. Then we could use
REGWR instead of HSD and could potentially see additional signicant dif-
ferences. Listing 5.2 y gives SAS output for REGWR (called REGWQ in
SAS) for the amino acid data.
REGWR is a step-down method that begins like the HSD. Comparing C
and A&B, we conclude as in the HSD that they are different. We may now
compare C with B and A with A&B. These are comparisons that involve
5.4 Pairwise Comparisons 95
Listing 5.2: SAS output for free amino acids in cheese.
Student-Newman-Keuls test for variable: FAA x
Alpha= 0.1 df= 4 MSE= 0.157224
Number of Means 2 3 4
Critical Range 0.84531 1.1146718 1.2859073
Means with the same letter are not significantly different.
SNK Grouping Mean N TRT
A 6.3215 2 4
B 5.3050 2 3
C 4.4300 2 2
C
C 4.1850 2 1
Ryan-Einot-Gabriel-Welsch Multiple Range Test for variable: FAA y
Alpha= 0.1 df= 4 MSE= 0.157224
Number of Means 2 3 4
Critical Range 1.0908529 1.1146718 1.2859073
Means with the same letter are not significantly different.
REGWQ Grouping Mean N TRT
A 6.3215 2 4
A
B A 5.3050 2 3
B
B C 4.4300 2 2
C
C 4.1850 2 1
stretches of k = 3 means; since k = g 1, we still use E as the error rate.
The signicant difference for these comparisons is
q
E
(k, )
2
_
MS
E
1
n
i
+
1
n
j
=
q
.1
(3, 4)
.1572
_
1
2
+
1
2
= 1.115 .
96 Multiple Comparisons
Both the B-C and A&B-A differences (1.12 and 1.89) exceed this cutoff, so
REGWR concludes that B differs from C, and A differs from A&B. Recall
that the HSD did not distinguish C from B.
Having concluded that there are B-C and A&B-A differences, we can
now compare stretches of means within them, namely C to A, A to B, and
B to A&B. These are stretches of k = 2 means, so for REGWR we use the
error rate kE/g = .05. The signicant difference for these comparisons is
q
E/2
(k, )
2
_
MS
E
1
n
i
+
1
n
j
=
q
.05
(2, 4)
.1572
_
1
2
+
1
2
= 1.101 .
None of the three differences exceeds this cutoff, so we fail to conclude that
those treatments differ and nish. The underline diagram is:
C A B A&B
4.19 4.43 5.31 6.32
Note in Listing 5.2 y that SAS displays pairwise comparisons using what
amounts to an underline diagramturned on its side, with vertical lines formed
by letters.
5.4.5 False discovery rate
The Student-Newman-Keuls (SNK) procedure is a step-down method that
uses the Studentized range test with critical value SNK
u = u(E, , k, g) = q
E
(k, )/
2
for a stretch of k means. This is similar to REGWR, except that we keep the
percent point of the Studentized range constant as we go to shorter stretches.
The SNK controls the false discovery rate, but not the strong familywise
error rate. As with the HSD, SNK error rate control is approximate when the
sample sizes are not equal.
Example 5.7 Free amino acids in cheese, continued
Suppose that we only wished to control the false discovery rate; now we
would use SNK instead of the more stringent HSD or REGWR. Listing 5.2
x gives SAS output for SNK for the amino acid data.
SNK is identical to REGWR in the rst two stages, so SNK will also get
to the point of making the comparisons of the three pairs C to A, A to B, and
5.4 Pairwise Comparisons 97
B to A&B. However, the SNK signicant difference for these pairs is less
than that used in REGWR:
q
E
(k, )
2
_
MS
E
1
n
i
+
1
n
j
=
q
.1
(2, 4)
.1572
_
1
2
+
1
2
= .845 .
Both the B-A and A&B-B differences (1.02 and .98) exceed the cutoff, but
the A-C difference (.14) does not. The underline diagram for SNK is:
C A B A&B
4.19 4.43 5.31 6.32
5.4.6 Experimentwise error rate
The Analysis of Variance F-test for equality of means controls the experi-
mentwise error rate. Thus investigating pairwise differences only when the Protected LSD
uses F-test to
control
experimentwise
error rate
F-test has a p-value less than E will control the experimentwise error rate.
This is the basis for the Protected least signicant difference, or Protected
LSD. If the F-test rejects at level E, then do simple t-tests at level E among
the different treatments.
The critical values are from a t-distribution:
u(E, ) = t
E/2,
,
leading to the signicant difference
LSD = t
E/2,
_
MS
E
_
1/n
i
+ 1/n
j
.
As usual, is the degrees of freedom for MS
E
, and t
E/2,
is the upper E/2
percent point of a t-curve with degrees of freedom.
Condence intervals produced from the protected LSD do not have the
anticipated 1 E coverage rate, either individually or simultaneously. See
Section 5.7.
Free amino acids in cheese, continued Example 5.8
Finally, suppose that we only wish to control the experimentwise error rate.
Protected LSD will work here. Listing 5.1 x shows that the ANOVA F-
test is signicant at level E, so we may proceed with pairwise comparisons.
98 Multiple Comparisons
Listing 5.1 z shows Minitab output for the LSD (called Fishers pairwise
comparisons) as condence intervals.
LSD uses the same signicant difference for all pairs:
t
E/2,
_
MS
E
1
n
i
+
1
n
j
= t
.05,4
.1572
_
1
2
+
1
2
= .845 .
This is the same as the SNK comparison for a stretch of length 2. All dif-
ferences except A-C exceed the cutoff, so the underline diagram for LSD
is:
C A B A&B
4.19 4.43 5.31 6.32
5.4.7 Comparisonwise error rate
Ordinary t-tests and condence intervals without any adjustment control the
comparisonwise error rate. In the context of pairwise comparisons, this is LSD
called the least signicant difference (LSD) method.
The critical values are the same as for the protected LSD:
u(E, ) = t
E/2,
,
and
LSD = t
E/2,
_
MS
E
_
1/n
i
+ 1/n
j
.
5.4.8 Pairwise testing reprise
It is easy to get overwhelmed by the abundance of methods, and there are
still more that we havent discussed. Your anchor in all this is your error rate. Choose your
error rate, not
your method
Once you have determined your error rate, the choice of method is reasonably
automatic, as summarized in Display 5.2. Your choice of error rate is deter-
mined by the needs of your study, bearing in mind that the more stringent
error rates have fewer false rejections, and also fewer correct rejections.
5.4.9 Pairwise comparisons methods that do not control combined
Type I error rates
There are many other pairwise comparisons methods beyond those already
mentioned. In this Section we discuss two methods that are motivated by
5.4 Pairwise Comparisons 99
Error rate Method
Simultaneous condence
intervals
BSD or HSD
Strong familywise REGWR
False discovery rate SNK
Experimentwise Protected LSD
Comparisonwise LSD
Display 5.2: Summary of pairwise comparison methods.
completely different criteria than controlling a combined Type I error rate.
These two techniques do not control the experimentwise error rate or any of
the more stringent error rates, and you should not use them with the expecta-
tion that they do. You should only use them when the situation and assump-
tions under which they were developed are appropriate for your experimental
analysis.
Suppose that you believe a priori that the overall null hypothesis H
0
is
less and less likely to be true as the number of treatments increases. Then the Duncans multiple
range if there is a
cost per error or
you believe H
0
less likely as g
increases
strength of evidence required to reject H
0
should decrease as the number of
groups increases. Alternatively, suppose that there is a quantiable penalty
for each incorrect (pairwise comparison) decision we make, and that the total
loss for the overall test is the sum of the losses from the individual decisions.
Under either of these assumptions, the Duncan multiple range (given below)
or something like it is appropriate. Note by comparison that the procedures
that control combined Type I error rates require more evidence to reject H
0
as
the number of groups increases, while Duncans method requires less. Also,
a procedure that controls the experimentwise error rate has a penalty of 1 if
there are any rejections when H
0
is true and a penalty of 0 otherwise; this is
very different from the summed loss that leads to Duncans multiple range.
Duncans multiple range (sometimes called Duncans test or Duncans
new multiple range) is a step-down Studentized range method. You specify Duncans Multiple
Range
a protection level E and proceed in step-down fashion using
u = u(E, , k, g) = q
1(1E)
k1 (k, )/
2
100 Multiple Comparisons
for the critical values. Notice that E is the comparisonwise error rate for
testing a stretch of length 2, and the experimentwise error rate will be 1
(1 E)
g1
, which can be considerably more than E. Thus xing Duncans Experimentwise
error rate very
large for Duncan
protection level at E does not control the experimentwise error rate or any
more stringent rate. Do not use Duncans procedure if you are interested in
controlling any of the combined Type I error rates.
As a second alternative to combined Type I error rates, suppose that our
interest is in predicting future observations from the treatment groups, and
that we would like to have a prediction method that makes the average Minimize
prediction error
instead of testing
squared prediction error small. One way to do this prediction is to rst par-
tition the g treatments into p classes, 1 p g; second, nd the average
response in each of these p classes; and third, predict a future observation
from a treatment by the observed mean response of the class for the treat-
ment. We thus look for partitions that will lead to good predictions.
One way to choose among the partitions is to use Mallows C
p
statistic:
C
p
=
SSR
p
MS
E
+ 2p N ,
where SSR
p
is the sum of squared errors for the Analysis of Variance, par- Predictive
Pairwise
Comparisons
titioning the data into p groups. Partitions with low values of C
p
should give
better predictions.
This predictive approach makes no attempt to control any Type I error
rate; in fact, the Type I error rate is .15 or greater even for g = 2 groups! This
approach is useful when prediction is the goal, but can be quite misleading if
interpreted as a test of H
0
.
5.4.10 Condent directions
In our heart of hearts, we often believe that all treatment means differ when
examined sufciently precisely. Thus our concern with null hypotheses H
0ij
All means differ,
but their order is
uncertain
is misplaced. As an alternative, we can make statements of direction. After
having collected data, we consider
i
and
j
; assume
i
<
j
. We could de-
cide from the data that
i
<
j
, or that
i
>
j
, or that we dont knowthat
is, we dont have enough information to decide. These decisions correspond
in the testing paradigm to rejecting H
0ij
in favor of
i
<
j
, rejecting H
0ij
Can only make
an error in one
direction
in favor of
j
<
i
, and failing to reject H
0ij
. In the condent directions
framework, only the decision
i
>
j
is an error. See Tukey (1991).
Condent directions procedures are pairwise comparisons testing proce-
dures, but with results interpreted in a directional context. Condent direc-
tions procedures bound error rates when making statements about direction.
5.5 Comparison with Control or the Best 101
If a testing procedure bounds an error rate at E, then the corresponding con-
dent directions procedure bounds a condent directions error rate at E/2, the
factor of 2 arising because we cannot falsely reject in the correct direction.
Let us reinterpret our usual error rates in terms of directions. Suppose
that we use a pairwise comparisons procedure with error rate bounded at E. Pairwise
comparisons can
be used for
condent
directions
In a condent directions setting, we have the following:
Strong familywise The probability of making any incorrect state-
ments of direction is bounded by E/2.
FDR Incorrect statements of direction will on average
be no more than a fraction E/2 of the total number
of statements of direction.
Experimentwise The probability of making any incorrect state-
ments of direction when all the means are very
nearly equal is bounded by E/2.
Comparisonwise The probability of making an incorrect statement
of direction for a given comparison is bounded by
E/2.
There is no directional analog of simultaneous condence intervals, so pro-
cedures that produce simultaneous intervals should be considered procedures
that control the strong familywise error rate (which they do).
5.5 Comparison with Control or the Best
There are some situations where we do not do all pairwise comparisons, but
rather make comparisons between a control and the other treatments, or the Comparison with
control does not
do all tests
best responding treatment (highest or lowest average) and the other treat-
ments. For example, you may be producing new standardized mathematics
tests for elementary school children, and you need to compare the new