100% found this document useful (5 votes)
3K views680 pages

Experimentos

This work is licensed under a "creative commons" license. You are free to copy, distribute, and transmit this work provided the following conditions are met. You may not alter, transform, or build upon this work.

Uploaded by

Gerardo Lopez
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
3K views680 pages

Experimentos

This work is licensed under a "creative commons" license. You are free to copy, distribute, and transmit this work provided the following conditions are met. You may not alter, transform, or build upon this work.

Uploaded by

Gerardo Lopez
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A First Course in

Design and Analysis


of Experiments
A First Course in
Design and Analysis
of Experiments
Gary W. Oehlert
University of Minnesota
Cover design by Victoria Tomaselli
Cover illustration by Peter Hamlin
Minitab is a registered trademark of Minitab, Inc.
SAS is a registered trademark of SAS Institute, Inc.
S-Plus is a registered trademark of Mathsoft, Inc.
Design-Expert is a registered trademark of Stat-Ease, Inc.
Library of Congress Cataloging-in-Publication Data.
Oehlert, Gary W.
A rst course in design and analysis of experiments / Gary W. Oehlert.
p. cm.
Includes bibligraphical references and index.
ISBN 0-7167-3510-5
1. Experimental Design I. Title
QA279.O34 2000
519.5dc21 99-059934
Copyright c 2010 Gary W. Oehlert. All rights reserved.
This work is licensed under a Creative Commons license. Briey, you are free to
copy, distribute, and transmit this work provided the following conditions are met:
1. You must properly attribute the work.
2. You may not use this work for commercial purposes.
3. You may not alter, transform, or build upon this work.
A complete description of the license may be found at
http://creativecommons.org/licenses/by-nc-nd/3.0/.
For Becky
who helped me all the way through
and for Christie and Erica
who put up with a lot while it was getting done
Contents
Preface xvii
1 Introduction 1
1.1 Why Experiment? . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Components of an Experiment . . . . . . . . . . . . . . . 4
1.3 Terms and Concepts . . . . . . . . . . . . . . . . . . . . . 5
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 More About Experimental Units . . . . . . . . . . . . . . 8
1.6 More About Responses . . . . . . . . . . . . . . . . . . . 10
2 Randomization and Design 13
2.1 Randomization Against Confounding . . . . . . . . . . . 14
2.2 Randomizing Other Things . . . . . . . . . . . . . . . . . 16
2.3 Performing a Randomization . . . . . . . . . . . . . . . . 17
2.4 Randomization for Inference . . . . . . . . . . . . . . . . 19
2.4.1 The paired t-test . . . . . . . . . . . . . . . . . . 20
2.4.2 Two-sample t-test . . . . . . . . . . . . . . . . . 25
2.4.3 Randomization inference and standard inference . 26
2.5 Further Reading and Extensions . . . . . . . . . . . . . . 27
2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Completely Randomized Designs 31
3.1 Structure of a CRD . . . . . . . . . . . . . . . . . . . . . 31
3.2 Preliminary Exploratory Analysis . . . . . . . . . . . . . 33
3.3 Models and Parameters . . . . . . . . . . . . . . . . . . . 34
viii CONTENTS
3.4 Estimating Parameters . . . . . . . . . . . . . . . . . . . 39
3.5 Comparing Models: The Analysis of Variance . . . . . . . 44
3.6 Mechanics of ANOVA . . . . . . . . . . . . . . . . . . . 45
3.7 Why ANOVA Works . . . . . . . . . . . . . . . . . . . . 52
3.8 Back to Model Comparison . . . . . . . . . . . . . . . . . 52
3.9 Side-by-Side Plots . . . . . . . . . . . . . . . . . . . . . 54
3.10 Dose-Response Modeling . . . . . . . . . . . . . . . . . . 55
3.11 Further Reading and Extensions . . . . . . . . . . . . . . 58
3.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Looking for Specic DifferencesContrasts 65
4.1 Contrast Basics . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Inference for Contrasts . . . . . . . . . . . . . . . . . . . 68
4.3 Orthogonal Contrasts . . . . . . . . . . . . . . . . . . . . 71
4.4 Polynomial Contrasts . . . . . . . . . . . . . . . . . . . . 73
4.5 Further Reading and Extensions . . . . . . . . . . . . . . 75
4.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Multiple Comparisons 77
5.1 Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Bonferroni-Based Methods . . . . . . . . . . . . . . . . . 81
5.3 The Scheff e Method for All Contrasts . . . . . . . . . . . 85
5.4 Pairwise Comparisons . . . . . . . . . . . . . . . . . . . . 87
5.4.1 Displaying the results . . . . . . . . . . . . . . . 88
5.4.2 The Studentized range . . . . . . . . . . . . . . . 89
5.4.3 Simultaneous condence intervals . . . . . . . . . 90
5.4.4 Strong familywise error rate . . . . . . . . . . . . 92
5.4.5 False discovery rate . . . . . . . . . . . . . . . . 96
5.4.6 Experimentwise error rate . . . . . . . . . . . . . 97
5.4.7 Comparisonwise error rate . . . . . . . . . . . . . 98
5.4.8 Pairwise testing reprise . . . . . . . . . . . . . . 98
5.4.9 Pairwise comparisons methods that do not control
combined Type I error rates . . . . . . . . . . . . 98
5.4.10 Condent directions . . . . . . . . . . . . . . . . 100
CONTENTS ix
5.5 Comparison with Control or the Best . . . . . . . . . . . . 101
5.5.1 Comparison with a control . . . . . . . . . . . . . 101
5.5.2 Comparison with the best . . . . . . . . . . . . . 104
5.6 Reality Check on Coverage Rates . . . . . . . . . . . . . 105
5.7 A Warning About Conditioning . . . . . . . . . . . . . . . 106
5.8 Some Controversy . . . . . . . . . . . . . . . . . . . . . . 106
5.9 Further Reading and Extensions . . . . . . . . . . . . . . 107
5.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Checking Assumptions 111
6.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Assessing Violations of Assumptions . . . . . . . . . . . . 114
6.3.1 Assessing nonnormality . . . . . . . . . . . . . . 115
6.3.2 Assessing nonconstant variance . . . . . . . . . . 118
6.3.3 Assessing dependence . . . . . . . . . . . . . . . 120
6.4 Fixing Problems . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.1 Accommodating nonnormality . . . . . . . . . . 124
6.4.2 Accommodating nonconstant variance . . . . . . 126
6.4.3 Accommodating dependence . . . . . . . . . . . 133
6.5 Effects of Incorrect Assumptions . . . . . . . . . . . . . . 134
6.5.1 Effects of nonnormality . . . . . . . . . . . . . . 134
6.5.2 Effects of nonconstant variance . . . . . . . . . . 136
6.5.3 Effects of dependence . . . . . . . . . . . . . . . 138
6.6 Implications for Design . . . . . . . . . . . . . . . . . . . 140
6.7 Further Reading and Extensions . . . . . . . . . . . . . . 141
6.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7 Power and Sample Size 149
7.1 Approaches to Sample Size Selection . . . . . . . . . . . 149
7.2 Sample Size for Condence Intervals . . . . . . . . . . . . 151
7.3 Power and Sample Size for ANOVA . . . . . . . . . . . . 153
7.4 Power and Sample Size for a Contrast . . . . . . . . . . . 158
7.5 More about Units and Measurement Units . . . . . . . . . 158
x CONTENTS
7.6 Allocation of Units for Two Special Cases . . . . . . . . . 160
7.7 Further Reading and Extensions . . . . . . . . . . . . . . 161
7.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8 Factorial Treatment Structure 165
8.1 Factorial Structure . . . . . . . . . . . . . . . . . . . . . . 165
8.2 Factorial Analysis: Main Effect and Interaction . . . . . . 167
8.3 Advantages of Factorials . . . . . . . . . . . . . . . . . . 170
8.4 Visualizing Interaction . . . . . . . . . . . . . . . . . . . 171
8.5 Models with Parameters . . . . . . . . . . . . . . . . . . . 175
8.6 The Analysis of Variance for Balanced Factorials . . . . . 179
8.7 General Factorial Models . . . . . . . . . . . . . . . . . . 182
8.8 Assumptions and Transformations . . . . . . . . . . . . . 185
8.9 Single Replicates . . . . . . . . . . . . . . . . . . . . . . 186
8.10 Pooling Terms into Error . . . . . . . . . . . . . . . . . . 191
8.11 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9 A Closer Look at Factorial Data 203
9.1 Contrasts for Factorial Data . . . . . . . . . . . . . . . . . 203
9.2 Modeling Interaction . . . . . . . . . . . . . . . . . . . . 209
9.2.1 Interaction plots . . . . . . . . . . . . . . . . . . 209
9.2.2 One-cell interaction . . . . . . . . . . . . . . . . 210
9.2.3 Quantitative factors . . . . . . . . . . . . . . . . 212
9.2.4 Tukey one-degree-of-freedom for nonadditivity . . 217
9.3 Further Reading and Extensions . . . . . . . . . . . . . . 220
9.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10 Further Topics in Factorials 225
10.1 Unbalanced Data . . . . . . . . . . . . . . . . . . . . . . 225
10.1.1 Sums of squares in unbalanced data . . . . . . . . 226
10.1.2 Building models . . . . . . . . . . . . . . . . . . 227
10.1.3 Testing hypotheses . . . . . . . . . . . . . . . . . 230
10.1.4 Empty cells . . . . . . . . . . . . . . . . . . . . . 233
10.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . . 234
CONTENTS xi
10.3 Power and Sample Size . . . . . . . . . . . . . . . . . . . 235
10.4 Two-Series Factorials . . . . . . . . . . . . . . . . . . . . 236
10.4.1 Contrasts . . . . . . . . . . . . . . . . . . . . . . 237
10.4.2 Single replicates . . . . . . . . . . . . . . . . . . 240
10.5 Further Reading and Extensions . . . . . . . . . . . . . . 244
10.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 245
11 Random Effects 253
11.1 Models for Random Effects . . . . . . . . . . . . . . . . . 253
11.2 Why Use Random Effects? . . . . . . . . . . . . . . . . . 256
11.3 ANOVA for Random Effects . . . . . . . . . . . . . . . . 257
11.4 Approximate Tests . . . . . . . . . . . . . . . . . . . . . 260
11.5 Point Estimates of Variance Components . . . . . . . . . . 264
11.6 Condence Intervals for Variance Components . . . . . . 267
11.7 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 271
11.8 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.9 Further Reading and Extensions . . . . . . . . . . . . . . 274
11.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 275
12 Nesting, Mixed Effects, and Expected Mean Squares 279
12.1 Nesting Versus Crossing . . . . . . . . . . . . . . . . . . 279
12.2 Why Nesting? . . . . . . . . . . . . . . . . . . . . . . . . 283
12.3 Crossed and Nested Factors . . . . . . . . . . . . . . . . . 283
12.4 Mixed Effects . . . . . . . . . . . . . . . . . . . . . . . . 285
12.5 Choosing a Model . . . . . . . . . . . . . . . . . . . . . . 288
12.6 Hasse Diagrams and Expected Mean Squares . . . . . . . 289
12.6.1 Test denominators . . . . . . . . . . . . . . . . . 290
12.6.2 Expected mean squares . . . . . . . . . . . . . . 293
12.6.3 Constructing a Hasse diagram . . . . . . . . . . . 296
12.7 Variances of Means and Contrasts . . . . . . . . . . . . . 298
12.8 Unbalanced Data and Random Effects . . . . . . . . . . . 304
12.9 Staggered Nested Designs . . . . . . . . . . . . . . . . . 306
12.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 307
xii CONTENTS
13 Complete Block Designs 315
13.1 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
13.2 The Randomized Complete Block Design . . . . . . . . . 316
13.2.1 Why and when to use the RCB . . . . . . . . . . 318
13.2.2 Analysis for the RCB . . . . . . . . . . . . . . . 319
13.2.3 How well did the blocking work? . . . . . . . . . 322
13.2.4 Balance and missing data . . . . . . . . . . . . . 324
13.3 Latin Squares and Related Row/Column Designs . . . . . 324
13.3.1 The crossover design . . . . . . . . . . . . . . . . 326
13.3.2 Randomizing the LS design . . . . . . . . . . . . 327
13.3.3 Analysis for the LS design . . . . . . . . . . . . . 327
13.3.4 Replicating Latin Squares . . . . . . . . . . . . . 330
13.3.5 Efciency of Latin Squares . . . . . . . . . . . . 335
13.3.6 Designs balanced for residual effects . . . . . . . 338
13.4 Graeco-Latin Squares . . . . . . . . . . . . . . . . . . . . 343
13.5 Further Reading and Extensions . . . . . . . . . . . . . . 344
13.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 345
14 Incomplete Block Designs 357
14.1 Balanced Incomplete Block Designs . . . . . . . . . . . . 358
14.1.1 Intrablock analysis of the BIBD . . . . . . . . . . 360
14.1.2 Interblock information . . . . . . . . . . . . . . . 364
14.2 Row and Column Incomplete Blocks . . . . . . . . . . . . 368
14.3 Partially Balanced Incomplete Blocks . . . . . . . . . . . 370
14.4 Cyclic Designs . . . . . . . . . . . . . . . . . . . . . . . 372
14.5 Square, Cubic, and Rectangular Lattices . . . . . . . . . . 374
14.6 Alpha Designs . . . . . . . . . . . . . . . . . . . . . . . . 376
14.7 Further Reading and Extensions . . . . . . . . . . . . . . 378
14.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 379
CONTENTS xiii
15 Factorials in Incomplete BlocksConfounding 387
15.1 Confounding the Two-Series Factorial . . . . . . . . . . . 388
15.1.1 Two blocks . . . . . . . . . . . . . . . . . . . . . 389
15.1.2 Four or more blocks . . . . . . . . . . . . . . . . 392
15.1.3 Analysis of an unreplicated confounded two-series 397
15.1.4 Replicating a confounded two-series . . . . . . . 399
15.1.5 Double confounding . . . . . . . . . . . . . . . . 402
15.2 Confounding the Three-Series Factorial . . . . . . . . . . 403
15.2.1 Building the design . . . . . . . . . . . . . . . . 404
15.2.2 Confounded effects . . . . . . . . . . . . . . . . 407
15.2.3 Analysis of confounded three-series . . . . . . . . 408
15.3 Further Reading and Extensions . . . . . . . . . . . . . . 409
15.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 410
16 Split-Plot Designs 417
16.1 What Is a Split Plot? . . . . . . . . . . . . . . . . . . . . 417
16.2 Fancier Split Plots . . . . . . . . . . . . . . . . . . . . . . 419
16.3 Analysis of a Split Plot . . . . . . . . . . . . . . . . . . . 420
16.4 Split-Split Plots . . . . . . . . . . . . . . . . . . . . . . . 428
16.5 Other Generalizations of Split Plots . . . . . . . . . . . . 434
16.6 Repeated Measures . . . . . . . . . . . . . . . . . . . . . 438
16.7 Crossover Designs . . . . . . . . . . . . . . . . . . . . . 441
16.8 Further Reading and Extensions . . . . . . . . . . . . . . 441
16.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 442
17 Designs with Covariates 453
17.1 The Basic Covariate Model . . . . . . . . . . . . . . . . . 454
17.2 When Treatments Change Covariates . . . . . . . . . . . . 460
17.3 Other Covariate Models . . . . . . . . . . . . . . . . . . . 462
17.4 Further Reading and Extensions . . . . . . . . . . . . . . 466
17.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 466
xiv CONTENTS
18 Fractional Factorials 471
18.1 Why Fraction? . . . . . . . . . . . . . . . . . . . . . . . . 471
18.2 Fractioning the Two-Series . . . . . . . . . . . . . . . . . 472
18.3 Analyzing a 2
kq
. . . . . . . . . . . . . . . . . . . . . . 479
18.4 Resolution and Projection . . . . . . . . . . . . . . . . . . 482
18.5 Confounding a Fractional Factorial . . . . . . . . . . . . . 485
18.6 De-aliasing . . . . . . . . . . . . . . . . . . . . . . . . . 485
18.7 Fold-Over . . . . . . . . . . . . . . . . . . . . . . . . . . 487
18.8 Sequences of Fractions . . . . . . . . . . . . . . . . . . . 489
18.9 Fractioning the Three-Series . . . . . . . . . . . . . . . . 489
18.10 Problems with Fractional Factorials . . . . . . . . . . . . 492
18.11 Using Fractional Factorials in Off-Line Quality Control . . 493
18.11.1 Designing an off-line quality experiment . . . . . 494
18.11.2 Analysis of off-line quality experiments . . . . . . 495
18.12 Further Reading and Extensions . . . . . . . . . . . . . . 498
18.13 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 499
19 Response Surface Designs 509
19.1 Visualizing the Response . . . . . . . . . . . . . . . . . . 509
19.2 First-Order Models . . . . . . . . . . . . . . . . . . . . . 511
19.3 First-Order Designs . . . . . . . . . . . . . . . . . . . . . 512
19.4 Analyzing First-Order Data . . . . . . . . . . . . . . . . . 514
19.5 Second-Order Models . . . . . . . . . . . . . . . . . . . . 517
19.6 Second-Order Designs . . . . . . . . . . . . . . . . . . . 522
19.7 Second-Order Analysis . . . . . . . . . . . . . . . . . . . 526
19.8 Mixture Experiments . . . . . . . . . . . . . . . . . . . . 529
19.8.1 Designs for mixtures . . . . . . . . . . . . . . . . 530
19.8.2 Models for mixture designs . . . . . . . . . . . . 533
19.9 Further Reading and Extensions . . . . . . . . . . . . . . 535
19.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 536
CONTENTS xv
20 On Your Own 543
20.1 Experimental Context . . . . . . . . . . . . . . . . . . . . 543
20.2 Experiments by the Numbers . . . . . . . . . . . . . . . . 544
20.3 Final Project . . . . . . . . . . . . . . . . . . . . . . . . . 548
Bibliography 549
A Linear Models for Fixed Effects 563
A.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
A.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 566
A.3 Comparison of Models . . . . . . . . . . . . . . . . . . . 568
A.4 Projections . . . . . . . . . . . . . . . . . . . . . . . . . 570
A.5 Random Variation . . . . . . . . . . . . . . . . . . . . . . 572
A.6 Estimable Functions . . . . . . . . . . . . . . . . . . . . . 576
A.7 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . 578
A.8 The Scheff e Method . . . . . . . . . . . . . . . . . . . . . 579
A.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 580
B Notation 583
C Experimental Design Plans 607
C.1 Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . 607
C.1.1 Standard Latin Squares . . . . . . . . . . . . . . 607
C.1.2 Orthogonal Latin Squares . . . . . . . . . . . . . 608
C.2 Balanced Incomplete Block Designs . . . . . . . . . . . . 609
C.3 Efcient Cyclic Designs . . . . . . . . . . . . . . . . . . 615
C.4 Alpha Designs . . . . . . . . . . . . . . . . . . . . . . . . 616
C.5 Two-Series Confounding and Fractioning Plans . . . . . . 617
D Tables 621
Index 647
Preface xvii
Preface
This text covers the basic topics in experimental design and analysis and
is intended for graduate students and advanced undergraduates. Students
should have had an introductory statistical methods course at about the level
of Moore and McCabes Introduction to the Practice of Statistics (Moore and
McCabe 1999) and be familiar with t-tests, p-values, condence intervals,
and the basics of regression and ANOVA. Most of the text soft-pedals theory
and mathematics, but Chapter 19 on response surfaces is a little tougher sled-
ding (eigenvectors and eigenvalues creep in through canonical analysis), and
Appendix A is an introduction to the theory of linear models. I use the text
in a service course for non-statisticians and in a course for rst-year Masters
students in statistics. The non-statisticians come from departments scattered
all around the university including agronomy, ecology, educational psychol-
ogy, engineering, food science, pharmacy, sociology, and wildlife.
I wrote this book for the same reason that many textbooks get written:
there was no existing book that did things the way I thought was best. I start
with single-factor, xed-effects, completely randomized designs and cover
them thoroughly, including analysis, checking assumptions, and power. I
then add factorial treatment structure and random effects to the mix. At this
stage, we have a single randomization scheme, a lot of different models for
data, and essentially all the analysis techniques we need. I next add block-
ing designs for reducing variability, covering complete blocks, incomplete
blocks, and confounding in factorials. After this I introduce split plots, which
can be considered incomplete block designs but really introduce the broader
subject of unit structures. Covariate models round out the discussion of vari-
ance reduction. I nish with special treatment structures, including fractional
factorials and response surface/mixture designs.
This outline is similar in content to a dozen other design texts; how is this
book different?
I include many exercises where the student is required to choose an
appropriate experimental design for a given situation, or recognize the
design that was used. Many of the designs in question are from earlier
chapters, not the chapter where the question is given. These are impor-
tant skills that often receive short shrift. See examples on pages 500
and 502.
xviii Preface
I use Hasse diagrams to illustrate models, nd test denominators, and
compute expected mean squares. I feel that the diagrams provide a
much easier and more understandable approach to these problems than
the classic approach with tables of subscripts and live and dead indices.
I believe that Hasse diagrams should see wider application.
I spend time trying to sort out the issues with multiple comparisons
procedures. These confuse many students, and most texts seem to just
present a laundry list of methods and no guidance.
I try to get students to look beyond saying main effects and/or interac-
tions are signicant and to understand the relationships in the data. I
want them to learn that understanding what the data have to say is the
goal. ANOVA is a tool we use at the beginning of an analysis; it is not
the end.
I describe the difference in philosophy between hierarchical model
building and parameter testing in factorials, and discuss how this be-
comes crucial for unbalanced data. This is important because the dif-
ferent philosophies can lead to different conclusions, and many texts
avoid the issue entirely.
There are three kinds of problems in this text, which I have denoted
exercises, problems, and questions. Exercises are intended to be sim-
pler than problems, with exercises being more drill on mechanics and
problems being more integrative. Not everyone will agree with my
classication. Questions are not necessarily more difcult than prob-
lems, but they cover more theoretical or mathematical material.
Data les for the examples and problems can be downloaded from the
Freeman web site at http://www.whfreeman.com/. A second re-
source is Appendix B, which documents the notation used in the text.
This text contains many formulae, but I try to use formulae only when I
think that they will increase a readers understanding of the ideas. In several
settings where closed-form expressions for sums of squares or estimates ex-
ist, I do not present them because I do not believe that they help (for example,
the Analysis of Covariance). Similarly, presentations of normal equations do
not appear. Instead, I approach ANOVA as a comparison of models t by
least squares, and let the computing software take care of the details of t-
ting. Future statisticians will need to learn the process in more detail, and
Appendix A gets them started with the theory behind xed effects.
Speaking of computing, examples in this text use one of four packages:
MacAnova, Minitab, SAS, and S-Plus. MacAnova is a homegrown package
that we use here at Minnesota because we can distribute it freely; it runs
Preface xix
on Macintosh, Windows, and Unix; and it does everything we need. You can
download MacAnova (any version and documentation, even the source) from
http://www.stat.umn.edu/gary/macanova. Minitab and SAS
are widely used commercial packages. I hadnt used Minitab in twelve years
when I started using it for examples; I found it incredibly easy to use. The
menu/dialog/spreadsheet interface was very intuitive. In fact, I only opened
the manual once, and that was when I was trying to gure out how to do
general contrasts (which I was never able to gure out). SAS is far and away
the market leader in statistical software. You can do practically every kind of
analysis in SAS, but as a novice I spent many hours with the manuals trying
to get SAS to do any kind of analysis. In summary, many people swear by
SAS, but I found I mostly swore at SAS. I use S-Plus extensively in research;
here Ive just used it for a couple of graphics.
I need to acknowledge many people who helped me get this job done.
First are the students and TAs in the courses where I used preliminary ver-
sions. Many of you made suggestions and pointed out mistakes; in particular
I thank John Corbett, Alexandre Varbanov, and Jorge de la Vega Gongora.
Many others of you contributed data; your footprints are scattered throughout
the examples and exercises. Next I have beneted from helpful discussions
with my colleagues here in Minnesota, particularly Kit Bingham, Kathryn
Chaloner, Sandy Weisberg, and Frank Martin. I thank Sharon Lohr for in-
troducing me to Hasse diagrams, and I received much helpful criticism from
reviewers, including Larry Ringer (Texas A&M), Morris Southward (New
Mexico State), Robert Price (East Tennessee State), Andrew Schaffner (Cal
PolySan Luis Obispo), Hiroshi Yamauchi (HawaiiManoa), and William
Notz (Ohio State). My editor Patrick Farace and others at Freeman were a
great help. Finally, I thank my family and parents, who supported me in this
for years (even if my father did say it looked like a foreign language!).
They say you should never let the camels nose into the tent, because
once the nose is in, theres no stopping the rest of the camel. In a similar
vein, student requests for copies of lecture notes lead to student requests for
typed lecture notes, which lead to student requests for more complete typed
lecture notes, which lead . . . well, in my case it leads to a textbook on de-
sign and analysis of experiments, which you are reading now. Over the years
my students have preferred various more primitive incarnations of this text to
other texts; I hope you nd this text worthwhile too.
Gary W. Oehlert
Chapter 1
Introduction
Researchers use experiments to answer questions. Typical questions might Experiments
answer questions
be:
Is a drug a safe, effective cure for a disease? This could be a test of
how AZT affects the progress of AIDS.
Which combination of protein and carbohydrate sources provides the
best nutrition for growing lambs?
How will long-distance telephone usage change if our company offers
a different rate structure to our customers?
Will an ice cream manufactured with a new kind of stabilizer be as
palatable as our current ice cream?
Does short-term incarceration of spouse abusers deter future assaults?
Under what conditions should I operate my chemical renery, given
this months grade of raw material?
This book is meant to help decision makers and researchers design good
experiments, analyze them properly, and answer their questions.
1.1 Why Experiment?
Consider the spousal assault example mentioned above. Justice ofcials need
to knowhowthey can reduce or delay the recurrence of spousal assault. They
are investigating three different actions in response to spousal assaults. The
2 Introduction
assailant could be warned, sent to counseling but not booked on charges,
or arrested for assault. Which of these actions works best? How can they
compare the effects of the three actions?
This book deals with comparative experiments. We wish to compare
some treatments. For the spousal assault example, the treatments are the three
actions by the police. We compare treatments by using them and comparing
the outcomes. Specically, we apply the treatments to experimental units Treatments,
experimental
units, and
responses
and then measure one or more responses. In our example, individuals who
assault their spouses could be the experimental units, and the response could
be the length of time until recurrence of assault. We compare treatments by
comparing the responses obtained from the experimental units in the different
treatment groups. This could tell us if there are any differences in responses
between the treatments, what the estimated sizes of those differences are,
which treatment has the greatest estimated delay until recurrence, and so on.
An experiment is characterized by the treatments and experimental units to
be used, the way treatments are assigned to units, and the responses that are
measured.
Experiments help us answer questions, but there are also nonexperimen-
tal techniques. What is so special about experiments? Consider that: Advantages of
experiments
1. Experiments allow us to set up a direct comparison between the treat-
ments of interest.
2. We can design experiments to minimize any bias in the comparison.
3. We can design experiments so that the error in the comparison is small.
4. Most important, we are in control of experiments, and having that con-
trol allows us to make stronger inferences about the nature of differ-
ences that we see in the experiment. Specically, we may make infer-
ences about causation.
This last point distinguishes an experiment from an observational study. An Control versus
observation
observational study also has treatments, units, and responses. However, in
the observational study we merely observe which units are in which treatment
groups; we dont get to control that assignment.
Example 1.1 Does spanking hurt?
Lets contrast an experiment with an observational study described in Straus,
Sugarman, and Giles-Sims (1997). A large survey of women aged 14 to 21
years was begun in 1979; by 1988 these same women had 1239 children
1.1 Why Experiment? 3
between the ages of 6 and 9 years. The women and children were inter-
viewed and tested in 1988 and again in 1990. Two of the items measured
were the level of antisocial behavior in the children and the frequency of
spanking. Results showed that children who were spanked more frequently
in 1988 showed larger increases in antisocial behavior in 1990 than those who
were spanked less frequently. Does spanking cause antisocial behavior? Per-
haps it does, but there are other possible explanations. Perhaps children who
were becoming more troublesome in 1988 may have been spanked more fre-
quently, while children who were becoming less troublesome may have been
spanked less frequently in 1988.
The drawback of observational studies is that the grouping into treat-
ments is not under the control of the experimenter and its mechanism is
usually unknown. Thus observed differences in responses between treatment
groups could very well be due to these other hidden mechanisms, rather than
the treatments themselves.
It is important to say that while experiments have some advantages, ob-
servational studies are also useful and can produce important results. For ex- Observational
studies are useful
too
ample, studies of smoking and human health are observational, but the link
that they have established is one of the most important public health issues
today. Similarly, observational studies established an association between
heart valve disease and the diet drug fen-phen that led to the withdrawal
of the drugs fenuramine and dexfenuramine from the market (Connolloy
et al. 1997 and US FDA 1997).
Mosteller and Tukey (1977) list three concepts associated with causation
and state that two or three are needed to support a causal relationship: Causal
relationships
Consistency
Responsiveness
Mechanism.
Consistency means that, all other things being equal, the relationship be-
tween two variables is consistent across populations in direction and maybe
in amount. Responsiveness means that we can go into a system, change the
causal variable, and watch the response variable change accordingly. Mech-
anism means that we have a step-by-step mechanism leading from cause to
effect.
In an experiment, we are in control, so we can achieve responsiveness. Experiments can
demonstrate
consistency and
responsiveness
Thus, if we see a consistent difference in observed response between the
various treatments, we can infer that the treatments caused the differences
in response. We dont need to know the mechanismwe can demonstrate
4 Introduction
causation by experiment. (This is not to say that we shouldnt try to learn
mechanismswe should. Its just that we dont need mechanism to infer
causation.)
We should note that there are times when experiments are not feasible,
even when the knowledge gained would be extremely valuable. For example, Ethics constrain
experimentation
we cant perform an experiment proving once and for all that smoking causes
cancer in humans. We can observe that smoking is associated with cancer in
humans; we have mechanisms for this and can thus infer causation. But we
cannot demonstrate responsiveness, since that would involve making some
people smoke, and making others not smoke. It is simply unethical.
1.2 Components of an Experiment
An experiment has treatments, experimental units, responses, and a method
to assign treatments to units.
Treatments, units, and assignment method specify the experimental design.
Some authors make a distinction between the selection of treatments to be
used, called treatment design, and the selection of units and assignment of
treatments, called experiment design.
Note that there is no mention of a method for analyzing the results.
Strictly speaking, the analysis is not part of the design, though a wise exper- Analysis not part
of design, but
consider it during
planning
imenter will consider the analysis when planning an experiment. Whereas
the design determines the proper analysis to a great extent, we will see that
two experiments with similar designs may be analyzed differently, and two
experiments with different designs may be analyzed similarly. Proper analy-
sis depends on the design and the kinds of statistical model assumptions we
believe are correct and are willing to assume.
Not all experimental designs are created equal. A good experimental
design must
Avoid systematic error
Be precise
Allow estimation of error
Have broad validity.
We consider these in turn.
1.3 Terms and Concepts 5
Comparative experiments estimate differences in response between treat-
ments. If our experiment has systematic error, then our comparisons will be
biased, no matter how precise our measurements are or how many experi- Design to avoid
systematic error
mental units we use. For example, if responses for units receiving treatment
one are measured with instrument A, and responses for treatment two are
measured with instrument B, then we dont know if any observed differences
are due to treatment effects or instrument miscalibrations. Randomization, as
will be discussed in Chapter 2, is our main tool to combat systematic error.
Even without systematic error, there will be randomerror in the responses,
and this will lead to random error in the treatment comparisons. Experiments Design to
increase
precision
are precise when this random error in treatment comparisons is small. Preci-
sion depends on the size of the random errors in the responses, the number of
units used, and the experimental design used. Several chapters of this book
deal with designs to improve precision.
Experiments must be designed so that we have an estimate of the size
of random error. This permits statistical inference: for example, condence Design to
estimate error
intervals or tests of signicance. We cannot do inference without an estimate
of error. Sadly, experiments that cannot estimate error continue to be run.
The conclusions we draw from an experiment are applicable to the exper-
imental units we used in the experiment. If the units are actually a statistical
sample from some population of units, then the conclusions are also valid Design to widen
validity
for the population. Beyond this, we are extrapolating, and the extrapolation
might or might not be successful. For example, suppose we compare two
different drugs for treating attention decit disorder. Our subjects are pread-
olescent boys from our clinic. We might have a fair case that our results
would hold for preadolescent boys elsewhere, but even that might not be true
if our clinics population of subjects is unusual in some way. The results are
even less compelling for older boys or for girls. Thus if we wish to have
wide validityfor example, broad age range and both gendersthen our ex-
perimental units should reect the population about which we wish to draw
inference.
We need to realize that some compromise will probably be needed be- Compromise
often needed
tween these goals. For example, broadening the scope of validity by using a
variety of experimental units may decrease the precision of the responses.
1.3 Terms and Concepts
Lets dene some of the important terms and concepts in design of exper-
iments. We have already seen the terms treatment, experimental unit, and
response, but we dene them again here for completeness.
6 Introduction
Treatments are the different procedures we want to compare. These could
be different kinds or amounts of fertilizer in agronomy, different long-
distance rate structures in marketing, or different temperatures in a re-
actor vessel in chemical engineering.
Experimental units are the things to which we apply the treatments. These
could be plots of land receiving fertilizer, groups of customers receiv-
ing different rate structures, or batches of feedstock processing at dif-
ferent temperatures.
Responses are outcomes that we observe after applying a treatment to an
experimental unit. That is, the response is what we measure to judge
what happened in the experiment; we often have more than one re-
sponse. Responses for the above examples might be nitrogen content
or biomass of corn plants, prot by customer group, or yield and qual-
ity of the product per ton of raw material.
Randomization is the use of a known, understood probabilistic mechanism
for the assignment of treatments to units. Other aspects of an exper-
iment can also be randomized: for example, the order in which units
are evaluated for their responses.
Experimental Error is the random variation present in all experimental re-
sults. Different experimental units will give different responses to the
same treatment, and it is often true that applying the same treatment
over and over again to the same unit will result in different responses
in different trials. Experimental error does not refer to conducting the
wrong experiment or dropping test tubes.
Measurement units (or response units) are the actual objects on which the
response is measured. These may differ from the experimental units.
For example, consider the effect of different fertilizers on the nitrogen
content of corn plants. Different eld plots are the experimental units,
but the measurement units might be a subset of the corn plants on the
eld plot, or a sample of leaves, stalks, and roots from the eld plot.
Blinding occurs when the evaluators of a response do not know which treat-
ment was given to which unit. Blinding helps prevent bias in the evalu-
ation, even unconscious bias from well-intentioned evaluators. Double
blinding occurs when both the evaluators of the response and the (hu-
man subject) experimental units do not know the assignment of treat-
ments to units. Blinding the subjects can also prevent bias, because
subject responses can change when subjects have expectations for cer-
tain treatments.
1.4 Outline 7
Control has several different uses in design. First, an experiment is con-
trolled because we as experimenters assign treatments to experimental
units. Otherwise, we would have an observational study.
Second, a control treatment is a standard treatment that is used as a
baseline or basis of comparison for the other treatments. This control
treatment might be the treatment in common use, or it might be a null
treatment (no treatment at all). For example, a study of newpain killing
drugs could use a standard pain killer as a control treatment, or a study
on the efcacy of fertilizer could give some elds no fertilizer at all.
This would control for average soil fertility or weather conditions.
Placebo is a null treatment that is used when the act of applying a treatment
any treatmenthas an effect. Placebos are often used with human
subjects, because people often respond to any treatment: for example,
reduction in headache pain when given a sugar pill. Blinding is impor-
tant when placebos are used with human subjects. Placebos are also
useful for nonhuman subjects. The apparatus for spraying a eld with
a pesticide may compact the soil. Thus we drive the apparatus over the
eld, without actually spraying, as a placebo treatment.
Factors combine to form treatments. For example, the baking treatment for
a cake involves a given time at a given temperature. The treatment is
the combination of time and temperature, but we can vary the time and
temperature separately. Thus we speak of a time factor and a temper-
ature factor. Individual settings for each factor are called levels of the
factor.
Confounding occurs when the effect of one factor or treatment cannot be
distinguished from that of another factor or treatment. The two factors
or treatments are said to be confounded. Except in very special cir-
cumstances, confounding should be avoided. Consider planting corn
variety A in Minnesota and corn variety B in Iowa. In this experiment,
we cannot distinguish location effects from variety effectsthe variety
factor and the location factor are confounded.
1.4 Outline
Here is a road map for this book, so that you can see how it is organized.
The remainder of this chapter gives more detail on experimental units and
responses. Chapter 2 elaborates on the important concept of randomiza-
tion. Chapters 3 through 7 introduce the basic experimental design, called
8 Introduction
the Completely Randomized Design (CRD), and describe its analysis in con-
siderable detail. Chapters 8 through 10 add factorial treatment structure to
the CRD, and Chapters 11 and 12 add random effects to the CRD. The idea
is that we learn these different treatment structures and analyses in the sim-
plest design setting, the CRD. These structures and analysis techniques can
then be used almost without change in the more complicated designs that
follow.
We begin learning new experimental designs in Chapter 13, which in-
troduces complete block designs. Chapter 14 introduces general incomplete
blocks, and Chapters 15 and 16 deal with incomplete blocks for treatments
with factorial structure. Chapter 17 introduces covariates. Chapters 18 and
19 deal with special treatment structures, including fractional factorials and
response surfaces. Finally, Chapter 20 provides a framework for planning an
experiment.
1.5 More About Experimental Units
Experimentation is so diverse that there are relatively few general statements
that can be made about experimental units. A common source of difculty is
the distinction between experimental units and measurement units. Consider Experimental and
measurement
units
an educational study, where six classrooms of 25 rst graders each are as-
signed at random to two different reading programs, with all the rst graders
evaluated via a common reading examat the end of the school year. Are there
six experimental units (the classrooms) or 150 (the students)?
One way to determine the experimental unit is via the consideration that
an experimental unit should be able to receive any treatment. Thus if students
were the experimental units, we could see more than one reading program in Experimental unit
could get any
treatment
each classroom. However, the nature of the experiment makes it clear that all
the students in the classroom receive the same program, so the classroom as
a whole is the experimental unit. We dont measure how a classroom reads,
though; we measure how students read. Thus students are the measurement
units for this experiment.
There are many situations where a treatment is applied to group of ob-
jects, some of which are later measured for a response. For example,
Fertilizer is applied to a plot of land containing corn plants, some of
which will be harvested and measured. The plot is the experimental
unit and the plants are the measurement units.
Ingots of steel are given different heat treatments, and each ingot is
punched in four locations to measure its hardness. Ingots are the ex-
perimental units and locations on the ingot are measurement units.
1.5 More About Experimental Units 9
Mice are caged together, with different cages receiving different nutri-
tional supplements. The cage is the experimental unit, and the mice
are the measurement units.
Treating measurement units as experimental usually leads to overopti-
mistic analysis morewe will reject null hypotheses more often than we Use a summary
of the
measurement unit
responses as
experimental unit
response
should, and our condence intervals will be too short and will not have their
claimed coverage rates. The usual way around this is to determine a single
response for each experimental unit. This single response is typically the
average or total of the responses for the measurement units within an exper-
imental unit, but the median, maximum, minimum, variance or some other
summary statistic could also be appropriate depending on the goals of the
experiment.
A second issue with units is determining their size or shape. For
agricultural experiments, a unit is generally a plot of land, so size and shape
have an obvious meaning. For an animal feeding study, size could be the Size of units
number of animals per cage. For an ice cream formulation study, size could
be the number of liters in a batch of ice cream. For a computer network
conguration study, size could be the length of time the network is observed
under load conditions.
Not all measurement units in an experimental unit will be equivalent.
For the ice cream, samples taken near the edge of a carton (unit) may have
more ice crystals than samples taken near the center. Thus it may make sense
to plan the units so that the ratio of edge to center is similar to that in the Edge may be
different than
center
products intended packaging. Similarly, in agricultural trials, guard rows
are often planted to reduce the effect of being on the edge of a plot. You
dont want to construct plots that are all edge, and thus all guard row. For
experiments that occur over time, such as the computer network study, there
may be a transient period at the beginning before the network moves to steady
state. You dont want units so small that all you measure is transient.
One common situation is that there is a xed resource available, such as
a xed area, a xed amount of time, or a xed number of measurements. More
experimental
units, fewer
measurement
units usually
better
This xed resource needs to be divided into units (and perhaps measurement
units). How should the split be made? In general, more experimental units
with fewer measurement units per experimental unit works better (see, for
example, Faireld Smith 1938). However, smaller experimental units are
inclined to have greater edge effect problems than are larger units, so this
recommendation needs to be moderated by consideration of the actual units.
A third important issue is that the response of a given unit should not de-
pend on or be inuenced by the treatments given other units or the responses
of other units. This is usually ensured through some kind of separation of Independence of
units
the units, either in space or time. For example, a forestry experiment would
10 Introduction
provide separation between units, so that a fast-growing tree does not shade
trees in adjacent units and thus make them grow more slowly; and a drug trial
giving the same patient different drugs in sequence would include a washout
period between treatments, so that a drug would be completely out of a pa-
tients system before the next drug is administered.
When the response of a unit is inuenced by the treatment given to other
units, we get confounding between the treatments, because we cannot esti-
mate treatment response differences unambiguously. When the response of
a unit is inuenced by the response of another unit, we get a poor estimate
of the precision of our experiment. In particular, we usually overestimate
the precision. Failure to achieve this independence can seriously affect the
quality of any inferences we might make.
A nal issue with units is determining how many units are required. We
consider this in detail in Chapter 7. Sample size
1.6 More About Responses
We have been discussing the response, but it is a rare experiment that mea-
sures only a single response. Experiments often address several questions,
and we may need a different response for each question. Responses such as
these are often called primary responses, since they measure the quantity of Primary response
primary interest for a unit.
We cannot always measure the primary response. For example, a drug
trial might be used to nd drugs that increase life expectancy after initial
heart attack: thus the primary response is years of life after heart attack.
This response is not likely to be used, however, because it may be decades
before the patients in the study die, and thus decades before the study is Surrogate
responses
completed. For this reason, experimenters use surrogate responses. (It isnt
only impatience; it becomes more and more difcult to keep in contact with
subjects as time goes on.)
Surrogate responses are responses that are supposed to be related to
and predictive forthe primary response. For example, we might measure
the fraction of patients still alive after ve years, rather than wait for their
actual lifespans. Or we might have an instrumental reading of ice crystals in
ice cream, rather than use a human panel and get their subjective assessment
of product graininess.
Surrogate responses are common, but not without risks. In particular, we
may nd that the surrogate response turns out not to be a good predictor of
the primary response.
1.6 More About Responses 11
Cardiac arrhythmias Example 1.2
Acute cardiac arrhythmias can cause death. Encainide and ecanide acetate
are two drugs that were known to suppress acute cardiac arrhythmias and
stabilize the heartbeat. Chronic arrhythmias are also associated with sud-
den death, so perhaps these drugs could also work for nonacute cases. The
Cardiac Arrhythmia Suppression Trial (CAST) tested these two drugs and
a placebo (CAST Investigators 1989). The real response of interest is sur-
vival, but regularity of the heartbeat was used as a surrogate response. Both
of these drugs were shown to regularize the heartbeat better than the placebo
did. Unfortunately, the real response of interest (survival) indicated that the
regularized pulse was too often 0. These drugs did improve the surrogate
response, but they were actually worse than placebo for the primary response
of survival.
By the way, the investigators were originally criticized for including a
placebo in this trial. After all, the drugs were known to work. It was only the
placebo that allowed them to discover that these drugs should not be used for
chronic arrhythmias.
In addition to responses that relate directly to the questions of interest,
some experiments collect predictive responses. We use predictive responses
to model theprimary response. The modeling is done for two reasons. First, Predictive
responses
such modeling can be used to increase the precision of the experiment and
the comparisons of interest. In this case, we call the predictive responses
covariates (see Chapter 17). Second, the predictive responses may help us
understand the mechanism by which the treatment is affecting the primary
response. Note, however, that since we observed the predictive responses
rather than setting them experimentally, the mechanistic models built using
predictive responses are observational.
A nal class of responses is audit responses. We use audit responses to
ensure that treatments were applied as intended and to check that environ- Audit responses
mental conditions have not changed. Thus in a study looking at nitrogen
fertilizers, we might measure soil nitrogen as a check on proper treatment
application, and we might monitor soil moisture to check on the uniformity
of our irrigation system.
12 Introduction
Chapter 2
Randomization and Design
We characterize an experiment by the treatments and experimental units to be
used, the way we assign the treatments to units, and the responses we mea-
sure. An experiment is randomized if the method for assigning treatments Randomization to
assign treatment
to units
to units involves a known, well-understood probabilistic scheme. The prob-
abilistic scheme is called a randomization. As we will see, an experiment
may have several randomized features in addition to the assignment of treat-
ments to units. Randomization is one of the most important elements of a
well-designed experiment.
Lets emphasize rst the distinction between a random scheme and a Haphazard is not
randomized
haphazard scheme. Consider the following potential mechanisms for as-
signing treatments to experimental units. In all cases suppose that we have
four treatments that need to be assigned to 16 units.
We use sixteen identical slips of paper, four marked with A, four with
B, and so on to D. We put the slips of paper into a basket and mix them
thoroughly. For each unit, we draw a slip of paper from the basket and
use the treatment marked on the slip.
Treatment A is assigned to the rst four units we happen to encounter,
treatment B to the next four units, and so on.
As each unit is encountered, we assign treatments A, B, C, and D based
on whether the seconds reading on the clock is between 1 and 15, 16
and 30, 31 and 45, or 46 and 60.
The rst method clearly uses a precisely-dened probabilistic method. We
understand howthis method makes it assignments, and we can use this method
14 Randomization and Design
to obtain statistically equivalent randomizations in replications of the exper-
iment.
The second two methods might be described as haphazard; they are not
predictable and deterministic, but they do not use a randomization. It is dif-
cult to model and understand the mechanism that is being used. Assignment
here depends on the order in which units are encountered, the elapsed time
between encountering units, how the treatments were labeled A, B, C, and
D, and potentially other factors. I might not be able to replicate your experi-
ment, simply because I tend to encounter units in a different order, or I tend
to work a little more slowly. The second two methods are not randomization.
Haphazard is not randomized.
Introducing more randomness into an experiment may seem like a per-
verse thing to do. After all, we are always battling against random exper-
imental error. However, random assignment of treatments to units has two Two reasons for
randomizing
useful consequences:
1. Randomization protects against confounding.
2. Randomization can form the basis for inference.
Randomization is rarely used for inference in practice, primarily due to com-
putational difculties. Furthermore, some statisticians (Bayesian statisticians
in particular) disagree about the usefulness of randomization as a basis for
inference.
1
However, the success of randomization in the protection against
confounding is so overwhelming that randomization is almost universally
recommended.
2.1 Randomization Against Confounding
We dened confounding as occurring when the effect of one factor or treat-
ment cannot be distinguished from that of another factor or treatment. How
does randomization help prevent confounding? Lets start by looking at the
trouble that can happen when we dont randomize.
Consider a new drug treatment for coronary artery disease. We wish to
compare this drug treatment with bypass surgery, which is costly and inva-
sive. We have 100 patients in our pool of volunteers that have agreed via
1
Statisticians dont always agree on philosophy or methodology. This is the rst of several
ongoing little debates that we will encounter.
2.1 Randomization Against Confounding 15
informed consent to participate in our study; they need to be assigned to the
two treatments. We then measure ve-year survival as a response.
What sort of trouble can happen if we fail to randomize? Bypass surgery
is a major operation, and patients with severe disease may not be strong
enough to survive the operation. It might thus be tempting to assign the Failure to
randomize can
cause trouble
stronger patients to surgery and the weaker patients to the drug therapy. This
confounds strength of the patient with treatment differences. The drug ther-
apy would likely have a lower survival rate because it is getting the weakest
patients, even if the drug therapy is every bit as good as the surgery.
Alternatively, perhaps only small quantities of the drug are available early
in the experiment, so that we assign more of the early patients to surgery,
and more of the later patients to drug therapy. There will be a problem if the
early patients are somehowdifferent from the later patients. For example, the
earlier patients might be from your own practice, and the later patients might
be recruited from other doctors and hospitals. The patients could differ by
age, socioeconomic status, and other factors that are known to be associated
with survival.
There are several potential randomization schemes for this experiment;
here are two:
Toss a coin for every patient; headsthe patient gets the drug, tails
the patient gets surgery.
Make up a basket with 50 red balls and 50 white balls well mixed
together. Each patient gets a randomly drawn ball; red balls lead to
surgery, white balls lead to drug therapy.
Note that for coin tossing the numbers of patients in the two treatment groups
are random, while the numbers are xed for the colored ball scheme.
Here is how randomization has helped us. No matter which features of
the population of experimental units are associated with our response, our
randomizations put approximately half the patients with these features in
each treatment group. Approximately half the men get the drug; approxi- Randomization
balances the
population on
average
mately half the older patients get the drug; approximately half the stronger
patients get the drug; and so on. These are not exactly 50/50 splits, but the
deviation from an even split follows rules of probability that we can use when
making inference about the treatments.
This example is, of course, an oversimplication. A real experimental
design would include considerations for age, gender, health status, and so
on. The beauty of randomization is that it helps prevent confounding, even
for factors that we do not know are important.
16 Randomization and Design
Here is another example of randomization. A company is evaluating two
different word processing packages for use by its clerical staff. Part of the
evaluation is how quickly a test document can be entered correctly using the
two programs. We have 20 test secretaries, and each secretary will enter the
document twice, using each program once.
As expected, there are potential pitfalls in nonrandomized designs. Sup-
pose that all secretaries did the evaluation in the order A rst and B second.
Does the second program have an advantage because the secretary will be
familiar with the document and thus enter it faster? Or maybe the second
program will be at a disadvantage because the secretary will be tired and
thus slower.
Two randomized designs that could be considered are:
1. For each secretary, toss a coin: the secretary will use the programs in
the orders AB and BA according to whether the coin is a head or a tail,
respectively.
2. Choose 10 secretaries at random for the AB order, the rest get the BA
order.
Both these designs are randomized and will help guard against confounding, Different
randomizations
are different
designs
but the designs are slightly different and we will see that they should be
analyzed differently.
Cochran and Cox (1957) draw the following analogy:
Randomization is somewhat analogous to insurance, in that it
is a precaution against disturbances that may or may not occur
and that may or may not be serious if they do occur. It is gen-
erally advisable to take the trouble to randomize even when it is
not expected that there will be any serious bias from failure to
randomize. The experimenter is thus protected against unusual
events that upset his expectations.
Randomization generally costs little in time and trouble, but it can save us
from disaster.
2.2 Randomizing Other Things
We have taken a very simplistic view of experiments; assign treatments to
units and then measure responses hides a multitude of potential steps and
choices that will need to be made. Many of these additional steps can be
randomized, as they could also lead to confounding. For example:
2.3 Performing a Randomization 17
If the experimental units are not used simultaneously, you can random-
ize the order in which they are used.
If the experimental units are not used at the same location, you can
randomize the locations at which they are used.
If you use more than one measuring instrument for determining re-
sponse, you can randomize which units are measured on which instru-
ments.
When we anticipate that one of these might cause a change in the response,
we can often design that into the experiment (for example, by using blocking;
see Chapter 13). Thus I try to design for the known problems, and randomize
everything else.
One tale of woe Example 2.1
I once evaluated data from a study that was examining cadmium and other
metal concentrations in soils around a commercial incinerator. The issue was
whether the concentrations were higher in soils near the incinerator. They
had eight sites selected (matched for soil type) around the incinerator, and
took ten random soil samples at each site.
The samples were all sent to a commercial lab for analysis. The analysis
was long and expensive, so they could only do about ten samples a day. Yes
indeed, there was almost a perfect match of sites and analysis days. Sev-
eral elements, including cadmium, were only present in trace concentrations,
concentrations that were so low that instrument calibration, which was done
daily, was crucial. When the data came back from the lab, we had a very
good idea of the variability of their calibrations, and essentially no idea of
how the sites differed.
The lab was informed that all the trace analyses, including cadmium,
would be redone, all on one day, in a random order that we specied. Fortu-
nately I was not a party to the question of who picked up the $75,000 tab for
reanalysis.
2.3 Performing a Randomization
Once we decide to use randomization, there is still the problem of actually
doing it. Randomizations usually consist of choosing a random order for
a set of objects (for example, doing analyses in random order) or choosing Random orders
and random
subsets
randomsubsets of a set of objects (for example, choosing a subset of units for
treatment A). Thus we need methods for putting objects into random orders
18 Randomization and Design
and choosing randomsubsets. When the sample sizes for the subsets are xed
and known (as they usually are), we will be able to choose random subsets
by rst choosing random orders.
Randomization methods can be either physical or numerical. Physical
randomization is achieved via an actual physical act that is believed to pro-
duce random results with known properties. Examples of physical random-
ization are coin tosses, card draws from shufed decks, rolls of a die, and Physical
randomization
tickets in a hat. I say believed to produce random results with known prop-
erties because cards can be poorly shufed, tickets in the hat can be poorly
mixed, and skilled magicians can toss coins that come up heads every time.
Large scale embarrassments due to faulty physical randomization include
poor mixing of Selective Service draft induction numbers during World War
II (see Mosteller, Rourke, and Thomas 1970). It is important to make sure
that any physical randomization that you use is done well.
Physical generation of random orders is most easily done with cards or
tickets in a hat. We must order N objects. We take N cards or tickets,
numbered 1 through N, and mix them well. The rst object is then given the Physical random
order
number of the rst card or ticket drawn, and so on. The objects are then sorted
so that their assigned numbers are in increasing order. With good mixing, all
orders of the objects are equally likely.
Once we have a random order, random subsets are easy. Suppose that
the N objects are to be broken into g subsets with sizes n
1
, . . ., n
g
, with
n
1
+ + n
g
= N. For example, eight students are to be grouped into one Physical random
subsets from
random orders
group of four and two groups of two. First arrange the objects in random
order. Once the objects are in random order, assign the rst n
1
objects to
group one, the next n
2
objects to group two, and so on. If our eight students
were randomly ordered 3, 1, 6, 8, 5, 7, 2, 4, then our three groups would be
(3, 1, 6, 8), (5, 7), and (2, 4).
Numerical randomization uses numbers taken from a table of random
numbers or generated by a random number generator in computer software. Numerical
randomization
For example, Appendix Table D.1 contains random digits. We use the table
or a generator to produce a random ordering for our N objects, and then
proceed as for physical randomization if we need random subsets.
We get the random order by obtaining a random number for each object,
and then sorting the objects so that the random numbers are in increasing
order. Start arbitrarily in the table and read numbers of the required size
sequentially from the table. If any number is a repeat of an earlier number,
replace the repeat by the next number in the list so that you get N different
numbers. For example, suppose that we need 5 numbers and that the random Numerical
random order
numbers in the table are (4, 3, 7, 4, 6, 7, 2, 1, 9, . . .). Then our 5 selected
numbers would be (4, 3, 7, 6, 2), the duplicates of 4 and 7 being discarded.
2.4 Randomization for Inference 19
Nowarrange the objects so that their selected numbers are in ascending order.
For the sample numbers, the objects, A through E would be reordered E, B,
A, D, C. Obviously, you need numbers with more digits as N gets larger.
Getting rid of duplicates makes this procedure a little tedious. You will
have fewer duplicates if you use numbers with more digits than are abso-
lutely necessary. For example, for 9 objects, we could use two- or three-digit Longer random
numbers have
fewer duplicates
numbers, and for 30 objects we could use three- or four-digit numbers. The
probabilities of 9 random one-, two-, and three-digit numbers having no du-
plicates are .004, .690, and .965; the probabilities of 30 random two-, three-,
and four-digit numbers having no duplicates are .008, .644, and .957 respec-
tively.
Many computer software packages (and even calculators) can produce
random numbers. Some produce random integers, others numbers be-
tween 0 and 1. In either case, you use these numbers as you would numbers
formed by a sequence of digits from a random number table. Suppose that
we needed to put 6 units into random order, and that our random number
generator produced the following numbers: .52983, .37225, .99139, .48011,
.69382, .61181. Associate the 6 units with these random numbers. The sec-
ond unit has the smallest random number, so the second unit is rst in the
ordering; the fourth unit has the next smallest random number, so it is second
in the ordering; and so on. Thus the random order of the units is B, D, A, F,
E, C.
The word random is quoted above because these numbers are not truly
random. The numbers in the table are the same every time you read it; they
dont change unpredictably when you open the book. The numbers produced
by the software package are from an algorithm; if you know the algorithm
you can predict the numbers perfectly. They are technically pseudorandom
numbers; that is, numbers that possess many of the attributes of randomnum- Pseudorandom
numbers
bers so that they appear to be random and can usually be used in place of
random numbers.
2.4 Randomization for Inference
Nearly all the analysis that we will do in this book is based on the normal
distribution and linear models and will use t-tests, F-tests, and the like. As
we will see in great detail later, these procedures make assumptions such as
The responses in treatment group A are independent from unit to unit and
follow a normal distribution with mean and variance
2
. Nowhere in the
design of our experiment did we do anything to make this so; all we did was
randomize treatments to units and observe responses.
20 Randomization and Design
Table 2.1: Auxiliary manual times runstitching a collar for 30
workers under standard (S) and ergonomic (E) conditions.
# S E # S E # S E
1 4.90 3.87 11 4.70 4.25 21 5.06 5.54
2 4.50 4.54 12 4.77 5.57 22 4.44 5.52
3 4.86 4.60 13 4.75 4.36 23 4.46 5.03
4 5.57 5.27 14 4.60 4.35 24 5.43 4.33
5 4.62 5.59 15 5.06 4.88 25 4.83 4.56
6 4.65 4.61 16 5.51 4.56 26 5.05 5.50
7 4.62 5.19 17 4.66 4.84 27 5.78 5.16
8 6.39 4.64 18 4.95 4.24 28 5.10 4.89
9 4.36 4.35 19 4.75 4.33 29 4.68 4.89
10 4.91 4.49 20 4.67 4.24 30 6.06 5.24
In fact, randomization itself can be used as a basis for inference. The
advantage of this randomization approach is that it relies only on the ran- Randomization
inference makes
few assumptions
domization that we performed. It does not need independence, normality,
and the other assumptions that go with linear models. The disadvantage of
the randomization approach is that it can be difcult to implement, even in
relatively small problems, though computers make it much easier. Further-
more, the inference that randomization provides is often indistinguishable
from that of standard techniques such as ANOVA.
Now that computers are powerful and common, randomization inference
procedures can be done with relatively little pain. These ideas of randomiza-
tion inference are best shown by example. Below we introduce the ideas of
randomization inference using two extended examples, one corresponding to
a paired t-test, and one corresponding to a two sample t-test.
2.4.1 The paired t-test
Bezjak and Knez (1995) provide data on the length of time it takes garment
workers to runstitch a collar on a mans shirt, using a standard workplace and
a more ergonomic workplace. Table 2.1 gives the auxiliary manual time
per collar in seconds for 30 workers using both systems.
One question of interest is whether the times are the same on average
for the two workplaces. Formally, we test the null hypothesis that the aver-
age runstitching time for the standard workplace is the same as the average
runstitching time for the ergonomic workplace.
2.4 Randomization for Inference 21
Table 2.2: Differences in runstitching times (standard ergonomic).
1.03 -.04 .26 .30 -.97 .04 -.57 1.75 .01 .42
.45 -.80 .39 .25 .18 .95 -.18 .71 .42 .43
-.48 -1.08 -.57 1.10 .27 -.45 .62 .21 -.21 .82
A paired t-test is the standard procedure for testing this null hypothesis.
We use a paired t-test because each worker was measured twice, once for Paired t-test for
paired data
each workplace, so the observations on the two workplaces are dependent.
Fast workers are probably fast for both workplaces, and slow workers are
slow for both. Thus what we do is compute the difference (standard er-
gonomic) for each worker, and test the null hypothesis that the average of
these differences is zero using a one sample t-test on the differences.
Table 2.2 gives the differences between standard and ergonomic times.
Recall the setup for a one sample t-test. Let d
1
, d
2
, . . ., d
n
be the n differ-
ences in the sample. We assume that these differences are independent sam-
ples from a normal distribution with mean and variance
2
, both unknown.
Our null hypothesis is that the mean equals prespecied value
0
= 0
(H
0
: =
0
= 0), and our alternative is H
1
: > 0 because we expect the
workers to be faster in the ergonomic workplace.
The formula for a one sample t-test is
t =

d
0
s/

n
,
where

d is the mean of the data (here the differences d
1
, d
2
, . . ., d
n
), n is the The paired t-test
sample size, and s is the sample standard deviation (of the differences)
s =

_
1
n 1
n

i=1
(d
i


d )
2
.
If our null hypothesis is correct and our assumptions are true, then the t-
statistic follows a t-distribution with n 1 degrees of freedom.
The p-value for a test is the probability, assuming that the null hypothesis
is true, of observing a test statistic as extreme or more extreme than the one The p-value
we did observe. Extreme means away from the the null hypothesis towards
the alternative hypothesis. Our alternative here is that the true average is
larger than the null hypothesis value, so larger values of the test statistic are
extreme. Thus the p-value is the area under the t-curve with n1 degrees of
freedom from the observed t-value to the right. (If the alternative had been
<
0
, then the p-value is the area under the curve to the left of our test
22 Randomization and Design
Table 2.3: Paired t-tests results for runstitching times (standard
ergonomic) for the last 10 and all 30 workers
n df

d s t p
Last 10 10 9 .023 .695 .10 .459
All 30 30 29 .175 .645 1.49 .074
statistic. For a two sided alternative, the p-value is the area under the curve
at a distance from 0 as great or greater than our test statistic.)
To illustrate the t-test, lets use the data for the last 10 workers and all
30 workers. Table 2.3 shows the results. Looking at the last ten workers,
the p-value is .46, meaning that we would observe a t-statistic this larger or
larger in 46% of all tests when the null hypothesis is true. Thus there is little
evidence against the null here. When all 30 workers are considered, the p-
value is .074; this is mild evidence against the null hypothesis. The fact that
these two differ probably indicates that the workers are not listed in random
order. In fact, Figure 2.1 shows box-plots for the differences by groups of ten
workers; the lower numbered differences tend to be greater.
Now consider a randomization-based analysis. The randomization null
hypothesis is that the two workplaces are completely equivalent and merely
act to label the responses that we observed. For example, the rst worker Randomization
null hypothesis
had responses of 4.90 and 3.87, which we have labeled as standard and er-
gonomic. Under the randomization null, the responses would be 4.90 and
3.87 no matter how the random assignment of treatments turned out. The
only thing that could change is which of the two is labeled as standard, and
which as ergonomic. Thus, under the randomization null hypothesis, we
could, with equal probability, have observed 3.87 for standard and 4.90 for
ergonomic.
What does this mean in terms of the differences? We observed a differ-
ence of 1.03 for worker 1. Under the randomization null, we could just as Differences have
random signs
under
randomization
null
easily have observed the difference -1.03, and similarly for all the other dif-
ferences. Thus in the randomization analogue to a paired t-test, the absolute
values of the differences are taken to be xed, and the signs of the differ-
ences are random, with each sign independent of the others and having equal
probability of positive and negative.
To construct a randomization test, we choose a descriptive statistic for
the data and then get the distribution of that statistic under the randomization
null hypothesis. The randomization p-value is the probability (under this
randomization distribution) of getting a descriptive statistic as extreme or
more extreme than the one we observed.
2.4 Randomization for Inference 23
-1
-0.5
0
0.5
1
1.5
1 2 3

Group of 10
T
i
m
e

d
i
f
f
e
r
e
n
c
e
Figure 2.1: Box-plots of differences in runstitching times by
groups of 10 workers, using MacAnova. Stars and diamonds
indicate potential outlier points.
For this problem, we take the sum of the differences as our descriptive
statistic. (The average would lead to exactly the same p-values, and we could
also form tests using the median or other measures of center.) Start with Randomization
statistic and
distribution
the last 10 workers. The sum of the last 10 observed differences is .23. To
get the randomization distribution, we have to get the sum for all possible
combinations of signs for the differences. There are two possibilities for
each difference, and 10 differences, so there are 2
10
= 1024 different equally
likely values for the sum in the randomization distribution. We must look at
all of them to get the randomization p-value.
Figure 2.2 shows a histogram of the randomization distribution for the
last 10 workers. The observed value of .23 is clearly in the center of this
distribution, so we expect a large p-value. In fact, 465 of the 1024 values are Randomization
p-value
.23 or larger, so the randomization p-value is 465/1024 = .454, very close to
the t-test p-value.
We only wanted to do a test on a mean of 10 numbers, and we had to
compute 1024 different sums of 10 numbers; you can see one reason why
randomization tests have not had a major following. For some data sets, you
can compute the randomization p-value by hand fairly simply. Consider the
last 10 differences in Table 2.2 (reading across rows, rather than columns).
24 Randomization and Design
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
-6 -4 -2 0 2 4 6

Sum of differences
D
e
n
s
i
t
y
Figure 2.2: Histogram of randomization distribution of the sum
of the last 10 worker differences for runstitching, with vertical
line added at the observed sum.
These differences are
.62 1.75 .71 .21 .01 .42 -.21 .42 .43 .82
Only one of these values is negative (-.21), and seven of the positive differ-
ences have absolute value greater than .21. Any change of these seven values
can only make the sum less, so we dont have to consider changing their
signs, only the signs of .21, .01, and -.21. This is a much smaller problem,
and it is fairly easy to work out that four of the 8 possible sign arrangements
for testing three differences lead to sums as large or larger than the observed
sum. Thus the randomization p-value is 4/1024 = .004, similar to the .007
p-value we would get if we used the t-test.
Looking at the entire data set, we have 2
30
= 1, 073, 741, 824 different
sets of signs. That is too many to do comfortably, even on a computer. What Subsample the
randomization
distribution
is done instead is to have the computer choose a random sample from this
complete distribution by choosing random sets of signs, and then use this
sample for computing randomization p-values as if it were the complete dis-
tribution. For a reasonably large sample, say 10,000, the approximation is
usually good enough. I took a random sample of size 10,000 and got a p-
value .069, reasonably close to the t-test p-value. Two additional samples
of 10,000 gave p-values of .073 and .068; the binomial distribution suggests
2.4 Randomization for Inference 25
Table 2.4: Log whole plant phosphorus
(ln g/plant) 15 and 28 days after rst harvest.
15 Days 28 Days
4.3 4.6 4.8 5.4 5.3 5.7 6.0 6.3
that these approximate p-values have a standard deviation of about
_
p (1 p)/10000
_
.07 .93/10000 = .0026 .
2.4.2 Two-sample t-test
Figure 2 of Hunt (1973) provides data from an experiment looking at the
absorption of phosphorus by Rumex acetosa. Table 2.4 is taken from Figure
2 of Hunt and gives the log phosphorus content of 8 whole plants, 4 each at
15 and 28 days after rst harvest. These are 8 plants randomly divided into
two groups of 4, with each group getting a different treatment. One natural
question is whether the average phosphorus content is the same at the two
sampling times. Formally, we test the null hypothesis that the two sampling
times have the same average.
A two-sample t-test is the standard method for addressing this question.
Let y
11
, . . ., y
14
be the responses from the rst sample, and let y
21
, . . ., y
24
Two-sample t-test
be the response from the second sample. The usual assumptions for a two-
sample t-test are that the data y
11
, . . ., y
14
are a sample from a normal dis-
tribution with mean
1
and variance
2
, the data y
21
, . . ., y
24
are a sample
from a normal distribution with mean
2
and variance
2
, and the two sam-
ples are independent. Note that while the means may differ, the variances
are assumed to be the same. The null hypothesis is H
0
:
1
=
2
and our
alternative is H
1
:
1
<
2
(presumably growing plants will accumulate
phosphorus).
The two-sample t-statistic is
t =
y
2
y
1
s
p
_
1/n
1
+ 1/n
2
,
where y
1
and y
2
are the means of the rst and second samples, n
1
and n
2
are the sample sizes, and s
2
p
is the pooled estimate of variance dened by
s
p
=

n
1
i=1
(y
1i
y
1
)
2
+

n
2
i=1
(y
2i
y
2
)
2
n
1
+n
2
2
.
26 Randomization and Design
If our null hypothesis is correct and our assumptions are true, then the t-
statistic follows a t-distribution with n
1
+ n
2
2 degrees of freedom. The
p-value for our one-sided alternative is the area under the t-distribution curve
with n
1
+ n
2
2 degrees of freedom that is to the right of our observed
t-statistic.
For these data y
1
= 4.775, y
2
= 5.825, s
p
= .446, and n
1
= n
2
= 4.
The t-statistic is then
t =
5.825 4.775
.446
_
1/4 + 1/4
= 3.33,
and the p-value is .008, the area under a t-curve with 6 degrees of freedom to
the right of 3.33. This is strong evidence against the null hypothesis, and we
would probably conclude that the null is false.
Now consider a randomization analysis. The randomization null hypoth-
esis is that growing time treatments are completely equivalent and serve only Randomization
null hypothesis
as labels. In particular, the responses we observed for the 8 units would be
the same no matter which treatments had been applied, and any subset of four
units is equally likely to be the 15-day treatment group. For example, under
the randomization null wth the 15-day treatment, the responses (4.3, 4.6, 4.8,
5.4), (4.3, 4.6, 5.3, 5.7), and (5.4, 5.7, 6.0, 6.3) are all equally likely.
To construct a randomization test, we choose a descriptive statistic for
the data and then get the distribution of that statistic under the randomization
null hypothesis. The randomization p-value is the probability (under this
randomization distribution) of getting a descriptive statistic as extreme or
more extreme than the one we observed.
For this problem, we take the average response at 28 days minus the aver-
age response at 15 days as our statistic. The observed value of this statistic is Randomization
statistic and
distribution
1.05. There are
8
C
4
= 70 different ways that the 8 plants can be split between
the two treatments. Only two of those 70 ways give a difference of averages
as large as or larger than the one we observed. Thus the randomization p-
value is 2/70 = .029. This p-value is a bit bigger than that computed from Randomization
p-value
the t-test, but both give evidence against the null hypothesis. Note that the
smallest possible randomization p-value for this experiment is 1/70 = .014.
2.4.3 Randomization inference and standard inference
We have seen a couple of examples where the p-values for randomization
tests were very close to those of t-tests, and a couple where the p-values
differed somewhat. Generally speaking, randomization p-values are close to
standard p-values. The two tend to be very close when the sample size is
2.5 Further Reading and Extensions 27
large and the assumptions of the standard methods are met. For small sample
sizes, randomization inference is coarser, in the sense that there are relatively
few obtainable p-values.
Randomization p-values are usually close to normal theory p-values.
We will only mention randomization testing in passing in the remainder
of this book. Normal theory methods such as ANOVA and t-tests are much
easier to implement and generalize; furthermore, we get essentially the same
inference as the randomization tests, provided we take some care to ensure
that the assumptions made by the standard procedures are met. We should
consider randomization methods when the assumptions of normal theory can-
not be met.
2.5 Further Reading and Extensions
Randomization tests, sometimes called permutation tests, were introduced
by Fisher (1935) and further developed by Pitman (1937, 1938) and others.
Some of the theory behind these tests can be found in Kempthorne (1955) and
Lehmann (1959). Fishers book is undoubtedly a classic and the granddaddy
of all modern books on the design of experiments. It is, however, difcult
for mere mortals to comprehend and has been debated and discussed since
it appeared (see, for example, Kempthorne 1966). Welch (1990) presents a
fairly general method for constructing randomization tests.
The randomization distribution for our test statistic is discrete, so there
is a nonzero lump of probability on the observed value. We have computed
the p-value by including all of this probability at the observed value as being
in the tail area (as extreme or more extreme than that we observed). One
potential variation on the p-value is to split the probability at the observed
value in half, putting only half in the tail. This can sometimes improve the
agreement between randomization and standard methods.
While randomization is traditional in experimental design and its use is
generally prescribed, it is only fair to point out that there is an alternative
model for statistical inference in which randomization is not necessary for
valid experimental design, and under which randomization does not form
the basis for inference. This is the Bayesian model of statistical inference.
The drawback is that the Bayesian analysis must model all the miscellaneous
factors which randomization is used to avoid.
28 Randomization and Design
The key assumption in many Bayesian analyses is the assumption of ex-
changeability, which is like the assumption of independence in a classical
analysis. Many Bayesians will concede that randomization can assist in mak-
ing exchangeability a reasonable approximation to reality. Thus, some would
do randomization to try to get exchangeability. However, Bayesians do not
need to randomize and so are free to consider other criteria, such as ethical
criteria, much more strongly. Berry (1989) has expounded this view rather
forcefully.
Bayesians believe in the likelihood principle, which here implies basing
your inference on the data you have instead of the data you might have had.
Randomization inference compares the observed results to results that would
have been obtained under other randomizations. This is a clear violation
of the likelihood principle. Of course, Bayesians dont generally believe in
testing or p-values to begin with.
A fairly recent cousin of randomization inference is bootstrapping (see
Efron 1979; Efron and Tibshirani 1993; and many others). Bootstrap infer-
ence in the present context does not rerandomize the assignment of treat-
ments to units, rather it randomly reweights the observations in each treat-
ment group in an effort to determine the distribution of statistics of interest.
2.6 Problems
We wish to evaluate a new textbook for a statistics class. There are seven Exercise 2.1
sections; four are chosen at randomto receive the new book, three receive the
old book. At the end of the semester, student evaluations show the following
percentages of students rate the textbook as very good or excellent:
Section 1 2 3 4 5 6 7
Book N O O N N O N
Rating 46 37 47 45 32 62 56
Find the one-sided randomization p-value for testing the null hypothesis that
the two books are equivalent versus the alternative that the new book is better
(receives higher scores).
Dairy cows are bred by selected bulls, but not all cows become pregnant Exercise 2.2
at the rst service. A drug is proposed that is hoped to increase the bulls
fertility. Each of seven bulls will be bred to 2 herds of 100 cows each (a
total of 14 herds). For one herd (selected randomly) the bulls will be given
the drug, while no drug will be given for the second herd. Assume the drug
has no residual effect. The response we observe for each bull is the number
2.6 Problems 29
of impregnated cows under drug therapy minus the number of impregnated
cows without the drug. The observed differences are -1, 6, 4, 6, 2, -3, 5. Find
the p-value for the randomization test of the null hypothesis that the drug has
no effect versus a one-sided alternative (the drug improves fertility).
Suppose we are studying the effect of diet on height of children, and we Exercise 2.3
have two diets to compare: diet A (a well balanced diet with lots of broccoli)
and diet B (a diet rich in potato chips and candy bars). We wish to nd the
diet that helps children grow (in height) fastest. We have decided to use 20
children in the experiment, and we are contemplating the following methods
for matching children with diets:
1. Let them choose.
2. Take the rst 10 for A, the second 10 for B.
3. Alternate A, B, A, B.
4. Toss a coin for each child in the study: heads A, tails B.
5. Get 20 children; choose 10 at random for A, the rest for B.
Describe the benets and risks of using these ve methods.
As part of a larger experiment, Dale (1992) looked at six samples of Exercise 2.4
a wetland soil undergoing a simulated snowmelt. Three were randomly se-
lected for treatment with a neutral pHsnowmelt; the other three got a reduced
pH snowmelt. The observed response was the number of Copepoda removed
from each microcosm during the rst 14 days of snowmelt.
Reduced pH Neutral pH
256 159 149 54 123 248
Using randomization methods, test the null hypothesis that the two treatments
have equal average numbers of Copepoda versus a two-sided alternative.
Chu (1970) studied the effect of the insecticide chlordane on the ner- Exercise 2.5
vous systems of American cockroaches. The coxal muscles from one meso-
and one metathoracic leg on opposite sides were surgically extracted from
each of six roaches. The roaches were then treated with 50 micrograms of
-chlordane, and coxal muscles from the two remaining meso- and metatho-
racic legs were removed about two hours after treatment. The Na
+
-K
+
ATPase
activity was measured in each muscle, and the percentage changes for the six
roaches are given here:
15.3 -31.8 -35.6 -14.5 3.1 -24.5
Test the null hypothesis that the chlordane treatment has not affected the
30 Randomization and Design
Na
+
-K
+
ATPas activity. What experimental technique (not mentioned in the
description above) must have been used to justify a randomization test?
McElhoe and Conner (1986) use an instrument called a Visiplume to Problem 2.1
measure ultraviolet light. By comparing absorption in clear air and absorp-
tion in polluted air, the concentration of SO
2
in the polluted air can be es-
timated. The EPA has a standard method for measuring SO
2
, and we wish
to compare the two methods across a range of air samples. The recorded
response is the ratio of the Visiplume reading to the EPA standard reading.
There were six observations on coal plant number 2: .950, .978, .762, .733,
.823, and 1.011. If we make the null hypothesis be that the Visiplume and
standard measurements are equivalent (and the Visiplume and standard labels
are just labels and nothing more), then the ratios could (with equal probabil-
ity) have been observed as their reciprocals. That is, the ratio of .950 could
with equal probability have been 1/.950 = 1.053, since the labels are equiva-
lent and assigned at random. Suppose we take as our summary of the data the
sum of the ratios. We observe .95 + ... + 1.011 = 5.257. Test (using random-
ization methods) the null hypothesis of equivalent measurement procedures
against the alternative that Visiplume reads higher than the standard. Report
a p-value.
In this problem, a data set of size 5 consists of the numbers 1 through 5; Problem 2.2
a data set of size 6 consists of the numbers 1 through 6; and so on.
(a) For data sets of size 5 and 6, compute the complete randomization distri-
bution for the mean of samples of size 3. (There will be 10 and 20 members
respectively in the two distributions.) How normal do these distributions
look?
(b) For data sets of size 4 and 5, compute the complete randomization distri-
bution for the mean of samples of any size (size 1, size 2, . . ., up to all the
data in the sample). Again, compare these to normal.
(c) Compare the size 5 distributions from parts a) and b). How do they com-
pare for mean, median, variance, and so on.
Let X
1
, X
2
, . . ., X
N
be independent, uniformly distributed, random k- Question 2.1
digit integers (that is, less than 10
k
). Find the probability of having no dupli-
cates in N draws.
Chapter 3
Completely Randomized
Designs
The simplest randomized experiment for comparing several treatments is the
Completely Randomized Design, or CRD. We will study CRDs and their
analysis in some detail, before considering any other designs, because many
of the concepts and methods learned in the CRD context can be transferred
with little or no modication to more complicated designs. Here, we dene
completely randomized designs and describe the initial analysis of results.
3.1 Structure of a CRD
We have g treatments to compare and N units to use in our experiment. For
a completely randomized design: All partitions of
units with sizes
n
1
through n
g
equally likely in
CRD
1. Select sample sizes n
1
, n
2
, . . . , n
g
with n
1
+n
2
+ +n
g
= N.
2. Choose n
1
units at random to receive treatment 1, n
2
units at random
from the N n
1
remaining to receive treatment 2, and so on.
This randomization produces a CRD; all possible arrangements of the N
units into g groups with sizes n
1
though n
g
are equally likely. Note that
complete randomization only addresses the assignment of treatments to units;
selection of treatments, experimental units, and responses is also required.
Completely randomized designs are the simplest, most easily understood, First consider a
CRD
most easily analyzed designs. For these reasons, we consider the CRD rst
when designing an experiment. The CRD may prove to be inadequate for
32 Completely Randomized Designs
some reason, but I always consider the CRD when developing an experimen-
tal design before possibly moving on to a more sophisticated design.
Example 3.1 Acid rain and birch seedlings
Wood and Bormann (1974) studied the effect of acid rain on trees. Clean
precipitation has a pH in the 5.0 to 5.5 range, but observed precipitation pH
in northern New Hampshire is often in the 3.0 to 4.0 range. Is this acid rain
harming trees, and if so, does the amount of harm depend on the pH of the
rain?
One of their experiments used 240 six-week-old yellow birch seedlings.
These seedlings were divided into ve groups of 48 at random, and the
seedlings within each group received an acid mist treatment 6 hours a week
for 17 weeks. The ve treatments differed by mist pH: 4.7, 4.0, 3.3, 3.0, and
2.3; otherwise, the seedlings were treated identically. After the 17 weeks, the
seedlings were weighed, and total plant (dry) weight was taken as response.
Thus we have a completely randomized design, with ve treatment groups
and each n
i
xed at 48. The seedlings were the experimental units, and plant
dry weight was the response.
This is a nice, straightforward experiment, but lets look over the steps
in planning the experiment and see where some of the choices and compro-
mises were made. It was suspected that damage might vary by pH level, plant
developmental stage, and plant species, among other things. This particu-
lar experiment only addresses pH level (other experiments were conducted
separately). Many factors affect tree growth. The experiment specically
controlled for soil type, seed source, and amounts of light, water, and fer-
tilizer. The desired treatment was real acid rain, but the available treatment
was a synthetic acid rain consisting of distilled water and sulfuric acid (rain
in northern New Hampshire is basically a weak mixture of sulfuric and ni-
tric acids). There was no placebo per se. The experiment used yellow birch
seedlings; what about other species or more mature trees? Total plant weight
is an important response, but other responses (possibly equally important) are
also available. Thus we see that the investigators have narrowed an enormous
question down to a workable experiment using articial acid rain on seedlings
of a single species under controlled conditions. A considerable amount of
nonstatistical background work and compromise goes into the planning of
even the simplest (from a statistical point of view) experiment.
Example 3.2 Resin lifetimes
Mechanical parts such as computer disk drives, light bulbs, and glue bonds
eventually fail. Buyers of these parts want to know how long they are likely
3.2 Preliminary Exploratory Analysis 33
Table 3.1: log
10
times till failure of a resin under stress.
Temperature (
o
C)
175 194 213 231 250
2.04 1.85 1.66 1.66 1.53 1.35 1.15 1.21 1.26 1.02
1.91 1.96 1.71 1.61 1.54 1.27 1.22 1.28 .83 1.09
2.00 1.88 1.42 1.55 1.38 1.26 1.17 1.17 1.08 1.06
1.92 1.90 1.76 1.66 1.31 1.38 1.16
to last, so manufacturers perform tests to determine average lifetime, some-
times expressed as mean time to failure, or mean time between failures for
repairable items. The last computer disk drive I bought had a mean time to
failure of 800,000 hours (over 90 years). Clearly the manufacturer did not
have disks on test for over 90 years; how do they make such claims?
One experimental method for reliability is called an accelerated life test.
Parts under stress will usually fail sooner than parts that are unstressed. By
modeling the lifetimes of parts under various stresses, we can estimate (ex-
trapolate to) the lifetime of parts that are unstressed. That way we get an
estimate of the unstressed lifetime without having to wait the complete un-
stressed lifetime.
Nelson (1990) gave an example where the goal was to estimate the life-
time (in hours) of an encapsulating resin for gold-aluminum bonds in inte-
grated circuits operating at 120
o
C. Since the lifetimes were expected to be
rather long, an accelerated test was used. Thirty-seven units were assigned
at random to one of ve different temperature stresses, ranging from 175
o
to
250
o
. Table 3.1 gives the log
10
lifetimes in hours for the test units.
For this experiment, the choice of units was rather clear: integrated cir-
cuits with the resin bond of interest. Choice of treatments, however, de-
pended on knowing that temperature stress reduced resin bond lifetime. The
actual choice of temperatures probably beneted from knowledge of the re-
sults of previous similar experiments. Once again, experimental design is a
combination of subject matter knowledge and statistical methods.
3.2 Preliminary Exploratory Analysis
It is generally advisable to conduct a preliminary exploratory or graphical
analysis of the data prior to any formal modeling, testing, or estimation. Pre-
liminary analysis could include:
Simple descriptive statistics such as means, medians, standard errors,
34 Completely Randomized Designs
interquartile ranges;
Plots, such as stem and leaf diagrams, box-plots, and scatter-plots; and
The above procedures applied separately to each treatment group.
See, for example, Moore and McCabe (1999) for a description of these ex-
ploratory techniques.
This preliminary analysis presents several possibilities. For example, a
set of box-plots with one box for each treatment group can show us the rel- Graphical
analysis reveals
patterns in data
ative sizes of treatment mean differences and experimental error. This often
gives us as much understanding of the data as any formal analysis proce-
dure. Preliminary analysis can also be a great help in discovering unusual
responses or problems in the data. For example, we might discover an outly-
ing value, perhaps due to data entry error, that was difcult to spot in a table
of numbers.
Example 3.3 Resin lifetimes, continued
We illustrate preliminary analysis by using Minitab to make box-plots of
the resin lifetime data of Example 3.2, with a separate box-plot for each
treatment; see Figure 3.1. The data in neighboring treatments overlap, but
there is a consistent change in the response from treatments one through ve,
and the change is fairly large relative to the variation within each treatment
group. Furthermore, the variation is roughly the same in the different treat-
ment groups (achieving this was a major reason for using log lifetimes).
A second plot shows us something of the challenge we are facing. Fig-
ure 3.2 shows the average log lifetimes per treatment group plotted against
the stress temperature, with a regression line superimposed. We are trying to
extrapolate over to a temperature of 120
o
, well beyond the range of the data.
If the relationship is nonlinear (and it looks curved), the linear t will give
a poor prediction and the average log lifetime at 120
o
could be considerably
higher than that predicted by the line.
3.3 Models and Parameters
A model for data is a specication of the statistical distribution for the data.
For example, the number of heads in ten tosses of a fair coin would have a
Binomial(10,.5) distribution, where .5 gives the probability of a success and
10 is the number of trials. In this instance, the distribution depends on two
numbers, called parameters: the success probability and the number of trials.
For ten tosses of a fair coin, we know both parameters. In the analysis of
3.3 Models and Parameters 35
5 4 3 2 1
2.0
1.5
1.0
Treatments
L
o
g
1
0

l
i
f
e
t
i
m
e
Figure 3.1: Box-plots of log
10
times till failure of a resin under
ve different temperature stresses, using Minitab.
1.5
2
2.5
3
120 140 160 180 200 220 240
Temperature
M
e
a
n

l
o
g

l
i
f
e
t
i
m
e
Figure 3.2: Average log
10
time till failure versus temperature,
with linear regression line added, using MacAnova.
experimental data, we may posit several different models for the data, all
with unknown parameters. The objectives of the experiment can often be
described as deciding which model is the best description of the data, and
making inferences about the parameters in the models.
36 Completely Randomized Designs
Our models for experimental data have two basic parts. The rst part
describes the average or expected values for the data. This is sometimes
called a model for the means or structure for the means. For example, Model for the
means
consider the birch tree weights from Example 3.1. We might assume that
all the treatments have the same mean response, or that each treatment has
its own mean, or that the means in the treatments are a straight line function
of the treatment pH. Each one of these models for the means has its own
parameters, namely the common mean, the ve separate treatment means,
and the slope and intercept of the linear relationship, respectively.
The second basic part of our data models is a description of how the
data vary around the treatment means. This is the model for the errors Model for the
errors
or structure for the errors. We assume that deviations from the treatment
means are independent for different data values, have mean zero, and all the
deviations have the same variance, denoted by
2
.
2
This description of the model for the errors is incomplete, because we
have not described the distribution of the errors. We can actually go a fair
way with descriptive statistics using our mean and error models without ever Normal
distribution of
errors needed for
inference
assuming a distribution for the deviations, but we will need to assume a dis-
tribution for the deviations in order to do tests, condence intervals, and other
forms of inference. We assume, in addition to independence, zero mean, and
constant variance, that the deviations follow a Normal distribution.
The standard analysis for completely randomized designs is concerned
with the structure of the means. We are trying to learn whether the means Standard analysis
explores means
are all the same, or if some differ from the others, and the nature of any
differences that might be present. The error structure is assumed to be known,
except for the variance
2
, which must be estimated and dealt with but is
otherwise of lesser interest.
Let me emphasize that these models in the standard analysis may not
be the only models of interest; for example, we may have data that do not Standard analysis
is not always
appropriate
follow a normal distribution, or we may be interested in variance differences
rather than mean differences (see Example 3.4). However, the usual analysis
looking at means is a reasonable place to start.
Example 3.4 Luria, Delbr uck, and variances
In the 1940s it was known that some strains of bacteria were sensitive to a
particular virus and would be killed if exposed. Nonetheless, some members
of those strains did not die when exposed to the virus and happily proceeded
to reproduce. What caused this phenomenon? Was it spontaneous mutation,
or was it an adaptation that occurred after exposure to the virus? These two
competing theories for the phenomenon led to the same average numbers
3.3 Models and Parameters 37
of resistant bacteria, but to different variances in the numbers of resistant
bacteriawith the mutation theory leading to a much higher variance. Ex-
periments showed that the variances were high, as predicted by the mutation
theory. This was an experiment where all the important information was in
the variance, not in the mean. It was also the beginning of a research collab-
oration that eventually led to the 1969 Nobel Prize for Luria and Delbr uck.
There are many models for the means; we start with two basic models.
We have g treatments and N units. Let y
ij
be the jth response in the ith
treatment group. Thus i runs between 1 and g, and j runs between 1 and n
i
,
in treatment group i. The model of separate group means (the full model) as- Separate means
model
sumes that every treatment has its own mean response
i
. Combined with the
error structure, the separate means model implies that all the data are inde-
pendent and normally distributed with constant variance, but each treatment
group may have its own mean:
y
ij
N(
i
,
2
) .
Alternatively, we may write this model as
y
ij
=
i
+
ij
,
where the
ij
s are errors or deviations that are independent, normally
distributed with mean zero and variance
2
.
The second basic model for the means is the single mean model (the
reduced model). The single mean model assumes that all the treatments have Single mean
model
the same mean . Combined with the error structure, the single mean model
implies that the data are independent and normally distributed with mean
and constant variance,
y
ij
N(,
2
) .
Alternatively, we may write this model as
y
ij
= +
ij
,
where the
ij
s are independent, normally distributed errors with mean zero
and variance
2
.
Note that the single mean model is a special case or restriction of the Compare reduced
model to full
model
group means model, namely the case when all of the
i
s equal each other.
Model comparison is easiest when one of the models is a restricted or reduced
form of the other.
38 Completely Randomized Designs
We sometimes express the group means
i
as
i
=

+
i
. The constant

is called the overall mean, and


i
is called the ith treatment effect. In this Overall mean

and treatment
effects
i
formulation, the single mean model is the situation where all the
i
values
are equal to each other: for example, all zero. This introduction of

and

i
seems like a needless complication, and at this stage of the game it really
is. However, the treatment effect formulation will be extremely useful later
when we look at factorial treatment structures.
Note that there is something a bit shy here. There are g means
i
,
one for each of the g treatments, but we are using g + 1 parameters (

and the
i
s) to describe the g means. This implies that

and the
i
s are Too many
parameters
not uniquely determined. For example, if we add 15 to

and subtract 15
from all the
i
s, we get the same treatment means
i
: the 15s just cancel.
However,
i

j
will always equal
i

j
, so the differences between
treatment effects will be the same no matter how we dene

.
We got into this embarrassment by imposing an additional mathematical
structure (the overall mean

) on the set of g group means. We can get out of


this embarrassment by deciding what we mean by

; once we know

, then
we can determine the treatment effects
i
by
i
=
i

. Alternatively, Restrictions make


treatment effects
well dened
we can decide what we mean by
i
; then we can get

by

=
i

i
.
These decisions typically take the form of some mathematical restriction on
the values for

or
i
. Restricting

or
i
is really two sides of the same
coin.
Mathematically, all choices for dening

are equally good. In prac-


tice, some choices are more convenient than others. Different statistical soft-
ware packages use different choices, and different computational formulae Differences of
treatment effects
do not depend on
restrictions
use different choices; our major worry is keeping track of which particular
choice is in use at any given time. Fortunately, the important things dont
depend on which set of restrictions we use. Important things are treatment
means, differences of treatment means (or equivalently, differences of
i
s),
and comparisons of models.
One classical choice is to dene

as the mean of the treatment means:

=
g

i=1

i
/g .
For this choice, the sum of the treatment effects is zero: Sum of treatment
effects is zero
g

i=1

i
= 0 .
3.4 Estimating Parameters 39
An alternative that makes some hand work simpler assumes that

is the
weighted average of the treatment means, with the sample sizes n
i
used as
weights:

=
g

i=1
n
i

i
/N .
For this choice, the weighted sum of the treatment effects is zero: Or weighted sum
of treatment
effects is zero
g

i=1
n
i

i
= 0 .
When the sample sizes are equal, these two choices coincide. The computa-
tional formulae we give in this book will use the restriction that the weighted
sum of the
i
s is zero, because it leads to somewhat simpler hand computa-
tions. Some of the formulae in later chapters are only valid when the sample
sizes are equal.
Our restriction that the treatment effects
i
add to zero (either weighted
or not) implies that the treatment effects are not completely free to vary. We Degrees of
freedom for
treatment effects
can set g 1 of them however we wish, but the remaining treatment effect is
then determined because it must be whatever value makes the zero sum true.
We express this by saying that the treatment effects have g 1 degrees of
freedom.
3.4 Estimating Parameters
Most data analysis these days is done using a computer. Few of us sit down
and crunch through the necessary calculations by hand. Nonetheless, know-
ing the basic formulae and ideas behind our analysis helps us understand and
interpret the quantities that come out of the software black box. If we dont
understand the quantities printed by the software, we cannot possibly use
them to understand the data and answer our questions.
The parameters of our group means model are the treatment means
i
and the variance
2
, plus the derived parameters

and the
i
s. We will Unbiased
estimators correct
on average
be computing unbiased estimates of these parameters. Unbiased means
that when you average the values of the estimates across all potential random
errors
ij
, you get the true parameter values.
It is convenient to introduce a notation to indicate the estimator of a pa-
rameter. The usual notation in statistics is to put a hat over the parameter to
indicate the estimator; thus is an estimator of . Because we have parame-
ters that satisfy
i
=

+
i
, we will use estimators that satisfy
i
=

+
i
.
40 Completely Randomized Designs
Lets establish some notation for sample averages and the like. The sum
of the observations in the ith treatment group is
y
i
=
n
i

j=1
y
ij
.
The mean of the observations in the ith treatment group is Treatment means
y
i
=
1
n
i
n
i

j=1
y
ij
= y
i
/n
i
.
The overbar indicates averaging, and the dot () indicates that we have aver-
aged (or summed) over the indicated subscript. The sum of all observations
is
y

=
g

i=1
n
i

j=1
y
ij
=
g

i=1
y
i
,
and the grand mean of all observations is Grand mean
y

=
1
N
g

i=1
n
i

j=1
y
ij
= y

/N .
The sum of squared deviations of the data from the group means is
SS
E
=
g

i=1
n
i

j=1
(y
ij
y
i
)
2
.
The SS
E
measures total variability in the data around the group means.
Consider rst the separate means model, with each treatment group hav-
ing its own mean
i
. The natural estimator of
i
is y
i
, the average of the
i
= y
i
observations in that treatment group. We estimate the expected (or average)
response in the ith treatment group by the observed average in the ith treat-
ment group responses. Thus we have

i
= y
i
.
The sample average is an unbiased estimator of the population average, so
i
is an unbiased estimator of
i
.
In the single mean model, the only parameter in the model for the means
is . The natural estimator of is y

, the grand mean of all the responses. = y

3.4 Estimating Parameters 41


That is, if we felt that all the data were responses from the same population,
we would estimate the mean of that single population by the grand mean of
the data. Thus we have
= y

.
The grand mean is an unbiased estimate of when the data all come from a
single population.
We use the restriction that

i
n
i

i
/N; an unbiased estimate of

is

g
i=1
n
i

i
N
=

g
i=1
n
i
y
i
N
=
y

N
= y

.
This is the same as the estimator we use for in the single mean model. =

for
weighted sum
restriction
Because and

are both estimated by the same value, we will drop the


notation

and just use the single notation for both roles.


The treatment effects
i
are

i
=
i
;
these can be estimated by
i
= y
i
y


i
=
i

= y
i
y

.
These treatment effects and estimates satisfy the restriction
g

i=1
n
i

i
=
g

i=1
n
i

i
= 0 .
The only parameter remaining to estimate is
2
. Our estimator of
2
is

2
= MS
E
=
SS
E
N g
=

g
i=1

n
i
j=1
(y
ij
y
i
)
2
N g
.
We sometimes use the notation s in place of in analogy with the sample
2
is unbiased for

2
standard deviation s. This estimator
2
is unbiased for
2
in both the separate
means and single means models. (Note that is not unbiased for .)
The deviations from the group mean y
ij
y
i
add to zero in any treatment
group, so that any n
i
1 of them determine the remaining one. Put another
way, there are n
i
1 degrees of freedom for error in each group, or N g = Error degrees of
freedom

i
(n
i
1) degrees of freedom for error for the experiment. There are thus
N g degrees of freedom for our estimate
2
. This is analogous to the
42 Completely Randomized Designs
Model Parameter Estimator
Single mean y

g
i=1

n
i
j=1
(y
ij
y
i
)
2
Ng
Separate means y

i
y
i

i
y
i
y

g
i=1

n
i
j=1
(y
ij
y
i
)
2
Ng
Display 3.1: Point estimators in the CRD.
formula n
1
+n
2
2 for the degrees of freedomin a two-sample t-test. Another
way to think of Ng is the number of data values minus the number of mean
parameters estimated.
The formulae for these estimators are collected in Display 3.1. The next
example illustrates their use.
Example 3.5 Resin lifetimes, continued
Most of the work for computing point estimates is done once we get the av-
erage responses overall and in each treatment group. Using the resin lifetime
data from Table 3.1, we get the following means and counts:
Treatment (
o
C) 175 194 213 231 250 All data
Average 1.933 1.629 1.378 1.194 1.057 1.465
Count 8 8 8 7 6 37
The estimates
i
and can be read from the table:

1
= 1.933
2
= 1.629
3
= 1.378

4
= 1.194
5
= 1.057 = 1.465
Get the
i
values by subtracting the grand mean from the group means:

1
= 1.932 1.465 = .467
2
= 1.629 1.465 = .164

3
= 1.378 1.465 = .088
4
= 1.194 1.465 = .271

5
= 1.057 1.465 = .408
3.4 Estimating Parameters 43
Notice that

g
i=1
n
i

i
= 0 (except for roundoff error).
The computation for
2
is a bit more work, because we need to compute
the SS
E
. For the resin data, SS
E
is
SS
E
= (2.04 1.933)
2
+ (1.91 1.933)
2
+ + (1.90 1.933)
2
+
(1.66 1.629)
2
+ (1.71 1.629)
2
+ + (1.66 1.629)
2
+
(1.53 1.378)
2
+ (1.54 1.378)
2
+ + (1.38 1.378)
2
+
(1.15 1.194)
2
+ (1.22 1.194)
2
+ + (1.17 1.194)
2
+
(1.26 1.057)
2
+ (.83 1.057)
2
+ + (1.06 1.057)
2
= .29369
Thus we have

2
= SS
E
/(N g) = .29369/(37 5) = .009178 .
A point estimate gives our best guess as to the value of a parameter. A
condence interval gives a plausible range for the parameter, that is, a set of Condence
intervals for
means and
effects
parameter values that are consistent with the data. Condence intervals for
and the
i
s are useful and straightforward to compute. Condence intervals
for the
i
s are only slightly more trouble to compute, but are perhaps less
useful because there are several potential ways to dene the s. Differences
between
i
s, or equivalently, differences between
i
s, are extremely useful;
these will be considered in depth in Chapter 4. Condence intervals for the
error variance
2
will be considered in Chapter 11.
Condence intervals for parameters in the mean structure have the gen-
eral form: Generic
condence
interval for mean
parameter
unbiased estimate multiplier (estimated) standard error of estimate.
The standard errors for the averages y

and y
i
are /

N and /

n
i
re-
spectively. We do not know , so we use = s =

MS
E
as an estimate
and obtain s/

N and s/

n
i
as estimated standard errors for y

and y
i
.
For an interval with coverage 1 E, we use the upper E/2 percent point
of the t-distribution with N g degrees of freedom as the multipler. This is
denoted t
E/2,Ng
. We use the E/2 percent point because we are constructing Use t multiplier
when error is
estimated
a two-sided condence interval, and we are allowing error rates of E/2 on
both the low and high ends. For example, we use the upper 2.5% point (or
97.5% cumulative point) of t for 95% coverage. The degrees of freedom for
the t-distribution come from
2
, our estimate of the error variance. For the
CRD, the degrees of freedom are N g, the number of data points minus the
number of treatment groups.
44 Completely Randomized Designs
Parameter Estimator Standard Error
y

s/

i
y
i
s/

n
i

i
y
i
y

s
_
1/n
i
1/N
Display 3.2: Standard errors of point estimators in the CRD.
The standard error of an estimated treatment effect
i
is
_
1/n
i
1/N.
Again, we must use an estimate of , yielding s
_
1/n
i
1/N for the esti-
mated standard error. Keep in mind that the treatment effects
i
are nega-
tively correlated, because they must add to zero.
3.5 Comparing Models: The Analysis of Variance
In the standard analysis of a CRD, we are interested in the mean responses
of the treatment groups. One obvious place to begin is to decide whether the
means are all the same, or if some of them differ. Restating this question in
terms of models, we ask whether the data can be adequately described by the
model of a single mean, or if we need the model of separate treatment group
means. Recall that the single mean model is a special case of the group means
model. That is, we can choose the parameters in the group means model so ANOVA
compares models
that we actually get the same mean for all groups. The single mean model is
said to be a reduced or restricted version of the group means model. Analysis
of Variance, usually abbreviated ANOVA, is a method for comparing the t
of two models, one a reduced version of the other.
Strictly speaking, ANOVA is an arithmetic procedure for partitioning the
variability in a data set into bits associated with different mean structures
plus a leftover bit. (Its really just the Pythagorean Theorem, though weve
chosen our right triangles pretty carefully in N-dimensional space.) When in
addition the error structure for the data is independent normal with constant ANOVA partitions
variability
variance, we can use the information provided by an ANOVA to construct
statistical tests comparing the different mean structures or models for means
that are represented in the ANOVA. The link between the ANOVA decom-
position for the variability and tests for models is so tight, however, that we
sometimes speak of testing via ANOVA even though the test is not really part
of the ANOVA.
3.6 Mechanics of ANOVA 45
Our approach to model comparison is Occams Razor we use the sim-
plest model that is consistent with the data. We only move to the more com- Use simplest
acceptable model
plicated model if the data indicate that the more complicated model is needed.
How is this need indicated? The residuals r
ij
are the differences between
the data y
ij
and the tted mean model. For the single mean model, the Residuals and
SSR
tted values are all y

, so the residuals are r


ij
= y
ij
y

; for the separate


means model, the tted values are the group means y
i
, so the residuals are
r
ij
= y
ij
y
i
. We measure the closeness of the data to a tted model by
looking at the sumof squared residuals (SSR). The point estimators we have
chosen for the mean parameters in our models are least squares estimators,
which implies that they are the parameter estimates that make these sums of Least squares
squared residuals as small as possible.
The sum of squared residuals for the separate means model is usually
smaller than that for the single mean model; it can never be larger. We will
conclude that the more complicated separate means model is needed if its
SSR is sufciently less than that of the single mean model. We still need
to construct a criterion for deciding when the SSR has been reduced suf-
ciently.
One way of constructing a criterion to compare models is via a statistical
test, with the null hypothesis that the single mean model is true versus the
alternative that the separate means model is true. In common practice, the
null and alternative hypotheses are usually expressed in terms of parameters
rather than models. Using the
i
= +
i
notation for group means, the Null and
alternative
hypotheses
null hypothesis H
0
of a single mean can be expressed as H
0
:
i
= 0 for
all i, and the alternative can be expressed as H
A
:
i
= 0 for some i. Note
that since we have assumed that

n
i

i
= 0, one nonzero
i
implies that the

i
s are not all equal to each other. The alternative hypothesis does not mean
that all the
i
s are different, just that they are not all the same.
The model comparison point of viewopts for the separate means model if
that model has sufciently less residual variation, while the parameter testing
view opts for the separate means model if there is sufciently great variation
between the observed group means. These seem like different ideas, but we
will see in the ANOVA decomposition that they are really saying the same
thing, because less residual variation implies more variation between group
means when the total variation is xed.
3.6 Mechanics of ANOVA
ANOVA works by partitioning the total variability in the data into parts that
mimic the model. The separate means model says that the data are not all
46 Completely Randomized Designs
equal to the grand mean because of treatment effects and random error: ANOVA
decomposition
parallels model
y
ij
=
i
+
ij
.
ANOVA decomposes the data similarly into a part that deals with group
means, and a part that deals with deviations from group means:
y
ij
y

= (y
i
y

) + (y
ij
y
i
)
=
i
+r
ij
.
The difference on the left is the deviation of a response from the grand mean. SS
T
If you square all such differences and add them up you get SS
T
, the total
sum of squares.
1
The rst difference on the right is the estimated treatment effect
i
. If
you squared all these (one for each of the N data values) and added them up,
you would get SS
Trt
, the treatment sum of squares: SS
Trt
SS
Trt
=
g

i=1
n
i

j=1
(y
i
y

)
2
=
g

i=1
n
i
(y
i
y

)
2
=
g

i=1
n
i

i
2
.
I think of this as
1. Square the treatment effect,
2. Multiply by the number of units receiving that effect, and
3. Add over the levels of the effect.
This three-step pattern will appear again frequently.
The second difference on the right is the ijth residual from the model,
which gives us some information about
ij
. If you squared and added the SS
E
r
ij
s you would get SS
E
, the error sum of squares:
SS
E
=
g

i=1
n
i

j=1
(y
ij
y
i
)
2
.
This is the same SS
E
that we use in estimating
2
.
1
For pedants in the readership, this quantity is the corrected total sum of squares. There
is also an uncorrected total sum of squares. The uncorrected total is the sum of the squared
observations; the uncorrected total sum of squares equals SS
T
plus Ny

2
. In this book, total
sum of squares will mean corrected total sum of squares.
3.6 Mechanics of ANOVA 47
SS
Trt
=

g
i=1
n
i

i
2
SS
E
=

g
i=1

n
i
j=1
(y
ij
y
i
)
2
SS
T
= SS
Trt
+ SS
E
Display 3.3: Sums of squares in the CRD
Recall that
y
ij
y

=
i
+r
ij
so that
(y
ij
y

)
2
=
i
2
+r
2
ij
+ 2
i
r
ij
.
Adding over i and j we get
SS
T
= SS
Trt
+SS
E
+ 2
g

i=1
n
i

j=1

i
r
ij
.
We can show (see Question 3.2) that the sum of the cross-products is zero, so Total SS
that
SS
T
= SS
Trt
+SS
E
.
Now we can see the link between testing equality of group means and com-
paring models via SSR. For a given data set (and thus a xed SS
T
), more Larger SS
Trt
implies smaller
SS
E
variation between the group means implies a larger SS
Trt
, which in turn im-
plies that the SS
E
must be smaller, which is the SSR for the separate means
model.
Display 3.3 summarizes the sums of squares formulae for the CRD. I
should mention that there are numerous calculator or shortcut formulae
for computing sums of squares quantities. In my experience, these formulae
are more difcult to remember than the ones given here, provide little insight
into what the ANOVA is doing, and are in some circumstances more prone
to roundoff errors. I do not recommend them.
ANOVAcomputations are summarized in a table with columns for source
of variation, degrees of freedom, sum of squares, mean squares, and F-
statistics. There is a row in the table for every source of variation in the full ANOVA table
model. In the CRD, the sources of variation are treatments and errors, some-
times called between- and within-groups variation. Some tables are written
48 Completely Randomized Designs
with rows for either or both of the grand mean and the total variation, though
these rows do not affect the usual model comparisons.
The following is a generic ANOVA table for a CRD. Generic ANOVA
table
Source DF SS MS F
Treatments g 1 SS
Trt
SS
Trt
/(g 1) MS
Trt
/MS
E
Error N g SS
E
SS
E
/(N g)
The degrees of freedom are g 1 for treatments and N g for error. We
saw the rationale for these in Section 3.4. The formulae for sums of squares
were given above, and mean squares are always sums of squares divided by
their degrees of freedom. The F-statistic is the ratio of two mean squares, the
numerator mean square for a source of variation that we wish to assess, and
a denominator (or error) mean square that estimates error variance.
We use the F-statistic (or F-ratio) in the ANOVA table to make a test of
the null hypothesis that all the treatment means are the same (all the
i
values
are zero) versus the alternative that some of the treatment means differ (some
of the
i
values are nonzero). When the null hypothesis is true, the F-statistic
is about 1, give or take some random variation; when the alternative is true,
the F-statistic tends to be bigger than 1. To complete the test, we need to be F-test to compare
models
able to tell how big is too big for the F-statistic. If the null hypothesis is true
and our model and distributional assumptions are correct, then the F-statistic
follows the F-distribution with g 1 and N g degrees of freedom. Note
that the F-distribution has two degrees of freedom, one from the numerator
mean square and one from the denominator mean square.
To do the test, we compute the F-statistic and the degrees of freedom, and
then we compute the probability of observing an F-statistic as large or larger
than the one we observed, assuming all the
i
s were zero. This probability is
called the p-value or observed signicance level of the test, and is computed p-value to assess
evidence
as the area under an F-distribution from the observed F-statistic on to the
right, when the F-distribution has degrees of freedom equal to the degrees of
freedom for the numerator and denominator mean squares. This p-value is
usually obtained from a table of the F-distribution (for example, Appendix
Table D.5) or via the use of statistical software.
Small values of the p-value are evidence that the null may be incorrect:
either we have seen a rare event (big F-statistics when the null is actually
true, leading to a small p-value), or an assumption we used to compute the
p-value is wrong, namely the assumption that all the
i
s are zero. Given
the choice of unlucky or incorrect assumption, most people choose incorrect
assumption.
3.6 Mechanics of ANOVA 49
Table 3.2: Approximate Type I error probabilities for
different p-values using the Sellke et al. lower bound.
p .05 .01 .001 .0001
P(p) .29 .11 .018 .0025
We have now changed the question from How big is too big an F? to
How small is too small a p-value? By tradition, p-values less than .05
are termed statistically signicant, and those less than .01 are termed highly
statistically signicant. These values are reasonable (one chance in 20, one .05 and .01
signicance levels
chance in 100), but there is really no reason other than tradition to prefer
them over other similar values, say one chance in 30 and one chance in 200.
It should also be noted that a person using the traditional values would declare
one test with p-value of .049 to be signicant and another test with a p-
value of .051 not to be signicant, but the two tests are really giving virtually
identical results. Thus I prefer to report the p-value itself rather than simply
report signicance or lack thereof.
As with any test, remember that statistical signicance is not the same
as real world importance. A tiny p-value may be obtained with relatively Practical
signicance
small
i
s if the sample size is large enough or
2
is small enough. Likewise,
large important differences between means may not appear signicant if the
sample size is small or the error variance large.
It is also important not to overinterpret the p-value. Reported p-values of
.05 or .01 carry the magnicent labels of statistically signicant or highly sta-
tistically signicant, but they actually are not terribly strong evidence against
the null. What we would really like to know is the probability that rejecting
the null is an error; the p-value does not give us that information. Sellke,
Bayarri, and Berger (1999) dene an approximate lower bound on this prob-
ability. They call their bound a calibrated p-value, but I do not like the name Approximate error
probability
because their quantity is not really a p-value. Suppose that before seeing any
data you thought that the null and alternative each had probability .5 of being
true. Then for p-values less than e
1
.37, the Sellke et al. approximate
error probability is
P(p) =
ep log(p)
1 ep log(p)
.
The interpretation of the approximate error probability P(p) is that having
seen a p-value of p, the probability that rejecting the null hypothesis is an
error is at least P(p). Sellke et al. show that this lower bound is pretty
good in a wide variety of problems. Table 3.2 shows that the probability that
rejection is a Type I error is more than .1, even for a p-value of .01.
50 Completely Randomized Designs
Listing 3.1: Minitab output for resin lifetimes.
One-way Analysis of Variance
Analysis of Variance for Lifetime
Source DF SS MS F P x
Temp 4 3.53763 0.88441 96.36 0.000
Error 32 0.29369 0.00918
Total 36 3.83132
Individual 95% CIs For Mean y
Based on Pooled StDev
Level N Mean StDev --------+---------+---------+--------
1 8 1.9325 0.0634 (-*--)
2 8 1.6288 0.1048 (-*--)
3 8 1.3775 0.1071 (-*-)
4 7 1.1943 0.0458 (--*-)
5 6 1.0567 0.1384 (-*--)
--------+---------+---------+--------
Pooled StDev = 0.0958 1.20 1.50 1.80
Example 3.6 Resin lifetimes, continued
For our resin data, the treatment sum of squares is
SS
Trt
=
g

i=1
n
i

i
2
= 8 .467
2
+ 8 .164
2
+ 8 (.088)
2
+
7 (.271)
2
+ 6 (.408)
2
= 3.5376 .
We have g = 5 treatments so there are g1 = 4 degrees of freedom between
treatments. We computed the SS
E
in Example 3.5; it was .29369 with 32
degrees of freedom. The ANOVA table is
ANOVA
Source DF SS MS F
treatments 4 3.5376 .88441 96.4
error 32 .29369 .0091779
total 36 3.8313
The F-statistic is about 96 with 4 and 32 degrees of freedom. There is
essentially no probability under the F-curve with 4 and 32 degrees of freedom
3.6 Mechanics of ANOVA 51
Listing 3.2: SAS output for resin lifetimes
Analysis of Variance Procedure
Dependent Variable: LIFETIME
Sum of Mean
Source DF Squares Square F Value Pr > F x
Model 4 3.53763206 0.88440802 96.36 0.0001
Error 32 0.29369226 0.00917788
Corrected Total 36 3.83132432
R-Square C.V. Root MSE LIFETIME Mean
0.923344 6.538733 0.09580 1.46514
Level of -----------LIFETIME---------- y
TEMPER N Mean SD
1 8 1.93250000 0.06341473
2 8 1.62875000 0.10480424
3 8 1.37750000 0.10713810
4 7 1.19428571 0.04577377
5 6 1.05666667 0.13837148
to the right of 96. (There is only .00001 probability to the right of 11.) Thus
the p-value for this test is essentially zero, and we would conclude that not all
the treatments yield the same mean lifetime. From a practical point of view,
the experimenters already knew this; the experiment was run to determine
the nature of the dependence of lifetime on temperature, not whether there
was any dependence.
Different statistics software packages give slightly different output for the
ANOVA of the resin lifetime data. For example, Listing 3.1 gives Minitab
ANOVA output. In addition to the ANOVA table x, the standard Minitab
output includes a table of treatment means and a plot of 95% condence
intervals for those means y. Listing 3.2 gives SAS output (edited to save
space) for these data x. SAS does not automatically print group means, but
you can request them as shown here y.
There is a heuristic for the degrees-of-freedom formulae. Degrees of
freedom for a model count the number of additional parameters used for the
mean structure when moving from the next simpler model to this model. For
example, the degrees of freedom for treatment are g 1. The next simpler
52 Completely Randomized Designs
model is the model of a single mean for all treatments; the full model has a
different mean for each of the g treatments. That is g 1 more parameters. Model df count
parameters
Alternatively, look at the
i
s. Under the null, they are all zero. Under the
alternative, they may be nonzero, but only g 1 of them can be set freely,
because the last one is then set by the restriction that their weighted summust
be zero. Degrees of freedom for error are the number of data less the number
of (mean) parameters estimated.
3.7 Why ANOVA Works
The mean square for error is a random variable; it depends on the random
errors in the data. If we repeated the experiment, we would get different ran-
dom errors and thus a different mean square for error. However, the expected E(MS
E
) =
2
value of the mean square for error, averaged over all the possible outcomes
of the random errors, is the variance of the random errors
2
. Thus, the mean
square for error estimates the error variance, no matter what the values of the

i
s.
The mean square for treatments is also a random variable, but the MS
Trt
has expectation: Expected mean
square for
treatments
E(MS
Trt
) = EMS
Trt
=
2
+
g

i=1
n
i

2
i
/(g 1) .
The important things to get from this expression are
1. When all of the
i
s are zero, the mean square for treatments also esti-
mates
2
.
2. When some of the
i
s are nonzero, the mean square for treatments
tends to be bigger than
2
.
When the null hypothesis is true, both MS
Trt
and MS
E
vary around

2
, so their ratio (the F-statistic) is about one, give or take some random
variation. When the null hypothesis is false, MS
Trt
tends to be bigger than

2
, and the F-statistic tends to be bigger than one. We thus reject the null
hypothesis for sufciently large values of the F-statistic.
3.8 Back to Model Comparison
The preceding section described Analysis of Variance as a test of the null
hypothesis that all the
i
values are zero. Another way to look at ANOVA is
3.8 Back to Model Comparison 53
as a comparison of two models for the data. The reduced model is the model
that all treatments have the same expected value (that is, the
i
values are all
zero); the full model allows the treatments to have different expected values. ANOVA
compares models
From this point of view, we are not testing whether a set of parameters is all
zero; we are comparing the adequacy of two different models for the mean
structure.
Analysis of Variance uses sums of squared deviations from a model, just
as sample standard deviations use squared deviations from a sample mean.
For the reduced model (null hypothesis), the estimated model is = y

.
For the data value y
ij
, the residual is
r
ij
= y
ij
= y
ij
y

.
The residual sum of squares for the reduced model is then
SSR
0
=

ij
r
2
ij
=

ij
(y
ij
y

)
2
.
For the full model (alternative hypothesis), the estimated model is
i
= y
i
,
and the residuals are
r
ij
= y
ij

i
= y
ij
y
i
.
The residual sum of squares for the full model is then Model SSR
SSR
A
=

ij
r
2
ij
=

ij
(y
ij
y
i
)
2
.
SSR
A
can never be bigger than SSR
0
and will almost always be smaller.
We would prefer the full model if SSR
A
is sufciently smaller than SSR
0
.
How does this terminology for ANOVA mesh with what we have already
seen? The residual sum of squares from the full model, SSR
A
, is the error
sum of squares SS
E
in the usual formulation. The residual sum of squares
from the reduced model, SSR
0
, is the total sum of squares SS
T
in the usual
formulation. The difference SSR
0
SSR
A
is equal to the treatment sum of
squares SS
Trt
. Thus the treatment sum of squares is the additional amount of Change in SSR
variation in the data that can be explained by using the more complicated full
model instead of the simpler reduced model.
This idea of comparing models instead of testing hypotheses about pa-
rameters is a fairly subtle distinction, and here is why the distinction is im-
portant: in our heart of hearts, we almost never believe that the null hypoth-
esis could be true. We usually believe that at some level of precision, there
54 Completely Randomized Designs
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5


E
f
f
e
c
t

o
r

R
e
s
i
d
u
a
l
1
2
3
4
5
Temp Residuals
Figure 3.3: Side-by-side plot for resin lifetime data, using
MacAnova.
is a difference between the mean responses of the treatments. So why the
charade of testing the null hypothesis?
The answer is that we are choosing a model for the data from a set of
potential models. We want a model that is as simple as possible yet still con-
sistent with the data. A more realistic null hypothesis is that the means are so
close to being equal that the differences are negligible. When we reject the
null hypothesis we are making the decision that the data are demonstrably Choose simplest
acceptable model
inconsistent with the simpler model, the differences between the means are
not negligible, and the more complicated model is required. Thus we use the
F-test to guide us in our choice of model. This distinction between testing
hypotheses on parameters and selecting models will become more important
later.
3.9 Side-by-Side Plots
Hoaglin, Mosteller, and Tukey (1991) introduce the side-by-side plot as a
method for visualizing treatment effects and residuals. Figure 3.3 shows a
side-by-side plot for the resin lifetime data of Example 3.2. We plot the es-
timated treatment effects
i
in one column and the residuals r
ij
in a second Side-by-side plots
show effects and
residuals
column. (There will be more columns in more complicated models we will
see later.) The vertical scale is in the same units as the response. In this
3.10 Dose-Response Modeling 55
plot, we have used a box-plot for the residuals rather than plot them indi-
vidually; this will usually be more understandable when there are relatively
many points to be put in a single column.
What we see from the side-by-side plot is that the treatment effects are
large compared to the size of the residuals. We were also able to see this in
the parallel box-plots in the exploratory analysis, but the side-by-side plots
will generalize better to more complicated models.
3.10 Dose-Response Modeling
In some experiments, the treatments are associated with numerical levels
such as drug dose, baking time, or reaction temperature. We will refer to Numerical levels
or doses
such levels as doses, no matter what they actually are, and the numerical
value of the dose for treatment i will be denoted z
i
. When we have numer-
ical doses, we may reexpress the treatment means as a function of the dose
z
i
:
+
i
= f(z
i
; ) ,
where is some unknown parameter of the function. For example, we could
express the mean weight of yellow birch seedlings as a function of the pH of
acid rain.
The most commonly used functions f are polynomials in the dose z
i
: Polynomial
models
+
i
=
0
+
1
z
i
+
2
z
2
i
+ +
g1
z
g1
i
.
We use the power g 1 because the means at g different doses determine
a polynomial of order g 1. Polynomials are used so often because they
are simple and easy to understand; they are not always the most appropriate
choice.
If we know the polynomial coefcients
0
,
1
, . . .,
g1
, then we can de-
termine the treatment means +
i
, and vice versa. If we know the poly-
nomial coefcients except for the constant
0
, then we can determine the Polynomials are
an alternative to
treatment effects
treatment effects
i
, and vice versa. The g 1 parameters
1
through
g1
in this full polynomial model correspond to the g 1 degrees of freedom
between the treatment groups. Thus polynomials in dose are not inherently
better or worse than the treatment effects model, just another way to describe
the differences between means.
Polynomial modeling is useful in two contexts. First, if only a few of
the polynomial coefcients are needed (that is, the others can be set to zero
without signicantly decreasing the quality of the t), then this reduced poly-
nomial model represents a reduction in the complexity of our model. For
56 Completely Randomized Designs
example, learning that the response is linear or quadratic in dose is useful,
whereas a polynomial of degree six or seven will be difcult to comprehend Polynomial
models can
reduce number of
parameters
needed and
provide
interpolation
(or sell to anyone else). Second, if we wish to estimate the response at some
dose other than one used in the experiment, the polynomial model provides
a mechanism for generating the estimates. Note that these estimates may be
poor if we are extrapolating beyond the range of the doses in our experiment
or if the degree of the polynomial is high. High-order polynomials will t
our observed treatment means exactly, but these high-order polynomials can
have bizarre behavior away from our data points.
Consider a sequence of regression models for our data, regressing the
responses on dose, dose squared, and so on. The rst model just includes
the constant
0
; that is, it ts a single value for all responses. The second
model includes the constant
0
and a linear term
1
z
i
; this model ts the
responses as a simple linear regression in dose. The third model includes the
constant
0
, a linear term
1
z
i
, and the quadratic term
2
z
2
i
; this model ts
the responses as a quadratic function (parabola) of dose. Additional models
include additional powers of dose up to g 1.
Let SSR
k
be the residual sumof squares for the model that includes pow-
ers up to k, for k = 0, . . ., g 1. Each successive model will explain a little
more of the variability between treatments, so that SSR
k
> SSR
k+1
. When
we arrive at the full polynomial model, we will have explained all of the
between-treatment variability using polynomial terms; that is, SSR
g1
= Polynomial
improvement SS
for including an
additional term
SS
E
. The linear sum of squares is the reduction in residual variability
going from the constant model to the model with the linear term:
SS
linear
= SS
1
= SSR
0
SSR
1
.
Similarly, the quadratic sumof squares is the reduction in residual variabil-
ity going from the linear model to the quadratic model,
SS
quadratic
= SS
2
= SSR
1
SSR
2
,
and so on through the remaining orders.
Each of these polynomial sums of squares has 1 degree of freedom, be-
cause each is the result of adding one more parameter
k
to the model for
the means. Thus their mean squares are equal to their sums of squares. In Testing
parameters
a model with terms up through order k, we can test the null hypothesis that

k
= 0 by forming the F-statistic SS
k
/MS
E
, and comparing it to an F-
distribution with 1 and N g degrees of freedom.
One method for choosing a polynomial model is to choose the small-
est order such that no signicant terms are excluded. (More sophisticated Model selection
model selection methods exist.) It is important to know that the estimated
3.10 Dose-Response Modeling 57
Listing 3.3: MacAnova output for resin lifetimes polynomial model.
DF SS MS F P-value x
CONSTANT 1 79.425 79.425 8653.95365 0
{temperature} 1 3.4593 3.4593 376.91283 0
{(temperature)^2} 1 0.078343 0.078343 8.53610 0.0063378
{(temperature)^3} 1 1.8572e-05 1.8572e-05 0.00202 0.9644
{(temperature)^4} 1 8.2568e-06 8.2568e-06 0.00090 0.97626
ERROR1 32 0.29369 0.0091779
CONSTANT y
(1) 0.96995
{temperature}
(1) 0.075733
{(temperature)^2}
(1) -0.00076488
{(temperature)^3}
(1) 2.6003e-06
{(temperature)^4}
(1) -2.9879e-09
DF SS MS F P-value z
CONSTANT 1 79.425 79.425 9193.98587 0
{temperature} 1 3.4593 3.4593 400.43330 0
{(temperature)^2} 1 0.078343 0.078343 9.06878 0.0048787
ERROR1 34 0.29372 0.0086388
CONSTANT {
(1) 7.418
{temperature}
(1) -0.045098
{(temperature)^2}
(1) 7.8604e-05
coefcients

i
depend on which terms are in the model when the model is es-
timated. Thus if we decide we only need
0
,
1
, and
2
when g is 4 or more,
we should ret using just those terms to get appropriate parameter estimates.
Resin lifetimes, continued Example 3.7
The treatments in the resin lifetime data are different temperatures (175, 194,
213, 231, and 250 degrees C), so we can use these temperatures as doses z
i
in
a dose-response relationship. With g = 5 treatments, we can use polynomials
up to power 4.
Listing 3.3 shows output for a polynomial dose-response modeling of the
resin lifetime data. The rst model ts up to temperature to the fourth power.
From the ANOVA x we can see that neither the third nor fourth powers are
58 Completely Randomized Designs
signicant, but the second power is, so a quadratic model seems appropriate.
The ANOVA for the reduced model is at z. The linear and quadratic sums
of squares are the same as in x, but the SS
E
in z is increased by the cubic
and quartic sums of squares in x. We can also see that the intercept, linear,
and quadratic coefcients change dramatically from the full model y to the
reduced model using just those terms {. We cannot simply take the intercept,
linear, and quadratic coefcients from the fourth power model and use them
as if they were coefcients in a quadratic model.
One additional trick to remember when building a dose-response model
is that we can transform or reexpress the dose z
i
. That is, we can build Try transforming
dose
models using log of dose or square root of dose as simply as we can using
dose. For some data it is much simpler to build a model as a function of a
transformation of the dose.
3.11 Further Reading and Extensions
There is a second randomization that is used occasionally, and unfortunately
it also is sometimes called completely randomized.
1. Choose probabilities p
1
though p
g
with p
1
+p
2
+ +p
g
= 1.
2. Choose a treatment independently for each unit, choosing treatment i
with probability p
i
.
Now we wind up with n
i
units getting treatment i, with n
1
+n
2
+ +n
g
=
N, but the sample sizes n
i
are random. This randomization is different than
the standard CRD randomization. ANOVA procedures do not distinguish be-
tween the xed and random sample size randomizations, but if we were to do
randomization testing, we would use different procedures for the two differ-
ent randomizations. As a practical matter, we should note that even though
we may design for certain xed sample sizes, we do not always achieve those
sample sizes when test tubes get dropped, subjects withdraw from studies, or
drunken statistics graduate students drive through experimental elds (you
know who you are!).
The estimates we have used for mean parameters are least squares es-
timates, meaning that they minimize the sum of squared residuals. Least
squares estimation goes back to Legendre (1806) and Gauss (1809), who
developed the procedure for working with astronomical data. Formal tests
based on the t-distribution were introduced by Gosset, who wrote under the
3.11 Further Reading and Extensions 59
pseudonym Student (Student 1908). Gosset worked at the Guiness Brew-
ery, and he was allowed to publish only under a pseudonym so that the com-
petition would not be alerted to the usefulness of the procedure. What Gosset
actually did was posit the t-distribution; proof was supplied later by Fisher
(1925a).
The Analysis of Variance was introduced by Fisher in the context of pop-
ulation genetics (Fisher 1918); he quickly extended the scope (Fisher 1925b).
The 1918 paper actually introduces the terms variance and analysis of
variance. Scheff e (1956) describes howmodels for data essentially the same
as those used for ANOVA were in use decades earlier, though analysis meth-
ods were different.
From a more theoretical perspective, the SS
E
is distributed as
2
times
a chi-square random variable with N g degrees of freedom; SS
Trt
is dis-
tributed as
2
times a possibly noncentral chi-square random variable with
g 1 degrees of freedom; and these two sums of squares are independent.
When the null hypothesis is true, SS
Trt
is a multiple of an ordinary (central)
chi-square; noncentrality arises under the alternative when the expected value
of MS
Trt
is greater than
2
. The ratio of two independent central chi-squares,
each divided by their degrees of freedom, is dened to have an F-distribution.
Thus the null-hypothesis distribution of the F-statistic is F. Chapter 7 and
Appendix A discuss this distribution theory in more detail. Scheff e (1959),
Hocking (1985), and others provide book-length expositions of linear models
and their related theory.
We have described model selection via testing a null hypothesis. An
alternative approach is prediction; for example, we can choose the model
that we believe will give us the lowest average squared error of prediction.
Mallows (1973) dened a quantity called C
p
C
p
=
SSR
p
MS
E
+ 2p N ,
where SSR
p
is the residual sum of squares for a means model with p pa-
rameters (degrees of freedom including any overall constant), MS
E
is the
error mean square from the separate means model, and N is the number of
observations. We prefer models with small C
p
.
The separate means model (with p = g parameters) has C
p
= g. The
single mean model, dose-response models, and other models can have C
p
values greater or less than g. The criterion rewards models with smaller
SSR and penalizes models with larger p. When comparing two models, one
a reduced form of the other, C
p
will prefer the larger model if the F-statistic
comparing the models is 2 or greater. Thus we see that it generally takes less
60 Completely Randomized Designs
evidence to choose a larger model when using a predictive criterion than
when doing testing at the traditional levels.
Quantitative dose-response models as described here are an instance of
polynomial regression. Weisberg (1985) is a good general source on regres-
sion, including polynomial regression. We have used polynomials because
they are simple and traditional, but there are many other sets of functions we
could use instead. Some interesting alternatives include sines and cosines,
B-splines, and wavelets.
3.12 Problems
Rats were given one of four different diets at random, and the response Exercise 3.1
measure was liver weight as a percentage of body weight. The responses
were
Treatment
1 2 3 4
3.52 3.47 3.54 3.74
3.36 3.73 3.52 3.83
3.57 3.38 3.61 3.87
4.19 3.87 3.76 4.08
3.88 3.69 3.65 4.31
3.76 3.51 3.51 3.98
3.94 3.35 3.86
3.64 3.71
(a) Compute the overall mean and treatment effects.
(b) Compute the Analysis of Variance table for these data. What would
you conclude about the four diets?
An experimenter randomly allocated 125 male turkeys to ve treatment Exercise 3.2
groups: control and treatments A, B, C, and D. There were 25 birds in each
group, and the mean results were 2.16, 2.45, 2.91, 3.00, and 2.71, respec-
tively. The sum of squares for experimental error was 153.4. Test the null
hypothesis that the ve group means are the same against the alternative that
one or more of the treatments differs from the control.
Twelve orange pulp silage samples were divided at random into four Exercise 3.3
groups of three. One of the groups was left as an untreated control, while
the other three groups were treated with formic acid, beet pulp, and sodium
chloride, respectively. One of the responses was the moisture content of the
3.12 Problems 61
silage. The observed moisture contents of the silage are shown below (data
from Caro et al. 1990):
NaCl Formic acid Beet pulp Control
80.5 89.1 77.8 76.7
79.3 75.7 79.5 77.2
79.0 81.2 77.0 78.6
Means 79.6 82.0 78.1 77.5
Grand mean 79.3
Compute an analysis of variance table for these data and test the null hypoth-
esis that all four treatments yield the same average moisture contents.
We have ve groups and three observations per group. The group means Exercise 3.4
are 6.5, 4.5, 5.7, 5.7, and 5.1, and the mean square for error is .75. Compute
an ANOVA table for these data.
The leaves of certain plants in the genus Albizzia will fold and unfold in Exercise 3.5
various light conditions. We have taken fteen different leaves and subjected
them to red light for 3 minutes. The leaves were divided into three groups of
ve at random. The leaet angles were then measured 30, 45, and 60 minutes
after light exposure in the three groups. Data from W. Hughes.
Delay (minutes) Angle (degrees)
30 140 138 140 138 142
45 140 150 120 128 130
60 118 130 128 118 118
Analyze these data to test the null hypothesis that delay after exposure does
not affect leaet angle.
Cardiac pacemakers contain electrical connections that are platinum pins Problem 3.1
soldered onto a substrate. The question of interest is whether different op-
erators produce solder joints with the same strength. Twelve substrates are
randomly assigned to four operators. Each operator solders four pins on each
substrate, and then these solder joints are assessed by measuring the shear
strength of the pins. Data from T. Kerkow.
Strength (lb)
Operator Substrate 1 Substrate 2 Substrate 3
1 5.60 6.80 8.32 8.70 7.64 7.44 7.48 7.80 7.72 8.40 6.98 8.00
2 5.04 7.38 5.56 6.96 8.30 6.86 5.62 7.22 5.72 6.40 7.54 7.50
3 8.36 7.04 6.92 8.18 6.20 6.10 2.75 8.14 9.00 8.64 6.60 8.18
4 8.30 8.54 7.68 8.92 8.46 7.38 8.08 8.12 8.68 8.24 8.09 8.06
62 Completely Randomized Designs
Analyze these data to determine if there is any evidence that the operators
produce different mean shear strengths. (Hint: what are the experimental
units?)
Scientists are interested in whether the energy costs involved in reproduc- Problem 3.2
tion affect longevity. In this experiment, 125 male fruit ies were divided at
random into ve sets of 25. In one group, the males were kept by themselves.
In two groups, the males were supplied with one or eight receptive virgin fe-
male fruit ies per day. In the nal two groups, the males were supplied with
one or eight unreceptive (pregnant) female fruit ies per day. Other than
the number and type of companions, the males were treated identically. The
longevity of the ies was observed. Data from Hanley and Shapiro (1994).
Companions Longevity (days)
None 35 37 49 46 63 39 46 56 63 65 56 65 70
63 65 70 77 81 86 70 70 77 77 81 77
1 pregnant 40 37 44 47 47 47 68 47 54 61 71 75 89
58 59 62 79 96 58 62 70 72 75 96 75
1 virgin 46 42 65 46 58 42 48 58 50 80 63 65 70
70 72 97 46 56 70 70 72 76 90 76 92
8 pregnant 21 40 44 54 36 40 56 60 48 53 60 60 65
68 60 81 81 48 48 56 68 75 81 48 68
8 virgin 16 19 19 32 33 33 30 42 42 33 26 30 40
54 34 34 47 47 42 47 54 54 56 60 44
Analyze these data to test the null hypothesis that reproductive activity does
not affect longevity. Write a report on your analysis. Be sure to describe the
experiment as well as your results.
Park managers need to know how resistant different vegetative types are Problem 3.3
to trampling so that the number of visitors can be controlled in sensitive areas.
The experiment deals with alpine meadows in the White Mountains of New
Hampshire. Twenty lanes were established, each .5 m wide and 1.5 m long.
These twenty lanes were randomly assigned to ve treatments: 0, 25, 75, 200,
or 500 walking passes. Each pass consists of a 70-kg individual wearing lug-
soled boots walking in a natural gait down the lane. The response measured
is the average height of the vegetation along the lane one year after trampling.
Data based on Table 16 of Cole (1993).
3.12 Problems 63
Number
of passes Height (cm)
0 20.7 15.9 17.8 17.6
25 12.9 13.4 12.7 9.0
75 11.8 12.6 11.4 12.1
200 7.6 9.5 9.9 9.0
500 7.8 9.0 8.5 6.7
Analyze these data to determine if trampling has an effect after one year, and
if so, describe that effect.
Caffeine is a common drug that affects the central nervous system. Among Problem 3.4
the issues involved with caffeine are how does it get from the blood to the
brain, and does the presence of caffeine alter the ability of similar compounds
to move across the blood-brain barrier? In this experiment, 43 lab rats were
randomly assigned to one of eight treatments. Each treatment consisted of
an arterial injection of C
14
-labeled adenine together with a concentration of
caffeine (0 to 50 mM). Shortly after injection, the concentration of labeled
adenine in the rat brains is measured as the response (data from McCall,
Millington, and Wurtman 1982).
Caffeine (mM) Adenine
0 5.74 6.90 3.86 6.94 6.49 1.87
0.1 2.91 4.14 6.29 4.40 3.77
0.5 5.80 5.84 3.18 3.18
1 3.49 2.16 7.36 1.98 5.51
5 5.92 3.66 4.62 3.47 1.33
10 3.05 1.94 1.23 3.45 1.61 4.32
25 1.27 .69 .85 .71 1.04 .84
50 .93 1.47 1.27 1.13 1.25 .55
The main issues in this experiment are whether the amount of caffeine present
affects the amount of adenine that can move from the blood to the brain, and
if so, what is the dose response relationship. Analyze these data.
Engineers wish to know the effect of polypropylene bers on the com- Problem 3.5
pressive strength of concrete. Fifteen concrete cubes are produced and ran-
domly assigned to ve levels of ber content (0, .25, .50, .75, and 1%). Data
from Figure 2 of Paskova and Meyer (1997).
64 Completely Randomized Designs
Fiber
content (%) Strength (ksi)
0 7.8 7.4 7.2
.25 7.9 7.5 7.3
.50 7.4 6.9 6.3
.75 7.0 6.7 6.4
1 5.9 5.8 5.6
Analyze these data to determine if ber content has an effect on concrete
strength, and if so, describe that effect.
Prove that

g
i=1

i
/g is equivalent to

g
i=1

i
= 0. Question 3.1
Prove that Question 3.2
0 =
g

i=1
n
i

j=1

i
r
ij
.
Chapter 4
Looking for Specic
DifferencesContrasts
An Analysis of Variance can give us an indication that not all the treatment
groups have the same mean response, but an ANOVA does not, by itself, tell
us which treatments are different or in what ways they differ. To do this, we
need to look at the treatment means, or equivalently, at the treatment effects.
One method to examine treatment effects is called a contrast.
ANOVAis like background lighting that dimly illuminates all of our data,
but not giving enough light to see details. Using a contrast is like using a Contrasts
examine specic
differences
spotlight; it enables us to focus in on a specic, narrow feature of the data.
But the contrast has such a narrow focus that it does not give the overall
picture. By using several contrasts, we can move our focus around and see
more features. Intelligent use of contrasts involves choosing our contrasts so
that they highlight interesting features in our data.
4.1 Contrast Basics
Contrasts take the form of a difference between means or averages of means.
For example, here are two contrasts:
( +
6
) ( +
3
)
and
+
2
+ +
4
2

+
1
+ +
3
+ +
5
3
.
66 Looking for Specic DifferencesContrasts
The rst compares the means of treatments 6 and 3, while the second com-
pares the mean response in groups 2 and 4 with the mean response in groups
1, 3, and 5.
Formally, a contrast is a linear combination of treatment means or effects Contrasts
compare
averages of
means

g
i=1
w
i

i
= w({
i
}) or

g
i=1
w
i

i
= w({
i
}), where the coefcients w
i
satisfy

g
i=1
w
i
= 0.
Contrast coefcients add to zero.
Less formally, we sometimes speak of the set of contrast coefcients {w
i
} as
being a contrast; we will try to avoid ambiguity. Notice that because the sum
of the coefcients is zero, we have that
w({
i
}) =
g

i=1
w
i

i
= x
g

i=1
w
i
+
g

i=1
w
i

i
=
g

i=1
w
i
(x +
i
) =
g

i=1
w
i
( +
i
) = w({
i
})
for any xed constant x (say or ). We may also make contrasts in the
observed data:
w({y
i
}) =
g

i=1
w
i
y
i
=
g

i=1
w
i
(y
i
y

) =
g

i=1
w
i

i
= w({
i
}) .
A contrast depends on the differences between the values being contrasted,
but not on the overall level of the values. In particular, a contrast in treatment
means depends on the
i
s but not on . A contrast in the treatment means Contrasts do not
depend on
-restrictions
or effects will be the same regardless of whether we assume that
1
= 0,
or

i
= 0, or

n
i

i
= 0. Recall that with respect to restrictions on
the treatment effects, we said that the important things dont depend on
which set of restrictions we use. In particular, contrasts dont depend on the
restrictions.
We may use several different kinds of contrasts in any one analysis. The
trick is to nd or construct contrasts that focus in on interesting features of
the data.
Probably the most common contrasts are pairwise comparisons, where
we contrast the mean response in one treatment with the mean response in a
second treatment. For a pairwise comparison, one contrast coefcient is 1, Pairwise
comparisons
a second contrast coefcient is -1, and all other contrast coefcients are 0.
For example, in an experiment with g = 4 treatments, the coefcients (0, 1,
4.1 Contrast Basics 67
-1, 0) compare the means of treatments 2 and 3, and the coefcients (-1, 0,
1, 0) compare the means of treatments 1 and 3. For g treatments, there are
g(g 1)/2 different pairwise comparisons. We will consider simultaneous
inference for pairwise comparisons in Section 5.4.
A second classic example of contrasts occurs in an experiment with a
control and two or more new treatments. Suppose that treatment 1 is a con-
trol, and treatments 2 and 3 are new treatments. We might wish to compare
the average response in the new treatments to the average response in the
control; that is, on average do the new treatments have the same response as
the control? Here we could use coefcients (-1, .5, .5), which would sub-
tract the average control response from the average of treatments 2 and 3s
average responses. As discussed below, this contrast applied to the observed Control versus
other treatments
treatment means ((y
2
+ y
3
)/2 y
1
) would estimate the contrast in the
treatment effects ((
2
+
3
)/2
1
). Note that we would get the same
kind of information from contrasts with coefcients (1, -.5, -.5) or (-6, 3, 3);
weve just rescaled the result with no essential loss of information. We might
also be interested in the pairwise comparisons, including a comparison of the
new treatments to each other (0, 1, -1) and comparisons of each of the new
treatments to control (1, -1, 0) and (1, 0, -1).
Consider next an experiment with four treatments examining the growth
rate of lambs. The treatments are four different food supplements. Treat-
ment 1 is soy meal and ground corn, treatment 2 is soy meal and ground oats,
treatment 3 is sh meal and ground corn, and treatment 4 is sh meal and
ground oats. Again, there are many potential contrasts of interest. A contrast
with coefcients (.5, .5, -.5, -.5) would take the average response for sh
meal treatments and subtract it from the average response for soy meal treat- Compare related
groups of
treatments
ments. This could tell us about how the protein source affects the response.
Similarly, a contrast with coefcients (.5, -.5, .5, -.5) would take the average
response for ground oats and subtract it from the average response for ground
corn, telling us about the effect of the carbohydrate source.
Finally, consider an experiment with three treatments examining the ef-
fect of development time on the number of defects in computer chips pro-
duced using photolithography. The three treatments are 30, 45, and 60 sec-
onds of developing. If we think of the responses as lying on a straight line
function of development time, then the contrast with coefcients (-1/30, 0, Polynomial
contrasts for
quantitative
doses
1/30) will estimate the slope of the line relating response and time. If instead
we think that the responses lie on a quadratic function of development time,
then the contrast with coefcients (1/450, -2/450, 1/450) will estimate the
quadratic term in the response function. Dont worry for now about where
these coefcients come from; they will be discussed in more detail in Sec-
tion 4.4. For now, consider that the rst contrast compares the responses at
68 Looking for Specic DifferencesContrasts
the ends to get a rate of change, and the second contrast compares the ends
to the middle (which yields a 0 comparison for responses on a straight line)
to assess curvature.
4.2 Inference for Contrasts
We use contrasts in observed treatment means or effects to make inference
about the corresponding contrasts in the true treatment means or effects. The
kinds of inference we work with here are point estimates, condence inter-
vals, and tests of signicance. The procedures we use for contrasts are similar
to the procedures we use when estimating or testing means.
The observed treatment mean y
i
is an unbiased estimate of
i
= +
i
,
so a sum or other linear combination of observed treatment means is an un- w({y
i
})
estimates
w({
i
})
biased estimate of the corresponding combination of the
i
s. In particular,
a contrast in the observed treatment means is an unbiased estimate of the
corresponding contrast in the true treatment means. Thus we have:
E[w({y
i
})] = E[w({
i
})] = w({
i
}) = w({
i
}) .
The variance of y
i
is
2
/n
i
, and the treatment means are independent,
so the variance of a contrast in the observed means is
Var [w({y
i
})] =
2
g

i=1
w
2
i
n
i
.
We will usually not know
2
, so we estimate it by the mean square for error
from the ANOVA.
We compute a condence interval for a mean parameter with the general
form: unbiased estimate t-multiplier estimated standard error. Contrasts
are linear combinations of mean parameters, so we use the same basic form. Condence
interval for
w({
i
})
We have already seen how to compute an estimate and standard error, so
w({y
i
}) t
E/2,Ng
_
MS
E

_
g

i=1
w
2
i
n
i
forms a 1 E condence interval for w({
i
}). As usual, the degrees of
freedom for our t-percent point come from the degrees of freedom for our
estimate of error variance, here Ng. We use the E/2 percent point because
we are forming a two-sided condence interval, with E/2 error on each side.
4.2 Inference for Contrasts 69
The usual t-test statistic for a mean parameter takes the form
unbiased estimate null hypothesis value
estimated standard error of estimate
.
This form also works for contrasts. If we have the null hypothesis H
0
:
w({
i
}) = , then we can do a t-test of that null hypothesis by computing
the test statistic
t =
w({y
i
})

MS
E
_

g
i=1
w
2
i
n
i
.
Under H
0
, this t-statistic will have a t-distribution with N g degrees of t-test for w({
i
})
freedom. Again, the degrees of freedom come from our estimate of error
variance. The p-value for this t-test is computed by getting the area under
the t-distribution with N g degrees of freedom for the appropriate region:
either less or greater than the observed t-statistic for one-sided alternatives,
or twice the tail area for a two-sided alternative.
We may also compute a sum of squares for any contrast w({y
i
}):
SS
w
=
(

g
i=1
w
i
y
i
)
2

g
i=1
w
2
i
n
i
.
This sum of squares has 1 degree of freedom, so its mean square is MS
w
=
SS
w
/1 = SS
w
. We may use MS
w
to test the null hypothesis that w({
i
}) =
0 by forming the F-statistic MS
w
/MS
E
. If H
0
is true, this F-statistic will
have an F-distribution with 1 and N g degrees of freedom (N g from the SS and F-test for
w({
i
})
MS
E
). It is not too hard to see that this F is exactly equal to the square of
the t-statistic computed for same null hypothesis = 0. Thus the F-test and
two-sided t-tests are equivalent for the null hypothesis of zero contrast mean.
It is also not too hard to see that if you multiply the contrast coefcients by
a nonzero constant (for example, change from (-1, .5, .5) to (2, -1, -1)), then
the contrast sum of squares is unchanged. The squared constant cancels from
the numerator and denominator of the formula.
Rat liver weights Example 4.1
Exercise 3.1 provided data on the weight of rat livers as a percentage of body
weight for four different diets. Summary statistics from those data follow:
i 1 2 3 4
y
i
3.75 3.58 3.60 3.92
n
i
7 8 6 8 MS
E
= .04138
70 Looking for Specic DifferencesContrasts
If diets 1, 2, and 3 are rations made by one manufacturer, and diet 4 is a
ration made by a second manufacturer, then it may be of interest to compare
the responses from the diets of the two manufacturers to see if there is any
difference.
The contrast with coefcients (1/3, 1/3, 1/3, -1) will compare the mean
response in the rst three diets with the mean response in the last diet. Note
that we intend the mean response in the rst three diets to denote the av-
erage of the treatment averages, not the simple average of all the data from
those three treatments. The simple average will not be the same as the aver-
age of the averages because the sample sizes are different.
Our point estimate of this contrast is
w({y
i
}) =
1
3
3.75 +
1
3
3.58 +
1
3
3.60 + (1)3.92 = .277
with standard error
SE(w({y
i
})) =

.04138

(
1
3
)
2
7
+
(
1
3
)
2
8
+
(
1
3
)
2
6
+
(1)
2
8
= .0847 .
The mean square for error has 29 4 = 25 degrees of freedom. To construct
a 95% condence interval for w({
i
}), we need the upper 2.5% point of a
t-distribution with 25 degrees of freedom; this is 2.06, as can be found in
Appendix Table D.3 or using software. Thus our 95% condence interval is
.277 2.06 .0847 = .277 .174 = (.451, .103) .
Suppose that we wish to test the null hypothesis H
0
: w({
i
}) = . Here
we will use the t-test and F-test to test H
0
: w({
i
}) = = 0, but the t-test
can test other values of . Our t-test is
.277 0
.0847
= 3.27 ,
with 25 degrees of freedom. For a two-sided alternative, we compute the p-
value by nding the tail area under the t-curve and doubling it. Here we get
twice .00156 or about .003. This is rather strong evidence against the null
hypothesis.
Because our null hypothesis value is zero with a two-sided alternative, we
can also test our null hypothesis by computing a mean square for the contrast
4.3 Orthogonal Contrasts 71
Listing 4.1: SAS PROC GLM output for the rat liver contrast.
Source DF Type I SS Mean Square F Value Pr > F
DIET 3 0.57820903 0.19273634 4.66 0.0102
Contrast DF Contrast SS Mean Square F Value Pr > F
1,2,3 vs 4 1 0.45617253 0.45617253 11.03 0.0028
Listing 4.2: MacAnova output for the rat liver contrast.
component: estimate
(1) -0.28115
component: ss
(1) 0.45617
component: se
(1) 0.084674
and forming an F-statistic. The sum of squares for our contrast is
(
1
3
3.75 +
1
3
3.58 +
1
3
3.60 + (1)3.92)
2
(1/3)
2
7
+
(1/3)
2
8
+
(1/3)
2
6
+
(1)
2
8
=
(.277)
2
.1733
= .443 .
The mean square is also .443, so the F-statistic is .443/.04138 = 10.7. We
compute a p-value by nding the area to the right of 10.7 under the F-
distribution with 1 and 25 degrees of freedom, getting .003 as for the t-test.
Listing 4.1 shows output from SAS for computing the sum of squares for
this contrast; Listing 4.2 shows corresponding MacAnova output. The sum
of squares in these two listings differs from what we obtained above due to
rounding at several steps.
4.3 Orthogonal Contrasts
Two contrasts {w} and {w

} are said to be orthogonal if


g

i=1
w
i
w

i
/n
i
= 0 .
72 Looking for Specic DifferencesContrasts
If there are g treatments, you can nd a set of g1 contrasts that are mutually
orthogonal, that is, each one is orthogonal to all of the others. However, there
are innitely many sets of g 1 mutually orthogonal contrasts, and there are g 1 orthogonal
contrasts
no mutually orthogonal sets with more than g 1 contrasts. There is an anal-
ogy from geometry. In a plane, you can have two lines that are perpendicular
(orthogonal), but you cant nd a third line that is perpendicular to both of
the others. On the other hand, there are innitely many pairs of perpendicular
lines.
The important feature of orthogonal contrasts applied to observed means
is that they are independent (as random variables). Thus, the random error of Orthogonal
contrasts are
independent and
partition variation
one contrast is not correlated with the randomerror of an orthogonal contrast.
An additional useful fact about orthogonal contrasts is that they partition the
between groups sum of squares. That is, if you compute the sums of squares
for a full set of orthogonal contrasts (g1 contrasts for g groups), then adding
up those g 1 sums of squares will give you exactly the between groups sum
of squares (which also has g 1 degrees of freedom).
Example 4.2 Orthogonal contrast inference
Suppose that we have an experiment with three treatmentsa control and
two new treatmentswith group sizes 10, 5, and 5, and treatment means 6.3,
6.4, and 6.5. The MS
E
is .0225 with 17 degrees of freedom. The contrast
w with coefcients (1, -.5, -.5) compares the mean response in the control
treatment with the average of the mean responses in the new treatments. The
contrast with coefcients (0, 1, -1) compares the two new treatments. In our
example above, we had a control with 10 units, and two new treatments with
5 units each. These contrasts are orthogonal, because
0 1
10
+
1 .5
5
+
1 .5
5
= 0 .
We have three groups so there are 2 degrees of freedom between groups,
and we have described above a set of orthogonal contrasts. The sum of
squares for the rst contrast is
(6.3 .5 6.4 .5 6.5)
2
1
10
+
(.5)
2
5
+
(.5)
2
5
= .1125 ,
and the sum of squares for the second contrast is
4.4 Polynomial Contrasts 73
(0 + 6.4 6.5)
2
0
10
+
1
2
5
+
(1)
2
5
=
.01
.4
= .025 .
The between groups sum of squares is
10(6.3 6.375)
2
+ 5(6.4 6.375)
2
+ 5(6.5 6.375)
2
= .1375
which equals .1125 + .025.
We can see from Example 4.2 one of the advantages of contrasts over
the full between groups sum of squares. The control-versus-newcontrast has Contrasts isolate
differences
a sum of squares which is 4.5 times larger than the sum of squares for the
difference of the new treatments. This indicates that the responses from the
new treatments are substantially farther from the control responses than they
are from each other. Such indications are not possible using the between
groups sum of squares.
The actual contrasts one uses in an analysis arise from the context of
the problem. Here we had new versus old and the difference between the
two new treatments. In a study on the composition of ice cream, we might
compare articial avorings with natural avorings, or expensive avorings
with inexpensive avorings. It is often difcult to construct a complete set
of meaningful orthogonal contrasts, but that should not deter you from using
an incomplete set of orthogonal contrasts, or from using contrasts that are
nonorthogonal.
Use contrasts that address the questions you are trying to answer.
4.4 Polynomial Contrasts
Section 3.10 introduced the idea of polynomial modeling of a response when
the treatments had a quantitative dose structure. We selected a polynomial Contrasts yield
improvement SS
in polynomial
dose-response
models
model by looking at the improvement sums of squares obtained by adding
each polynomial term to the model in sequence. Each of these additional
terms in the polynomial has a single degree of freedom, just like a contrast. In
fact, each of these improvement sums of squares can be obtained as a contrast
sum of squares. We call the contrast that gives us the sum of squares for the
linear term the linear contrast, the contrast that gives us the improvement sum
of squares for the quadratic term the quadratic contrast, and so on.
74 Looking for Specic DifferencesContrasts
When the doses are equally spaced and the sample sizes are equal, then
the contrast coefcients for polynomial terms are fairly simple and can be Simple contrasts
for equally
spaced doses
with equal n
i
found, for example, in Appendix Table D.6; these contrasts are orthogonal
and have been scaled to be simple integer values. Equally spaced doses
means that the gaps between successive doses are the same, as in 1, 4, 7,
10. Using these tabulated contrast coefcients, we may compute the linear,
quadratic, and higher order sums of squares as contrasts without tting a sep-
arate polynomial model. Doses such as 1, 10, 100, 1000 are equally spaced
on a logarithmic scale, so we can again use the simple polynomial contrast
coefcients, provided we interpret the polynomial as a polynomial in the log-
arithm of dose.
When the doses are not equally spaced or the sample sizes are not equal,
then contrasts for polynomial terms exist, but are rather complicated to de-
rive. In this situation, it is more trouble to derive the coefcients for the
polynomial contrasts than it is to t a polynomial model.
Example 4.3 Leaet angles
Exercise 3.5 introduced the leaet angles of plants at 30, 45, and 60 minutes
after exposure to red light. Summary information for this experiment is given
here:
Delay time (min)
30 45 60
y
i
139.6 133.6 122.4
n
i
5 5 5
MS
E
= 58.13
With three equally spaced groups, the linear and quadratic contrasts are (-1,
0, 1) and (1, -2, 1).
The sum of squares for linear is
((1)139.6 + (0)133.6 + (1)122.4)
2
(1)
2
5
+
0
5
+
1
2
5
= 739.6 ,
and that for quadratic is
((1)139.6 + (2)133.6 + (1)122.4)
2
1
2
5
+
(2)
2
5
+
1
2
5
= 22.53 .
Thus the F-tests for linear and quadratic are 739.6/58.13 = 12.7 and
22.53/58.13 = .39, both with 1 and 12 degrees of freedom; there is a strong
linear trend in the means and almost no nonlinear trend.
4.5 Further Reading and Extensions 75
4.5 Further Reading and Extensions
Contrasts are a special case of estimable functions, which are described in
some detail in Appendix Section A.6. Treatment means and averages of
treatment means are other estimable functions. Estimable functions are those
features of the data that do not depend on how we choose to restrict the treat-
ment effects.
4.6 Problems
An experimenter randomly allocated 125 male turkeys to ve treatment Exercise 4.1
groups: 0 mg, 20 mg, 40 mg, 60 mg, and 80 mg of estradiol. There were
25 birds in each group, and the mean results were 2.16, 2.45, 2.91, 3.00,
and 2.71 respectively. The sum of squares for experimental error was 153.4.
Test the null hypothesis that the ve group means are the same against the
alternative that they are not all the same. Find the linear, quadratic, cubic,
and quartic sums of squares (you may lump the cubic and quartic together
into a higher than quadratic if you like). Test the null hypothesis that the
quadratic effect is zero. Be sure to report a p-value.
Use the data from Exercise 3.3. Compute a 99% condence interval for Exercise 4.2
the difference in response between the average of the three treatment groups
(acid, pulp, and salt) and the control group.
Refer to the data in Problem 3.1. Workers 1 and 2 were experienced, Exercise 4.3
whereas workers 3 and 4 were novices. Find a contrast to compare the expe-
rienced and novice workers and test the null hypothesis that experienced and
novice works produce the same average shear strength.
Consider an experiment taste-testing six types of chocolate chip cookies: Exercise 4.4
1 (brand A, chewy, expensive), 2 (brand A, crispy, expensive), 3 (brand B,
chewy, inexpensive), 4 (brand B, crispy, inexpensive), 5 (brand C, chewy,
expensive), and 6 (brand D, crispy, inexpensive). We will use twenty different
raters randomly assigned to each type (120 total raters).
(a) Design contrasts to compare chewy with crispy, and expensive with inex-
pensive.
(b) Are your contrasts in part (a) orthogonal? Why or why not?
A consumer testing agency obtains four cars from each of six makes: Problem 4.1
Ford, Chevrolet, Nissan, Lincoln, Cadillac, and Mercedes. Makes 3 and 6
are imported while the others are domestic; makes 4, 5, and 6 are expensive
76 Looking for Specic DifferencesContrasts
while 1, 2, and 3 are less expensive; 1 and 4 are Ford products, while 2 and
5 are GM products. We wish to compare the six makes on their oil use per
100,000 miles driven. The mean responses by make of car were 4.6, 4.3, 4.4,
4.7, 4.8, and 6.2, and the sum of squares for error was 2.25.
(a) Compute the Analysis of Variance table for this experiment. What
would you conclude?
(b) Design a set of contrasts that seem meaningful. For each contrast,
outline its purpose and compute a 95% condence interval.
Consider the data in Problem 3.2. Design a set of contrasts that seem Problem 4.2
meaningful. For each contrast, outline its purpose and test the null hypothesis
that the contrast has expected value zero.
Consider the data in Problem 3.5. Use polynomial contrasts to choose a Problem 4.3
quantitative model to describe the effect of ber proportion on the response.
Show that orthogonal contrasts in the observed treatment means are un- Question 4.1
correlated random variables.
Chapter 5
Multiple Comparisons
When we make several related tests or interval estimates at the same time,
we need to make multiple comparisons or do simultaneous inference. The
issue of multiple comparisons is one of error rates. Each of the individual
tests or condence intervals has a Type I error rate E
i
that can be controlled Multiple
comparisons,
simultaneous
inference, families
of hypotheses
by the experimenter. If we consider the tests together as a family, then we can
also compute a combined Type I error rate for the family of tests or intervals.
When a family contains more and more true null hypotheses, the probabil-
ity that one or more of these true null hypotheses is rejected increases, and
the probability of any Type I errors in the family can become quite large.
Multiple comparisons procedures deal with Type I error rates for families of
tests.
Carcinogenic mixtures Example 5.1
We are considering a newcleaning solvent that is a mixture of 100 chemicals.
Suppose that regulations state that a mixture is safe if all of its constituents
are safe (pretending we can ignore chemical interaction). We test the 100
chemicals for causing cancer, running each test at the 5% level. This is the
individual error rate that we can control.
What happens if all 100 chemicals are harmless and safe? Because we
are testing at the 5% level, we expect 5% of the nulls to be rejected even
when all the nulls are true. Thus, on average, 5 of the 100 chemicals will be
declared to be carcinogenic, even when all are safe. Moreover, if the tests
are independent, then one or more of the chemicals will be declared unsafe
in 99.4% of all sets of experiments we run, even if all the chemicals are safe.
This 99.4% is a combined Type I error rate; clearly we have a problem.
78 Multiple Comparisons
5.1 Error Rates
When we have more than one test or interval to consider, there are several
ways to dene a combined Type I error rate for the family of tests. This vari-
ety of combined Type I error rates is the source of much confusion in the use Determine error
rate to control
of multiple comparisons, as different error rates lead to different procedures.
People sometimes ask Which procedure should I use? when the real ques-
tion is Which error rate do I want to control?. As data analyst, you need
to decide which error rate is appropriate for your situation and then choose
a method of analysis appropriate for that error rate. This choice of error rate
is not so much a statistical decision as a scientic decision in the particular
area under consideration.
Data snooping is a practice related to having many tests. Data snooping
occurs when we rst look over the data and then choose the null hypotheses Data snooping
performs many
implicit tests
to be tested based on interesting features in the data. What we tend to
do is consider many potential features of the data and discard those with
uninteresting or null behavior. When we data snoop and then perform a test,
we tend to see the smallest p-value from the ill-dened family of tests that we
considered when we were snooping; we have not really performed just one
test. Some multiple comparisons procedures can actually control for data
snooping.
Simultaneous inference is deciding which error rate we wish to control, and
then using a procedure that controls the desired error rate.
Lets set up some notation for our problem. We have a set of K null
hypotheses H
01
, H
02
, . . ., H
0K
. We also have the combined, overall, or
intersection null hypotheses H
0
which is true if all of the H
0i
are true. In Individual and
combined null
hypotheses
formula,
H
0
= H
01
H
02
H
0K
.
The collection H
01
, H
02
, . . ., H
0K
is sometimes called a family of null hy-
potheses. We reject H
0
if any of null hypotheses H
0i
is rejected. In Exam-
ple 5.1, K = 100, H
0i
is the null hypothesis that chemical i is safe, and H
0
is the null hypothesis that all chemicals are safe so that the mixture is safe.
We now dene ve combined Type I error rates. The denitions of these
error rates depend on numbers or fractions of falsely rejected null hypotheses
H
0i
, which will never be known in practice. We set up the error rates here
and later give procedures that can be shown mathematically to control the
error rates.
5.1 Error Rates 79
The per comparison error rate or comparisonwise error rate is the prob-
ability of rejecting a particular H
0i
in a single test when that H
0i
is true.
Controlling the per comparison error rate at E means that the expected frac- Comparisonwise
error rate
tion of individual tests that reject H
0i
when H
0
is true is E. This is just the
usual error rate for a t-test or F-test; it makes no correction for multiple com-
parisons. The tests in Example 5.1 controlled the per comparison error rate
at 5%.
The per experiment error rate or experimentwise error rate or familywise
error rate is the probability of rejecting one or more of the H
0i
(and thus Experimentwise
error rate
rejecting H
0
) in a series of tests when all of the H
0i
are true. Controlling
the experimentwise error rate at E means that the expected fraction of exper-
iments in which we would reject one or more of the H
0i
when H
0
is true
is E. In Example 5.1, the per experiment error rate is the fraction of times
we would declare one or more of the chemicals unsafe when in fact all were
safe. Controlling the experimentwise error rate at E necessarily controls the
comparisonwise error rate at no more than E. The experimentwise error rate
considers all individual null hypotheses that were rejected; if any one of them
was correctly rejected, then there is no penalty for any false rejections that
may have occurred.
A statistical discovery is the rejection of an H
0i
. The false discovery
fraction is 0 if there are no rejections; otherwise it is the number of false False discovery
rate
discoveries (Type I errors) divided by the total number of discoveries. The
false discovery rate (FDR) is the expected value of the false discovery frac-
tion. If H
0
is true, then all discoveries are false and the FDR is just the
experimentwise error rate. Thus controlling the FDR at E also controls the
experimentwise error at E. However, the FDR also controls at E the average
fraction of rejections that are Type I errors when some H
0i
are true and some
are false, a control that the experimentwise error rate does not provide. With
the FDR, we are allowed more incorrect rejections as the number of true re-
jections increases, but the ratio is limited. For example, with FDR at .05, we
are allowed just one incorrect rejection with 19 correct rejections.
The strong familywise error rate is the probability of making any false
discoveries, that is, the probability that the false discovery fraction is greater
than zero. Controlling the strong familywise error rate at E means that the Strong familywise
error rate
probability of making any false rejections is E or less, regardless of how
many correct rejections are made. Thus one true rejection cannot make any
false rejections more likely. Controlling the strong familywise error rate at
E controls the FDR at no more than E. In Example 5.1, a strong familywise
error rate of E would imply that in a situation where 2 of the chemicals were
carcinogenic, the probability of declaring one of the other 98 to be carcino-
genic would be no more than E.
80 Multiple Comparisons
Finally, suppose that each null hypothesis relates to some parameter (for
example, a mean), and we put condence intervals on all these parameters.
An error occurs when one of our condence intervals fails to cover the true
parameter value. If this true parameter value is also the null hypothesis value,
then an error is a false rejection. The simultaneous condence intervals cri- Simultaneous
condence
intervals
terion states that all of our condence intervals must cover their true param-
eters simultaneously with condence 1 E. Simultaneous 1 E condence
intervals also control the strong familywise error rate at no more than E. (In
effect, the strong familywise criterion only requires simultaneous intervals
for the null parameters.) In Example 5.1, we could construct simultaneous
condence intervals for the cancer rates of each of the 100 chemicals. Note
that a single condence interval in a collection of intervals with simultaneous
coverage 1 E will have coverage greater than 1 E.
There is a trade-off between Type I error and Type II error (failing to
reject a null when it is false). As we go to more and more stringent Type I More stringent
procedures are
less powerful
error rates, we become more condent in the rejections that we do make, but
it also becomes more difcult to make rejections. Thus, when using the more
stringent Type I error controls, we are more likely to fail to reject some null
hypotheses that should be rejected than when using the less stringent rates. In
simultaneous inference, controlling stronger error rates leads to less powerful
tests.
Example 5.2 Functional magnetic resonance imaging
Many functional Magnetic Resonance Imaging (fMRI) studies are interested
in determining which areas of the brain are activated when a subject is
engaged in some task. Any one image slice of the brain may contain 5000
voxels (individual locations to be studied), and one analysis method produces
a t-test for each of the 5000 voxels. Null hypothesis H
0i
is that voxel i is not
activated. Which error rate should we use?
If we are studying a small, narrowly dened brain region and are uncon-
cerned with other brain regions, then we would want to test individually the
voxels in the brain regions of interest. The fact that there are 4999 other
voxels is unimportant, so we would use a per comparison method.
Suppose instead that we are interested in determining if there are any
activations in the image. We recognize that by making many tests we are
likely to nd one that is signicant, even when all nulls are true; we want
to protect ourselves against that possibility, but otherwise need no stronger
control. Here we would use a per experiment error rate.
Suppose that we believe that there will be many activations, so that H
0
is
not true. We dont want some correct discoveries to open the ood gates for
many false discoveries, but we are willing to live with some false discoveries
5.2 Bonferroni-Based Methods 81
as long as they are a controlled fraction of the total made. This is acceptable
because we are going to investigate several subjects; the truly activated re-
jections should be rejections in most subjects, and the false rejections will be
scattered. Here we would use the FDR.
Suppose that in addition to expecting true activations, we are also only
looking at a single subject, so that we cant use multiple subjects to determine
which activations are real. Here we dont want false activations to cloud our
picture, so we use the strong familywise error rate.
Finally, we might want to be able to estimate the amount of activation in
every voxel, with simultaneous accuracy for all voxels. Here we would use
simultaneous condence intervals.
Amultiple comparisons procedure is a method for controlling a Type I error
rate other than the per comparison error rate.
The literature on multiple comparisons is vast, and despite the length of
this Chapter, we will only touch the highlights. I have seen quite a bit of
nonsense regarding these methods, so I will try to set out rather carefully
what the methods are doing. We begin with a discussion of Bonferroni-based
methods for combining generic tests. Next we consider the Scheff e proce-
dure, which is useful for contrasts suggested by data (data snooping). Then
we turn our attention to pairwise comparisons, for which there are dozens of
methods. Finally, we consider comparing treatments to a control or to the
best response.
5.2 Bonferroni-Based Methods
The Bonferroni technique is the simplest, most widely applicable multiple
comparisons procedure. The Bonferroni procedure works for a xed set of
K null hypotheses to test or parameters to estimate. Let p
i
be the p-value
for testing H
0i
. The Bonferroni procedure says to obtain simultaneous 1 Ordinary
Bonferroni
E condence intervals by constructing individual condence intervals with
coverage 1 E/K, or reject H
0i
(and thus H
0
) if
p
i
< E/K .
That is, simply run each test at level E/K. The testing version controls the
strong familywise error rate, and the condence intervals are simultaneous.
The tests and/or intervals need not be independent, of the same type, or re-
lated in any way.
82 Multiple Comparisons
Reject H
0(i)
if Method Control
p
(i)
< E/K Bonferroni Simultaneous condence
intervals
p
(j)
< E/(K j + 1)
for all j = 1, . . ., i
Holm Strong familywise error
rate
p
(j)
jE/K
for some j i
FDR False discovery rate;
needs independent tests
Display 5.1: Summary of Bonferroni-style methods for K comparisons.
The Holm procedure is a modication of Bonferroni that controls the
strong familywise error rate, but does not produce simultaneous condence
intervals (Holm 1979). Let p
(1)
, . . ., p
(K)
be the p-values for the K tests Holm
sorted into increasing order, and let H
0(i)
be the null hypotheses sorted along
with the p-values. Then reject H
0(i)
if
p
(j)
E/(K j + 1) for all j = 1, . . ., i.
Thus we start with the smallest p-value; if it is rejected we consider the next
smallest, and so on. We stop when we reach the rst nonsignicant p-value.
This is a little more complicated, but we gain some power since only the
smallest p-value is compared to E/K.
The FDR method of Benjamini and Hochberg (1995) controls the False
Discovery Rate. Once again, sort the p-values and the hypotheses. For the FDR modication
of Bonferroni
requires
independent tests
FDR, start with the largest p-value and work down. Reject H
0i
if
p
(j)
jE/K for some j i.
This procedure is correct when the tests are statistically independent. It con-
trols the FDR, but not the strong familywise error rate.
The three Bonferroni methods are summarized in Display 5.1. Exam-
ple 5.3 illustrates their use.
5.2 Bonferroni-Based Methods 83
Sensory characteristics of cottage cheeses Example 5.3
Table 5.1 shows the results of an experiment comparing the sensory charac-
teristics of nonfat, 2% fat, and 4% fat cottage cheese (Michicich 1995). The
table shows the characteristics grouped by type and p-values for testing the
null hypothesis that there was no difference between the three cheeses in the
various sensory characteristics. There are 21 characteristics in three groups
of sizes 7, 6, and 8.
How do we do multiple comparisons here? First we need to know:
1. Which error rate is of interest?
2. If we do choose an error rate other than the per comparison error rate,
what is the appropriate family of tests? Is it all 21 characteristics, or
separately within group of characteristic?
There is no automatic answer to either of these questions. The answers de-
pend on the goals of the study, the tolerance of the investigator to Type I error,
how the results of the study will be used, whether the investigator views the
three groups of characteristics as distinct, and so on.
The last two columns of Table 5.1 give the results of the Bonferroni,
Holm, and FDR procedures applied at the 5% level to all 21 comparisons
and within each group. The p-values are compared to the criteria in Dis-
play 5.1 using K = 21 for the overall family and K of 7, 6, or 8 for by group
comparisons.
Consider the characteristic cheesy avor with a .01 p-value. If we use
the overall family, this is the tenth smallest p-value out of 21 p-values. The
results are
Bonferroni The critical value is .05/21 = .0024not signicant.
Holm The critical value is .05/(2110+1) = .0042not signicant.
FDR The critical value is 10 .05/21 = .024signicant.
If we use the avor family, this is the fourth smallest p-value out of six p-
values. Now the results are
Bonferroni The critical value is .05/6 = .008not signicant.
Holm The critical value is .05/(6 4 + 1) = .017 (and all smaller
p-values meet their critical values)signicant.
FDR The critical value is 4 .05/6 = .033signicant.
84 Multiple Comparisons
Table 5.1: Sensory attributes of three cottage cheeses: p-values and 5%
signicant results overall and familywise by type of attribute using the
Bonferroni (), Holm (), and FDR methods().
Characteristic p-value Overall By group
Appearance
White .004
Yellow .002
Gray .13
Curd size .29
Size uniformity .73
Shape uniformity .08
Liquid/solid ratio .02
Flavor
Sour .40
Sweet .24
Cheesy .01
Rancid .0001
Cardboard .0001
Storage .001
Texture
Breakdown rate .001
Firm .0001
Sticky .41
Slippery .07
Heavy .15
Particle size .42
Runny .002
Rubbery .006
These results illustrate that more null hypotheses are rejected considering
each group of characteristics to be a family of tests rather than overall (the
K is smaller for the individual groups), and fewer rejections are made using
the more stringent error rates. Again, the choices of error rate and family of
tests are not purely statistical, and controlling an error rate within a group of
tests does not control that error rate for all tests.
5.3 The Scheff e Method for All Contrasts 85
5.3 The Scheff e Method for All Contrasts
The Scheff e method is a multiple comparisons technique for contrasts that
produces simultaneous condence intervals for any and all contrasts, includ-
ing contrasts suggested by the data. Thus Scheff e is the appropriate tech-
nique for assessing contrasts that result from data snooping. This sounds like Scheff e protects
against data
snooping, but has
low power
the ultimate in error rate controlarbitrarily many comparisons, even ones
suggested from the data! The downside of this amazing protection is low
power. Thus we only use the Scheff e method in those situations where we
have a contrast suggested by the data, or many, many contrasts that cannot
be handled by other techniques. In addition, pairwise comparison contrasts
y
i
y
j
, even pairwise comparisons suggested by the data, are better han-
dled by methods specically designed for pairwise comparisons.
We begin with the Scheff e test of the null hypothesis H
0
: w({
i
}) = 0
against a two-sided alternative. The Scheff e test statistic is the ratio
SS
w
/(g 1)
MS
E
;
we get a p-value as the area under an F-distribution with g 1 and degrees Scheff e F-test
of freedom to the right of the test statistic. The degrees of freedom are from
our denominator MS
E
; = N g for the completely randomized designs
we have been considering so far. Reject the null hypothesis if this p-value
is less than our Type I error rate E. In effect, the Scheff e procedure treats
the mean square for any single contrast as if it were the full g 1 degrees of
freedom between groups mean square.
There is also a Scheff e t-test for contrasts. Suppose that we are testing
the null hypothesis H
0
: w({
i
}) = against a two-sided alternative. The
Scheff e t-test controls the Type I error rate at E by rejecting the null hypoth- Scheff e t-test
esis when
|w({y
i
}) |
_
MS
E

g
i=1
w
2
i
n
i
>
_
(g 1)F
E,g1,
,
where F
E,g1,
is the upper E percent point of an F-distribution with g 1
and degrees of freedom. Again, is the degrees of freedom for MS
E
. For
the usual null hypothesis value = 0, this is equivalent to the ratio-of-mean-
squares version given above.
We may also use the Scheff e approach to form simultaneous condence Scheff e
condence
interval
intervals for any w({
i
}):
w({y
i
})
_
(g 1)F
E,g1,

_
MS
E
g

i=1
w
2
i
n
i
.
86 Multiple Comparisons
These Scheff e intervals have simultaneous 1 E coverage over any set of
contrasts, including contrasts suggested by the data.
Example 5.4 Acid rain and birch seedlings, continued
Example 3.1 introduced an experiment in which birch seedlings were ex-
posed to various levels of articial acid rain. The following table gives some
summaries for the data:
pH 4.7 4.0 3.3 3.0 2.3
weight .337 .296 .320 .298 .177
n 48 48 48 48 48
The MS
E
was .0119 with 235 degrees of freedom.
Inspection of the means shows that most of the response means are about
.3, but the response for the pH 2.3 treatment is much lower. This suggests
that a contrast comparing the pH 2.3 treatment with the mean of the other
treatments would have a large value. The coefcients for this contrast are
(.25, .25, .25, .25, -1). This contrast has value
.337 +.296 +.320 +.298
4
.177 = .1357
and standard error

.0119
_
.0625
48
+
.0625
48
+
.0625
48
+
.0625
48
+
1
48
_
= .0176 .
We must use the Scheff e procedure to construct a condence interval or
assess the signicance of this contrast, because the contrast was suggested
by the data. For a 99% condence interval, the Scheff e multiplier is
_
4 F
.01,4,235
= 3.688 .
Thus the 99% condence interval for this contrast is .13573.688.0176 up
to .1357 + 3.688 .0176, or (.0708, .2006). Alternatively, the t-statistic for
testing the null hypothesis that the mean response in the last group is equal to
the average of the mean responses in the other four groups is .1357/.0176 =
7.71. The Scheff e critical value for testing the null hypothesis at the E = .001
level is
_
(g 1)F
E,g1,Ng
=
_
4 F
.001,4,235
=

4 4.782 = 4.37 ,
so we can reject the null at the .001 level.
5.4 Pairwise Comparisons 87
Remember, it is not fair to hunt around through the data for a big contrast,
test it, and think that youve only done one comparison.
5.4 Pairwise Comparisons
A pairwise comparison is a contrast that examines the difference between
two treatment means y
i
y
j
. For g treatment groups, there are
(
g
2
) =
g(g 1)
2
different pairwise comparisons. Pairwise comparisons procedures control a
Type I error rate at E for all pairwise comparisons. If we data snoop, choose
the biggest and smallest y
i
s and take the difference, we have not made just
one comparison; rather we have made all g(g 1)/2 pairwise comparisons,
and selected the largest. Controlling a Type I error rate for this greatest dif-
ference is one way to control the error rate for all differences.
As with many other inference problems, pairwise comparisons can be
approached using condence intervals or tests. That is, we may compute Tests or
condence
intervals
condence intervals for the differences
i

j
or
i

j
or test the null
hypotheses H
0ij
:
i
=
j
or H
0ij
:
i
=
j
. Condence regions for the
differences of means are generally more informative than tests.
A pairwise comparisons procedure can generally be viewed as a critical
value (or set of values) for the t-tests of the pairwise comparison contrasts.
Thus we would reject the null hypothesis that
i

j
= 0 if
|y
i
y
j
|

MS
E
_
1/n
i
+ 1/n
j
> u ,
where u is a critical value. Various pairwise comparisons procedures differ Critical values u
for t-tests
in how they dene the critical value u, and u may depend on several things,
including E, the degrees of freedom for MS
E
, the number of treatments, the
number of treatments with means between y
i
and y
j
, and the number of
treatment comparisons with larger t-statistics.
An equivalent form of the test will reject if
|y
i
y
j
| > u
_
MS
E
_
1/n
i
+ 1/n
j
= D
ij
.
88 Multiple Comparisons
If all sample sizes are equal and the critical value u is constant, then D
ij
will be the same for all i, j pairs and we would reject the null if any pair of Signicant
differences D
ij treatments had mean responses that differed by D or more. This quantity D
is called a signicant difference; for example, using a Bonferroni adjustment
to the g(g 1)/2 pairwise comparisons tests leads to a Bonferroni signicant
difference (BSD).
Condence intervals for pairwise differences
i

j
can be formed from
the pairwise tests via
(y
i
y
j
) u
_
MS
E
_
1/n
i
+ 1/n
j
.
The remainder of this section presents methods for displaying the results
of pairwise comparisons, introduces the Studentized range, discusses sev-
eral pairwise comparisons methods, and then illustrates the methods with an
example.
5.4.1 Displaying the results
Pairwise comparisons generate a lot of tests, so we need convenient and com-
pact ways to present the results. An underline diagram is a graphical presen- Underline
diagram
summarizes
pairwise
comparisons
tation of pairwise comparison results; construct the underline diagram in the
following steps.
1. Sort the treatment means into increasing order and write out treatment
labels (numbers or names) along a horizontal axis. The y
i
values may
be added if desired.
2. Draw a line segment under a group of treatments if no pair of treat-
ments in that group is signicantly different. Do not include short lines
that are implied by long lines. That is, if treatments 4, 5, and 6 are not
signicantly different, only use one line under all of themnot a line
under 4 and 5, and a line under 5 and 6, and a line under 4, 5, and 6.
Here is a sample diagram for three treatments that we label A, B, and C:
C A B
This diagram includes treatment labels, but not treatment means. From this
summary we can see that Ccan be distinguished fromB(there is no underline
that covers both B and C), but A cannot be distinguished from either B or C
(there are underlines under A and C, and under A and B).
5.4 Pairwise Comparisons 89
Note that there can be some confusion after pairwise comparisons. You
must not confuse is not signicantly different from or cannot be distin- Insignicant
difference does
not imply equality
guished from with is equal to. Treatment mean A cannot be equal to
treatment means B and C and still have treatment means B and C not equal
each other. Such a pattern can hold for results of signicance tests.
There are also several nongraphical methods for displaying pairwise com-
parisons results. In one method, we sort the treatments into order of increas-
ing means and print the treatment labels. Each treatment label is followed by Letter or number
tags
one or more numbers (letters are sometimes used instead). Any treatments
sharing a number (or letter) are not signicantly different. Thus treatments
sharing common numbers or letters are analogous to treatments being con-
nected by an underline. The grouping letters are often put in parentheses or
set as sub- or superscripts. The results in our sample underline diagrammight
thus be presented as one of the following:
C (1) A (12) B (2) C (a) A (ab) B (b)
C
1
A
12
B
2
C
a
A
ab
B
b
There are several other variations on this theme.
A third way to present pairwise comparisons is as a table, with treatments Table of CIs or
signicant
differences
labeling both rows and columns. Table elements can ag signicant differ-
ences or contain condence intervals for the differences. Only entries above
or below the diagonal of the table are needed.
5.4.2 The Studentized range
The range of a set is the maximum value minus the minimum value, and
Studentization means dividing a statistic by an estimate of its standard error. Range,
Studentization,
and Studentized
range
Thus the Studentized range for a set of treatment means is
max
i
y
i
_
MS
E
/n
min
j
y
j
_
MS
E
/n
.
Note that we have implicitly assumed that all the sample sizes n
i
are the
same.
If all the treatments have the same mean, that is, if H
0
is true, then the
Studentized range statistic follows the Studentized range distribution. Large Studentized
range distribution
values of the Studentized range are less likely under H
0
and more likely
under the alternative when the means are not all equal, so we may use the
Studentized range as a test statistic for H
0
, rejecting H
0
when the Studentized
90 Multiple Comparisons
range statistic is sufciently large. This Studentized range test is a legitimate
alternative to the ANOVA F-test.
The Studentized range distribution is important for pairwise comparisons
because it is the distribution of the biggest (scaled) difference between treat-
ment means when the null hypothesis is true. We will use it as a building
block in several pairwise comparisons methods.
The Studentized range distribution depends only on g and , the number
of groups and the degrees of freedom for the error estimate MS
E
. The quan- Percent points
q
E
(g, )
tity q
E
(g, ) is the upper E percent point of the Studentized range distribution
for g groups and error degrees of freedom; it is tabulated in Appendix Ta-
ble D.8.
5.4.3 Simultaneous condence intervals
The Tukey honest signicant difference (HSD) is a pairwise comparisons Tukey HSD or
honest signicant
difference
technique that uses the Studentized range distribution to construct simultane-
ous condence intervals for differences of all pairs of means. If we reject the
null hypothesis H
0ij
when the (simultaneous) condence interval for
i

j
does not include 0, then the HSD also controls the strong familywise error
rate.
The HSD uses the critical value
u(E, , g) =
q
E
(g, )

2
,
leading to The HSD
HSD =
q
E
(g, )

2
_
MS
E
_
1
n
+
1
n
=
q
E
(g, )

MS
E

n
.
Form simultaneous 1 E condence intervals via
y
i
y
j

q
E
(g, )

2
_
MS
E
_
1
n
+
1
n
.
The degrees of freedom are the degrees of freedom for the error estimate
MS
E
.
Strictly speaking, the HSD is only applicable to the equal sample size
situation. For the unequal sample size case, the approximate HSD is
HSD
ij
= q
E
(g, )
_
MS
E

1
2n
i
n
j
/(n
i
+n
j
)
5.4 Pairwise Comparisons 91
Table 5.2: Total free amino acids in cheeses
after 168 days of ripening.
Strain added
None A B A&B
4.195 4.125 4.865 6.155
4.175 4.735 5.745 6.488
or, equivalently, Tukey-Kramer
form for unequal
sample sizes
HSD
ij
=
q
E
(g, )

2
_
MS
E

(
1
n
i
+
1
n
j
) .
This approximate HSD, often called the Tukey-Kramer form, tends to be
slightly conservative (that is, the true error rate is slightly less than E).
The Bonferroni signicant difference (BSD) is simply the application of Bonferroni
signicant
difference or BSD
the Bonferroni technique to the pairwise comparisons problem to obtain
u = u(E, , K) = t
E/(2K),
,
BSD
ij
= t
E/(2K),
_
MS
E
_
1/n
i
+ 1/n
j
,
where K is the number of pairwise comparisons. We have K = g(g 1)/2
for all pairwise comparisons between g groups. BSD produces simultaneous
condence intervals and controls the strong familywise error rate.
When making all pairwise comparisons, the HSD is less than the BSD. Use HSD when
making all
pairwise
comparisons
Thus we prefer the HSD to the BSD for all pairwise comparisons, because
the HSD will produce shorter condence intervals that are still simultaneous.
When only a preplanned subset of all the pairs is being considered, the BSD
may be less than and thus preferable to the HSD.
Free amino acids in cheese Example 5.5
Cheese is produced by bacterial fermentation of milk. Some bacteria in
cheese are added by the cheese producer. Other bacteria are present but were
not added deliberately; these are called nonstarter bacteria. Nonstarter bac-
teria vary from facility to facility and are believed to inuence the quality of
cheese.
Two strains (A and B) of nonstarter bacteria were isolated at a premium
cheese facility. These strains will be added experimentally to cheese to deter-
mine their effects. Eight cheeses are made. These cheeses all get a standard
92 Multiple Comparisons
starter bacteria. In addition, two cheeses will be randomly selected for each
of the following four treatments: control, add strain A, add strain B, or add
both strains A and B. Table 5.2 gives the total free amino acids in the cheeses
after 168 days of ripening. (Free amino acids are thought to contribute to
avor.)
Listing 5.1 gives Minitab output showing an Analysis of Variance for
these data x, as well as HSD comparisons (called Tukeys pairwise compar-
isons) using E = .1 y; we use the MS
E
from this ANOVA in constructing
the HSD. HSD is appropriate if we want simultaneous condence intervals
on the pairwise differences. The HSD is
q
E
(g, )

2
_
MS
E

1
n
i
+
1
n
j
=
q
.1
(4, 4)

.1572
_
1
2
+
1
2
= 4.586 .3965/1.414 = 1.286 .
We form condence intervals as the observed difference in treatment means,
plus or minus 1.286; so for A&B minus control, we have
6.322 4.185 1.286 or (.851, 3.423) .
In fact, only two condence intervals for pairwise differences do not include
zero (see Listing 5.1 y). The underline diagram is:
C A B A&B
4.19 4.43 5.31 6.32
Note in Listing 5.1 y that Minitab displays pairwise comparisons as a table
of condence intervals for differences.
5.4.4 Strong familywise error rate
A step-down method is a procedure for organizing pairwise comparisons
starting with the most extreme pair and then working in. Relabel the groups Step-down
methods work
inward from the
outside
comparisons
so that the sample means are in increasing order with y
(1)
smallest and y
(g)
largest. (The relabeled estimated effects
(i)
will also be in increasing or-
der, but the relabeled true effects
[i]
may or may not be in increasing order.)
With this ordering, y
(1)
to y
(g)
is a stretch of g means, y
(1)
to y
(g1)
is a
stretch of g 1 means, and y
(i)
to y
(j)
is a stretch of j i + 1 means. In a
step-down procedure, all comparisons for stretches of k means use the same
critical value, but we may use different critical values for different k. This
5.4 Pairwise Comparisons 93
Listing 5.1: Minitab output for free amino acids in cheese.
Source DF SS MS F P x
Trt 3 5.628 1.876 11.93 0.018
Error 4 0.629 0.157
Total 7 6.257
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ------+---------+---------+---------+
A 2 4.4300 0.4313 (------*-------)
A+B 2 6.3215 0.2355 (-------*-------)
B 2 5.3050 0.6223 (-------*-------)
control 2 4.1850 0.0141 (-------*-------)
------+---------+---------+---------+
Pooled StDev = 0.3965 4.0 5.0 6.0 7.0
Tukeys pairwise comparisons y
Family error rate = 0.100
Individual error rate = 0.0315
Critical value = 4.59
Intervals for (column level mean) - (row level mean)
A A+B B
A+B -3.1784
-0.6046
B -2.1619 -0.2704
0.4119 2.3034
control -1.0419 0.8496 -0.1669
1.5319 3.4234 2.4069
Fishers pairwise comparisons z
Family error rate = 0.283
Individual error rate = 0.100
Critical value = 2.132
Intervals for (column level mean) - (row level mean)
A A+B B
A+B -2.7369
-1.0461
B -1.7204 0.1711
-0.0296 1.8619
control -0.6004 1.2911 0.2746
1.0904 2.9819 1.9654
94 Multiple Comparisons
has the advantage that we can use larger critical values for long stretches and
smaller critical values for short stretches.
Begin with the most extreme pair (1) and (g). Test the null hypothesis
that all the means for (1) up through (g) are equal. If you fail to reject,
declare all means equal and stop. If you reject, declare (1) different from (g) (i) and (j) are
different if their
stretch and all
containing
stretches reject
and go on to the next step. At the next step, we consider the stretches (1)
through (g 1) and (2) through (g). If one of these rejects, we declare its
ends to be different and then look at shorter stretches within it. If we fail to
reject for a stretch, we do not consider any substretches within the stretch.
We repeat this subdivision till there are no more rejections. In other words,
we declare that means (i) and (j) are different if the stretch from (i) to (j)
rejects its null hypothesis and all stretches containing (i) to (j) also reject
their null hypotheses.
The REGWR procedure is a step-down range method that controls the
strong familywise error rate without producing simultaneous condence in- REGWR is
step-down with
Studentized
range based
critical values
tervals. The awkward name REGWR abbreviates the Ryan-Einot-Gabriel-
Welsch range test, named for the authors who worked on it. The REGWR
critical value for testing a stretch of length k depends on E, , k, and g.
Specically, we use
u = u(E, , k, g) = q
E
(k, )/

2 k = g, g 1,
and
u = u(E, , k, g) = q
kE/g
(k, )/

2 k = g 2, g 3, . . ., 2.
This critical value derives from a Studentized range with k groups, and we
use percent points with smaller tail areas as we move in to smaller stretches.
As with the HSD, REGWR error rate control is approximate when the
sample sizes are not equal.
Example 5.6 Free amino acids in cheese, continued
Suppose that we only wished to control the strong familywise error rate in-
stead of producing simultaneous condence intervals. Then we could use
REGWR instead of HSD and could potentially see additional signicant dif-
ferences. Listing 5.2 y gives SAS output for REGWR (called REGWQ in
SAS) for the amino acid data.
REGWR is a step-down method that begins like the HSD. Comparing C
and A&B, we conclude as in the HSD that they are different. We may now
compare C with B and A with A&B. These are comparisons that involve
5.4 Pairwise Comparisons 95
Listing 5.2: SAS output for free amino acids in cheese.
Student-Newman-Keuls test for variable: FAA x
Alpha= 0.1 df= 4 MSE= 0.157224
Number of Means 2 3 4
Critical Range 0.84531 1.1146718 1.2859073
Means with the same letter are not significantly different.
SNK Grouping Mean N TRT
A 6.3215 2 4
B 5.3050 2 3
C 4.4300 2 2
C
C 4.1850 2 1
Ryan-Einot-Gabriel-Welsch Multiple Range Test for variable: FAA y
Alpha= 0.1 df= 4 MSE= 0.157224
Number of Means 2 3 4
Critical Range 1.0908529 1.1146718 1.2859073
Means with the same letter are not significantly different.
REGWQ Grouping Mean N TRT
A 6.3215 2 4
A
B A 5.3050 2 3
B
B C 4.4300 2 2
C
C 4.1850 2 1
stretches of k = 3 means; since k = g 1, we still use E as the error rate.
The signicant difference for these comparisons is
q
E
(k, )

2
_
MS
E

1
n
i
+
1
n
j
=
q
.1
(3, 4)

.1572
_
1
2
+
1
2
= 1.115 .
96 Multiple Comparisons
Both the B-C and A&B-A differences (1.12 and 1.89) exceed this cutoff, so
REGWR concludes that B differs from C, and A differs from A&B. Recall
that the HSD did not distinguish C from B.
Having concluded that there are B-C and A&B-A differences, we can
now compare stretches of means within them, namely C to A, A to B, and
B to A&B. These are stretches of k = 2 means, so for REGWR we use the
error rate kE/g = .05. The signicant difference for these comparisons is
q
E/2
(k, )

2
_
MS
E

1
n
i
+
1
n
j
=
q
.05
(2, 4)

.1572
_
1
2
+
1
2
= 1.101 .
None of the three differences exceeds this cutoff, so we fail to conclude that
those treatments differ and nish. The underline diagram is:
C A B A&B
4.19 4.43 5.31 6.32
Note in Listing 5.2 y that SAS displays pairwise comparisons using what
amounts to an underline diagramturned on its side, with vertical lines formed
by letters.
5.4.5 False discovery rate
The Student-Newman-Keuls (SNK) procedure is a step-down method that
uses the Studentized range test with critical value SNK
u = u(E, , k, g) = q
E
(k, )/

2
for a stretch of k means. This is similar to REGWR, except that we keep the
percent point of the Studentized range constant as we go to shorter stretches.
The SNK controls the false discovery rate, but not the strong familywise
error rate. As with the HSD, SNK error rate control is approximate when the
sample sizes are not equal.
Example 5.7 Free amino acids in cheese, continued
Suppose that we only wished to control the false discovery rate; now we
would use SNK instead of the more stringent HSD or REGWR. Listing 5.2
x gives SAS output for SNK for the amino acid data.
SNK is identical to REGWR in the rst two stages, so SNK will also get
to the point of making the comparisons of the three pairs C to A, A to B, and
5.4 Pairwise Comparisons 97
B to A&B. However, the SNK signicant difference for these pairs is less
than that used in REGWR:
q
E
(k, )

2
_
MS
E

1
n
i
+
1
n
j
=
q
.1
(2, 4)

.1572
_
1
2
+
1
2
= .845 .
Both the B-A and A&B-B differences (1.02 and .98) exceed the cutoff, but
the A-C difference (.14) does not. The underline diagram for SNK is:
C A B A&B
4.19 4.43 5.31 6.32
5.4.6 Experimentwise error rate
The Analysis of Variance F-test for equality of means controls the experi-
mentwise error rate. Thus investigating pairwise differences only when the Protected LSD
uses F-test to
control
experimentwise
error rate
F-test has a p-value less than E will control the experimentwise error rate.
This is the basis for the Protected least signicant difference, or Protected
LSD. If the F-test rejects at level E, then do simple t-tests at level E among
the different treatments.
The critical values are from a t-distribution:
u(E, ) = t
E/2,
,
leading to the signicant difference
LSD = t
E/2,
_
MS
E
_
1/n
i
+ 1/n
j
.
As usual, is the degrees of freedom for MS
E
, and t
E/2,
is the upper E/2
percent point of a t-curve with degrees of freedom.
Condence intervals produced from the protected LSD do not have the
anticipated 1 E coverage rate, either individually or simultaneously. See
Section 5.7.
Free amino acids in cheese, continued Example 5.8
Finally, suppose that we only wish to control the experimentwise error rate.
Protected LSD will work here. Listing 5.1 x shows that the ANOVA F-
test is signicant at level E, so we may proceed with pairwise comparisons.
98 Multiple Comparisons
Listing 5.1 z shows Minitab output for the LSD (called Fishers pairwise
comparisons) as condence intervals.
LSD uses the same signicant difference for all pairs:
t
E/2,
_
MS
E

1
n
i
+
1
n
j
= t
.05,4

.1572
_
1
2
+
1
2
= .845 .
This is the same as the SNK comparison for a stretch of length 2. All dif-
ferences except A-C exceed the cutoff, so the underline diagram for LSD
is:
C A B A&B
4.19 4.43 5.31 6.32
5.4.7 Comparisonwise error rate
Ordinary t-tests and condence intervals without any adjustment control the
comparisonwise error rate. In the context of pairwise comparisons, this is LSD
called the least signicant difference (LSD) method.
The critical values are the same as for the protected LSD:
u(E, ) = t
E/2,
,
and
LSD = t
E/2,
_
MS
E
_
1/n
i
+ 1/n
j
.
5.4.8 Pairwise testing reprise
It is easy to get overwhelmed by the abundance of methods, and there are
still more that we havent discussed. Your anchor in all this is your error rate. Choose your
error rate, not
your method
Once you have determined your error rate, the choice of method is reasonably
automatic, as summarized in Display 5.2. Your choice of error rate is deter-
mined by the needs of your study, bearing in mind that the more stringent
error rates have fewer false rejections, and also fewer correct rejections.
5.4.9 Pairwise comparisons methods that do not control combined
Type I error rates
There are many other pairwise comparisons methods beyond those already
mentioned. In this Section we discuss two methods that are motivated by
5.4 Pairwise Comparisons 99
Error rate Method
Simultaneous condence
intervals
BSD or HSD
Strong familywise REGWR
False discovery rate SNK
Experimentwise Protected LSD
Comparisonwise LSD
Display 5.2: Summary of pairwise comparison methods.
completely different criteria than controlling a combined Type I error rate.
These two techniques do not control the experimentwise error rate or any of
the more stringent error rates, and you should not use them with the expecta-
tion that they do. You should only use them when the situation and assump-
tions under which they were developed are appropriate for your experimental
analysis.
Suppose that you believe a priori that the overall null hypothesis H
0
is
less and less likely to be true as the number of treatments increases. Then the Duncans multiple
range if there is a
cost per error or
you believe H
0
less likely as g
increases
strength of evidence required to reject H
0
should decrease as the number of
groups increases. Alternatively, suppose that there is a quantiable penalty
for each incorrect (pairwise comparison) decision we make, and that the total
loss for the overall test is the sum of the losses from the individual decisions.
Under either of these assumptions, the Duncan multiple range (given below)
or something like it is appropriate. Note by comparison that the procedures
that control combined Type I error rates require more evidence to reject H
0
as
the number of groups increases, while Duncans method requires less. Also,
a procedure that controls the experimentwise error rate has a penalty of 1 if
there are any rejections when H
0
is true and a penalty of 0 otherwise; this is
very different from the summed loss that leads to Duncans multiple range.
Duncans multiple range (sometimes called Duncans test or Duncans
new multiple range) is a step-down Studentized range method. You specify Duncans Multiple
Range
a protection level E and proceed in step-down fashion using
u = u(E, , k, g) = q
1(1E)
k1 (k, )/

2
100 Multiple Comparisons
for the critical values. Notice that E is the comparisonwise error rate for
testing a stretch of length 2, and the experimentwise error rate will be 1
(1 E)
g1
, which can be considerably more than E. Thus xing Duncans Experimentwise
error rate very
large for Duncan
protection level at E does not control the experimentwise error rate or any
more stringent rate. Do not use Duncans procedure if you are interested in
controlling any of the combined Type I error rates.
As a second alternative to combined Type I error rates, suppose that our
interest is in predicting future observations from the treatment groups, and
that we would like to have a prediction method that makes the average Minimize
prediction error
instead of testing
squared prediction error small. One way to do this prediction is to rst par-
tition the g treatments into p classes, 1 p g; second, nd the average
response in each of these p classes; and third, predict a future observation
from a treatment by the observed mean response of the class for the treat-
ment. We thus look for partitions that will lead to good predictions.
One way to choose among the partitions is to use Mallows C
p
statistic:
C
p
=
SSR
p
MS
E
+ 2p N ,
where SSR
p
is the sum of squared errors for the Analysis of Variance, par- Predictive
Pairwise
Comparisons
titioning the data into p groups. Partitions with low values of C
p
should give
better predictions.
This predictive approach makes no attempt to control any Type I error
rate; in fact, the Type I error rate is .15 or greater even for g = 2 groups! This
approach is useful when prediction is the goal, but can be quite misleading if
interpreted as a test of H
0
.
5.4.10 Condent directions
In our heart of hearts, we often believe that all treatment means differ when
examined sufciently precisely. Thus our concern with null hypotheses H
0ij
All means differ,
but their order is
uncertain
is misplaced. As an alternative, we can make statements of direction. After
having collected data, we consider
i
and
j
; assume
i
<
j
. We could de-
cide from the data that
i
<
j
, or that
i
>
j
, or that we dont knowthat
is, we dont have enough information to decide. These decisions correspond
in the testing paradigm to rejecting H
0ij
in favor of
i
<
j
, rejecting H
0ij
Can only make
an error in one
direction
in favor of
j
<
i
, and failing to reject H
0ij
. In the condent directions
framework, only the decision
i
>
j
is an error. See Tukey (1991).
Condent directions procedures are pairwise comparisons testing proce-
dures, but with results interpreted in a directional context. Condent direc-
tions procedures bound error rates when making statements about direction.
5.5 Comparison with Control or the Best 101
If a testing procedure bounds an error rate at E, then the corresponding con-
dent directions procedure bounds a condent directions error rate at E/2, the
factor of 2 arising because we cannot falsely reject in the correct direction.
Let us reinterpret our usual error rates in terms of directions. Suppose
that we use a pairwise comparisons procedure with error rate bounded at E. Pairwise
comparisons can
be used for
condent
directions
In a condent directions setting, we have the following:
Strong familywise The probability of making any incorrect state-
ments of direction is bounded by E/2.
FDR Incorrect statements of direction will on average
be no more than a fraction E/2 of the total number
of statements of direction.
Experimentwise The probability of making any incorrect state-
ments of direction when all the means are very
nearly equal is bounded by E/2.
Comparisonwise The probability of making an incorrect statement
of direction for a given comparison is bounded by
E/2.
There is no directional analog of simultaneous condence intervals, so pro-
cedures that produce simultaneous intervals should be considered procedures
that control the strong familywise error rate (which they do).
5.5 Comparison with Control or the Best
There are some situations where we do not do all pairwise comparisons, but
rather make comparisons between a control and the other treatments, or the Comparison with
control does not
do all tests
best responding treatment (highest or lowest average) and the other treat-
ments. For example, you may be producing new standardized mathematics
tests for elementary school children, and you need to compare the new