0% found this document useful (0 votes)

100 views223 pages

Book

This document provides an introduction and overview of nonparametric statistics. Chapter 1 covers probability background and elementary frequentist inference tasks. Chapter 2 discusses one-sample nonparametric inference methods, including median-based tests and confidence intervals. Chapter 3 focuses on two-sample testing and ranking methods like the Mann-Whitney-Wilcoxon test. Chapter 4 extends these methods to tests for three or more groups, including the Kruskal-Wallis test. Chapter 5 presents approaches for blocked and paired data, including nonparametric analysis of variance methods.

Uploaded by

Alvaro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views223 pages

Book

Uploaded by

Alvaro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 223

John E.

Kolassa

An Introduction to
Nonparametric Statistics
Contents

Introduction vii

1 Background 1
1.1 Probability Background . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Probability Distributions for Observations . . . . . . . 1
1.1.1.1 Gaussian Distribution . . . . . . . . . . . . . 1
1.1.1.2 Uniform Distribution . . . . . . . . . . . . . 2
1.1.1.3 Laplace Distribution . . . . . . . . . . . . . . 3
1.1.1.4 Cauchy Distribution . . . . . . . . . . . . . . 3
1.1.1.5 Logistic Distribution . . . . . . . . . . . . . . 4
1.1.1.6 Exponential Distribution . . . . . . . . . . . 4
1.1.2 Location and Scale Families . . . . . . . . . . . . . . . 4
1.1.3 Sampling Distributions . . . . . . . . . . . . . . . . . 5
1.1.3.1 Binomial Distribution . . . . . . . . . . . . . 5
1.1.4 χ2 -distribution . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 T -distribution . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.6 F -distribution . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Elementary Tasks in Frequentist Inference . . . . . . . . . . 6
1.2.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 6
1.2.1.1 One-Sided Hypothesis Tests . . . . . . . . . 7
1.2.1.2 Two-sided Hypothesis Tests . . . . . . . . . . 9
1.2.1.3 P -values . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . 11
1.2.2.1 P -value Inversion . . . . . . . . . . . . . . . 11
1.2.2.2 Test Inversion with Pivotal Statistics . . . . 12
1.2.2.3 A Problematic Example . . . . . . . . . . . . 12
1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 One-Sample Nonparametric Inference 15

2.1 Parametric Inference . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Estimation using Averages . . . . . . . . . . . . . . . 15
2.1.2 One-Sample Testing for Gaussian Observations . . . . 16
2.2 The Need for Distribution-Free Tests . . . . . . . . . . . . . 16
2.3 One-Sample Median Methods . . . . . . . . . . . . . . . . . 17
2.3.1 Estimates of the Population Median . . . . . . . . . . 18
2.3.2 Hypothesis Tests Concerning the Population Median . 19

i
ii Contents

2.3.3 Confidence Intervals for the Median . . . . . . . . . . 24

2.3.4 Inference for Other Quantiles . . . . . . . . . . . . . . 27
2.4 Comparing Tests . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Power, Sample Size, and Effect Size . . . . . . . . . . 29
2.4.1.1 Power . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1.2 Sample and Effect Sizes . . . . . . . . . . . . 30
2.4.2 Efficiency Calculations . . . . . . . . . . . . . . . . . . 31
2.4.3 Examples of Power Calculations . . . . . . . . . . . . 33
2.5 Distribution Function Estimation . . . . . . . . . . . . . . . 34
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Two-Sample Testing 39
3.1 Two-Sample Approximately Gaussian Inference . . . . . . . 39
3.1.1 Inference on Expectations . . . . . . . . . . . . . . . . 39
3.1.2 Inference on Dispersions . . . . . . . . . . . . . . . . . 40
3.2 General Two-Sample Rank Tests . . . . . . . . . . . . . . . . 41
3.2.1 Null Distributions of General Rank Statistics . . . . . 41
3.2.2 Moments of Rank Statistics . . . . . . . . . . . . . . . 42
3.3 A First Distribution-Free Test . . . . . . . . . . . . . . . . . 43
3.4 The Mann-Whitney-Wilcoxon Test . . . . . . . . . . . . . . 46
3.4.1 Exact and Approximate Mann-Whitney Probabilities 48
3.4.1.1 Moments and Approximate Normality . . . . 48
3.4.2 Other Scoring Schemes . . . . . . . . . . . . . . . . . . 50
3.4.3 Using Data as Scores: the Permutation Test . . . . . . 51
3.5 Empirical Levels and Powers of Two-Sample Tests . . . . . . 52
3.6 Adaptation to the Presence of Tied Observations . . . . . . . 53
3.7 Mann-Whitney-Wilcoxon Null Hypotheses . . . . . . . . . . 54
3.8 Efficiency and Power of Two-Sample Tests . . . . . . . . . . 55
3.8.1 Efficacy of the Gaussian-Theory Test . . . . . . . . . . 55
3.8.2 Efficacy of the Mann-Whitney-Wilcoxon Test . . . . . 55
3.8.3 Summarizing Asymptotic Relative Efficiency . . . . . 56
3.8.4 Power for Mann-Whitney-Wilcoxon Testing . . . . . . 56
3.9 Testing Equality of Dispersion . . . . . . . . . . . . . . . . . 58
3.10 Two-Sample Estimation and Confidence Intervals . . . . . . 59
3.10.1 Inversion of the Mann-Whitney-Wilcoxon Test . . . . 60
3.11 Tests for Broad Alternatives . . . . . . . . . . . . . . . . . . 62
3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 Methods for Three or More Groups 67

4.1 Gaussian-Theory Methods . . . . . . . . . . . . . . . . . . . 67
4.1.1 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . 69
4.2 General Rank Tests . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 Moments of General Rank Sums . . . . . . . . . . . . 71
4.2.2 Construction of a Chi-Square-Distributed Statistic . . 72
Contents iii

4.3 The Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . 74

4.3.1 Kruskal-Wallis Approximate Critical Values . . . . . . 74
4.4 Other Scores for Multi-Sample Rank Based Tests . . . . . . 76
4.5 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Ordered Alternatives . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Powers of Tests . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7.1 Power of Tests for Ordered Alternatives . . . . . . . . 82
4.7.2 Power of Tests for Unordered Alternatives . . . . . . . 83
4.8 Efficiency Calculations . . . . . . . . . . . . . . . . . . . . . 88
4.8.1 Ordered Alternatives . . . . . . . . . . . . . . . . . . . 89
4.8.2 Unordered Alternatives . . . . . . . . . . . . . . . . . 89
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5 Group Differences with Blocking 93

5.1 Gaussian Theory Approaches . . . . . . . . . . . . . . . . . . 93
5.1.1 Paired Comparisons . . . . . . . . . . . . . . . . . . . 93
5.1.2 Multiple Group Comparisons . . . . . . . . . . . . . . 93
5.2 Nonparametric Paired Comparisons . . . . . . . . . . . . . . 94
5.2.1 Estimating the Population Median Difference . . . . . 97
5.2.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . 98
5.2.3 Signed-Rank Statistic Alternative Distribution . . . . 99
5.3 Two-Way Non-Parametric Analysis of Variance . . . . . . . 100
5.3.1 Distribution of Rank Sums . . . . . . . . . . . . . . . 100
5.4 A Generalization of the Test of Friedman . . . . . . . . . . . 101
5.4.1 The Balanced Case . . . . . . . . . . . . . . . . . . . . 102
5.4.2 The Unbalanced Case . . . . . . . . . . . . . . . . . . 103
5.5 Multiple Comparisons and Scoring . . . . . . . . . . . . . . . 105
5.6 Tests for a Putative Ordering in Two-Way Layouts . . . . . 106
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6 Bivariate Methods 111

6.1 Parametric Approach . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Permutation Inference . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Nonparametric Correlation . . . . . . . . . . . . . . . . . . . 113
6.3.1 Rank Correlation . . . . . . . . . . . . . . . . . . . . . 113
6.3.1.1 Alternative Expectation of the Spearman Cor-
relation . . . . . . . . . . . . . . . . . . . . . 116
6.3.2 Kendall’s τ . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4 Bivariate Semi-Parametric Estimation via Correlation . . . . 119
6.4.1 Inversion of the Test of Zero Correlation . . . . . . . . 119
6.4.1.1 Inversion of the Pearson Correlation . . . . . 119
6.4.1.2 Inversion of Kendall’s τ . . . . . . . . . . . . 120
6.4.1.3 Inversion of the Spearman Correlation . . . . 121
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
iv Contents

7 Multivariate Analysis 127

7.1 Standard Parametric Approaches . . . . . . . . . . . . . . . 127
7.1.1 Multivariate Estimation . . . . . . . . . . . . . . . . . 127
7.1.2 One-Sample Testing . . . . . . . . . . . . . . . . . . . 127
7.1.3 Two-Sample Testing . . . . . . . . . . . . . . . . . . . 128
7.2 Nonparametric Multivariate Estimation . . . . . . . . . . . . 128
7.2.1 Equivariance Properties . . . . . . . . . . . . . . . . . 130
7.3 Nonparametric One-Sample Testing Approaches . . . . . . . 130
7.3.1 More General Permutation Solutions . . . . . . . . . . 133
7.4 Confidence Regions for a Vector Shift Parameter . . . . . . . 134
7.5 Two-Sample Methods . . . . . . . . . . . . . . . . . . . . . . 134
7.5.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 135
7.5.1.1 Permutation Testing . . . . . . . . . . . . . . 135
7.5.1.2 Permutation Distribution Approximations . . 138
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8 Density Estimation 141

8.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2 Kernel Density Estimates . . . . . . . . . . . . . . . . . . . . 142
8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

9 Regression Function Estimates 149

9.1 Standard Regression . . . . . . . . . . . . . . . . . . . . . . . 149
9.2 Kernel and Local Regression Smoothing . . . . . . . . . . . . 150
9.3 Isotonic Regression . . . . . . . . . . . . . . . . . . . . . . . 154
9.4 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.5 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . 157
9.5.1 Fitting the Quantile Regression Model . . . . . . . . . 158
9.6 Resistant Regression . . . . . . . . . . . . . . . . . . . . . . . 162
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

10 Resampling Techniques 167

10.1 The Bootstrap Idea . . . . . . . . . . . . . . . . . . . . . . . 167
10.1.1 The Bootstrap Sampling Scheme . . . . . . . . . . . . 168
10.2 Univariate Bootstrap Techniques . . . . . . . . . . . . . . . . 170
10.2.1 The Normal Method . . . . . . . . . . . . . . . . . . . 170
10.2.2 Basic Interval . . . . . . . . . . . . . . . . . . . . . . . 170
10.2.3 The Percentile Method . . . . . . . . . . . . . . . . . . 171
10.2.4 BCa Method . . . . . . . . . . . . . . . . . . . . . . . 172
10.2.5 Summary So Far, and More Examples . . . . . . . . . 174
10.3 Bootstrapping Multivariate Data Sets . . . . . . . . . . . . . 175
10.3.1 Regression Models and the Studentized Bootstrap
Method . . . . . . . . . . . . . . . . . . . . . . . . . . 176
10.3.2 Fixed X Bootstrap . . . . . . . . . . . . . . . . . . . . 178
10.4 The Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Contents v

10.4.1 Examples of Biases of the Proper Order . . . . . . . . 181

10.4.2 Bias Correction . . . . . . . . . . . . . . . . . . . . . . 182
10.4.2.1 Correcting the Bias in Mean Estimators . . . 182
10.4.2.2 Correcting the Bias in Quantile Estimators . 182
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

A Analysis Using the SAS System 187

B Construction of Heuristic Tables and Figures Using R 197

Bibliography 201

Index 211
Introduction

Preface
This book is intended to accompany a one-semester MS-level course in non-
parametric statistics. Prerequisites for the course are calculus through mul-
tivariate Taylor series, elementary matrix algebra including matrix inversion,
and a first course in frequentist statistical methods including some basic prob-
ability. Most of the techniques described in this book apply to data with only
minimal restrictions placed on their probability distributions, but performance
of these techniques, and the performance of analogous parametric procedures,
depends on these probability distributions. The first chapter below reviews
probability distributions. It also reviews some objectives of standard frequen-
tist analyses. Chapters covering methods that have elementary parametric
counterparts begin by reviewing those counterparts. These introductions are
intended to give a common terminology for later comparisons with new meth-
ods, and are not intended to reflect the richness of standard statistical analysis,
or to substitute for an intentional study of these techniques.

Computational Tools and Data Sources

Conceptual developments in this text are intended to be independent of
the computational tools used in practice, but analyses used to illustrate
techniques developed in this book will be facilitated using the program
R (R Core Team, 2018). This program may be downloaded for free from
https://cran.r-project.org/ . This course will heavily depend on the R
package MultNonParam. This package is part of the standard R repository
CRAN, and is installed by typing inside of R:
library(MultNonParam)
Other packages will be called as needed; if your system does not have
these installed, install them as above, substituting the package name for
MultNonParam.
Calculations of a more heuristic nature, and not intended for routine anal-
ysis, are performed using the package NonparametricHeuristic. Since this

vii
viii Introduction

package is so tightly tied to the presentation in this book, and hence of less
general interest, it is hosted on the github repository, and installed via

library(devtools)
install_github("kolassa-dev/NonparametricHeuristic")
and, once installed, loaded into R using the library command.
An appendix gives guidance on performing some of these calculations using
SAS.
Errata and other materials will be posted at
http://stat.rutgers.edu/home/kolassa/NonparametricBook
as they become available.

Acknowledgments
In addition to works referenced in the following chapters, I consulted Stigler
(1986) and Hald (1998) for early bibliographical references. Bibliographic trails
have been tracked through documentation of software packages R and SAS,
and bibliography resources from JSTOR, Citulike.org, Project Euclid, and
various publishers’ web sites, have been used to construct the bibliography.
I am grateful to the Rutgers Physical Sciences librarian Melanie Miller, the
Rutgers Department of Statistics Administrative Assistant Lisa Curtin, and
the work study students that she supervised, and the Rutgers Interlibrary
Loan staff, for assistance in locating reference material.
I am grateful to my students at both Rutgers, and at the University of
Rochester, to whom I taught this material over the years. Halley Constantino,
Jianning Yang, and Peng Zhang used a preliminary version of this manuscript,
and were generous with their suggestions for improvements. My experience
teaching this material helped me to select material for this volume, and to de-
termine its level and scope. I consulted various text books during this time, in-
cluding those of Hettmansperger and McKean (2011), Hettmansperger (1984),
and Higgins (2004).
I thank my family for their patience with me during preparation of this
volume. I thank my editor and proofreader for their important contributions.
I dedicate this volume to my wife, Yodit.
1
Background

Statistics is the solution to an inverse problem: given the outcome from a

random process, the statistician infers aspects of the underlying probabilistic
structure that generated the data. This chapter reviews some elementary as-
pects of probability, and then reviews some classical tools for inference about
a distribution’s location parameter.

1.1 Probability Background

This section first reviews some important elementary probability distribu-
tions, and then reviews a tool for embedding a probability distribution into
a larger family that allows for the distribution to be recentered and rescaled.
Most statistical techniques described in this volume are best suited to contin-
uous distributions, and so all of these examples of plausible data sources are
continuous.

1.1.1 Probability Distributions for Observations

Some common probability distributions are shown in Figure 1.1. The con-
tinuous distributions described below might plausibly give rise to a data set
of independent observations. This volume is intended to direct statistical in-
ference on a data set without knowing the family from which it came. The
behavior of various statistical procedures, including both standard parametric
analyses, and nonparametric techniques forming the subject of this volume,
may depend on the distribution generating the data, and knowledge of these
families will be used to explore this behavior.

1.1.1.1 Gaussian Distribution

The normal distribution, or Gaussian distribution, has density
√
fG (x) = exp(−(x − µ)2 /(2σ 2 ))/(σ 2π).

1
2 Background

FIGURE 1.1: Comparison of Three Densities

.......
.... ....
... ....
.. ...
0.3 .
..
. ..
..
.
... ..
...
.
.... . . ....
... . . ....
.
. . . ..
.... ...
0.25 ...
............................................................................................................................
.. . .... ... .
... .
..
..
.. . ... ... ..
. . .... ... . ..
.. ..
... . .. . .
... . ..
.. . ... . .
... . ..
... ... . ..
0.2 ...
.. .
.
.
.
.
...
.
. .
...
.
...
.
..
..
..
... . .... ... . ..
. ..
dens- ...
.. .
. .
...
. ..
..
..
. ..
..
... . ..
ity ... .
. .
.
.
. .
..
.
.
. ..
..
0.15 .. .
... . .
.
.
.
.. ..
.
...
..
.
. ....
..
... ... .
.. .. . ...
... . .. ... . ...
.. . .. ..
... .. ... ....
..... .. ... ...
... ... .. .
0.1 . ..
. ..
.
. .
..
..
.
.
.. ....
....
..
.. .
.
. .
.. ..... ... .
.
. .. ....
. ......... . ..... ... .
. .
... ..
..... .
..
. . ........ .
. ..
.......
. .. .
......
.....
. .
... .
.. ....... Cauchy .
. ...
.
. ...... ... .. ......... .
0.05 . ..............
.
..............
.
.
.
..
.
..
........ .
......... .

.................
.
.............
.
. .
.
..
..
. . . . . Gaussian .
..
..
........... .
.............
................
...................... . . ... .. . .......................
. .
. .
. .
..
.
....................... Uniform ..
..
...........................................................
.
..........................................................
0
-4 -2 0 2 4
ordinate

The parameter µ is both the expectation and the median, and σ is the standard
deviation. The Gaussian cumulative distribution function is
Z x
FG (x) = fG (y) dy.
−∞

There is no closed form for this integral. This distribution is symmetric

about µ; that is, fG (x) = fG (2µ − x), and FG (x) = 1 − FG (2µ − x). The
specific member of this family of distributions with µ = 0 and σ = 1 is
called standard normal distribution or standard Gaussian distribution. The
standard Gaussian cumulative distribution function is denoted by Φ(x). The
distribution in its generality will be denoted by G(µ, σ 2 ).
This Gaussian distribution may be extended to higher dimensions; a mul-
tivariate Gaussian random variable, or multivariate normal random variable,
X, in a space of dimension d, has a density of form exp(−(x − µ)> Υ−1 (x −
µ)/2) det Υ−1/2 (2π)−d/2 . Here µ is the expectation E [X], and Υ is the
variance-covariance matrix E (X − µ)> (X − µ) .

1.1.1.2 Uniform Distribution

The uniform distribution has density
(
1/λ for x ∈ (θ − λ/2, θ + λ/2)
fU (x) = .
0 otherwise
Probability Background 3

The cumulative distribution function of this distribution is


0
 for x ≤ θ − λ/2
FU (x) = 1/2 + (x − θ)/λ for x ∈ (θ − λ/2, θ + λ/2) .

1 for x ≥ θ + λ/2


Again, the expectation and median for this distribution are both √ θ, and the
distribution is symmetric about θ. The standard deviation is λ/ 12. A com-
mon canonical member of this family is the distribution uniform on [0, 1],
with θ = 1/2 and λ = 1. The distribution in its generality will be denoted by
U(θ, λ).

1.1.1.3 Laplace Distribution

The the double exponential distribution or Laplace distribution has density
√ √
fLa (x) = exp(−|x − θ| 2/σ))/(σ 2).
The cumulative distribution function for this distribution is
( √
exp((x − θ) 2/σ))/2 for x ≤ θ
FLa (x) = √ .
1 − exp(−(x − θ) 2/σ))/2 for x > θ

As before, the expectation and median of this distribution are both θ. The
standard deviation of this distribution is σ. The distribution is symmetric
about θ. A canonical member of this family is the one with θ = 0 and σ = 1.
The distribution in its generality will be denoted by La(θ, σ 2 ).

1.1.1.4 Cauchy Distribution

Consider the family of distributions
fC (x) = ς −1 /(π(1 + (x − θ)2 /ς 2 )), (1.1)
with θ real and ς positive. The cumulative distribution function for such a
distribution is
FC (x) = 1/2 + arctan((x − θ)/ς)/π.
This distribution is symmetric about its median θ, but, unlike the Gaussian,
uniform, and Laplace examples, does not have either an expectation nor a
variance; the quantity ς represents not a standard deviation but a more gen-
eral scaling parameter. Its upper and lower quartiles are θ ± ς, and so the
interquartile range is 2ς. This distribution is continuous. The family member
with θ = 0 and ς = 1 is the Cauchy distribution (and, when necessary to
distinguish it from other members (1.1), will be called standard Cauchy), and
a member with ς = 1 but θ 6= 0 is called a non-central Cauchy distribution.
An interesting and important property of the Cauchy relates to the distri-
bution of sums of independent and identical copies of members of this fam-
ily. If X and Y are independent standard Cauchy, then Z = (X + Y )/2
4 Background

is standard Cauchy. One may see this by first noting that P [Z ≤ z] =

R ∞ R 2z−y R∞
−∞ −∞
fC (x)fC (y) dx dy, and hence that fZ (z) = −∞ fC (2z−y)fC (y) dy.
Dwass (1985) evaluates this integral using partial fractions. Alternatively, this
fact might be verified using characteristic functions.

1.1.1.5 Logistic Distribution

The logistic distribution has density

fL (x) = (1 + exp(−(x − θ)/σ))−1 (1 + exp((x − θ)/σ))−1 σ −1 .

This distribution is symmetric about θ, and has expectation and variance θ

and σ 2 π 2 /3 respectively. The cumulative distribution function of this distri-
bution is
FL (x) = exp((x − θ)/σ)/(1 + exp((x − θ)).
The distribution in its generality will be denoted by Lo(θ, σ 2 ).

1.1.1.6 Exponential Distribution

The exponential distribution has density
(
exp(−x) for x > 0
fE (x) = .
0 otherwise

The cumulative distribution function of this distribution is

(
0 for x ≤ 0
FE (x) = .
1 − exp(−x) for x > 0

The expectation is 1, and the median is log(2). The inequality of these values
is an indication of the asymmetry of the distribution. The standard deviation
is 1. The distribution will be denoted by E.

1.1.2 Location and Scale Families

Most of the distributions above are or can be naturally extended to a family of
distributions, called a location-scale family, by allowing an unknown location
constant to shift the center of the distribution, and a second unknown scale
constant to shrink or expand the scale. That is, suppose that X has density
f (x) and cumulative distribution function F (x). Then a + bX has density
f ((y − a)/b)/b and the cumulative distribution function F ((y − a)/b). If X has
a standard distribution with location and scale 0 and 1, then Y has location
a and scale b.
This argument does not apply to the exponential distribution, because the
lower endpoint of the support of the distribution often is fixed at zero by the
structure of the application in which it is used.
Probability Background 5

1.1.3 Sampling Distributions

The distributions presented above in §1.1.1 represent mechanisms for gener-
ating observations that might potentially be analyzed nonparametricly. Dis-
tributions in this subsection will be used in this volume primarily as sampling
distributions, or approximate sampling distributions, of test statistics.

1.1.3.1 Binomial Distribution

The binomial distribution will be of use, not to model data plausibly arising
from it, but because it will be used to generate the first of the nonparametric
tests considered below. This distribution is supported on {0, 1, . . . , n} for some
integer n, and has an additional parameter π ∈ [0, 1]. Its probability mass
function is nx π x (1 − π)n−x , and its cumulative distribution function is
x
X n y
FB (x) = π (1 − π)n−y .
y=0
y

Curiously, the binomial cumulative distribution function can be expressed in

terms of the cumulative distribution function of the F distribution, to be
discussed below. The expectation is nπ, and the variance is nπ(1 − π). The
median does not have a closed-form expression. This distribution is symmetric
only if π = 1/2. The distribution will be denoted by Bin(n, π).
The multinomial distribution extends the binomial distribution to the dis-
tribution of counts of objects independently classified into more than two
categories according to certain probabilities.

1.1.4 χ2 -distribution
If X1 , . . . , Xk are independent random variables, each with a standard
Gaussian distribution (that is, Gaussian with expectation zero and vari-
ance one), then the distribution of the sum of their squares is called the
chi-square distribution, and is denoted χ2k . Here the index k is called the
degrees of freedom.
Distributions of quadratic forms of correlated summands sometimes have
a χ2 distribution as well. When Y has a multivariate Gaussian distribution
with dimension k, expectation 0 and variance matrix Υ, and if Υ has an
inverse, then
Y > Υ−1 Y ∼ χ2k .
One can see this by noting that Υ may be written as ΘΘ> . Then X =
Θ−1 Y is multivariate Gaussian with expectation 0, and variance matrix
Θ−1 ΘΘ> Θ−1> = I, where I is the identity matrix with k rows and
columns. Then X is a vector of independent standard Gaussian variables,
and Y > Υ−1 Y = X > X.
Furthermore, still assuming
Xj ∼ G(0, 1), independent (1.2)
6 Background

the distribution of
k
X
W = (Xi − δi )2 (1.3)
i=1

is called a non-central chi-square distribution (Johnson et al., 1995). The den-

sity and distribution function of this distribution are complicated. The most
important property
Pk of this distribution is that it depends on δ1 , . . . , δk only
through ξ = i=1 δk2 ; this quantity is known as the non-centrality parameter,
and the distribution of W will be denoted by χ2k (ξ). This dependence on
nonzero expectations only through the simple non-centrality parameter may
be seen by calculating the moment generating function of this distribution.
When Y has a multivariate Gaussian distribution with dimension k, ex-
pectation 0 and variance matrix Υ, and if Υ has an inverse, then X =
Θ−1 (Y − µ) is multivariate Gaussian with expectation −Θ−1 µ, and vari-
ance matrix I. Hence

(Y − µ)> Υ−1 (Y − µ) ∼ χ2k (ξ) with ξ = µ> Υ−1 µ. (1.4)

1.1.5 T -distribution
2
When U has a standard Gaussian distribution,
p V has a χk distribution, and
U and V are independent, then T = U/ V /k has a distribution called
Student’s t distribution, denoted here by Tk , with k called the degrees of
freedom.

1.1.6 F -distribution
When U and V are independent random variables, with χ2k and χ2m distribu-
tions respectively, then F = (U/k)/(V /m) is said to have an F distribution
with k and m degrees of freedom; denote this distribution by Fk,m . If a vari-
able T has a Tm distribution, then T 2 has an F1,m distribution.

1.2 Elementary Tasks in Frequentist Inference

This section reviews elementary frequentist tasks of hypothesis testing and
producing confidence intervals.

1.2.1 Hypothesis Testing

Statisticians are often asked to choose among potential hypotheses about the
mechanism generating a set of data. This choice is often phrased as between
a null hypothesis, generally implying the absence of a potential effect, and
Elementary Tasks in Frequentist Inference 7

an alternative hypothesis, generally the presence of a potential effect. These

hypotheses in turn are expressed in terms of sets of probability distributions,
or equivalently, in terms of restrictions on a summary (such as the median)
of a probability distribution. A hypothesis test is a rule that takes a data set
and returns “Reject the null hypothesis that θ = θ0 ” or “Do not reject the
null hypothesis.”
In many applications studied in this manuscript, hypothesis tests are im-
plemented by constructing a test statistic T , depending on data, and a con-
stant t◦ , such that the statistician rejects the null hypothesis (1.6) if T ≥ t◦ ,
and fails to reject it otherwise. The constant t◦ is called a critical value, and
the collection of data sets in
{data|T (data) ≥ t◦ } (1.5)
for which the null hypothesis is rejected is called the critical region.

1.2.1.1 One-Sided Hypothesis Tests

For example, if data represent the changes in some physiological measure after
receiving some therapy, measured on subjects acting independently, then a null
hypothesis might be that each of the changes in measurements comes from a
distribution with median zero, and the alternative hypothesis might be that
each of the changes in measurements comes from a distribution with median
greater than zero. In symbols, if θ represents the median of the distribution
of changes, then the null hypothesis is θ = 0, and the alternative is
θ ≥ 0. (1.6)
Such an alternative hypothesis is typically called a one-sided hypothesis. The
null hypothesis might then be thought of as the set of all possible distribu-
tions for observations with median zero, and the alternative is the set of all
possible distributions for observations with positive median. If larger values
of θ make larger values of T more likely, and smaller values less likely, then a
test rejecting the null hypothesis for data in (1.5) is reasonable.
Because null hypotheses are generally smaller in dimension than alterna-
tive hypotheses, frequencies of errors are generally easier to control for null
hypotheses than for alternative hypotheses.
Tests are constructed so that, in cases in which the null hypothesis is actu-
ally true, it is rejected with no more than a fixed probability. This probability
is called the test level or type I error rate, and is commonly denoted α.
Hence the critical value in such cases is defined to be the smallest value t◦
satisfying
◦
Pθ0 [T ≥ t ] ≤ α. (1.7)
When the distribution of T is continuous then the critical value satisfies (1.7)
with ≤ replaced by equality. Many applications in this volume feature test
statistics with a discrete distribution; in this case, the ≤ in (1.7) is generally
<.
8 Background

The other type of possible error occurs when the alternative hypothesis is
true, but the null hypothesis is not rejected. The probability of erroneously
failing to reject the null hypothesis is called the type II error rate, and is
denoted β. More commonly, the behavior of the test under an alternative
hypothesis is described in terms of the probability of a correct answer, rather
than of an error; this probability is called power; power is 1 − β.
One might attempt to control power as well, but, unfortunately, in the
common case in which an alternative hypothesis contains probability distri-
butions arbitrarily close to those in the null hypothesis, the type II error rate
will come arbitrarily close to one minus the test level, which is quite large.
Furthermore, for a fixed sample size and mechanism for generating data, once
a particular distribution in the alternative hypothesis is selected, the smallest
possible type II error rate is fixed, and cannot be independently controlled.
Hence generally, tests are constructed primarily to control level. Under this
paradigm, then, a test is constructed by specifying α, choosing a critical value
to give the test this type I error, and determining whether this test rejects the
null hypothesis or fails to reject the null hypothesis.
In the one-sided alternative hypothesis formulation {θ > θ0 }, the inves-
tigator is, at least in principle, interested in detecting departures from the
null hypothesis that vary in proximity to the null hypothesis. (The same
observation will hold for two-sided tests in the next subsection). For plan-
ning purposes, however, investigators often pick a particular value within the
alternative hypothesis. This particular value might be the minimal value of
practical interest, or a value that other investigators have estimated. They
then calculate the power at this alternative, to ensure that it is large enough
to meet their needs. A power that is too small indicates that there is a sub-
stantial chance that the investigator’s alternative hypothesis is correct, but
that they will fail to demonstrate it. Powers near 80% are typical targets.
Consider a test with a null hypothesis of form θ = θ0 and an alternative
hypothesis of form θ = θA , using a statistic T such that under the null hy-
pothesis T ∼ G(θ0 , σ02 ), and under the alternative hypothesis T ∼ G(θA , σA 2
).
0
Test with a level α, and without loss of generality assume that θA > θ . In
this case, the critical value is approximately t◦ = θ0 + σ0 zα . Here zα is the
number such that Φ(zα ) = 1 − α. A common level for such a one-sided test is
α = 0.025; z0.025 = 1.96. The power for such a one-sided test is
1 − Φ((t◦ − θA )/σA ) = 1 − Φ((θ0 + σ0 zα − θA )/σA ). (1.8)
One might plan an experiment by substituting null hypothesis values θ0 and
σ0 and alternative hypothesis values θA and σA into (1.8), and verifying that
this power is high enough to meet the investigator’s needs; alternatively, one
might require power to be 1 − β, and solve for the effect size necessary to give
this power. This effect size is
θA = θ0 + σA zβ + σ0 zα . (1.9)
One can then ask whether this effect size is plausible. More commonly, σ0 and
Elementary Tasks in Frequentist Inference 9

σA both may be made to depend on a parameter representing sample size,

with both decreasing to zero as sample size increases. Then (1.8) is solved for
sample size.

1.2.1.2 Two-sided Hypothesis Tests

In contrast to one-sided alternatives, consider a two-sided hypothesis for θ 6=
θ0 , and, in this case, one often uses a test that rejects the null hypothesis for
data sets in
{T ≤ t◦L } ∪ {T ≥ t◦U }. (1.10)
In this case, there are two critical values, chosen so that

α ≥ Pθ0 [T ≤ t◦L or T ≥ t◦U ] = Pθ0 [T ≤ t◦L ] + Pθ0 [T ≥ t◦U ] , (1.11)

and so that the first inequality is close to equality.

Many of the statistics constructed in this volume require an arbitrary
choice by the analyst of direction of effect, and choosing the direction of effect
differently typically changes the sign of T . In order keep the analytical results
invariant to this choice of direction, the critical values are chosen to make
the two final probabilities in (1.11) equal. Then the critical values solve the
equations
◦ ◦
Pθ0 [T ≥ tU ] ≤ α/2, and Pθ0 [T ≤ tL ] ≤ α/2, (1.12)
with t◦U chosen as small as possible, and t◦L chosen as large as possible consis-
tent with (1.12). Comparing with (1.7), the two-sided critical value is calcu-
lated exactly as is the one-sided critical value for an alternative hypothesis in
the appropriate direction, and for a test level half that of the one-sided test.
Hence a two-sided test of level 0.05 is constructed in the same way as two
one-sided tests of level 0.025.
Often the critical region implicit in (1.11) can be represented by creating
a new statistic that is large when T is either large or small. That is, one might
set W = |T − Eθ0 [T ] |. In this case, if t◦L and t◦U are symmetric about Eθ0 [T ],
then the two-sided critical region might be expressed as

{W ≥ w◦ } (1.13)

for w◦ = t◦U − Eθ0 [T ]. Alternatively, one might define W = |T − Eθ0 [T ] |2 ,

and, under the same symmetry condition, use as the critical region (1.13) for
w◦ = (t◦U − Eθ0 [T ])2 . In the absence of such a symmetry condition, w◦ may
be calculated from the distribution of W directly, by choosing w◦ to make the
probability of the set in (1.13) equal to α.
Statistics T for which (1.10) is a reasonable critical region are inherently
one-sided, since the two-sided test is constructed from one-sided tests com-
bining evidence pointing in opposite directions. Similarly, statistics W for
which (1.13) is a reasonable critical region for the two-sided alternative are
inherently two-sided.
Power for the two-sided test is the same probability as calculated in (1.11),
10 Background

with the θA substituted for θ0 , and power substituted for α. Again, assume
that large values of θ make larger values of T more likely. Then, for alternatives
θA greater than θ0 , the first probability added in

1 − β = PθA [T ≤ t◦L ] + PθA [T ≥ t◦U ]

is quite small, and is typically ignored for power calculations. Additionally,

rejection of the null hypothesis because the evidence is in the opposite di-
rection of that anticipated will result in conclusions from the experiment not
comparable to those for which the power calculation is constructed. Hence
power for the two-sided test is generally approximated as the power for the
one-sided test with level half that of the two-sided tests, and α in (1.8) is often
0.025, corresponding to half of the two-sided test level.
PkSome2 tests to be constructed in this volume may be expressed as W =
j=1 Uj for variables Uj which are, under the null hypothesis, approximately
standard Gaussian and independent; furthermore, in such cases, the critical
region for such tests is often of the form {W ≥ w◦ }. In such cases, the test
of level α rejects the null hypothesis when W ≥ χ2k,α , for χ2k,α the 1 − α
quantile of the χ2k distribution. If the standard Gaussian approximation for
the distribution of Uj is only approximately correct, then the resulting test
will have level α approximately, but not exactly.
If, under the alternative hypothesis, the variables Uj have expectations µj
and standard deviations εj , the alternative distribution of W will be a compli-
cated weighted sum of χ21 (µ2j ) variables. Usually, however, the impact of the
move from the null distribution to the alternative distribution is much higher
on the component expectations than on the standard deviations, and one
might treat these alternative standard deviations fixed at 1. With this Pksimpli-
fication, the sampling distribution of W under the alternative is χ2k ( i=1 δi2 ),
the non-central chi-square distribution.

1.2.1.3 P -values
Alternatively, one might calculate a test statistic, and determine the test level
at which one transitions from rejecting to not rejecting the null hypothesis.
This quantity is called a p-value. For a one-sided test with critical region of
form (1.5), the p-value is given by

P0 [T ≥ tobs ] , (1.14)

for tobs the observed value of the test statistic. For two-sided critical values of
form (1.10), with condition (1.12), the p-value is given by

2 min(P0 [T ≥ tobs ] , P0 [T ≤ tobs ]). (1.15)

These p-values are interpreted as leading to rejection of the null hypothesis

when they are as low as or lower than the test level specified in advance by
the investigator before data collection.
Elementary Tasks in Frequentist Inference 11

Inferential procedures that highlight p-values are indicative of the inferen-

tial approach of Fisher (1925), while those that highlight pre-specified test lev-
els and powers are indicative of the approach of Neyman and Pearson (1933).
I refer readers to a thorough survey (Lehmann, 1993), and note here only
that while I find the pre-specified test level arguments compelling, problem-
atic examples leading to undesirable interpretations of p-values are rare using
the techniques developed in this volume, and, more generally, the contrasts
between techniques advocated by these schools of thought are not central to
the questions investigated here.

1.2.2 Confidence Intervals

A confidence interval of level 1 − α for parameter θ is defined as a set (L, U )
such that L and U depend on data, and such that for any θ,

Pθ [L < θ < U ] ≥ 1 − α.

The most general method for constructing a confidence interval is test

inversion. For every possible null value θ0 , find a test of the null hypothesis
θ = θ0 , versus the two-sided alternative, of level no larger than α. Then the
confidence set is

{θ0 |The null hypothesis θ = θ0 is not rejected.}. (1.16)

In many cases, (1.16) is an interval. In such cases, one attempts to determine

the lower and upper bounds of the interval, either analytically or numerically.
Often, such tests are phrased in terms of a quantity W (θ) depending on
both the data and the parameter, such that the test rejects the null hypothesis
θ = θ0 if and only if W (θ0 ) ≥ w◦ (θ0 ) for some critical value c that might
depend on the null hypothesis.

1.2.2.1 P -value Inversion

One might construct confidence intervals through tail probability inversion.
Suppose that one can find a univariate statistic T whose distribution de-
pends on the unknown parameter θ, such that potential one-sided p-values
are monotonic in θ for each potential statistic value t. Typical applications
have

Pθ [T ≥ t] nondecreasing in θ, Pθ [T ≤ t] non-increasing in θ ∀t, (1.17)

with probabilities in (1.17) continuous in θ. Let t be the observed value of T .

Under (1.17),
{θ|Pθ [T ≥ t] > α/2, Pθ [T ≤ t] > α/2}
is an interval, of form (θL , θU ), with endpoints satisfying

PθL [T ≥ t] = α/2, PθU [T ≤ t] = α/2. (1.18)

12 Background

There may be t such that the equation PθL [T ≥ t] = α/2 has no solution,
because Pθ [T ≥ t] > α/2 for all θ. In such cases, take θL to be the lower
bound on possible values for θ. For example, if π ∈ [0, 1], and T ∼ Bin(n, π),
then Pπ [T ≥ 0] = 1 > α/2 for all π, Pπ [T ≥ 0] = α/2 has no solution, and
π L = 0. Alternatively, if θ can take any real value, and T ∼ Bin(n, exp(θ)/(1+
exp(θ))), then Pθ [T ≥ 0] = α/2 has no solution, and θL = −∞. Similarly,
there may be t such that the equation PθU [T ≤ t] = α/2 has no solution,
because Pθ [T ≤ t] > α/2 for all θ. In such cases, take θU to be the upper
bound on possible values for θ.
Construction of intervals for the binomial proportion represents a simple
example in which p-values may be inverted (Clopper and Pearson, 1934).

1.2.2.2 Test Inversion with Pivotal Statistics

Confidence interval construction is simplified when there exists a random
quantity, generally involving an unknown parameter θ, with a distribution that
does not depend on θ. Such a quantity is called a pivot. For instance, in the case
of independent and identically distributed observations with average X̄ √ and
standard deviation s from a G(θ, σ 2 ) distribution, then T = (X̄ − θ)/(s/ n)
has a t distribution with n − 1 degrees of freedom, regardless of θ.
One may construct a confidence interval using a pivot by finding quantiles
t◦L and t◦U such that
◦ ◦
P [tL < T (θ, data) < tU ] ≥ 1 − α. (1.19)

Then
{θ|t◦L < T (θ, data) < t◦U } (1.20)
is a confidence interval, if it is really an interval. In the case when (1.20) is
an interval, and when T (θ, data) is continuous in θ, then the interval is of the
form (L, U ); that is, the interval does not include the endpoints.

1.2.2.3 A Problematic Example

One should use this test inversion technique with care, as the following prob-
lematic case shows. Suppose that X and Y are Gaussian random variables,
with expectations µ and ν respectively, and common known variances σ 2 .
Suppose that one desires a confidence p interval for ρ = µ/ν (Fieller, 1954).
√
The quantity T = n(X̄ − ρȲ )/(σ 1 + ρ2 ) has a standard Gaussian dis-
tribution, independent of ρ, and hence is pivotal. A confidence region is
{ρ : n(X̄ − ρȲ )2 /σ 2 (1 + ρ2 )) ≤ zα/2
2
}. Equivalently, the region is

{ρ|Q(ρ) < 0} for Q(ρ) = (X 2 − υ 2 )ρ2 − 2XY ρ + Y 2 − υ 2 (1.21)

for υ = σzα/2 .
If X 2 + Y 2 < υ 2 , then Q(ρ) in (1.21) has a negative coefficient for
ρ , and the maximum value is at ρ = XY /(X 2 − υ 2 ). The maximum is
2
Exercises 13

(υ 2 −υ 2 + X 2 + Y 2 )/(υ 2 − X 2 ) < 0, and so the inequality in (1.21) holds
for all ρ, and the confidence interval is the entire real line.
If X 2 + Y 2 > υ 2 > X 2 , then the quadratic form in (1.21) has a negative
coefficient for ρ2 , and the maximum is positive. Hence values satisfying the
inequality in (1.21) are very large and very small values of ρ; that is, the
confidence interval is
√ ! √ !
−XY − υ X 2 + Y 2 − υ 2 −XY + υ X 2 + Y 2 − υ 2
−∞, ∪ ,∞ .
υ2 − X 2 υ2 − X 2

If X 2 > υ 2 , then the quadratic form in (1.21) has a positive coefficient for
2
ρ , and the minimum is negative. Then the values of ρ satisfying the inequality
in (1.21) are those near the minimizer XY /(X 2 − υ 2 ). Hence the interval is
√ √ !
XY − υ X 2 + Y 2 − υ 2 XY + υ X 2 + Y 2 − υ 2
, .
X 2 − υ2 X 2 − υ2

1.3 Exercises
1. Demonstrate that the moment generating function for P the statistic
k
(1.3), under (1.2), depends on δ1 , . . . , δk only through j=1 δj2 .
2
One-Sample Nonparametric Inference

This chapter first reviews standard Gaussian-theory inference on one sample

location models. It then presents motivation for why a distribution-free ap-
proach to location testing is necessary, and presents nonparametric techniques
for inference on quantiles. Later in this chapter, techniques for comparing the
efficiencies of tests are introduced, and these are applied to various parametric
and nonparametric tests. Finally, techniques for estimating a single cumulative
distribution function are discussed.

2.1 Parametric Inference on Means

Suppose one wants to learn about θ = E [Xj ], from a sample
X1 , . . . , Xj , . . . , Xn of independent and identically distributed random vari-
ables. When one knows the parametric family generating this set of indepen-
dent data, this information may be used to construct testing and estimation
methods tailored to the individual distribution. The variety of such techniques
is so large that only those presuming approximately a Gaussian model will be
reviewed in this volume, and in what follows, parametric analyses for compar-
ison purposes will be taken to assume approximate Gaussian distributions.

2.1.1 Estimation using Averages

Practitioners often estimate the location of a distribution using the sample
average
Xn
X̄ = Xj /n. (2.1)
j=1

If a new data set is created using an affine transformation Yj = a + bXj , then

Ȳ = a + bX̄, and the sample average is equivariant under affine transforma-
tions. For example, average temperature in degrees Fahrenheit Ȳ may be cal-
culated from average temperature in degrees Celsius X̄ using Ȳ = 32 + 1.8X̄,
without needing access to the original measurements.
If these variables have a finite variance σ 2 , then the central limit theorem
(CLT) ensures that X̄ is approximately G(θ, σ 2 /n); however, many com-

15
16 One-Sample Nonparametric Inference

mon techniques designed for data with a Gaussian distribution require con-
sequences of this distribution beyond the marginal distribution of the sample
average.

2.1.2 One-Sample Testing for Gaussian Observations

To test the null hypothesis θ√= θ0 versus the alternative θ > θ0 , reject the null
hypothesis if X̄ > θ0 + zα σ/ n. To test the null hypothesis θ = θ0 versus√ the
0 0
two-sided alternative θ 6=
√ θ , reject the null hypothesis if X̄ > θ + zα/2 σ/ n,
or if X̄ < θ0 − zα/2 σ/ n. If σ is not known, substitute the estimate s =
qP
n 2
j=1 (Xj − X̄) /(n − 1), and compare this quantity to the t distribution
with n − 1 degrees of freedom.

2.2 The Need for Distribution-Free Tests

Table 2.1 contains actual test levels for some tests of location parameters for
four of the families described in §1.1.1. True levels were determined via sim-
ulation; a large number of samples were drawn from each of the distributions
under the null hypothesis, the specified test statistic was calculated, the test
of §2.1.2 was performed for each simulated data set, and the proportion of
times the null hypothesis was rejected was tabulated. For now, restrict at-
tention to the first line in each subtable, corresponding to the t-test. Null
hypotheses in Table 2.1 are in terms of the distribution median. The t-test,
however, is appropriate for hypotheses involving the expectation. In the
Gaussian, Laplace, and uniform cases, the median coincides with the expecta-
tion, and so standard asymptotic theory justifies the use of the t-test. In the
Cauchy example, as noted before, even though the distribution is symmetric,
no expectation exists, and the t-test is inappropriate. However, generally, data
analysts do not have sufficient information to distinguish the Cauchy example
from the set of distributions having enough moments to justify the t-test, and
so it is important to study the implications of such an inappropriate use of
methodology.
For both sample sizes, observations from a Gaussian distribution give the
targeted level, as expected. Observations from the Laplace distribution give
a level close to the targeted level. Observations from the Cauchy distribution
give a level much smaller than the targeted level, which is paradoxical, because
one might expect heavy tails to make it anti-conservative. Figure 2.1 shows
the density resulting from Studentizing the average of independent Cauchy
variables. The resulting density is bimodal, with tails lighter than one would
otherwise expect. This shows that larger values of the sample standard devi-
One-Sample Median Methods 17

TABLE 2.1: True levels for the T Test, and Sign Test, and Exact Sign Test,
nominal level 0.05

(a) Sample size 10, Two-Sided

Gaussian Cauchy Laplace Uniform

T 0.05028 0.01879 0.04098 0.05382
Approximate Sign 0.02127 0.02166 0.02165 0.02060
Exact Sign 0.02127 0.02166 0.02165 0.02060

(b) Sample size 17, Two-Sided

Gaussian Cauchy Laplace Uniform

T 0.05017 0.02003 0.04593 0.05247
Approximate Sign 0.01234 0.01299 0.01274 0.01310
Exact Sign 0.04847 0.04860 0.04871 0.04898

(c) Sample size 40, Two-Sided

Gaussian Cauchy Laplace Uniform

T 0.04938 0.02023 0.04722 0.05029
Approximate Sign 0.03915 0.03952 0.03892 0.03904
Exact Sign 0.03915 0.03952 0.03892 0.03904

ation in the denominator of the Studentized statistic act more strongly than
larger values of components of the average in the numerator.
In all cases above, the t-test succeeds in providing a test level not much
larger than the target nominal level. On the other hand, in some cases the
true level is significantly below that expected.
This effect decreases as sample level increases.

2.3 One-Sample Median Methods

For moderate sample sizes, then, the standard one-sample t-test fails to con-
trol test level as the distribution of summands changes. Techniques that avoid
this problem are developed in this section. These methods apply in broad gen-
erality, including in cases when the expectation of the individual observations
does not exist. Because of this, inference about the population median rather
than the expectation is pursued. Recall that the median θ of random variable
Xj is defined so that

P [Xj ≥ θ] ≥ 1/2, P [Xj ≤ θ] ≥ 1/2. (2.2)

18 One-Sample Nonparametric Inference

FIGURE 2.1: Density of Studentized Cauchy made Symmetric, Sample Size

.. ...
..... .. .
0.4 .. .. .. ..
.. ... .. ....
.... .... .... ....
.. . .. .
.. .. .. ..
.. .. .. ..
.. .... .. ....
.... .. ..
. ...
.. ... .. ...
... .. ... ...
.. ..
.... ...
..
.
.
..
...
...
... .. . ...
... ..
0.3 ...
..
..
... .
.
.
. ...
..
... .. .
. ...
... ... ... ..
..
.... .. ... ...
.. ... ..
. ..
.... . ..
.... ... ...... ...
.. ...... ..
... ..
Density ...
...
..
... ..
..
0.2 .... ...
.. ...
... ..
... ..
..
... ..
... ..
...
... ...
... ...
... ..
... ..
..
... ..
... ...
0.1 ... ..
..
... ...
... ..
..
.. ...
..
. ..
..
. ...
.
.. ...
...
. ...
....
....
. ....
..
...
... ........
.
.... ...................
......................................... .........................
0
-4 -2 0 2 4
Data Value

Below, the term median refers to the population version, unless otherwise
specified.

2.3.1 Estimates of the Population Median

An estimator smed [X1 , . . . , Xn ] of the population median may be constructed
by applying (2.2) to the empirical distribution of Xi , formed by putting point
mass on each of the n values. In this case, with n odd, the median is the
middle value, and, with n even, (2.2) fails to uniquely define the estimator.
In this case, the estimator is conventionally defined to be the average of the
middle two values. By this convention, with X(1) , . . . , X(n) the ordered values
in the sample,
(
X((n+1)/2) for n odd
smed [X1 , . . . , Xn ] = . (2.3)
(X(n/2) + X(n/2+1) )/2 for n even
Alternatively, one might define the sample median to minimize the sum of
distances from the median:
n
X
smed [X1 , . . . , Xn ] = argmin |Xi − η|; (2.4)
η
i=1

that is, the estimate minimizes the sum of distances from data points to
One-Sample Median Methods 19

the potential median value, with distance measured by the sum of absolute
values. This definition (2.4) exactly coincides with the earlier definition (2.3)
for n odd, shares in the earlier definition’s lack of uniqueness for even sample
sizes, and typically shares the opposite resolution (averaging the middle two
observations) of this non-uniqueness. In contrast, the sample mean X̄ of (2.1)
satisfies
Xn
X̄ = argmin |Xi − η|2 . (2.5)
η
i=1

Under certain circumstances, the sample median is approximately Gaus-

sian. Central limit theorems for the sample median generally require only
that the density of the raw observations be positive in a neighborhood of the
population median.
In §2.1.1 it was claimed that the sample average is equivariant for affine
transformation. A stronger property holds for medians; if Yi = h(Xi ), for h
monotonic, then for n odd, smed [Y1 , . . . , Yn ] = h(smed [X1 , . . . , Xn ]). For n
even, this is approximately true, except that the averaging of the middle two
observations undermines exact equivariance for non-affine transformations.
Both (2.4) and (2.5) are special cases of an estimator defined by
n
X
argmin %(Xi − η), (2.6)
η
i=1

for some convex function %; the sample mean uses %(z) = z 2 /2 and the sample
median uses %(z) = |z|. Huber (1964) suggests an alternative estimator com-
bining the behavior of the mean and median, by taking ρ quadratic for small
values, and continuing linearly for larger values, thus balancing increased effi-
ciency of the mean and the smaller dependence on outliers of the median; he
suggests (
z 2 /2 for |z| < k
%(z) = 2
, (2.7)
k|z| − k /2 for |z| ≥ k
and recommends a value of the tuning parameter k between 1 and 2.

2.3.2 Hypothesis Tests Concerning the Population Median

Techniques in this section can be traced to Arbuthnott (1712), as described
in the example below. Fisher (1930) treats this test as too obvious to require
comment.
Consider independent identically distributed random variables Xi for i =
1, . . . , n. To test whether a putative median value θ0 is the true value, define
new random variables
(
1 if Xj − θ0 ≤ 0
Yj = . (2.8)
0 if Xj − θ0 > 0
20 One-Sample Nonparametric Inference

Then under H0 : θ = θ0 , Yj ∼ Bin(1/2, 1). This logic only works if

0

P Xj = θ = 0; (2.9)
assume this. It is usually easier to assess this continuity assumption than it is
for distributional assumptions. Then the Pnmedian inference problem reduces to
one of binomial testing. Let T (θ0 ) = j=1 Yj be the number of observations
Ptu −1
(1/2)n nj ≥ 1 − α. One

less than or equal to θ0 . Pick tl and tu so that j=t l
might choose tl and tu symmetrically, so that tl is the largest value such that
tXl −1
n n
(1/2) ≤ α/2. (2.10)
j=0
j

That is, tl is that potential value for T such that not more than α/2 probability
sits below it. The largest such tl has probability at least 1 − α/2 equal to or
larger than it, and at least α/2 equal to or smaller than it; hence tl is the α/2
quantile of the Bin(n, 1/2) distribution. Generally, the inequality in (2.10) is
strict; that is, ≤ is actually <. For combinations of n and α for which this
inequality holds with equality, the quantile is not uniquely defined, and take
the quantile to be the lowest candidate. Symmetrically, one might choose the
smallest tu so that
n
n n
X
(1/2) ≤ α/2; (2.11)
j=t
j
u

n + 1 − tu is the α/2 quantile of the Bin(n, 1/2) distribution, with the opposite
convention used in case the quantile is not uniquely defined.
Then, reject the null hypothesis if T ≤ t◦L or T ≥ t◦U for t◦L = tl − 1 and
◦
tU = tu . This test is called the (exact) sign test, or the binomial test (Higgins,
2004). An approximate version of the sign test might be created by selecting
critical values from the Gaussian approximation to the distribution of T (θ0 ).
Again, direct attention to Table 2.1. Both variants of the sign test succeed
in keeping the test level no larger than the nominal value. However, the sign
test variants, because of the discreteness of the binomial distribution, in some
cases achieve levels much smaller than the nominal target. Subtable (a), for
sample size 10, is the most extreme example of this; subtable (b), for sample
size 17, represents the smallest reduction in actual sample size, and subtable
(c), for sample size 40, is intermediate. Note further that, while the asymp-
totic sign test, based on the Gaussian approximation, is not identical to the
exact version, for subtables (a) and (c) the levels coincide exactly, since for
all simulated data sets, the p-values either exceed 0.05 or fail to exceed 0.05
for both tests. Subtable (b) exhibits a case in which for one data value, the
exact and approximate sign tests disagree on whether p-values exceed 0.05.
Table 2.2 presents characteristics of the exact two-sided binomial test of
the null hypothesis that the probability of success is half, with level α = 0.05,
applied to small samples. In this case, the two-sided p-value is obtained by
doubling the one-sided p value.
One-Sample Median Methods 21

TABLE 2.2: Exact levels and exact and asymptotic lower critical values for
symmetric two-sided binomial tests of nominal level 0.05

Critical Value tl − 1 Exact Critical Value tl − 1 Exact

n Exact Asymptotic Levels n Exact Asymptotic Levels
6 0 0 0.0313 24 6 6 0.0227
7 0 0 0.0156 25 7 7 0.0433
8 0 0 0.0078 26 7 7 0.0290
9 1 1 0.0391 27 7 7 0.0192
10 1 1 0.0215 28 8 8 0.0357
11 1 1 0.0117 29 8 8 0.0241
12 2 2 0.0386 30 9 9 0.0428
13 2 2 0.0225 31 9 9 0.0294
14 2 2 0.0129 32 9 9 0.0201
15 3 3 0.0352 33 10 10 0.0351
16 3 3 0.0213 34 10 10 0.0243
17 4 3 0.0490 35 11 11 0.0410
18 4 4 0.0309 36 11 11 0.0288
19 4 4 0.0192 37 12 12 0.0470
20 5 5 0.0414 38 12 12 0.0336
21 5 5 0.0266 39 12 12 0.0237
22 5 5 0.0169 40 13 13 0.0385
23 6 6 0.0347 41 13 13 0.0275

For small samples (n < 6), the smallest one-sided p-value, 1/2n , is greater
than .025, and the null hypothesis is never rejected. Such small samples are
omitted from Table 2.2. This table consists of two subtables side by side, for
n ≤ 23, and for n > 23. The first column of each subtable is sample size. The
second is tl − 1 from (2.10). The third is the value taken from performing the
same operation on the Gaussian approximation; that is, it is the largest a such
that √
Φ((tl − 1 − n/2 + 0.5)/(0.5 n)) ≤ α/2. (2.12)
The fourth is the observed test level; that is, it is double the right side of
(2.10). Observations here agree with those from Table 2.1; for sample size 10,
the level of the binomial test is severely too small, for sample size 17, the
binomial test has close to the optimal level, and for sample size 40, the level
for the binomial test is moderately too small.
A complication (or, to an optimist, an opportunity for improved approx-
imation) arises when approximating a discrete distribution by a continuous
distribution. Consider the case with n = 10, exhibited in Figure 2.2. Bar areas
represent the probability under the null hypothesis of observing the number of
successes. Table 2.2 indicates that the one-sided test of level 0.05 rejects the
null hypothesis for W ≤ 1. The actual test size is 0.0215, which is graphically
represented as the sum of the areas in the bar centered at 1, and the very small
area of the neighboring bar centered at 0. Expression (2.12) approximates the
22 One-Sample Nonparametric Inference

sum of these two bar areas by the area under the dotted curve, representing
the Gaussian
√ density with the appropriate expectation n/2 = 5 and standard
deviation n/2 = 1.58. In order to align the areas of the bars most closely
with the area under the curve, the Gaussian area should be taken to extend to
the upper end of the bar containing 1; that is, evaluate the Gaussian distribu-
tion function at 1.5, explaining the 0.5 in (2.12). More generally, for a discrete
distribution with potential values ∆ units apart, the ordinate is shifted by
∆/2 before applying a Gaussian approximation; this adjustment is called a
correction for continuity.

FIGURE 2.2: Approximate Probability Calculation for Sign Test

.
0.25 ..........
...............................
..... ....
. .....
... .......
......
.
. . ... ...
.... .... ... ..
.. ... ... ...
... ... ... ...
... ... ... ..
............................
. ..
. ............................
. . ... .. ..
0.2 ... ...
.... ...
...
.
...
...
...
.. ..
.. ...
.. ..
... ... ... ... .. ..
.. ..
... ... ... ... .. ..
... ... ... ... .. ..
.. ...
... .. ... ... ....
...... ... ... ....
...... ... ... ....
... ... ... ...
.
...... ..
. .
. ........
...... ... .... .....
0.15 .. ...
... ...
...
...
.
.
..
.
... ...
... ...
... ... ... .... ... ..
... ... ... ... ... ...
.. ... .
. .
. ... ...
Probability ... ... .... .... ... ...
... ... ... ... ... ..
.......................... .
. .
. .............................
.... ... .... ..
. ..
. ... ... ...
.. .. .
. ..
. ..
. ... .. ...
.... .... .... .... .... ... .. ..
.. ..
0.1 ... ...
... ...
...
...
...
...
...
...
...
...
.. ..
.. ..
... .. ... ... ... ... .. ...
.. ..
... ... ... ... ... ... .. ..
...... ... ... ... ... ....
..... . .....
... .
. .
. . ..
.
. .
. ..
. ..
. ..
. ....
..
....... .... .... .... .... ......
.. . .
. .
. .
. .
. ... ...
.. ...
. ..
. ..
. ..
. .
.
. ... ...
... ... ..
. ..
. ..
. .
.
. ... ...
0.05 .
..
..............................
.
.
. .
.
.
...
.
.
.
.
....
.
.
.
....
.
.
.
..
.
... ...
..............................
... ... .
. .
. .
. .
. .
.
. ... .. ...
... ... .... .... .... .... ..
. ... .. ..
.. ..
... .... ... ... ... .
. ..
. ... .. ..
..... ... ... ... .... .... ... ....
..
. .
. .
. .
. .
. .
. .
. .
. ......
.
........ .
.
. .
.
. .
.
. .
.
. .
.
.
.
.
. ... ......
... ... . . . . . .
.
.
..
.................................
.
.
..
.
.
..
.
.
..
.
.
..
.
.
..
.
.. .................................
.... . .
. .
. .
. .
. .
. .
. .. ........
.. ... .
.. .
. .
. .
. .
. .
. .
. .
. .
. ... ...............................
....................... ..........
0 . . . . . . .

0 1 2 3 4 5 6 7 8 9 10
Sample Size 10, Target Level 0.05, and from Table 2.2 a − 1 = 1

The power of the sign test is determined
by PθA Xj ≤ θ0 for values
of θA 6= θ0 . Since θA > θ0 if PθA Xj ≤ θ0 < 1/2, alternatives θA > θ0
correspond to one sided alternatives P [Yj = 1] < 1/2.
If θ0 is the true population median of the Xj , and if there exists
a set of
form (θ0 − , θ0 + ), with > 0, such that P Xj ∈ (θ0 − , θ0 + ) = 0, then
any other θ in this set is also a population median for Xj , and hence the test
will have power against such alternatives no larger than the test level. Such
occurrences are rare.
Table 2.3 represents powers for these various tests for various sample levels.
The alternative is chosen to make the t-test have power approximately .80 for
the Gaussian and Laplace distributions, using (1.9). In this case both σ0 and
One-Sample Median Methods 23

TABLE 2.3: Power for the T Test, Sign Test, and Exact Sign Test, nominal
level 0.05

(a) Sample size 10, Two-Sided

Gaussian Cauchy Laplace

T 0.70593 0.14345 0.73700
Approximate Sign 0.41772 0.20506 0.57222
Exact Sign 0.41772 0.20506 0.57222

(b) Sample size 17, Two-Sided

Gaussian Cauchy Laplace

T 0.74886 0.10946 0.76456
Approximate Sign 0.35747 0.17954 0.58984
Exact Sign 0.57893 0.35759 0.79011

(c) Sample size 40, Two-Sided

Gaussian Cauchy Laplace

T 0.78152 0.06307 0.78562
Approximate Sign 0.55462 0.35561 0.84331
Exact Sign 0.55462 0.35561 0.84331

√
σA for the Gaussian and Laplace distributions are 1/ n. Formula (1.9) is
inappropriate for the Cauchy distribution, since in this case X̄ does not have
a distribution that is approximately Gaussian. For the Cauchy distribution,
the same alternative as for the Gaussian and Laplace distributions is used.
Results in Table 2.3 show that for a sample size for which the sign test
level approximates the nominal level (n = 17), use of the sign test for Gaussian
data results in a moderate loss in power relative to the t-test, while use of the
sign test results in a moderate gain in power for Laplace observations, and in
a substantial gain in power for Cauchy observations.

Example 2.3.1 An early (and very simple) application of this test was
to test whether the proportion of boys born in a given year is the same as
the proportion of girls born that year (Arbuthnott, 1712). Number of births
was determined for a period of 82 years. Let Xj represent the number
of births of boys, minus the number of births of girls, in year j. The
parameter θ represents the median amount by which the number of girls
exceeds the number of boys; its null value is 0. Let Yj take the value 0 for
years in which more girls than boys are born, and 1 otherwise. Note that
in this case, (2.9) is violated, but P [Xj = 0] is small, and this violation
is not important. Test at level 0.05.
24 One-Sample Nonparametric Inference

Values in (2.10) and (2.11) are tl = 32 and tu = 51, obtained as the

0.025 and 0.975 quantiles of the binomial distribution with 82 trials and
success probability .5. Reject the null hypothesis if T < 32 or if T ≥ 51.
(The asymmetry in the treatment of the lower and upper critical values
is intentional, and is the result of the asymmetry in the definition of the
distribution function for discrete variables.)
In each of these years Xj < 0, and so Yj = 1, and T = 82. Reject the
null hypothesis of equal proportion of births. The original analysis of this
data presented what is now considered the p-value; the one sided value of
(1.14) is trivially P [T ≥= 82] = (1/2)82 , which is tiny. The two-sided
p-value of (1.15) is 2 × (1/2)82 = (1/2)81 , which is still tiny.

2.3.3 Confidence Intervals for the Median

Apply the test inversion approach of §1.2 to the sign test that rejects H0 :
θ = θ0 if fewer than tl or at least tu data points are less than or equal to θ0 .
Let X(·) referring to the data values after ordering. When θ0 ≤ X(1) , then
T (θ0 ) = 0. For θ0 ∈ (X(1) , X(2) ], T (θ0 ) = 1. For θ0 ∈ (X(2) , X(3) ], T (θ0 ) = 2.
In each case, the ( at the beginning of the interval and the ] at the end of the
interval arises from (2.8), because observations that are exactly equal to θ0
are coded as one. Hence the test rejects H0 if θ0 ≤ X(tl ) or θ0 > X(tu ) , and,
for any θ0 ,
0

Pθ0 X(tl ) < θ ≤ X(tu ) ≥ 1 − α.
This relation leads to the confidence interval is (X(tl ) , X(tu ) ]. However, since
the data have a continuous distribution, then X(tu ) also has a continuous
distribution, and 0
Pθ0 θ = X(tu ) = 0

for any θ0 . Hence Pθ0 X(tl ) < θ0 < X(tu ) ≥ 1 − α, and one might exclude
the upper end point, to obtain the interval (X(tl ) , X(tu ) ).

Example 2.3.2 Consider data from

http://lib.stat.cmu.edu/datasets/Arsenic
from a pilot study on the uptake of arsenic from drinking water. Column
six of this file gives arsenic concentrations in toenail clippings, in parts
per million. The link above is to a Word file; the file
http://stat.rutgers.edu/home/kolassa/Data/arsenic.dat
contains a plain text version. Sorted nail arsenic values are
0.073, 0.080, 0.099, 0.105, 0.118, 0.118, 0.119, 0.135,
0.141, 0.158, 0.175, 0.269, 0.275, 0.277, 0.310, 0.358,
0.433, 0.517, 0.832, 0.851, 2.252.
One-Sample Median Methods 25

We construct a confidence interval for the (natural) log of toenail arsenic.

The sign test statistic has a Bin(21, .5) distribution under the null hypoth-
esis. We choose the largest tl such that P0 [T < tl ] ≤ α/2. The first few
terms in (2.10) are

4.76 × 10−7 , 1.00 × 10−5 , 1.00 × 10−4 , 6.34 × 10−4 , 2.85 × 10−3 ,
9.70 × 10−3 , 2.59 × 10−2 , 5.54 × 10−2 , 9.70 × 10−2 , 1.40 × 10−1 ,

and cumulative probabilities are

4.77 × 10−7 , 1.05 × 10−5 , 1.11 × 10−4 , 7.45 × 10−4 , 3.60 × 10−3 ,
1.33 × 10−2 , 3.91 × 10−2 , 9.46 × 10−2 , 1.92 × 10−1 , 3.32 × 10−1 .

The largest of these cumulative sums smaller than 0.025 is the sixth,
corresponding to T < 6. Hence tl = 6. Similarly, tu = 16. Reject the null
hypothesis that the mean is 0.26 if T < 6 or if T ≥ 16. Since 11 of the
observations are greater than the null median 0.26, T = 11. Do not reject
the null hypothesis.
Alternatively, one might calculate a p-value. Using (1.15), the p-value
is 2 min(P0 [T ≥ 11] , P0 [T ≤ 11]) = 1.
Furthermore, the confidence interval for the median is (X(6) , X(16) ) =
(0.118, 0.358).
The values tl and tu may be calculated in R by

a<-qbinom(0.025,21,.5); b<-21+1-qbinom(0.025,21,.5)
and the ensemble of calculations might also have been performed in R
using
arsenic<-as.data.frame(scan(’arsenic.dat’,
what=list(age=0,sex=0,drink=0,cook=0,water=0,nails=0)))
library(BSDA)#Gives sign test.
SIGN.test(arsenic$nails,md=0.26)#Argument md gives null hyp.
Graphical construction of a confidence interval for the median is calcu-
lated by

library(NonparametricHeuristic)
invertsigntest(log(arsenic$nails),maint="Log Nail Arsenic")
and is given in Figure 2.3. Instructions for installing this last library are
given in the introduction, and in Appendix B.

Figure 2.3 exhibits construction of the confidence interval in the previous

example; I apply these techniques on the log scale. The confidence interval is
the set of log medians that yield a test statistic for which the null hypoth-
esis is not rejected. Values of the statistic for which the null hypothesis is
26 One-Sample Nonparametric Inference

not rejected are between the horizontal lines; log medians in the confidence
intervals are values of the test statistic within this region.

FIGURE 2.3: Construction of CI for Log Nail Arsenic Location

-3 -2 -1 0 1 2

In this construction, order statistics (that is, the ordered values) are first
plotted on the horizontal axis, with the place in the ordered data set on the
vertical axis. These points are represented by the points in Figure 2.3 where
the step function transitions from vertical to horizontal, as one moves from
lower left to upper right. Next, draw horizontal lines at the values tl and tu ,
given by (2.10) and (2.11) respectively. Finally, draw vertical lines through
the data points that these horizontal lines hit.
For this particular example, the exact one-sided binomial test of level 0.025
rejects the null hypothesis that the event probability is half if the sum of event
indicators is 0, 1, 2, 3, 4, or 5; tl = 6. For Yj of (2.8), the sum is less than
6 for all θ to the left of the point marked X(tl ) . Similarly, the one-sided level
0.025 test in the other direction rejects the null hypothesis if the sum of event
indicators is at least tu = 16. The sum of the Yj exceeds 15 for θ to the right
of the point marked X(tu ) .
By symmetry, one might expect tl = n − tu , but this is not the case. The
asymmetry in definitions (2.10) and (2.11) arises because construction of the
confidence interval requires counting not the data points, but the n − 1 spaces
One-Sample Median Methods 27

between them, plus the regions below the minimum and above the maximum,
for a total of n + 1 ranges. Then tl = n + 1 − tu .
√This interval is not of the usual form θ̂ ± 2σ̂, for σ̂ with a factor of
1/ n. Cramér (1946, pp. 368f.) shows that if X1 , . . . , Xn is a set of indepen-
dent random variables, each having density f , then Var [smed [X1 , . . . , Xn ]] ≈
1/(4f (θ)2 n). Chapter 8 investigates estimation of this density; this estimate
can be used to estimate the median variance, but density estimation is harder
than the earlier confidence interval rule.

2.3.4 Inference for Other Quantiles

The quantile θ corresponding to probability γ is defined by Pθ [Xj ≤ θ] =
γ. Suppose that θ is quantile γ ∈ (0, 1) of distribution of independent and
identically distributed continuous random variables X1 , . . . , Xn . Then one can
produce a generalized sign test. Define the null and alternative hypotheses
H0 : θ = θ0 and HA : θ 6= θ0 . As before, T (θ) is the number of observations
Ptu −1 j n−j n
the quantile, T ∼ Bin(n, γ).
smaller than or equal to θ. For the true value θ of
Choose tl and tu so that j=t l
γ (1 − γ) j ≥ 1 − α. Often, one chooses
the largest tl and smallest tu so that
l −1
tX n
j n−j n X
j n−j n
γ (1 − γ) < α/2, γ (1 − γ) < α/2 (2.13)
j=0
j j=t
j
u

this tl is α/2 quantile of the Bin(n, γ) distribution, and n + 1p − tu is the α/2

quantile of the Bin(n, 1 − γ) distribution. Hence tl ≈ nγ − nγ(1 − γ)zα/2
p
and tu ≈ nγ + nγ(1 − γ)zα/2 . One then rejects H0 if T < tl or T ≥ tu .
This test is then inverted to obtain (X(tl ) , X(tu ) ) as the
confidence interval

for θ. Note thatthe confidence
level
is conservative: P X(tl ) ≤ θ ≤ X(tu ) =
1 − P X(tl ) ≥ θ − P θ ≥ X(tu ) ≥ 1 − α. For any given θ, the inequality is
generally strict.

Example 2.3.3 Test the null hypothesis that the upper quartile (that is,
the 0.75 quantile) of the arsenic nail data from Example 2.3.2 is the ref-
erence value 0.26, and give a confidence interval for this quantile. The
analysis is the same as before, except that tl and tu are different. We de-
termine tl and tu in (2.13). Direct calculation, or using the R commands
a<-qbinom(0.025,21,.75);b<-21+1-qbinom(0.025,21,1-0.75)

shows tl = 12 and tu = 20. Since T = 10 < tl , reject the null hypothesis

that the upper quartile is 0.26. Furthermore, the confidence interval is the
region between the twelfth and twentieth ordered values, (X(12) , X(20) ) =
(0.269, 0.851). With data present in the R workspace, one calculates a
confidence interval as
28 One-Sample Nonparametric Inference

sort(arsenic$nails)[c(a,b)]

and the p-value as

tt<-10
2*min(c(pbinom(tt,21,.75), pbinom(21+1-tt,21,1-.75)))
to give 0.0128.

Dependence of the test statistic T (θ) on θ is relatively simple. Later in-

versions of more complicated statistics will make use of the simplifying device
of first, shifting all or part of the data by subtracting θ, and then testing the
null hypothesis that the location parameter for this shifted variable is zero.

2.4 Comparing Tests

For fixed level, alternative, and power, the test with a smaller sample size is
better. Consider two families of one-sided tests, indexed by sample size, using
statistics T1 and T2 , both with test level α, and determine the sample sizes
required to give power 1 − β, for the same alternative. Compare the tests by
taking ratio of these two sample sizes. The ratio is called relative efficiency; the
notation dates back at least as far as Noether (1950), citing Pitman (1948).
Let t◦j,n represent the critical value for test j based on n observations; that
is, the test based on statistic Tj and using n observations, rejects the null
hypothesis if Tj ≥ t◦j,n . Hence t◦j,n satisfies Pθ0 Tj ≥ t◦j,n = α. Let $j,n (θA )

represent the power for test Tj using n observations, under the alternative θA :

$j,n (θA ) = PθA Tj ≥ t◦j,n .

Assume that

$j,n (θA ) is continuous and increasing in θA for all j, n, 
limθA →∞ $j,n (θA ) = 1, (2.14)
limn→∞ $j,n (θA ) = 1 for all θA > θ0 .


Two tests, tests 1 and 2, involving hypotheses about a parameter θ, taking

the value θ0 under the null hypothesis, and with a simple alternative hypoth-
esis of form {θA }, for some θA > θ0 , with similar level and power, will be
compared. Pick a test level α and a power 1 − β, and the sample size n1 for
A
test 1. The power and level conditions on T1 imply a value
for θ under the
A ◦
alternative hypothesis; that is, θ solves PθA T1 ≥ t1,n1 = 1 − β. Note that
θA is a function of n1 , α, and β. Under conditions (2.14), one can determine
the minimal value of n2 so that test 2 has power at least 1 − β, under the
alternative given by θA . Report n1 /n2 as the relative efficiency of test 2 to
test 1; this depends on n1 , α, and β.
Comparing Tests 29

Define the asymptotic relative efficiency AREα,β [T1 , T2 ] as

lim n1 /n2 ,
n1 →∞

when this limit exists. Considering this quantity removes dependence on n1 .

This measure comparing efficiencies of two tests takes on a particularly
easy form in a special, yet common, case, in which both statistics are asymp-
totically Gaussian. In this case, the relative efficiency can be approximated
in terms of standard deviations and derivatives of means under alternative
hypotheses. General approximations for sample size, power, and effect sizes
are investigated first; these are applied to relative efficiency later.

2.4.1 Power, Sample Size, and Effect Size

This subsection presents formulas for power, sample size, and effect size, that
may be used for efficiency comparisons, but are also useful on their own. Gaus-
sian approximations earlier in this chapter often applied a continuity correc-
tion; this correction will not be applied for large-sample power and sample
size calculations, as the effect of this correction quickly becomes negligible as
the sample size increases. Without loss of generality, take θ0 = 0.

2.4.1.1 Power
Consider test statistics satisfying

Tj ∼ G(µj (θ), ςj2 (θ)), for ςj (θ) > 0, µj (θ) increasing in θ. (2.15)

The Gaussian distribution in (2.15) does not need to hold exactly; holding
approximately is sufficient. In this
h case, onei can find the critical values for
◦ ◦
the two tests, tj,nj , such that P0 Tj ≥ tj,nj = α. Since (Tj − µj (0))/ςj (0) is
approximately standard Gaussian under the null hypothesis, then

α = P0 [(Tj − µj (0))/ςj (0) ≥ zα ] = P0 [Tj ≥ µj (0) + ςj (0)zα ] .

Hence
t◦j,nj = µj (0) + ςj (0)zα . (2.16)
The power for test j is approximately

$j,nj (θA ) ≈ PθA [Tj ≥ µj (0) + ςj (0)zα ]

= 1 − Φ µj (0) + ςj (0)zα − µj (θA ) /ςj (θA ) .

(2.17)

Often the variance of the test statistic changes slowly as one moves away from
the null hypothesis; in this case, the power for test j is approximately

$j,nj (θA ) ≈ 1 − Φ µj (0) − µj (θA ) /ςj (0) + zα .

(2.18)
30 One-Sample Nonparametric Inference

2.4.1.2 Sample and Effect Sizes

When the test statistic variance decreases in a regular way with sample size,
one can invert the power relationship to determine the sample size needed for
a given power and effect size. Consider tests satisfying, in addition to (2.15),

ςj2 (θ) = σj2 (θ)/nj . (2.19)

Then

A √ σj (0)zα A A
$j,nj (θ ) = 1 − Φ nj µj (0) + √ − µj (θ ) /σj (θ ) . (2.20)
nj

As sample sizes increase, power increases for a fixed alternative, and calcula-
tions will consider a class of alternatives moving towards the null. Calculations
below will consider behavior of the expectation of the test statistic near the
null hypothesis (which is taken as θ = 0). Suppose that

µj (θ), σj (θ) are differentiable on some set θ ∈ (−, ). (2.21)

(These conditions are somewhat simpler than considered by Noether (1950);

in particular, note that (2.15), (2.19), and (2.21) together are not enough
to demonstrate the second condition of (2.14).) Without loss of generality,
continue to take θ0 = 0. In this case, critical values for the two tests are given
in (2.16). The power expression (2.17) may be simplified by approximating
the variances at the alternative hypothesis by quantities at the null. For large
nj , alternatives with power less than 1 will have alternative hypotheses near
the null, and so σj (θA ) ≈ σj (θ0 ). Hence

µj (0) − µj (θA )

A √
$j,nj (θ ) ≈ 1 − Φ nj + zα
σj (0)
µj (θA ) − µj (0)

√
= Φ nj − zα . (2.22)
σj (0)
This expression for approximate power may be solved for sample size, by
noting that if
$j,nj (θA ) = 1 − β, (2.23)
√ A
h i
µ (θ )−µ (0)
then $j,nj (θA ) = Φ(zβ ), and (2.22) holds if nj j σj (0) j − zα = zβ , or

nj = σj (0)2 (zα + zβ )2 /(µj (θA ) − µj (0))2 . (2.24)

Common values for α and β are 0.025 and 0.2, giving upper Gaussian quantiles
of zα = 1.96 and zβ = 0.84. Recall that z with a subscript strictly between
0 and 1 indicates that value for which a standard Gaussian random variable
has that probability above it.
It may be of use in practice, and will be essential in the efficiency cal-
culations below, to approximate which member of the alternative hypothesis
Comparing Tests 31

corresponds with a test of a given power, with sample size held fixed. Solving
(2.24) exactly for θA is difficult, since the function µ is generally non-linear.
Approximating this function using a one-term Taylor approximation,

µj (θA ) ≈ µj (0) + µ0j (0)θA .

(Contrast this with approximation of the alternative standard deviation by

the null standard deviation, as in the transition from (2.20) to (2.22). Ap-
proximation by the leading term alone cannot be applied to the expectation
µj (θ), since it would remove all of the effect of the difference between null and
alternative.) The power for test j is approximately
√ √
1 − Φ (σj (0)zα − nj µ0j (0)θA )/σj (0) = 1 − Φ zα − nj ej θA

for
ej = µ0j (0)/σ(0).
The quantity ej is called the efficacy of test j. Setting this power to 1 − β,
√
zα − nj ej θA = z1−β . Solving this equation for θA ,
√
θA ≈ (zα + zβ )/ nj ej , (2.25)

verifying requirement that θA get close to zero. This expression can be used
to approximate an effect size needed to obtain a certain power with a certain
sample size and test level, and will be used in the context of asymptotic relative
efficiency.

2.4.2 Efficiency Calculations

Equating the alternative hypothesis parameter values (2.25) corresponding to
√ √
power 1 − β, then (zα − z1−β )/ n1 e1 = (zα − z1−β )/ n2 e2 , or

AREα,β [T1 , T2 ] = n2 /n1 = e21 /e22 .

Note that this relative efficiency doesn’t depend on α or β, or on n1 . As an

example, suppose X1 , . . . , Xn are independent observations from a symmetric
distribution with finite variance ρ2 and mean θ. Then θ is also the median of
these observations. Compare tests T1 , the t-test, and T2 , the sign test. Then T1
has a distribution depending on the distribution of Xj , and T2 has a binomial
distribution. Note that T1 has approximately a standard Gaussian distribution
for large n1 . That is, T1 ∼ G(θ/ρ, 1/n1 ), and

µ01 (0) = 1/ρ, σ1 (0) = 1, and e1 = 1/ρ. (2.26)

On the other hand, T2 ∼ G(µ2 (θ), σ2 (θ)2 /n2 ) for

p
µ2 (θ) = F (θ), σ2 (θ) = F (θ)(1 − F (θ)). (2.27)
32 One-Sample Nonparametric Inference

TABLE 2.4: Empirical powers for one-sample location tests with sample size
ratios indicated by asymptotic relative efficiency

Larger sample size t test sign test

20 0.5623 0.2241
100 0.5647 0.3897
1000 0.5594 0.5040
10000 0.5621 0.5438

Hence µ02 (0) = f (0), and σ2 (0) = 1/2, and e2 = 2f (0).

The asymptotic relative efficiencies of these statistics depends on the dis-
tribution
√ that generates the data. p If data come from G(θ, ρ2 ), √ then µ02 (0) =
1/( 2πρ), σ2 (0) = 1/2, and e2 = 2/π/ρ. Then n1 /n2 ≈ (2/ 2π)2 = 2/π.
Hence, as expected, the t-test is more powerful; the sign test requires more
than 50% more observations to obtain the same power against the same al-
ternative, for large samples.
If the data come from a Laplace distribution, then ρ = 1, since the Laplace
distribution has variance 1. Substituting
√ into (2.26), µ01 (0)
√ = 1, σ1 (0) = 1,
0
and
√ e 1 = 1. Also µ 2 (0) = 1/ 2, σ 2 (0) = 1/2, and e 2 = 2. Hence n1 /n2 ≈
( 2)2 = 2; in this case, the sign test is more powerful, requiring roughly half
the sample size as does the t-test.
Table 2.4 contains results of a simulation to check actual powers that the
asymptotic relative efficiency calculations show should be approximately the
same. The table shows powers of the level 0.05 two-sided t and exact sign tests
for Laplace data sets, of size n1 and n2 = n1 /2 respectively, shifted to have
expectation 3/n1 . Data sets need to be quite large in order for sample sizes in
the ratio of the asymptotic relative efficiency to give equal power.
Now suppose these data come from a Cauchy distribution shifted to have
point of symmetry θ. In this case, the expectation of the distribution does
not exist, the standard deviation ρ is infinite, and the distribution is not
approximately Gaussian even in large samples. In fact, the distribution of the
the mean of Cauchy random variables is again a Cauchy random variable,
with no change in the spread of the distribution. Plugging into the definition
of efficacy, without worrying about regularity conditions, gives µ01 (0) = 1,
σ1 (0) = ∞, and e1 = 0. On the other hand, the quantities for the sign test
are
µ02 (0) = π −1 , σ2 (0) = 1/2, and e2 = 2/π.
Hence, for Cauchy responses, the efficiency of the sign test relative to the t-
test is n1 /n2 ≈ ∞. This abuse of notation retains the interpretation that the
sign test is infinitely more efficient for Cauchy observations.
Table 2.5 summarizes these calculations.
Comparing Tests 33

TABLE 2.5: Efficacies for one-sample location tests

t-test Sign
√ test Relative
Gaussian µ0 (0) 1/ρ 1/( 2πρ)
σ(0) 1 p 1/2 p
e 1/ρ 2/π/ρ
√ π/2
Laplace µ0 (0) 1 1/ 2
σ(0) 1 1/2
√ √
e 1 2 1/ 2
Cauchy µ0 (0) 1 π −1
σ(0) ∞ 1/2
e 0 2/π 0

2.4.3 Examples of Power Calculations

The Gaussian approximations to power (2.17), to sample size (2.24), and to
effect size (2.25), may be used to assist in planning an experiment.

Example 2.4.1 In this example I calculate power for a sign test ap-
plied to 49 observations from a Gaussian distribution with unit vari-
ance. Suppose X1 , . . . , X49 ∼ G(θ, 1), with null hypothesis θ = 0 and
alternative hypothesis θ = 1/2. The sign test statistic, divided by
n, approximately satisfies (2.15) and √ (2.19), with µ and σ given by
(2.27). Then√ µ1 (0) = .5, σ1 (0) = 0.5 × 0.5 = .5, µ1 (0.5) = 0.691,
σ1 (0.5) = .691 × 0.309 = 0.462, and power for a one-sided test of
level 0.025, or a two-sided test of level 0.05, is approximated by (2.17):
1−Φ(7×(0.5+0.5×1.96/7−0.691)/0.462) = 1−Φ(−0.772) = 0.780. The
null and alternative standard deviations are close enough to motivate the
use of the simpler approximation (2.22), approximating power as

1 − Φ(7 × (0.5 − 0.691)/0.5 + 1.96) = 1 − Φ(−0.714) = 0.769.

If, instead, a test of power 0.85 were desired for alternative expectation
1/2, with a one-sided test of level 0.025, zα = 1.96, and zβ = 1.036. From
(2.24), one needs at least

n = (0.462)2 (1.96 + 1.036)2 /(0.691 − 0.5)2 = 52.52

observations; choose 53.

Finally, one might determine how large an effect one might detect
using the original 49 observations
p with a test of level 0.025 and power
0.85. One could use e = 2/π = 0.797, from the box in Table 2.5 specific
to the sign test and the Gaussian distribution. Expression (2.25) gives
this number as (1.96 + 1.036)/(7 × 0.797) = 0.537.
34 One-Sample Nonparametric Inference

2.5 Distribution Function Estimation

Suppose one wishes to estimate a common distribution function of X1 , . . . , Xn
independent variables. For x in the range of Xj , let F̂ (x) be the number of
data points less than or equal to x, divided by n. Since the observations are
independent, F̂ (x) ∼ n−1 Bin(n, F (x)). A confidence interval for F (x) is
q
F̂ (x) ± zα/2 F̂ (x)(1 − F̂ (x))/n. (2.28)

The above intervals will extend outside [0, 1], which is not reasonable; this can
be circumvented by transforming the probability scale.
Figure 2.4 represents the bounds from (2.28), without any rescaling, to
be discussed further in the next example. Confidence bounds in Figure 2.4
exhibit occurrences of larger estimates being associated with upper confi-
dence bounds that are smaller (ex., in Figure 2.4, the region between the
second-to-largest and the largest observations), and for the region with the
cumulative distribution function estimated at zero or one (that is, the region
below the smallest observed value, and the region above the largest observed
value), confidence limits lie on top of the estimates, indicating no uncertainty.
Both of these phenomena are unrealistic. The first phenomenon, that of non-
monotonic confidence bounds, cannot be reliably avoided through rescaling;
the second, with the upper confidence bounds outside the range of the data,
can never be repaired through rescaling. A preferred solution is to substitute
the intervals of Clopper and Pearson (1934), described in §1.2.2.1, to avoid
all three of these problems (viz., bounds outside (0, 1), bounds ordered differ-
ently than the estimate, and bounds with zero variability). Such intervals are
exhibited in Figure 2.5.
Finally, the confidence associated with these bounds is point-wise, and not
simultaneous. That is, if (L1 , U1 ) and (L2 , U2 ) and are 1−α confidence bounds
associated with two ordinates x1 and x2 , then P [L1 ≤ F (x1 ) ≤ U1 ] ≥ 1 − α
and P [L2 ≤ F (x2 ) ≤ U2 ] ≥ 1 − α, at least approximately, but the preced-
ing argument does not bound P [L1 ≤ F (x1 ) ≤ U1 and L2 ≤ F (x2 ) ≤ U2 ] any
higher than 1 − 2α.

Example 2.5.1 Consider the arsenic data of Example 2.3.2. For every
real x, one counts the number of data points less than this x. For any
x less than the smallest value 0.073, this estimate is F̂ (x) = 0. For x
greater than or equal to this smallest value and smaller than the next
smallest value 0.080, the estimate is F̂ (x) = 1/21. This data set con-
tains one duplicate value 0.118. For values below, but close to, 0.118 (for
example, x = 0.1179), F̂ (x) = 4/21, since 21 of the observations are
less than x. However, F̂ (x) = 6/21; the jump here is twice what it is
at other data values, since there are two observations here. This esti-
Exercises 35

mate is sketched in both Figures 2.4 and 2.5, and may be constructed in
R using ecdf(arsenic$nails), presuming the data of Example 2.3.2 is
still present to R. The command ecdf(arsenic$nails) does not produce
confidence intervals; use
library(MultNonParam); ecdfcis(arsenic$nails,exact=FALSE)
to add confidence bounds, and changing exact to TRUE forces exact in-
tervals.

FIGURE 2.4: Empirical CDF and Confidence Bounds for Arsenic in Nails
. . . . . . . . . . . . . . ..
.
1 .. . . ◦......................................................................
.
.
◦..........................................................................................•..
.. .
. ◦.•. .
. .
. .. . . . . . . . . . . . . . . .
. ◦...................• .
. ◦ ....•
.. .
0.8 .
.◦ ....•
. .
.
. . .
. ◦ ..•
.. . . . .
. .
. ◦ .•
.. .
.
. ..
. ◦ • .
0.6 . . ..
. ◦ •
.
Probab- .◦
.
.....•
.. .
.
ility .◦.. .
•
. .
.. .
.◦
•
0.4 .. .
◦
.• .
.. .
.•
◦
. ..
•. .
.◦
..
..
.. . Empirical CDF
0.2 .◦...
•
...
. ◦
•
.
. . Confidence Bounds
..•
◦ .
..
..
◦
•
..
..
0 ..................................................................•
..
.

-1 0 1 2 3
Arsenic in Nails
Confidence Level 0.95 Approximate

2.6 Exercises
1. Calculate the asymptotic relative efficiency for the sign statistic
relative to the one-sample t-test (which you should approximate
using the one-sample z-test). Do this for observations from the
a. uniform distribution, on [−1/2, 1/2] with variance 1/12 and mean
under the null hypothesis of 0, and
b. logistic distribution, symmetric about 0, with variance π 2 /3 and
density exp(x)/(1 + exp(x))2 .
36 One-Sample Nonparametric Inference

FIGURE 2.5: Empirical CDF and Confidence Bounds for Arsenic in Nails

1 .
. . . . . . . . . . . . . . .• ....................................................................
. . . . ...........................................................................................
.. • ◦
.
. •.◦. .
. •...................◦
..
. .. . . . . . . . . .
. • ....◦
.. .
0.8 .
.• ....◦
. . . . . . . . . . . . . . . .
.
. . .
. • ..◦
.. .
. ... .
. • ◦ .
. . . . . .
. • ◦ .
0.6 . ◦ . . .
. • .
Probab- .•
.
.. . .
.....◦

ility .•.. . .
◦
. ..
.•..
◦
0.4 .. .
.•
◦ .
.•.. .
◦
. ..
.• ◦. ..
.
. .
..
Empirical CDF .
0.2 . . . . . . . . . . . .
.• ...
◦
Confidence Bounds
•
◦..
.
.
•.◦..
•
◦...
..
0 ..................................................................◦

-1 0 1 2 3
Arsenic in Nails
Confidence Level 0.95 Exact

2. The data set

HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/lupus

gives data on 87 lupus patients. The third column gives duration,

and the fourth column gives transformed disease duration.
a. Give a 90% confidence interval for the median duration, through
inverting the sign test, and compare this to the normal theory inter-
val for the mean. Keep in mind that the normal theory and sign test
approach are only comparable if you can argue that the mean and
the median for the distribution are plausibly the same. Comment
on this.
b. Give a 90% point-wise confidence bound for the distribution func-
tion for disease duration.
3. The data set

HTTP://lib.stat.cmu.edu/datasets/bodyfat

gives data on body fat in 252 men. The second column gives propor-
tion of lean body tissue. Give a 95% confidence interval for upper
quartile proportion of lean body tissue. Note that the first 116 lines
Exercises 37

and last 10 lines are data set description, and should be deleted.
(Line 117 is blank, and should also be deleted).
4. Suppose 49 observations are drawn from a Cauchy distribution, dis-
placed to have location parameter 1.
a. What is the power of the sign test at level 0.05 to test the null
hypothesis of expectation zero for these observations?
b. What size sample is needed to distinguish between a null hy-
pothesis of median 0 and an alternative hypothesis of median 1, for
independent Cauchy variables with a one-sided level 0.025 sign test
to give 80% power?
3
Two-Sample Testing

This chapter addresses the question of two-sample testing. Data will gener-
ally consist of observations X1 , . . . , XM1 from continuous distribution function
F , and observations Y1 , . . . , YM2 from a continuous distribution function G.
Model these observations as independent, and unless otherwise specified, treat
their distributions as identical, up to some known shift θ; that is,

F (z) = G(z − θ) ∀z. (3.1)

Techniques for testing null a hypothesis of form θ = θ0 in (3.1), vs. the al-
ternative that (3.1) holds for some alternative θ 6= θ0 , and for estimating θ
assuming (3.1), are presented. Techniques for tests in which (3.1) is the null
hypothesis, for an unspecified θ, are also presented.

3.1 Two-Sample Approximately Gaussian Inference

Two-sample Gaussian-theory inference primarily concerns expectations; how-
ever, one might also compare other aspects of distributions. Under the as-
sumption of an approximate Gaussian distribution, the only additional aspect
of the distributions to be compared is their dispersion.

3.1.1 Two-Sample Approximately Gaussian Inference on

Expectations
If the observations X1 , . . . , XM1 and Y1 , . . . , YM2 are approximately Gaussian
distributed, one might use the test statistic
q
T (θ) = (Ȳ − X̄ − θ)/ s2p (1/M2 + 1/M1 ), (3.2)

for PM1 PM2

2 − X̄)2
i=1 (Xi − Ȳ )2
i=1 (Yi
SX = , SY2 = (3.3)
M1 − 1 M2 − 1
and
2
(M1 − 1)SX + (M2 − 1)SY2
s2p = ,
N −2
39
40 Two-Sample Testing

for N = M1 + M2 . Statistic (3.2) is called the two-sample pooled t statistic,

and the associated test is the two-sample pooled t-test. When θ is correctly
specified,
T (θ) ∼ TN −2 , (3.4)
and so the standard test of level α rejects the null hypothesis θ = θ0 in (3.1)
when
|T (θ0 )| ≥ TN −2,α/2 , (3.5)
for T (θ) of (3.2). Here TN −2,α/2 is the 1 − α/2 quantile of the TN −2 distribu-
tion.
Estimates of θ, under the assumption (3.1), may be constructed by setting
T (θ̂) = 0; that is, θ̂ = Ȳ − X̄. Confidence intervals are generally constructed
by inverting the pooled two-sample t-test (3.2)p and (3.5) to obtain the interval
{θ||T | ≤ TN −2,α/2 } = Ȳ − X̄ ± sp TN −2,α/2 1/M1 + 1/M2 .

3.1.2 Approximately Gaussian Dispersion Inference

One might consider the formerly-alternative hypothesis (3.1), with θ unspec-
ified, as a null hypothesis. Under the Gaussian hypothesis, (3.1) fails to hold
only if the variances of the distributions are unequal. In order to compare
variances of Gaussian variables, one might compare the separate variance es-
timates. Under the model (3.1), and with the distributions approximately
Gaussian,
T = SY2 /SX
2
∼ FM2 −1,M1 −1 (3.6)
2
for SX and SY2 of (3.3) (Fisher, 1925, p. 808), although Fisher (1930) rec-
ommended transforming this ratio by taking logs to obtain an approximately
Gaussian test statistic. A simpler test, with E [X] and E [Y ] known, is also
available (Fisher, 1926).
Pearson (1931) notes that, under the hypothesis of equal variances, the log
of differences in estimated standard deviations is a monotonic transformation
of (1 + M2 T /M1 )−1 , and that this quantity follows a Pearson I family; he fur-
ther considers the test arising from comparing this statistic to this exact null
sampling distribution. Fisher (1973), crediting Snedecor (1934), recommends
comparing T of (3.6) to the FM2 −1,M1 −1 distribution.
When the underlying distribution of X1 , . . . , XM1 and Y1 , . . . , YM2 is not
exactly Gaussian, the distributional result in (3.4) and in (3.6) are approxi-
mate rather than exact. This approximation of (3.6) was observed to be poor
for even moderate deviations from the Gaussian distribution, in cases when
(3.4) remains entirely adequate (Pearson, 1931). Testing for equality of dis-
persion is revisited in §3.9, and in some sense a nonparametric dispersion test
is of more urgency than the test of location.
General Two-Sample Rank Tests 41

3.2 General Two-Sample Rank Tests

Nonparametric alternatives to the two-sample pooled t-test, to be developed
in this chapter, will reduce to rank tests of the form
N
{k} {k}
X
TG = aj Ij , (3.7)
j=1

{k}
for Ij equal to the 1 if the item ranked j in the combined sample comes
{k}
from the group k, and 0 otherwise. The superscript in Ij refers to group,
and does not represent power.
{2}
The statistic TG is designed to take on large values when items in group
two are generally larger than the remainder of the observations (that is, the
items in group one), and to take small values when items in group two are
{1}
generally smaller than the remainder of the observations. The statistic TG is
designed to take on large values when items in group one are generally larger
than the remainder of the observations (that is, the items in group two), and
to take small values when items in group one are generally smaller than the
{1}
remainder of the observations. The statistic TG provides no information not
{2} {2} P N {1}
also captured in TG , since TG = j=1 aj − TG .

3.2.1 Null Distributions of General Rank Statistics

In order to use this statistic in a hypothesis test, one first needs to know
its distribution under the null hypothesis. That h is, one i needs first to be
{2} {2}
able to calculate one-sided p-values of form P0 TG ≥ t , for TG of (3.7)
and various values of t, with the subscript 0 on the probability indicating
calculation under the null hypothesis. One might calculate p-values for these
N

test statistics exactly: List all M2
possible ways to divide the N objects into
two groups, one of size M1 . Calculate the test statistic for each rearrangement.
Count the number of rearrangements giving a test statistic as extreme or
N
more extreme than the one observed. Divide this count by M 2
to obtain
{2}
the p-value. If there exist different rankings giving the same value for TG ,
this information might be used to simplify calculations, but in general one
cannot depend on such a simplification. Hence these calculations are likely
quite slow. A common simplification is to round the scores so that the scores
become integer multiples of a common value; these simplified scores may then
be more amenable to exact analysis.
More commonly, the distribution of rank statistics is approximated by the
{2}
Gaussian distribution. Since TG is the sum of random variables that are
neither identically distributed nor independent, a variant of the usual central
limit theorem due to Erdös and Réyni (1959), specific to finite population
42 Two-Sample Testing

sampling without replacement, is used. There are some conditions on the set
of scores needed to ensure that they are approximately Gaussian. Critical
values for the two-sided test of level α of the null hypothesis of equality of
distribution, vs. the alternative, relying on the Gaussian approximation, are
given by
h i r h i
{2} {2}
E0 G ± Var0 TG zα/2
T (3.8)

and the p-value is given by

!
h i r h i
{2} {2} {2}
2Φ̄ |TG − E0 TG |/ Var0 TG . (3.9)

Moments needed to evaluate (3.8) and (3.9) are given in the next section.

3.2.2 Moments of Rank Statistics

{1} {2}
The Gaussian approximation to the null distribution of TG and TG requires
calculation of the null expectation and variance of the statistic. This subsection
determines moments for a statistic of the form (3.7). Under the null hypothesis,
M1 of these scores are assigned to individuals in the first group, and M2 scores
are assigned to individuals in the the second group, with all rearrangements
having equal probability. The scores for the Y group may be thought PN of as
sampled without h replacement
i P from h a finite
i population.
h i Let ā = k=1 k /N .
a
{k} N {k} {k}
In this case, E TG = j=1 aj E Ij , and E Ij = Mk /N . Hence

h i N
{k}
X
E0 TG = Mk aj /N = Mk ā; (3.10)
j=1

the subscript 0 on the expectation operator indicates that the expectation is

taken under the null hypothesis.
{k}
Calculating the variance is harder, since the Ij are not independent. Let
h i
PN {k}
â = k=1 a2k /N . Let b1 = Var Ij = Mk (N − Mk )/N 2 . When j 6= i, then
h i h i
{k} {k} {k} {k}
E Ij Ii = P Ij = 1, Ii = 1
h i h i M −1M
{k} {k} {k} k k
= P Ij = 1|Ii = 1 P Ii = 1 = .
N −1 N
Let
h i M (M − 1) M 2 (N − Mk )Mk
{k} {k} k k
b2 = Cov I2 , I1 = − k2 = − .
N (N − 1) N N 2 (N − 1)
A First Distribution-Free Test 43

TABLE 3.1: Reduction of Two-Sample Testing Problem to Fisher’s Exact

Test, via Mood’s Test, for an even total sample size

Y X Total
Greater than Median A N/2 − A N/2
Less than Median B N/2 − B N/2
Total M2 M1 N

So
h i N X
N h i
{k} {k} {k}
X
Var TG = ai aj Cov Ii , Ij
i=1 j=1
N
X X
= a2i b1 + ai aj b2 = (b1 − b2 )N â + N 2 b2 ā2
i=1 i6=j

= (N − Mk )Mk (â − ā2 )/(N − 1). (3.11)

3.3 A First Distribution-Free Test

The nonparametric approach analogous to the sign test is Mood’s median test.
One first calculates the combined sample median. Let A be number of obser-
vations from Y ’s above the combined median. Let B be number of obser-
vations from Y ’s below the combined median. Under the null hypothesis
F (x) = G(x)∀x, and when N is even, A has a hypergeometric distribution,
and Mood’s test reduces the two-sample equality of distribution problem to
Fisher’s exact test, as exhibited in Table 3.1. Mood’s test TM is equivalent to
{2}
the score test TG of (3.7) with

1
 for j ≥ (N + 1)/2
aj = 0 for j = (N + 1)/2 (3.12)

−1 for j ≤ (N + 1)/2,


although, as originally formulated, this test was applied only in the case of
even sample sizes, and so the score 0 would not be used.
Mood’s test, and other rank tests, ignore the ordering of the observations
from the first group among themselves, and similarly ignore the orderings of
the second group among themselves. Represent the data set as a vector of N
symbols, M1 of them X and M2 of them Y . The letter X in position j indicates
that, after ordering a combined set of N observations, the observation ranked
j comes from the first group, and the letter Y in position j indicates that
the observation ranked j comes from the second group. The advantage of
44 Two-Sample Testing

TABLE 3.2: Reduction of Two-Sample Testing Problem to Fisher’s Exact

Test, via Mood’s Test, for an odd total sample size

Y X Total
Greater than Median A (N − 1)/2 − A (N − 1)/2
Equal to Median C 1−C 1
Less than Median B (N − 1)/2 − B (N − 1)/2
Total M2 M1 N

Mood’s test lies in its simplicity, and its disadvantage is its low power. To
see why its power is low, consider the test with M1 = M2 = 3, for a total
of six observations. A data set label X, Y, X, Y, X, Y indicates that the lowest
observation is from the first group, the second lowest is from the second group,
the third lowest is from the first group, the fourth lowest is from the second
group, the fifth lowest (that is, the second highest) is from the first group,
and the highest is from the second group. Mood’s test treats X, Y, X, Y, X, Y
and X, Y, X, X, Y, Y as having equal evidence against H0 , but the second
should be treated as having more evidence. Furthermore, Mood’s test takes a
value between 0 and min(M2 , bN/2c). This high degree of discreteness in the
statistic’s support undermines power.
Westenberg (1948) presented the equal-sample case of the statistic, and
Mood (1950) detailed the use in the case with an even combined sample size.
Mood’s test is of sufficiently low importance in applications that the early
references did not bother to present the slight complication that arises when
the combined sample size is odd.
When the total sample size is odd, one might represent the median test
as inference from a 2 × 3 contingency table with ordered categories, as in
Table 3.2. Then TM = A − B. Then the one-sided p-values may be calculated
as
M1 M2
P0 [A − B ≥ t] = P0 [A − B ≥ t|C = 0] + P0 [A − B ≥ t|C = 1] .
N N
The probabilities P0 [A − B ≥ t|C = c] are calculated from the hypergeomet-
ric distribution.
The null expectation and variance of TM are given by (3.10) and (3.11)
respectively. Note ā = 0, and
(
(N − 1)/N if N odd
â =
1 if N even.

Then (3.10) shows that

(
M1 M2 /N if N odd
E0 [TM ] = 0, Var0 [TM ] = (3.13)
M1 M2 /(N − 1) if N even.
A First Distribution-Free Test 45

Critical values and p-values for Mood’s test may be calculated from (3.8) and
(3.9) respectively.

Example 3.3.1 Cox and Snell (1981, Example Q) present data on

breaking loads (in oz.) of yarn, of two types (A and B) coming from
six bobbins. Data may be found at
http://stat.rutgers.edu/home/kolassa/Data/yarn.dat .
Each combination of bobbin and type is represented four times, for 48 ob-
servations in a balanced design. Gaussian quantile plots may be calculated
using
yarn<-as.data.frame(scan("yarn.dat",what=
list(strength=0, bobbin=0,type="")))
par(mfrow=c(2,1))
for(yt in c("A","B"))
qqnorm(yarn$strength[yarn$type==yt],
pch=yarn$bobbin[yarn$type==yt],
main=paste("Gaussian QQ plot for Yarn Type",yt))
par(mfrow=c(1,1))
and are presented in Figure 3.1. The deviation from a straight line for
these points indicates departure from the Gaussian distribution; a poten-
tial thresholding associated with bobbin indicates a role for bobbin as a
second factor, to be addressed in Chapter 5. Boxplots by type are shown
in Figure 3.2. Use Mood’s test for a difference in median, ignoring the
effect of bobbin. The median strength is 15.75; 15 yarn samples of type B
are above the joint median. This leads to Table 3.3, and TM = 15−9 = 6.
In this even-sample case, from (3.13), Var0 [TM ] = 24 × 24/47√= 12.26.
Hence the approximately standard Gaussian statistic
√ is (6 − 0)/ 12.26 =
1.71 (and, corrected for continuity, is (5−0)/ 12.26 = 1.43). The correc-
tion for continuity here is 1, because possible values of TM are two units
apart. The p-value 2 × Φ̄(1.43) = 0.153. Do not reject the null hypothesis
of equal yarn strength. You might also do this using

library(MultNonParam)
attach(yarn)
mood.median.test(strength[type=="A"],strength[type=="B"])
or

genscorestat((strength>median(strength))*2-1,type,correct=1)
Compare this with the two-sample t-test:
46 Two-Sample Testing

t.test(strength[type=="A"],strength[type=="B"])
detach(yarn)

to give a p-value 0.029.

FIGURE 3.1: Normal Quantile Plot for Yarn

M
18
♦
♦
17

Sample × Symbol Bobbin

×
Quantile 16 M5
M 1
♦5♦
Yarn 5 ♦ 2
Type 15 +5 M 3
M ×+
A + 4
+
14
× 5
5 6
13 +
×

-2 -1 0 1 2
Normal Quantile
×
19
♦

18 × Symbol Bobbin
Sample 1
Quantile 17 ♦
×× ♦ 2
Yarn
M 3
Type 5♦
B 16 M5+ + 4
5+ ×
M+ 5
15 5 + ♦ 5 6
M
M
14
-2 -1 0 1 2
Normal Quantile
Yarn Type B

Mood’s test, using (3.7) in conjunction with (3.12), is almost never used,
and is presented here only as a point of departure for more powerful tests.
Mood’s test looses power primarily because of the discreteness of the scores
(3.12). The balance of this chapter explores tests with less discrete scores.
The Mann-Whitney-Wilcoxon Test 47

TABLE 3.3: Mood’s test for the yarn example

Type A Type B Total

Greater than Median 9 15 24
Less than Median 15 9 24
Total 24 24 48

FIGURE 3.2: Boxplots of Yarn Strength (oz) by Type

Type ......................................................................
.... ... .
....................................... ..................................................................................
B .
..
.
...........................................................................
.

12 13 14 15 16 17 18 19 20
Strength

3.4 The Mann-Whitney-Wilcoxon Test

{2}
The Wilcoxon rank-sum test is defined as TG of (3.7), with aj = j
(Wilcoxon, 1945). That is, the statistic is the sum of ranks of observations
coming from the second group. Denote this specific example of the general
rank statistic as TW . The term “rank-sum” is used to differentiate this test
from another test proposed by the same author, to be discussed in a later
chapter.
An alternative test statistic for detecting group differences is
M1 X
X M2
TU = I(Xi < Yj ). (3.14)
i=1 j=1

This new statistic TU and the previous statistic TW can be shown to be iden-
tical. For j ∈ {1, . . . , M2 } indexing a member of the second sample, define Rj
as the rank of observation j among all of the observations from the combined
samples. Note that
M2
X M2
X
TW = Rj = #(sample entries less than or equal to Yj )
j=1 j=1
M2
X
= #(X values less than or equal to Yj )
j=1
M2
X
+ #(Y values less than or equal to Yj ) (3.15)
j=1
48 Two-Sample Testing

The first sum in (3.15) is TU , and the second is M2 (M2 + 1)/2, and so

TW = TU + M2 (M2 + 1)/2.

The test based on TU is called the Mann-Whitney test (Mann and Whitney,
1947). This statistic is a U statistic; that is, a statistic formed by summing
over pairs of observations in a data set.

3.4.1 Exact and Approximate Mann-Whitney Probabilities

The distribution of test statistic (3.14) can be calculated exactly, via recursion
(Festinger, 1946). Let cW (t, M1 , M2 ) the number of ways that M1 symbols X
and M2 symbols Y can be written in a vector to give TU = t, as in §3.3. Then

N
PM1 ,M2 U[T = t] = c W (t, M1 , M2 )/ . (3.16)
M2

The collection of vectors giving statistic value t can be divided according to

whether the last symbol is X or Y . If the last symbol was X, then ignoring
this final value, the vector still gives the same statistic value t, and there
are cW (t, M1 − 1, M2 ) such vectors. If the last symbol was Y , then ignoring
this final value, the factor gives the statistic value t − M1 , and there are
cW (t − M1 , M1 , M2 − 1) such vectors. So

cW (t, M1 , M2 ) = cW (t, M1 − 1, M2 ) + cW (t − M1 , M1 , M2 − 1). (3.17)

The recursion stops once either sample size hits zero:

( (
1 if t = 0 1 if t = 0
cW (t, M1 , 0) = , and cW (t, 0, M2 ) = . (3.18)
0 if t 6= 0 0 6 0
if t =

The maximal value for t is

N (N + 1)/2 − M1 (M1 + 1)/2 = M2 (2M1 + M2 + 1)/2;

hence the recursion can be stopped early by noting that

cW (t, M1 , M2 ) = 0 if t < M2 (M2 +1)/2 or if t > M2 (2M1 +M2 +1)/2. (3.19)

A natural way to perform these calculations is with recursive calls to a com-

puter routine to calculate lower-order probabilities, although the algorithm
can be implemented without such explicit recursion (Dinneen and Blakesley,
1973).

3.4.1.1 Moments and Approximate Normality

Using this recursion can be slow, and the argument at the end of §3.2.1 can
be used to show that the distribution of the test statistic is approximately
The Mann-Whitney-Wilcoxon Test 49

Gaussian. Fortunately, a central limit theorem applies to this statistic (Erdös

and Réyni, 1959).
The Wilcoxon version Pof the Mann-Whitney-Wilcoxon statistic is given by
N PN
(3.7) with aj = j. Then j=1 aj = j=1 j = N (N + 1)/2, and ā = (N + 1)/2.
Hence E0 [TW ] is M2 (N + 1)/2, using (3.10). Pw
In order to calculate Var0 [TW ] using (3.11), one needs g(w) = j=1 j 2 .
One might guess it must be cubic in N . Examine functions g(w) = aw3 +
bw2 + cw + d so that g(0) = 0 and g(w) − g(w − 1) = w2 . Then d = 0, and

w2 = (aw3 + bw2 + cw) −

aw3 + 3aw2 − 3aw + a − bw2 + 2bw − b − cw + c
= 3aw2 − 3aw + a + 2bw − b + c.

Equating quadratic terms above gives a = 1/3. Setting the linear term to zero
gives b = 1/2, and setting the constant term to zero gives c = 1/6. Then
w
X
j2 = g(w) = w(2w + 1)(w + 1)/6, (3.20)
j=1
â = (2N + 1)(N + 1)/6,
2
â − ā = (2N + 1)(N + 1)/6 − (N + 1)2 /4 = (N 2 − 1)/12, (3.21)

and from (3.10), (3.11) and (3.21),

E [TW ] = M2 (N + 1)/2, Var [TW ] = M1 M2 (N + 1)/12. (3.22)

In conjunction with the central limit theorem argument described above, one
can test for equality of distributions, with critical values and p-values given
by (3.8) and (3.9) respectively.

Example 3.4.1 Refer again to the yarn data of Example 3.3.1. Consider
yard strengths for bobbin 3.

14.2(B), 14.5(B), 14.8(A), 15.2(B), 15.8(A), 15.9(B), 16.0(A), 18.2(A).

Sum the ranks associated with Type B, to get TW = 1 + 2 + 4 + 6 = 13.

Here M1 = M2 = 4, and N + 1 = M1 + M2 + 1 = 9. From (3.22),
under the null hypothesis of equality of distributions, the expected value
of the rank sum is 4 × 9/2 = 18, and the variance is 4 × 4 × 9/12 = 12.
Hence the statistic,
√ after standardizing to zero mean and unit variance,
is (13 − 18)/ 12 = −1.44. The p − value is 0.149. This may be done
using R by

wilcox.test(strength~type,data=yarn[yarn$bobbin==3,],
exact=FALSE, correct=FALSE)
50 Two-Sample Testing
√
The continuity-corrected p-value uses statistic (13 + 0.5 − 18)/ 12, and
is 0.194, and might be done by

wilcox.test(strength~type,data=yarn[yarn$bobbin==3,],
exact=FALSE)
Finally, p-values might be calculated exactly using (3.17), (3.18), and
(3.19), and in R by

wilcox.test(strength~type,data=yarn[yarn$bobbin==3,],
exact=TRUE)
Moments (3.22) apply to the statistic given by scores aj = j. By contrast,
the Mann-Whitney statistic TU is constructed using (3.7) from scores aj = j −
(N +1)/2. The variance of this statistic is still given by (3.22); the expectation
is E [TU ] = M2 (N + 1)/2 − M2 (M2 + 1)/2 = M2 M1 /2.
The Wilcoxon variance in (3.22) increases far more quickly than that of
Mood’s test as the sample size N increases; relative to this variance, the
continuity correction is quite small, and is of little importance.

3.4.2 Other Scoring Schemes

One might construct tests using other scores aj . A variety of techniques
are available for use. One could use scores equal to expected value of order
statistics from Gaussian distribution; these are called normal scores. Alterna-
tively, one could use scores calculated from the Gaussian quantile function
aj = Φ−1 (j/(N + 1)) (Waerden, 1952), called van der Waerden scores, or
PN
scores of form aj = i=j i−1 (Savage, 1956), called Savage scores, or scores
equal to expected value of order statistics from exponential distribution, called
exponential scores. Van der Waerden scores are an approximation to normal
scores. Calculating exact probabilities for general score tests, and the difficul-
ties that this entails, was discussed at the end of §3.2.1.
Scores may be chosen to be optimal for certain distributions. Normal scores
are optimal for Gaussian observations. Exponential scores are optimal for
exponential observations. Original ranks are optimal for logistic observations.
Savage scores are optimal for Lehmann alternatives, discussed below at (3.28).

Example 3.4.2 Consider the nail arsenic data of Example 2.3.2. One
might perform an analysis using these scoring methods.

library(exactRankTests)#Gives savage and vw scores

arsenic$savagenails<-as.numeric(cscores(
arsenic$nails,type="Savage"))
arsenic$vwnails<-as.numeric(cscores(arsenic$nails,
type="Normal"))
The Mann-Whitney-Wilcoxon Test 51

The Savage scores are

0.603, 0.669, 0.850∗ , 0.669, −0.056∗ , −0.366, 0.902∗ ,

0.371, −0.199, 0.794, 0.952, −1.149, −0.816∗ , −2.649,
−1.649, 0.180, −0.566∗ , 0.454, 0.069∗ , 0.531∗ , 0.280∗ .

Asterisks denote men. The mean of these scores is ā = −0.005, and

the mean of the squares is â = 0.823. Test for equality of arsenic in
nails between sexes. Here M1 = 8 and M2 = 13. The expectation and
variance of the test statistic are given by (3.10), as 13ā = −0.065, and
13 × 8 × (â − ā2 )/20 = 4.28. Sum scores for women, the second gender
{k}
group; here k = 2, and TG = −1.32, the sum of √ scores above without
the asterisk. The z-statistic is (−1.32 − (−0.065))/ 4.28 = −0.60. The
p-value is 0.548. Do not reject the null hypothesis. These calculations
may be done using
library(MultNonParam)#Contains genscorestat
genscorestat(arsenic$vwnails,arsenic$sex)
genscorestat(arsenic$savagenails,arsenic$sex)
giving the same results for Savage scores, and the p-value 0.7834 for van
der Waerden scores.

3.4.3 Using Data as Scores: the Permutation Test

One might instead use the original data as scores. That is, sort the combined
data set (X1 , . . . , XM1 , Y1 , . . . , YM2 ) to obtain (Z(1) , . . . , Z(N ) ), with Z(i) ≤
Z(i+1) for all i; still assuming continuity, each inequality is strict. Then use
aj = Z(j) . Hence the test statistic is

N M2
{2}
X X
TP = Z(j) Ij = Yj = M2 Ȳ . (3.23)
j=1 j=1

The analysis is performed conditionally on (Z(1) , . . . , Z(N ) ); note that both

the statistic, and its reference distribution, depend on these order statistics.
Compare TP with the numerator of the two-sample pooled t-test (3.2):

N Z̄ − M2 Ȳ N Ȳ − N Z̄
Ȳ − X̄ = Ȳ − = = N TP − M2 Z̄ /(M1 M2 ),
M1 M1
PN
where Z̄ = i=1 Z(i) /N . The pooled variance estimate for the two-sample t
52 Two-Sample Testing

statistic is
PM1 PM2
j=1 (Xj − X̄)2 + j=1 (Yj − Ȳ )2
s2p =
N −2
PM1 2
PM2
j=1 (Xj − Z̄) − M1 (X̄ − Z̄)2 + j=1 (Yj − Z̄)2 − M2 (Ȳ − Z̄)2
=
N −2
(N − 1)s2Z − M1 (X̄ − Z̄)2 − M2 (Ȳ − Z̄)2
= .
N −2
Some algebra shows this to be
N − 1 (TP − M2 Z̄)2 (1/M1 + 1/M2 )
s2p = s2Z − .
N −2 (N − 2)
Hence, conditional on (Z(1) , . . . , Z(N ) ), the two-sample pooled t statistic is
p
(N − 2)N TP − M2 Z̄
q ,
(s2Z (N − 1)M1 M2 − (TP − M2 Z̄)2 N

for sZ the sample standard deviation of (Z(1) , . . . , Z(N ) ).

Hence the pooled two-sample t statistic is a strictly increasing function of
the score statistic TP with ordered data used as scores. However, while the
pooled t statistic is typically compared to a T distribution, the rank statistic
is compared to the distribution of values arising from random permutations of
the group labels; this is the same mechanism that generates the distribution
for the rank statistics
with scores determined in advance. In the two-sample
N
case, there are M 1
ways to assign M1 labels 1, and M2 labels 2, to the or-
der statistics (Z(1) , . . . , Z(N ) ). A less-efficient way to think of this process is
to specify N labels, the first M1 of them 1 and the remaining M2 of them
2, and randomly assign, or randomly permute, (Z(1) , . . . , Z(N ) ) without re-
N

placement; there are N ! such assignments, leading to at most M 1
distinct
values. The observed value of TP is then compared with the sampling distri-
bution arising from this random permutation of values; such a test is called a
permutation test. The same permutation concept coincides with the desired
reference distribution for all of the rank statistics in this chapter.

Example 3.4.3 Again consider the nail arsenic data of Example 2.3.2.
Recall that there are 21 subjects in this data set, of whom 8 are male.
The permutation test testing the null hypothesis of equality of distribution
across gender may be performed in R using

library(MultNonParam)
aov.P(dattab=arsenic$nails,treatment=arsenic$sex)
to give a two-sided p-value of 0.482. In this case, all 21

8 = 203490 ways
to reassign arsenic nail levels to the various groups were considered. The
Empirical Levels and Powers of Two-Sample Tests 53

TABLE 3.4: Levels for various Two-Sample Two-Sided Tests, Nominal level
0.05, from 100,000 random data sets each, sample size 10 each

Test Gaussian Laplace Cauchy

T-test 0.04815 0.04414 0.01770
Exact Wilcoxon 0.04231 0.04413 0.04424
Approximate Wilcoxon 0.05134 0.05317 0.05318
Normal Scores 0.04693 0.04744 0.04871
Savage Scores 0.04191 0.04340 0.04319
Mood 0.02198 0.02314 0.02314

TABLE 3.5: Powers for various Two-Sample Two-Sided Tests, Nominal level
0.05, from 100,000 random data sets each, sample size 10 each, samples offset
by one unit

Test Gaussian Laplace Cauchy

T-test 0.55445 0.35116 0.06368
Approximate Wilcoxon 0.54661 0.41968 0.20765
Normal Scores 0.53442 0.37277 0.16978
Savage Scores 0.47270 0.33016 0.15225

statistic TP of (3.23) was calculated for each assignment, this value was
subtracted from the null expectation Z̄, and the difference was squared
to provide a two-sided statistic. The p-value reported is the proportion of
these for which the squared differences among the reassignments meets or
exceeds that seen in the original data.

3.5 Empirical Levels and Powers of Two-Sample Tests

As in Table 2.1, one might simulate data from a variety of distributions, and
compare levels of the various two-sample tests. Results are in Table 3.4.
Table 3.4 shows that the extreme conservativeness of Mood’s test justifies
its exclusion from practical consideration. We see that the Wilcoxon test, cali-
brated exactly using its exact null distribution, falls short of the desired level;
a less-conservative equal-tailed test would have a level exceeding the nominal
target of 0.05. The conservativeness of the Savage Score test is somewhat sur-
prising. The close agreement between the level of the t-test and the nominal
level with Gaussian data is as expected, as is the poor agreement between the
level of the t-test and the nominal level with Cauchy data.
As was done in Table 2.3, one might perform a similar simulation under the
alternative hypotheses to calculate power. In this case, alternative hypotheses
were generated by offsetting one group by one unit. Results are in Table 3.5.
54 Two-Sample Testing

Table 3.5 excludes the exact version of the Wilcoxon test and Mood’s test,
since for these sample sizes (Mj = 10 for j = 1, 2), they fail to achieve the
desired level for any data distribution. The approximate Wilcoxon test has
comparable power to that of the t-test under the conditions optimal for the
t-test, and also maintains high power throughout.

3.6 Adaptation to the Presence of Tied Observations

The Mann-Whitney-Wilcoxon statistic is designed to be used for variables
arising from a continuous distribution. Processes expected to produce data
with distinct values, however, sometimes produced tied values, frequently be-
cause of limits on measurement precision. Sometimes an observation from
the first group is tied with one from the second group. Then the scheme for
assigning scores must be modified. Tied observations are frequently assigned
scores averaged over the scores that would have been assigned if the data had
been distinct; for example, if Z(1) , . . . , Z(N ) are the ordered values from the
combination of the two samples, and if Z(j+1) = Z(j) , then both observation
j and observation j + 1 are assigned score (aj + aj+1 )/2. The variance of the
test statistic must be adjusted for this change in scores.
When both tied observations come from the first group, or both from
the second group, then one might assume that the tie arises because of
imprecise measurement of a process that, measured more precisely, would
have produced untied individuals. The test statistic is unaffected by as-
signment of scores to observations according to either of the potential or-
derings. However, the permutation distribution is affected, because many
of the permutations considered will split the tied observations into differ-
ent groups. Return to variance formula (3.11). The average rank ā is un-
changed by modification of ranks, but the average squared rank â changes by
a2j + a2j+1 − (aj + aj+1 )2 /2 = (aj − aj+1 )2 /2. Then, for each pair of ties in
the data, the variance (3.11) is reduced by M1 M2 (aj − aj+1 )2 /(N − 1). This
process could be continued for triplets, etc., with more complicated expres-
sions for the correction. Lehmann (2006) derives these corrections for generic
numbers of replicated values, in the simpler case in which aj = j; in this case,
the correction is applied to the simpler variance expression (3.22).
It is simpler, however, to bypass (3.22), and, instead of correcting (3.11),
recalculating (3.11) using the new scores.
When the assumption of continuity of the distributions of underlying mea-
surements does not hold, the distribution of rank statistics is no longer inde-
pendent of the underlying data distribution, since the rank statistic distribu-
tion will then depend on the probability of ties. Hence no exact calculation of
the form in §3.4.1 is possible.
It was noted at the end of §3.4.1.1 that continuity correction is of little
Mann-Whitney-Wilcoxon Null Hypotheses 55

importance in case of rank tests. When average ranks replace the original
ranks, the continuity correction argument using Figure 2.2 no longer holds.
Potential values of the test statistic are in some cases closer together than 1
unit apart, and, in such cases, the continuity correction might be abandoned.

3.7 Mann-Whitney-Wilcoxon Null Hypotheses

The Mann-Whitney-Wilcoxon test was constructed to test whether the dis-
tribution F of the X variables is the same as the distribution G of the Y
variables. This null hypothesis implies that P [Xk ≤ Yj ] = 1/2. Unequal pairs
F and G violate the null hypothesis of this test. However, certain distribution
pairs violating the null hypothesis fall in the alternative hypothesis, but the
Mann-Whitney-Wilcoxon test has no power to distinguish these. This is true
if F and G are unequal but symmetric about the same point. In this case,
the standard error of the Mann-Whitney-Wilcoxon test statistic (3.11) is no
longer correct, and the expectation under this alternative
R∞ is the same as it is
under the null. The same phenomenon arises if −∞ F (y)g(y) dy = 1/2.
As an example, suppose that Yj ∼ E(1), Xi ∼ G(θ, 1). We now determine
the θ for which the above
R ∞ alternative hypothesis has power no larger than the
test size. Solve 1/2 = 0 (1 − exp(−y)) exp(−(y − θ)2 /2)(2π)−1/2 dy to obtain
θ = .876.

3.8 Efficiency and Power of Two-Sample Tests

In this section, consider models of the form (3.1), with the null hypothesis
θ = θ0 . Without loss of generality, one may take θ0 = 0; otherwise, shift Yj
by θ0 .
Relative efficiency
√ has already been defined for test statistics Ti such that
(Ti − µi (θ))/(σi (θ)/ N ) ≈ G(0, 1), for N the total sample size. Asymptotic
relative efficiency calculations require specification of how M1 and M2 move
together. Let M1 = λN , M2 = (1 − λ)N , for λ ∈ (0, 1).

3.8.1 Efficacy of the Gaussian-Theory Test

As in the one-sample case, the large sample behavior of this test will be
approximated by a version with known variance. Here µ(θ) = θ, and Var [T ] =
56 Two-Sample Testing

1 1 1 1
ρ2 + = ρ2 + ; hence
M2 M1 N (1 − λ) N λ
p
σ(θ) = ρ 1/λ + 1/(1 − λ) = ρ/ζ,
p
for ρ2 the variance of each observation, and ζ = λ(1 − λ).
For example, suppose that Yj ∼ G(0, 1), and Xi ∼ G(θ, 1). In this case,
the efficacy is e = ζ.
Alternatively, suppose that the observations are logistically √ distributed.
Each observation has variance is π 2 /3, and the efficacy is e = ζ 3/π = .551ζ.

The analysis of §2.4.1, for tests as in (2.15)and variance scaled as in (2.19),

allows for calculation of asymptotic relative efficiency, in terms of the separate
efficacies, defined as the ratio of µ0 (θ) to σ(θ).

3.8.2 Efficacy of the Mann-Whitney-Wilcoxon Test

In order to apply the results for the asymptotic relative efficiency of §2.4, the
test statistic must be scaled so that the asymptotic variance is approximately
equal to a constant divided by the sample size, and must be such that the
derivative of the mean function is available at
Pzero. Using the Mann-Whitney
M1 PM2
formulation, and rescaling so that the T = i=1 j=1 I(Xi < Yj )/(M1 M2 ),
then
N +1 1
Var [T ] = ≈ ,
12M1 M2 N (12λ(1 − λ))
and so √
σ(0) = 1/( 12ζ). (3.24)
Also,

µ(θ) = Pθ [Y > X] = P0 [Y + θ > X] = P0 [θ > X − Y ] . (3.25)

For example, suppose that Yj ∼ G(0, 1), and Xi ∼ G(θ, 1). The differences
Xi − Yj ∼ G(θ, 2), and so √
µ(θ) = Φ(θ/ 2). (3.26)
0
√
Hence µ (0) = 1/(2 π). Also, (3.24) still holds, and
1 √ p
e= √ 12ζ = 3/πζ = .977ζ.
2 π

Alternatively, suppose that these distributions have a logistic distribution.

In this case,
Z ∞Z ∞
exp(x) exp(y)
µ(θ) = 2 2
dy dx
−∞ x−θ (1 + exp(x)) (1 + exp(y))
−2
= eθ eθ − θ − 1 eθ − 1

, (3.27)
Efficiency and Power of Two-Sample Tests 57

TABLE 3.6: Asymptotic relative efficiencies for Two Sample Tests

Test Pooled T MWW ARE of MWW to T

General µ(θ) = θ µ(θ) = P [θ > X √−Y] 12ε2 µ0 (0)2
σ(θ) = εζ −1 σ(θ) = ζ −1 /(2
√ 3)
e = ζ/ε e = ζµ0 (0)(2 3)
√ −1 3
Normal, µ0 (0) = 1 µ0 (0) = (2 π)√ = .95
unit σ(0) = ζ −1 σ(0)p −1
= ζ /(2 3) π
variance e=ζ e = 3/πζ

π2
Logistic µ0 (0) = 1 0
√ µ (0) = 6
1
√ = 1.10
−1 9
σ(0) = √ πζ / 3 σ(0) =√ζ −1 /(2 3)
e = ζ 3/π e = ζ/ 3

p
ζ = (λ(1 − λ))1/2 , ε = Var [Xi ].

and √ √
µ0 (0) = 1/6, e = (1/6) 12ζ = (1/ 3)ζ = .577ζ.
Efficacies for more general rank statistics may be obtained using calcula-
tions involving expectations of derivatives of underlying densities, with respect
to the model parameter, evaluated at order statistics under the null hypoth-
esis, without providing rank expectations away from the null (Dwass, 1956).

3.8.3 Summarizing Asymptotic Relative Efficiency

Table 3.6 contains results of calculations for asymptotic relative efficiencies of
the Mann-Whitney-Wilcoxon test to the Pooled t-test. For Gaussian vari-
ables, as expected, the Pooled t-test is more efficient, but only by 5%. For
a distribution with moderate tails, the logistic, the Mann-Whitney-Wilcoxon
test is 10% more efficient.

3.8.4 Power for Mann-Whitney-Wilcoxon Testing

Power may be calculated for Mann-Whitney-Wilcoxon testing, using (2.22) in
conjunction with (3.24) for the null variance of the rescaled test, and (3.25),
adapted to the particular distribution of interest. Application to Gaussian
and Laplace observations are given by (3.26) and (3.27) respectively. Zhong
and Kolassa (2017) give second moments for this statistic under the alterna-
tive hypothesis, and allow for calculation of σ1 (θ) for non-null θ. The second
58 Two-Sample Testing

moment depends not only on the probability (3.25), but also on probabilities
involving two independent copies of X and one copy of Y , and of two inde-
pendent copies of Y and one copy of X. This additional calculation allows the
use of (2.17); calculations below involve the simpler formula.

Example 3.8.1 Consider using two independent sets of 40 observations

each to test the null hypothesis of equal distributions vs. the alternative
that (3.1) holds, with θ = 1, and with observations having a Laplace
distribution. Then, using (3.27), µ(1) = e(e − 2)(e − 1)−2 = 0.661. The
function µ(θ) has a removable singularity at zero; fortunately the null
probability is easily seenpto be 1/2. Then λ = 1/2,√N = 80, µ(0) = 1/2,
µ(1) = 0.661, σ(0) = 1/ 12 × (1/2) × (1/2) √ = 1/ 3. The power for √ the
one-sided level 0.025 test, from (2.22), is Φ( 80(0.661 − 0.5)/(1/ 3) −
1.96) = Φ(0.534) = .707.
One could also determine the total
√ sample size needed to obtain 80%
power. Using (2.24), one needs (1/ 3)2 (z0.025 + z0.2 )2 /(0.661 − 0.5)2 =
151.4; choose 76 per group.

In contrast with the shift alternative (3.1), one might consider the
Lehmann alternative

1 − F (z) = (1 − G(z))k ∀z, (3.28)

for some k 6= 1. Power calculations for Mann-Whitney-Wilcoxon tests for this

alternative have the advantage that power does not depend on the underlying
G (Lehmann, 1953).
As noted above, while efficacy calculations are available for more general
rank statistics, the non-asymptotic expectation of the test statistic under the
alternative is difficult enough that it is omitted here.

3.9 Testing Equality of Dispersion

One can adapt the above rank tests to test whether two populations have
equal dispersion, assuming a common center. If one population is more spread
out than another, then the members of one sample would tend to lie outside
the points from the other sample. This motivates the Siegel-Tukey test. Rank
the points, with the minimum getting rank 1, the maximum getting rank 2,
then the second to the maximum getting rank 3, the second to the minimum
getting rank 4, the third from the minimum getting rank 5 and continuing to
alternate. Then sum the ranks associated with one of the samples. Under the
null hypothesis, this statistic has the same distribution as the Wilcoxon rank-
sum test. Alternately, one might perform the Ansari-Bradley test, by ranking
Two-Sample Estimation and Confidence Intervals 59

from the outside in, with extremes getting equal rank, and again summing the
ranks from one sample.
The Ansari-Bradley test has a disadvantage with respect to the Siegel-
Tukey test, in that one can’t use off-the-shelf Wilcoxon tail calculations. On
the other hand, the Ansari-Bradley test is exactly invariant to reflection.

Example 3.9.1 Consider again the yarn data of Example 3.3.1. Test
equality of dispersion between the two types of yarn. Ranks are given in
Table 3.7, and are calculated in package NonparametricHeuristic as
yarn$ab<-pmin(rank(yarn$strength),rank(-yarn$strength))
yarn$st<-round(siegel.tukey.ranks(yarn$strength),2)
yarnranks<-yarn[order(yarn$strength),
c("strength","type","ab","st")]
R functions may be used to perform the test.

library(DescTools)#For SiegelTukeyTest
SiegelTukeyTest(strength~type,data=yarn)
yarnsplit<-split(yarn$strength,yarn$type)
ansari.test(yarnsplit[[1]],yarnsplit[[2]])
to find the Siegel-Tukey p-value as 0.7179, and the Ansari-Bradley p-value
as 0.6786. There is no evidence of inequality of dispersion.

3.10 Two-Sample Estimation and Confidence Intervals

Practitioners often ask what two samples can tell us about how a population
location parameter (be it mean, median, or another quantile) differs in two
populations. In the restricted case in which the populations are assumed to
be the same except for the location parameter, this question does not depend
on what location measure is intended.
Our treatment of two-sample confidence intervals will mirror that of the
one-sample interval. That is, this parameter will index a family of test statis-
tics, such that the distribution of the family member is independent of the
parameter, when the statistic is evaluated at the correct parameter value. In-
vert the test by determining for which parameters the null hypothesis is not
rejected.
Denote the samples as X1 , . . . , XM1 and Y1 , . . . , YM2 as before. Let θ repre-
sent the amount by which a location parameter for the population from which
the second sample exceeds that of the first sample. Under the assumption
60 Two-Sample Testing

TABLE 3.7: Yarn data with rankings for testing dispersion

strength type ab st strength type ab st

12.8 A 1.0 1.00 15.8 A 24.0 47.00
13.0 A 2.0 4.00 15.9 A 22.0 43.67
13.8 A 3.0 5.00 15.9 B 22.0 43.67
14.2 A 4.5 8.50 15.9 B 22.0 43.67
14.2 B 4.5 8.50 16.0 A 19.5 38.50
14.5 B 6.0 12.00 16.0 B 19.5 38.50
14.8 A 7.5 14.50 16.2 A 17.0 33.33
14.8 A 7.5 14.50 16.2 B 17.0 33.33
14.9 A 10.0 19.33 16.2 B 17.0 33.33
14.9 B 10.0 19.33 16.4 A 15.0 30.00
14.9 B 10.0 19.33 16.8 B 14.0 27.00
15.0 A 13.5 26.50 16.9 B 13.0 26.00
15.0 A 13.5 26.50 17.0 A 11.0 21.33
15.0 A 13.5 26.50 17.0 B 11.0 21.33
15.0 B 13.5 26.50 17.0 B 11.0 21.33
15.2 B 16.5 32.50 17.1 A 9.0 18.00
15.2 B 16.5 32.50 17.2 B 8.0 15.00
15.5 A 19.0 37.67 17.6 A 7.0 14.00
15.5 A 19.0 37.67 18.0 B 6.0 11.00
15.5 B 19.0 37.67 18.1 B 5.0 10.00
15.6 A 22.0 43.33 18.2 A 3.5 6.50
15.6 A 22.0 43.33 18.2 B 3.5 6.50
15.6 B 22.0 43.33 18.5 B 2.0 3.00
15.7 A 24.0 48.00 19.2 B 1.0 2.00
Two-Sample Estimation and Confidence Intervals 61

(3.1) that the distributions are identical up to shift, then

X1 , . . . , XM1 , Y1 − θ, . . . , YM2 − θ (3.29)
{2}
all have the same distribution. Then let TG (θ) be the general rank statistic
(3.7) calculated from this data set (3.29).
{2}
Most commonly the scores are chosen to make TG (θ) the Wilcoxon rank-
sum statistic, or equivalently the Mann-Whitney statistic, but conceptually
this could be done by inverting, for example, Mood’s median test or any other
{2} PN
rank test. For general scores aj , TG (θ) = j=1 aj Zj (θ), where Zj (θ) is 1 if
item ranked j among (3.29) came from Y , and 0 otherwise.
One can define an estimator as that value of θ that makes this test statistic
equal to its null expectation; that is, θ̂ solves
{2}
TG (θ̂) = M2 ā. (3.30)
Furthermore, one can determine the largest integer tl and smallest integer tu
such that
h i h i
{2} {2}
P0 TG (0) < tl ≤ α/2, P0 TG (0) ≥ tu ≤ α/2, (3.31)

in close parallel with definitions of §2.3.2. Then, reject the null hypothesis if
{2} {2}
TG (θ0 ) ≤ t◦L or TG (θ0 ) ≥ t◦U for t◦L = tl − 1 and t◦U = tu , and use as the
confidence interval
{2}
{θ|tl ≤ TG (θ0 ) < tu }. (3.32)
{2}
Applying the Gaussian approximation to TG (θ),
r h i
{2}
tl , tu ≈ M2 ā ± zα/2 Var TG (0) . (3.33)

3.10.1 Inversion of the Mann-Whitney-Wilcoxon Test

{2}
When aj are ranks j, TG (θ) is the Wilcoxon P version
P of the test. The cor-
responding Mann-Whitney version TU (θ) = i j I(Xi < Yj − θ) gives the
estimator and confidence interval end points more easily. In this case, the null
expectation of the statistic is M1 M2 /2, and TU (θ) = M2 M1 /2 if and only if
M2 × M1 even, and exactly M2 × M1 /2 of Vij = Yj − Xi are greater than θ,
or M2 × M1 odd, and (M2 × M1 − 1)/2 of Vij = Yj − Xi are greater than θ,
(M2 × M1 − 1)/2 are less than θ, and one is equal to θ. Hence the estimator
(3.30), specific to the Mann-Whitney test, is the median of differences of pairs
Yj − Xi . This estimator is given by Hodges and Lehmann (1963), in the same
paper giving the analogous estimator for the one-sample symmetric problem
of §5.2.1. The confidence interval created by inverting the Mann-Whitney
statistic
M1 X
M2
{2}
X
TU (θ) = I(Yi − Xj > θ)
i=1 j=1
62 Two-Sample Testing
{2}
is {θ|tl ≤ TU (θ) < tu }, for the largest tl and smallest tu as in (3.31), made
specific to the Mann-Whitney statistic:
l −1
tX h i α MX
1 M2 h i α
{2} {2}
PM1 ,M2 TU (0) = k ≤ , PM1 ,M2 TU (0) = k ≤ , (3.34)
2 2
k=0 k=tu

with probabilities given by (3.16). This defines the Hodges-Lehmann estimator

in terminology of Higgins (2004) and SAS Institute Inc. (2017). More standard
definitions in the one-sample case follows later. One conventionally uses the
mean of middle two paired differences for even M2 × M1 ; the theory does not
technically require this.
When ties are present, (3.16) no longer gives the exact sampling distri-
{2}
bution of TU (0); exact intervals may be determined using the permutation
distribution of the tie-adjusted ranks, as in §3.4.3, or (3.33) may be employed.
In this case of confidence intervals derived by inverting the Mann-Whitney
(or equivalently Wilcoxon rank-sum) test, using (3.32), θ satisfies TU (θ) = tl
if there are tl pairs Xi , Yj − θ such that the second component is greater
than the first, and the remainder of the pairs have the inequality reversed,
or equivalently if there are tl pairs with differences Vij = Yj − Xi such that
Vij > θ, and the remainder reversed.
Hence the confidence interval is formed by, first, finding tl and tu as in
(3.34), and, second, calculating and sorting all Y − X pairs, and, third, re-
porting differences tl and tu on this list. Let V(k) represent ordered value k
from the differences Vij . For θ larger than the largest of these differences,
V(M1 M2 ) , the test statistic TU (θ) is zero. Each time θ decreases past one of the
pairwise differences, TU (θ) increases by one. Hence the confidence interval is
(V(M1 M2 +1−tu ) , V(M1 M2 +1−tl ) ). By symmetry, this is (V(tl ) , V(tu ) ). If (3.33), or
some other approximation to the Mann-Whitney critical values, is employed,
then the above technique indicates an order statistic that is generally not
an integer. In this case, many procedures interpolate between adjacent order
statistics.
Note parallels between this construction and that of the confidence inter-
val for the median in §2.3.3. In this section, pairwise differences replace raw
observations, and the Wilcoxon quantiles replace those of the binomial distri-
bution. Furthermore, the direction of the test was changed. The rest of the
construction is identical.

Example 3.10.1 Again consider the arsenic data of Example 2.3.2. We

estimate the median of the distribution of men’s nail arsenic levels minus
women’s arsenic levels, and begin by calculating the 13 × 8 = 104 pair-
wise differences. One woman had an unusually large nail arsenic level;
hence there is a cluster of particularly negative differences. Figure 3.3 dis-
plays construction of the confidence interval. Confidence intervals given
correspond to those order statistics determined from the quantiles of the
Two-Sample Estimation and Confidence Intervals 63

Mann-Whitney distribution, as in (3.34). These quantiles are tl = 25

and tu = 80. The confidence interval is then entries 25 and 80 among
the list of ordered pairwise differences, or (−0.278, 0.158). Figure 3.3 was
constructed using
library(NonparametricHeuristic)
invertsigntest(split(arsenic$nails,arsenic$sex))

The above analysis did not reflect the fact of a small number of ties among
these pairwise differences. The code
wilcox.test(nails~sex,data=arsenic,conf.int=TRUE)
gives intervals found by using approximation (3.33) to the test critical val-
ues, and interpolating between the appropriate order statistics, to obtain
an identical result to the same accuracy.

FIGURE 3.3: Construction of the Confidence Interval for Median of Differ-

ences in Nail Arsenic by Sex
... ...
... ...
..................... .. ..
......... .
. ...
..
100 .............
...................
....................................................................................................................
.
...
...
..
.... .
. ...
.... .
.
. ...
..... .
.
...... . ..
.......... .
. ...
.
... ... ...
................. .. ..
.......... .. ..
. . . . . . . . . . . . . . . . . . . . . . . . .............. . . . ..... . . . . . . . . . . . tu
80 ........ ..
... .... ...
... ... ...
.. ...... ..
. . ...
. . . . . . . Critical Values .
. .......
... .. ...
.. ...
. ..
. . ...
60
............................... CI Endpoints .
.
...
..
... ...
.. .. ..
Mann ...
...
..
..
..
...
...
Whitney ..
...
...
.... ...
.. ...
... .. ..
Statistic ..
...
.. ..
.. .
... ..
... .. ...
40 ..
...
.... ..
... .
.. ...
... ..... ..
... ... ..
.. ....
.....
... ....
. .
. . . . . . . . . . . . . . . . . . . . . . . . ... . . . . ..... . . . . . . . . . . . .
tl
. .
... ...
20 ...
..
......
.......
... ... .......
... ... ........
.. .. ..
... ... .......
... ... ...
.. .. ....
... ... ...
.....
... ...
0 ..
...
..
...
..........................................................

-2 -1 0 1
Location Difference
Compare Figure 3.3 to Figure 2.3. The test statistic in Figure 2.3 is con-
structed to be non-decreasing in θ; this simplified the discussion of the interval
construction, at the cost of employing a slightly non-standard version of the
sign test statistic. Figure 3.3 uses a more standard version of the test statistic.
64 Two-Sample Testing

3.11 Tests for Broad Alternatives

The above development motivates different tests for equality of distribution,
depending on the kind of departure of greatest interest. For example, a Mann-
Whitney test is appropriate for shift alternatives, and an Ansari-Bradley test
is appropriate for differences in spread. This section describes a test sensitive
to departures in various directions.
Use the empirical cumulative distribution function estimator as in §2.5.
The Kolmogorov-Smirnov test uses the largest difference between these as
the test statistic. The null hypothesis under the permutation distribution
is formed from all permutations of data between the two samples. The
p-value is the portion with as large or larger difference. If the number of
such permutations is quite large, one might use a random sample instead.
Asymptotic approximations to these distributions exist as well. Calculation
of these statistics can be simplified by noting that the maximum may be
calculated from differences in the empirical cumulative distribution functions
evaluated exclusively at jumps in one or the other curve.
Alternatively, one might use the integral of difference between these,
squared, as the test statistic; this statistic is called the Cramér-von Mises test.
That is, if F̂ and Ĝ are the empirical
R∞ distribution functions for the two sam-
ples, then the test statistic is −∞ |F̂ − Ĝ|2 (M1 dF̂ + M2 dĜ)/N ; this integral
is in the sense of Stieltjes (1894), and is calculated as the average over all val-
ues in the combined sample of the difference between empirical distributions,
squared. Conceptually, to implement this test, use all permutations of data
between the two samples, and count the proportion with as large or larger
integrated difference as the p-value. When the number of permutations is
excessive, one might also do this with a random sample of permutations. Al-
ternatively, one might use results from Stochastic Processes to approximate
tail areas.

Example 3.11.1 Again consider the the yarn data of Example 3.3.1.
Figure 3.4 shows the cumulative distribution functions of strengths for
the two types of yarn. This figure might be generated using
par(mfrow=c(1,1))
plot(range(yarn$strength),c(0,1),type="n",
main="Yarn Strength",xlab="Yarn Strength",
ylab="Probability")
yarnsplit<-split(yarn$strength,yarn$type)
lines(ecdf(yarnsplit[[1]]),col=1)
lines(ecdf(yarnsplit[[2]]),col=2)
legend(17,.2,lty=c(1,1),col=c(1,2),legend=c("A","B"))
Tests for Broad Alternatives 65

The largest difference between distribution functions happens at strengths

slightly larger than 16; this gives the Kolmogorov-Smirnov statistic. The
Cramér-von Mises statistic is the integral of the squared distance between
these curves. Statistics may be calculated using
ks.test(yarnsplit[[1]],yarnsplit[[2]])
library(CvM2SL2Test)
cvmstat<-cvmts.test(yarnsplit[[1]],yarnsplit[[2]])
cvmts.pval(cvmstat,length(yarnsplit[[1]]),
length(yarnsplit[[2]]))
Note that the library CvM2SL2Test, giving cvmts.test and cvmts.pval,
is no longer supported, and must be installed from archives, using
library(devtools);install_version("CvM2SL2Test"). These calcu-
lations can also be performed using package MultNonParam. Significance
tests are based on permutation distributions; the Kolmogorov-Smirnov p-
value is 0.2591, and the Cramér-von Mises p-value is 0.09474.

FIGURE 3.4: Empirical CDF for Yarn

1 5 .............................
.................................5
..
...................
5 ... 5 .................. ...
................
5 .. .......
5 ..
.............. Type A ..
5 .. 5 ..
.
...................
5 ... ..
5 .
............ Type B
0.8 5 .....
... .....................
5 ...
.....
5 ... ....
5 ..

5 ..
..

5 ..
.. 5 ..
.
..
5 .. ..
5 .
0.6 ..
5 .. ................
5 ..

Probab- ..
5 .. 5 .....
.
ility ..
5 .
................
5 ..
0.4 .......
5 ..
..
5 .

5 ..
.. 5 .......
..
..
5 ..

5 .....
.
0.2 ................... ..
...5 .
5
............
5 ...
..........................
5 ... ..........
5 ..
.....
5 ... .......
5 ..

0 .................................
... ...........................
..

12 14 16 18 20
Yarn
66 Two-Sample Testing

3.12 Exercises
1. The data set

HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo

reflects an an experiment using measurements used to detect

schizophrenia, in both schizophrenic and non-schizophrenic pa-
tients. Results of various eye-tracking tests are given. The second
column gives eye tracking target type. For all parts of this question
select subjects with target type CS. Gain ratios are given in vari-
ous columns; for the balance of this question, use the first of these,
found in the third column.
a. Select only non-schizophrenic patients. The comments at the top
of the data file tell you which subjects are female. Compare the first
gain ratios for women with those for men. Test whether the gain
ratios come from the same distribution, using the Wilcoxon rank
sum test.
b. Test whether the first gain ratios for women and men come from
the same distribution. Use data values as scores.
For the remainder of this question, select only female patients.
c. Estimate the poplulation median of the difference between the
gain ratios for schizophrenic patients (in the lower part of the file)
and for the non-schizophrenic patients (in the upper part of the
file). Calculate a 95% confidence interval for this median.
d. Test the null hypothesis that the dispersion of the first gain
ratios is the same for schizophrenic patients (in the lower part of
the file) as it is for the non-schizophrenic patients (in the upper part
of the file). Use the Ansari-Bradley test.
e. Test the null hypothesis that the distribution of the first gain
ratios is the same for schizophrenic patients (in the lower part of the
file) as it is for the non-schizophrenic patients (in the upper part of
the file). Use the Kolmogorov-Smirnov test.
2. The data set

HTTP://lib.stat.cmu.edu/datasets/biomed.desc

gives values of certain biological markers of a certain disease, on in-

dividuals known to be carriers, and control individuals. The carrier
subjects are in lines 4 through 78, and the control individuals are in
lines 82 through 215. Ignore other lines. Focus attention on column
5 of these data, the first measurement.
Exercises 67

a. Graphically assess the quality of a normal model for these mea-

surements.
b. Graphically investigate differences in this variable by car-
rier/control status.
c. Test the null hypothesis of equality of distribution using the
Mann-Whitney-Wilcoxon test.
d. Repeat part (c) using the Normal Scores rank test.
3. Calculate the asymptotic relative efficiency for the Mann-Whitney-
Wilcoxon test to the two-sample T -test. Do this for observations
from the
a. Laplace distribution of §1.1.1.3.
b. Cauchy distribution of §1.1.1.4, again abusing notation to extend
to the Cauchy’s property that the central limit theorem does not ap-
ply to sums of independent and identically-distributed observations
from this distribution.
4. Determine a pair of distributions F and G satisfying both the
Lehmann alternative (3.28) and the shift alternative (3.1).
5. Proofs that the two-sample t-statistic for Gaussian data follows the
expected distribution uses the fact that the mean differences are
independent of the pooled standard deviation estimate, and that
the pooled variance estimate times degrees of freedom has a χ2
distribution with the expected number of degrees of freedom. Using
simulation, for tests arising from samples of size M1 = M2 = 10,
evaluate both of these assumptions for data coming from a Laplace
distribution, and from a Cauchy distribution.
6. Consider using a one-sided Mann-Whitney-Wilcoxon test of level
0.025 using two samples of size 25, both from a Cauchy distribution.
a. Approximate the power for this test to successfully detect an
offset of 1 unit.
b. Check the quality of this approximation via simulation.
c. Compare the result from part (a) to the power of the same test,
performed two-sided with a level 0.05, to detect the same alterna-
tive.
4
Methods for Three or More Groups

This chapter develops nonparametric techniques for one way analysis of vari-
ance.
Suppose Xki are samples from K potentially different populations. That is,
for fixed k, Xk1 , . . . , XkMk are independent and identically distributed, each
with cumulative distribution function Fk . Here k ∈ {1, . . . , K} indexes group,
and Mk represents the number of observations in each group. In order to
determine whether all populations are the same, test a null hypothesis

H0 : F1 (x) = · · · = FK (x)∀x, (4.1)

vs. the alternative hypothesis HA : there exists j, k, and x such that Fj (x) 6=
Fk (x). Most tests considered in this chapter, however, are most powerful
against alternatives of the form HA : Fk (x) ≤ Fj (x)∀x for some indices k, j,
with strict inequality at some k, j, and x. Of particular interest, particularly
for power calculations, are alternatives of the form

Fi (x − θi ) = Fj (x − θj ) (4.2)

for some constants θ1 , . . . , θK .

4.1 Gaussian-Theory Methods

Under the assumptions that the data Xki are Gaussian and homoscedastic
(that is, having equal variances) under both the null and alternative hypothe-
ses, the null hypothesis H0 is equivalent to µj = µk for all pairs j, k for
µj = E [Xji ]. One might test H0 vs. HA via analysis of variance (ANOVA).
PMk PK PMk PK
Let X̄k. = i=1 Xki /Mk , X̄.. = k=1 i=1 Xki / k=1 Mk , and
PK
( k=1 Mk (X̄k. − X̄.. )2 )/(K − 1)
WA = , (4.3)
σ̃ 2
for
Mk
K X
! K
!
X X
2
σ̃ = (Xki − X̄k. )2 / Mk − K . (4.4)
k=1 i=1 k=1

69
70 Methods for Three or More Groups

When the data have a Gaussian distribution, and (4.1) holds, the numerator
and denominator of (4.3) have χ2 distributions, and are independent; hence
the ratio W has an F distribution.
When the data are not Gaussian, the central limit theorem implies that the
numerator is still approximately χ2K−1 , as long as the minimal Mk is large, and
as long as the distribution of the data is not too far from Gaussian. However,
neither the χ2 distribution for the denominator of (4.3), nor the independence
of numerator and denominator, are guaranteed in this case. Fortunately, again
for large sample sizes and data not too far from Gaussian, the strong law of
large numbers indicates that the denominator of (4.3) is close to the population
variance of the observations, and the denominator degree of freedom for the
F distribution is large enough to make the F distribution close the the χ2
distribution. Hence in this large-sample close-to-Gaussian case, the standard
analysis of variance results will not mislead.

4.1.1 Contrasts
Let µk = E [Xki ]. Continue considering the null hypothesis that µj = µk for
all j, k pairs, and consider alternative hypotheses in which Xki all have the
same finite variance σ 2 , but the means differ, in a more structured way than
for standard ANOVA. Consider alternatives such that µk+1 − µk are the same
for all k, and denote the common value by ∆ > 0. One might construct a
test particularly sensitive to this departure from the null hypothesis using the
estimate ∆.ˆ If ∆ ˆ is approximately Gaussian, then the associated test of the
null hypothesis (4.1) vs.rthe ordered and equidistant alternative is constructed
h i h i
as T = (∆ ˆ − E0 ∆ ˆ )/ Var0 ∆ ˆ ; this statistic is compared to the standard
Gaussian distribution in the usual way.
An intuitive estimator ∆ ˆ is the least squares estimators; for example, when
ˆ
K = 3 then ∆ = (X̄3. − X̄1. )/2, and when K = 4 then ∆ ˆ = (3X̄4. + X̄3. −
X̄2. − 3X̄1. )/10. Generally, the least squares estimator is a linear combination
PK
of group means, of form k=1 ck X̄k. for a set of constants ck such that
K
X
ck = 0, (4.5)
k=1
h i
ˆ = 0 and
with ck evenly spaced. In this case, E0 ∆

h i K
X
ˆ
Var0 ∆ = Var0 [Xk ] c2k /Mk ,
k=1

and one may use the test statistic

q
PK PK 2
T = c X̄
k=1 k k. / σ c
k=1 k /M k .
Gaussian-Theory Methods 71

If σ is known, then the null distribution of T is the standard Gaussian distri-

bution. If σ is unknown, an estimator as in (4.4) is substituted, and then the
null distribution of T is the TN −K distribution.
In this case, W of (4.3) retains its level, but has less power against this
ordered alternative. PK
A linear combination k=1 ck X̄k. of group means, with constants summing
to zero as in (4.5), is called a contrast.
Standard parametric methods will be compared to nonparametric methods
below. In order to make these comparisons using methods of efficiency, the pat-
tern of numbers of observations in various groups must be expressed in terms
of a single sample size N ; presume that Mk = λk N for all k ∈ {1, . . . , K}.
Under alternative (4.2), and with ε2 = Var [Xij ] known, then
K K K
!
X X X
ck X̄k ∼ G ck θk , ε2 ( c2k /λk )/N .
k=1 k=1 k=1

In the case when the shift parameters and the contrast coefficients are equally
spaced, and for groups of equal size (that is, θk = (k − 1)∆ for some ∆ > 0,
ck = 2k − (K + 1), and λ = 1/K),
K
X
ck X̄k ∼ G(K 2 (K − 1)∆/6, (ε2 /3)K 2 (K 2 − 1)/N ).
k=1
p
Then µ0 (∆) = (K 2 − 1)K/6, and σ(0) = εK (K 2 − 1)/3, and the efficacy,
as defined in §2.4.1.2, is
√
K(K 2 − 1)/6 K2 − 1
e= p = √ . (4.6)
εK (K 2 − 1)/3 2 3ε

4.1.2 Multiple Comparisons

In the event that the null hypothesis of equal distributions is rejected, one nat-
urally asks which distributions differ. Testing for group differences pairwise
(perhaps using the two-sample t-test) allows for K(K − 1)/2 chances to find
a significant result. If each test is done at nominal level, this will inflate the
family-wise error rate, or the proportion of experiments that provide any in-
correct result. This family-wise error rate will be bounded by the nominal level
used for each separate test multiplied by number of possible comparisons per-
formed (1/2K(K −1)), but such a procedure, called the Bonferroni procedure,
will usually result in a very conservative bound.
Alternatively, consider Fisher’s Least Significant Difference (LSD) meth-
od. First, perform the standard analysis of variance test (4.3). If this test
rejects H0 , then test on each pairwise comparison between mean ranks, and
report those pairs whose otherwise uncorrected p-values are smaller than the
72 Methods for Three or More Groups

nominal size. Tests in this second stage may be performed using the two-
sample pooled t-test (3.2), except that the standard deviation estimate of
(4.4) may be substituted for sp , with the corresponding increase in degrees of
freedom in (3.5).
Fisher’s LSD method fails to control family-wise error rate if K > 3. To
see this, suppose F1 (x) = F2 (x) = · · · = FK−1 (x) = FK (x − ∆) for ∆ 6= 0.
Then null hypotheses Fj (x) = Fi (x) are true, for i, j < K. One can make ∆
so large that the analysis of variance test rejects equality of all distributions
with probability close to 1. Then multiple true null hypotheses are tested,
without control for multiplicity. If K = 3, there is only one such test with a
true hull hypothesis, and so no problem with multiple comparisons.
Contrast this with Tukey’s Honest Significant Difference (HSD) method
(Tukey, 1953, 1993). Suppose that Yj ∼ G(0, 1/Mj ) for j ∈ {1, . . . , K}, U ∼
χ2m , and that the Yj and U are independent. Assume p further that Mj are all
p
equal. The distribution of max1≤i,j≤K (|Xj − Xi |/( U/n/ Mj ) is called the
Studentized range distribution with K and m degrees of freedom. If Mj are
not all equal,
√ p q
2 max ((Xj − Xi )/( U/m 1/Mj + 1/Mk )
1≤i,j≤K

has the Studentized range distribution with K and m degrees of freedom,

approximately (Kramer, 1956); extensions also exist to correlated means
(Kramer, 1957). Let qK,m,α be the 1 − α quantile of this distribution, and
let ΞK,m be its cumulative distribution function.
One then applies this distribution with Yj = (X̄j − µj )/σ and U/m the
standard sample variance S 2 . Here σ is the common standard deviation. If
one then sets
√ q
Pjk = Ξ̄K,N −K ( 2|X̄j − X̄k |/(S (1/Mj + 1/Mk ))), (4.7)
PK
for N = k=1 Mk , then for any α ∈ (0, 1),

P [Pjk ≤ α for any j 6= k such that µj = µk ] ≤ α, (4.8)

and the collection of tests that rejects the hypothesis µi = µj if Pjk ≤ α

provides simultaneous test level less than or equal to α. Furthermore, if
q √
Cjk = X̄k − X̄j ± qK,m,α S 1/Mj + 1/Mk / 2, (4.9)

then
P [µk − µj ∈
/ Cjk for some j, k] ≤ α. (4.10)
This method had been suggested before the Studentized range distribution
had been derived (Tukey, 1949).
We now proceed to analogs of one-way analysis of variance that preserve
nominal test size for small samples and highly non-Gaussian data.
General Rank Tests 73

4.2 General Rank Tests

By analogy with test (4.3), and with the scoring ideas of §3.2, create a test
by first ranking all of the observations in a data set, to obtain rank Rki for
replicate i in group k. One then creates non-decreasing scores a1 , . . . , aN ,
assigns scores Aki = aRki to the ranked observations, and calculates the
PMk
score sums Aki . One might express the score sums as in (3.7), as
{k} PN i=1 {k} {k}
TG = j=1 aj Ij , for Ij equal to the 1 if the item ranked j in the
combined sample comes from the group k, and 0 otherwise. Analogously to
the numerator in (4.3), let
K h i2
{k} {k}
X
WG = uk TG − E0 TG , (4.11)
k=1
h i
{k}
for null expectations E0 TG as calculated in (3.10), and quantities uk to be
{k}
determined later. The next subsection will calculate covariances of the TG ,
and the following subsection will demonstrate that WG has an approximate
χ2K−1 null distribution, if

N −1
uk = .
(N 2 (â
− ā2 )Mk
{k}
The remainder of this section considers the joint distribution of the TG ,
calculates their moments, and confirms the asymptotic distribution for WG .

4.2.1 Moments of General Rank Sums

{k}
First and univariate second moments of TG are as in §3.2.2, and are given
{k} {j}
by (3.10) and (3.11), respectively. The covariance between TG and TG , for
{j,k}
j 6= k, can be calculated by forming a general rank statistic TG combining
both groups j and k, to obtain the sum of ranks for individuals in either
h groupi
{j,k} {j} {k} {j,k}
j or group k. Note that TG = TG + TG , and, furthermore, Var TG
may be found by applying (3.11), with the number of observations whose ranks
are summed being Mk + Mj . Then

(N − Mk − Mj )(Mk + Mj ) h
{j,k}
i
(â − ā2 ) = Var TG
(N − 1)
h i h i h i
{j} {k} {j} {k}
= Var TG + Var TG + 2Cov TG , TG
(N − Mj )Mj (N − Mk )Mk h
{j} {k}
i
= (â − ā2 ) + (â − ā2 ) + 2Cov TG , TG
(N − 1) (N − 1)
74 Methods for Three or More Groups

and h i −M M
{j} {k} j k
Cov TG , TG = (â − ā2 ).
(N − 1)

4.2.2 Construction of a Chi-Square-Distributed Statistic

This subsection shows that the distribution of WG is well-approximated by a
χ2K−1 distribution.
The use of the Gaussian approximation for the distribution of two-sample
general rank statistics was justified at the end of §3.4.1. This argument ad-
dressed the distributions of rank sums associated with each group separately;
the argument below requires that the joint distribution of the sums of ranks
over the various groups be multivariate Gaussian. Hájek (1960), while prov-
ing more general distributional finite population sampling results, notes that
the results similar to those of Erdös and Réyni (1959) can be applied to all
linear combinations of rank sums from separate groups, and hence the col-
lection of rank sums is approximately multivariate Gaussian. This result
requires some condition forcing all group proportions to stay away from zero;
lim inf N →∞ Mk /N > 0 should suffice.
Statistics formed by squaring observed deviations of data summaries from
their expected values, and summing, arise in various contexts in statistics.
Often these statistics have distributions that are well-approximated as χ2 .
For example, in standard parametric one-way analysis of variance, for Gaus-
sian data, sums of squared centered group means have a χ2 distribution, after
rescaling by the population variance. Tests involving the multinomial distri-
bution also often have the χ2 distribution as an approximate referent. In both
these cases, as with the rank test developed below, the χ2 distribution has as
its degrees of freedom something less than the number of quantities added.
In the analysis of variance case, the χ2 approximation to the test statistic
may be demonstrated by considering the full joint distribution of all group
means, and integrating out the distribution of the grand mean. In the multi-
nomial case, the χ2 approximation to the test statistic may be demonstrated
by treating the underlying cell counts as Poisson, and conditioning on the
table total. In the present case, one might embed the fixed rank sum for all
groups taken together into a larger multivariate distribution, and either con-
ditioning or marginalizing, but this approach is unnatural. Instead, below a
smaller set of summaries, formed by dropping rank sums for one of the groups,
is considered. This yields a distribution with a full-rank variance matrix, and
calculations follow. PK
The total number of observations is N = k=1 Mk . Let Y be the K − 1
by 1 matrix
 h i h i >
(1) (1) (K−1) (K−1)
TG − E TG TG − E TG
 √ ,..., p  ω
M1 MK−1
General Rank Tests 75
p
(note excluding the final group rank sum), for ω = (N − 2
p1)/[N (â − ā )].
The covariances between components j and k of Y are − Mj Mk /N , and
p variance ofpcomponent j is 1 − Mj /N . Let ν be the K − 1 by 1 matrix
the
( M1 /N , . . . , MK−1 /N )> . Then

Var [Y ] = I − νν > . (4.12)

This proof will proceed by analytically inverting Var [Y ]. Note that

Var [Y ] (I + (N/MK )νν > ) = I + (−1 + (N/MK )(1 − ν > ν))νν > = I, (4.13)
PK−1
since ν > ν = j=1 Mj /N = 1 − MK /N . Then
−1
Var [Y ] = I + (N/MK )νν > . (4.14)

Hence
Y > (I + (N/MK )νν > )Y ∼ χ2K−1 . (4.15)
Also,
K−1
N X {j}
Y > (I + νν > )Y = ω2 (TG − Mj ā)2 /Mj
MK j=1
 2
K−1
{j}
X
2
+ ω  (TG − Mj ā) /MK
j=1
 
K−1 {j} 2 {K} 2
X (TG − Mj ā) (TG − MK ā)) 
= ω2  +
j=1
Mj MK
K {j}
N − 1 X (TG − Mj ā)2
= . (4.16)
(â − ā2 )N j=1 Mj

The above calculation required some notions from linear algebra. The cal-
culation (4.13) requires an understanding of the definition of matrix multipli-
cation, and the associative and distributive properties of matrices, and (4.14)
requires an understanding of the definition of a matrix inverse. Observation
(4.15) is deeper; it requires knowing that a symmetric non-negative definite
matrix may be decomposed as V = D > D, for a square matrix D, and an
understanding that variances matrices in the multivariate case transform as
do scalar variances in the one-dimensional case.
One might compare this procedure to either the standard analysis of vari-
ance procedure, which is heavily reliant on distribution of responses. Alterna-
tively, one might perform the ANOVA analysis on ranks; this procedure does
not depend on distribution of responses.
76 Methods for Three or More Groups

4.3 The Kruskal-Wallis Test

A simple case of the general multivariate rank statistic (4.16) may be con-
structed by choosing the scores for the rank statistics to be the identity, with
the ranks themselves as the scores.
Kruskal and Wallis (1952) introduced the test that rejects the null hypoth-
esis of equal distributions when the test statistic (4.16) exceeds the appropriate
quantile from the null χ2K−1 distribution. They apply this with scores equal
to ranks. Using (3.21), â − ā2 = (N 2 − 1)/12, and the statistic simplifies to
K
X
WH = (12/[(N + 1)N ]) (Rk. − Mk (N + 1)/2)2 /Mk . (4.17)
k=1

This test is called the Kruskal-Wallis test, and is often referred to as the
H test.PHere, again, Rki is the rank of Xki within the combined sample, and
Mk
Rk. = i=1 Rki , and (3.21) gives the first multiplicative factor.

4.3.1 Kruskal-Wallis Approximate Critical Values

Critical values for the Kruskal-Wallis test are often a χ2K−1 quantile. Let
Gk (w; ξ) represent the cumulative distribution function for the χ2 distribution
with k degrees of freedom and non-centrality parameter ξ, evaluated at w.
Let G−1k (π, ξ) represent the quantile function for this distribution. Then the
critical value for the level α test given by statistic (4.17) is

G−1
K−1 (1 − α; 0), (4.18)

and the p-value is given by GK−1 (WH ; 0).

Example 4.3.1 The data at

http://lib.stat.cmu.edu/datasets/Andrews/T58.1
represent corn (maize) yields resulting from various fertilizer treatments
(Andrews and Herzberg, 1985, Example 58). Test the null hypothesis
that corn weights associated with various fertilizer combinations have the
same distribution, vs. the alternative hypothesis that a measure of location
varies among these groups. Treatment is a three-digit string representing
three fertilizer components. Fields in this file are separated by space. The
first three fields are example, table, and observation number. The fourth
and following fields are location, block, plot, treatment, ears of corn, and
weight of corn. Location TEAN has no ties; restrict attention to that
location. The yield for one of the original observations 36 was missing
(denoted by -9999 in the file), and is omitted in this analysis. We cal-
The Kruskal-Wallis Test 77

culate the Kruskal-Wallis test, with 12 groups, and hence 11 degrees of

freedom, with 35 observations. Rank sums by treatment are in Table 4.1.
Subtracting expected rank sums from observed rank sums, squaring, and
dividing by the number of observations in the group gives 971.875. Hence
WH = 971.875 × 12/(35 × 36) = 9.256. Comparing this to a χ211 distribu-
tion gives the p-value 0.598. Do not reject the null hypothesis of equality
of distribution. This might have been done in R using

maize<-as.data.frame(scan("T58.1",what=list(exno=0,tabno=0,
lineno=0,loc="",block="",plot=0,trt="",ears=0, wght=0)))
maize$wght[maize$wght==-9999]<-NA
maize$nitrogen<-as.numeric(substring(maize$trt,1,1))
#Location TEAN has no tied values. R treats ranks of
#missing values nonintuitively. Remove missing values.
tean<-maize[(maize$loc=="TEAN")&(!is.na(maize$wght)),]
cat(’\n Kruskal Wallis H Test for Maize Data \n’)
kruskal.test(split(tean$wght,tean$trt))
#Alternative R syntax:
#kruskal.test(tean$wght,tean$trt)

This might be compared with analysis of variance:

#Note that treatment is already a factor.
anova(lm(wght~trt,data=tean))
and with analysis of variance of the ranks:

anova(lm(rank(wght,na.last=NA)~trt,data=tean))
These last two tests have p-values 0.705 and 0.655 respectively. Note the
difference between these Gaussian theory results and the Kruskal-Wallis
test.

Figure 4.1 shows the support of the normal scores statistic on the set of
possible group rank sums for a hypothetical very small data set; the contour
of the approximate critical region for the test of level 0.05 is superimposed.
As is the case for the chi-square test for contingency tables, points enter the
critical region as the level increases in an irregular way, and so constructing an
additive continuity correction to (4.17) is difficult. Yarnold (1972) constructs a
continuity correction that is additive on the probability, rather on the statistic,
scale. Furthermore, even though group sizes are very small, the sample space
for the group-wise rank sums is quite rich. This richness of the sample space,
as manifest by the small ratio of the point separation (in this case, 1) to
the marginal standard deviations (the square roots of the variance in (3.22)),
implies that continuity correction will have only very limited utility (Chen
and Kolassa, 2018).
78 Methods for Three or More Groups

TABLE 4.1: Rank sums for the maize production example

Treatment Replicates Rank Sum Expected Rank Sum

000 2 24 36
002 2 48 36
020 1 7 18
022 2 15 36
111 8 167 144
113 4 70 72
131 4 63 72
200 2 38 36
202 2 28 36
220 2 62 36
222 2 42 36
311 4 66 72

4.4 Other Scores for Multi-Sample Rank Based Tests

One might generalize the Kruskal-Wallis test in many of the same ways as one
generalized the Mann-Whitney-Wilcoxon test. One might use scoring ideas as
before. In (4.17) replace Rki with the scores aRki . Options include van der
Waerden scores, Savage scores, and others as described earlier. This provides
an adjustment for ties, by letting the scores for the untied entries be the
original ranks, and the scores for the tied entries be the average ranks.
Figure 4.2 shows the support of the Kruskal-Wallis statistic on the set of
possible normal scores sums for a hypothetical very small data set.
Compare this figure to Figure 4.1, in which sample points for group-wise
score sums are far fewer, because more rearrangements of group identifiers
lead to the same scores sums. Hence the normal scores distribution shows less
discreteness.

Example 4.4.1 Revisiting the TEAN subset of the maize data of Ex-
ample 4.3.1, one might perform the van der Waerden and Savage score
tests,
library(exactRankTests)#Gives savage, normal scores
library(NonparametricHeuristic)#Gives genmultscore
cat("Other scoring schemes: Normal Scores\n")
genmultscore(tean$wght,tean$trt,
cscores(tean$wght,type="Normal"))
cat("Other scoring schemes: Savage Scores\n")
genmultscore(tean$wght,tean$trt,
cscores(tean$wght,type="Savage"))
Other Scores for Multi-Sample Rank Based Tests 79

FIGURE 4.1: Asymptotic Critical Region for Kruskal Wallis Test, level 0.05

5 10 15 20 25
Group 1 rank sum
Group sizes 3, 3, 4

The p-values for normal and Savage scores are 0.9544 and 0.9786 respec-
tively.

One might also apply a permutation test in this context. In this case, the
original data are used in place of ranks. The reference distribution arises from
the random redistribution of group labels among the observed responses. A
Gaussian approximation to this sampling distribution leads to analysis similar
to that of an analysis of variance.

Example 4.4.2 Revisiting the TEAN subset of the maize data of Exam-
ple 4.3.1, the following syntax performs the exact permutation test, again
using package MultNonPram.
#date()
#aov.P(tean$wght[!is.na(tean$wght)],
# tean$trt[!is.na(tean$wght)])
#date()
80 Methods for Three or More Groups

FIGURE 4.2: Distribution of Normal Scores

3 . . .. .. .... ... ... .... ... .. ... .. . .

. . .. .. ... .. ... .. .. .. .. .. .. . .
.. .. .... ... ...... .... ..... ...... ..... ..... ..... ... .... ... .. ..
.. .. ... .. ...... ..... .... ....... ..... ..... ..... ..... ... .... .. . ..
2 .. .. .... ... ....... ..... ..... ....... ....... ..... ........ ....... ...... ...... ..... ... ... ...
.. .. ... ... ..... ..... .... ....... ....... ....... ........ ....... ...... ...... .... .... ... .. ..
.. .. ... ... ....... ..... ..... ....... ....... ..... ....... ....... ...... ....... .... ..... ...... ... .. ...
.. .. .... .... ...... ...... .... ......... ...... ....... ....... ...... ....... ....... ..... ..... ..... ... ... ... ..
. . .. . ... .. .. ... .... .. .. .. . . .
1 .. .. ....
. .... ......... .....
. ....... .......... ........ ......... .......... ........ ............ .......... ........ ...... ....... ....
.
...
.
.... . ..
.. .. .... ..... ....... ......
. ...... ......... ...... ...... ...... ....... ....... ........ ...... ....... ....... ... .... .... .. .. .
.. .. .. ... ....... ...... ...... ........... ......... ........ .......... ........ ........... ........ ....... ......... ........ .....
.
..... ...... ... .. . .
. . . .. ... . .. ... . . . . .
Group 2 . .. .. ... ...... ....
. ..... ....... ........ ........ ........ ......... ....... ........ ...... ........ ........ ...
.. ...... ....
.
.. ... . .
score 0 . . .... ... ....... ..... .
.....
.. .. . .. . .. ... .. . .
........ .......... ....... ........... ......... .......... ......... ......... .......... ......... ..... ... ....... .... .... .. .
. . . .. . .. .. .. .
sum . .. ... ...... ..... ..... ...... ..... ...... ...... ....... ...... ....... ..... ..... .....
.. ... .. .. .... ... ... ... ... . . ....... ...... ....... ... ... .. ..
. .. .. ..... .... ...
. ......... ......... ........ .......... ......... .......... ......... ........ ........ ............ ...... ....... .
...... ..... .... .. ..
.. . .... .... ....
.
.... ... ...... ....... ........ ....... ...... ...... ...... .......
... ... .. .. .. .. .... ... .. ...
.
.....
.
..... ......... .
...
.
.... .. ..
-1 .. ... ... .... ...... ...... ....... ......... ........ ......... ......... ....... ......... .......... .... .....
... ........ .... .... .. ..
... .. ... ...... ..... .... ....... ...... ....... ....... ..... ....... ....... ..... ..... ....... ... ... .. ..
.. .. ... .... .... ...... ...... ....... ........ ....... ....... ........ .... ..... ...... ... .... .. ..
... ... ... ..... ...... ...... ....... ....... ..... ....... ....... ...... ..... ....... .... .... .. ..
-2 .. . .. .... ... ..... ..... ..... ..... ....... .... ..... ...... .. ... .. ..
.. .. ... .... ... ..... ..... ..... ...... ..... .... ...... ... .... .. ..
. . .. .. .. .. .. .. ... .. ... .. .. . .
. . .. ... .. ... .... ... ... .... .. .. . .
-3
-3 -2 -1 0 1 2 3
Group 1 score sum
Group sizes 3 3 4

The date commands bracketing the call to aov.P allow calculation of

elapsed time. However, these calculations are quite slow, and hence are
commented out. The commands below approximate the p-value via simu-
lation.
obsp<-anova(lm(wght~trt,data=tean))[[4]][1]
out<-rep(NA,10000)
#Monte Carlo approximation to the permutation distribution
for(i in seq(length(out))){
out[i]<-anova(lm(sample(wght)~trt,data=tean))[[4]][1]
}
mean(out>=obsp)

giving an approximation
p to the p-value of 0.7254, or, more carefully,
0.7254 ± 1.96 0.7254 × 0.2746/10000 = (0.717, 0.734).

4.5 Multiple Comparisons

The Gaussian-theory multiple comparison techniques of §4.1.2 may be adapted
to rank-based testing. The LSD method is adapted by substituting the
Multiple Comparisons 81

Kruskal-Wallis test (4.17) of §4.3 for the analysis of variance test (4.3) in the
first stage of the procedure, and substituting the Mann-Whitney-Wilcoxon
test (3.14) for the two-sample t-test, with the same lack of Type I error con-
trol.
In the case of rank testing, when sample sizes are equal, the Studentized
range method may be applied to rank means for simultaneous population dif-
ferentiation (Dunn, 1964); Conover and Iman (1979) credits this to Nemenyi
(1963). This technique may be used to give corrected p-values and corrected
simultaneous confidence intervals for rank means. Since rank mean expecta-
tions are generally not of interest, the application of the Studentized range
distribution to rank means is typically of direct interest only for testing. Use
{k}
TG /Mk in place of X̄k . Note that for j 6= k,
h i h i h i
" # {k} {j} {j} {k}
T
{k}
T
{j} Var TG Var TG Cov TG , TG
Var G − G = 2 + 2 −2
Mk Mj Mk Mj Mj Mk
2

N − Mj N − Mk â − ā
= + +2
Mj Mk N −1
2

1 1 N (â − ā )
= + .
Mj Mk N −1
p
Hence S in (4.7) may be replaced with N (â − ā2 )/(N − 1) to obtain si-
multaneous p-values to satisfy (4.8). Also, take the denominator degrees of
freedom to be ∞ as the second argument to q. The same substitution may be
made in (4.9) to obtain (4.10), but the parameters bounded in these intervals
are differences in average rank, which are seldom of interest.

Example 4.5.1 Consider again the yarn data of Example 3.3.1. Con-
sider just type A, and explore pairwise bobbin differences. One might do
all pairwise Mann-Whitney-Wilcoxon tests.
yarna<-yarn[yarn$type=="A",]
cat(’\nMultiple Comparisons for Yarn with No Correction\n’)
pairwise.wilcox.test(yarna$strength,yarna$bobbin,exact=F,
p.adjust.method="none")
This gives pairwise p-values
1 2 3 4 5
2 0.112 - - - -
3 0.470 1.000 - - -
4 0.245 0.030 0.112 - -
5 0.885 0.312 0.772 0.470 -
6 0.661 0.146 0.470 0.042 1.000
82 Methods for Three or More Groups

Bobbins 2 and 4 seem to be the most different, followed by 4 and 6, but

the above comparison is not adjusted for multiple comparisons.
One might perform the version of the Fisher LSD approach using
the Kruskal-Wallis test for pairwise comparisons, as described by Higgins
(2004):
library(MultNonParam)
higgins.fisher.kruskal.test(yarna$strength,yarna$bobbin)

In this case, the initial Kruskal-Wallis test fails to reject the null hypoth-
esis of equality of distribution, and no further exploration is performed.

Applying the Bonferroni adjustment,

# pairwise.wilcox.test corrects for multiple comparisons

# methods using only on p-values. This requirement
# excludes methods of Tukey and Scheffe.
cat(’\nBonferroni Comparisons for Yarn Type A Data\n’)
pairwise.wilcox.test(yarna$strength,yarna$bobbin,
exact=F,p.ajust.method="bonferroni")

This gives corrected p-values:

1 2 3 4 5
2 1.00 - - - -
3 1.00 1.00 - - -
4 1.00 0.46 1.00 - -
5 1.00 1.00 1.00 1.00 -
6 1.00 1.00 1.00 0.59 1.00
None are significant after correction for multiplicity. One might also use
Tukey’s method for calculating p-values respecting multiple comparisons:

library(MultNonParam)
tukey.kruskal.test(yarna$strength,yarna$bobbin)
indicating no significant differences.

4.6 Ordered Alternatives

Again consider the null hypothesis of equal distributions. Test this hypothesis
vs. the ordered alternative hypothesis

HA : Fi (x) ≥ Fi+1 (x)∀x

Ordered Alternatives 83

for all indices i ∈ {1, . . . , K − 1}, with strict equality at some x and some i.
This alternative reduces parameter space to 1/2K of former size. One might
use as the test statistic X
J= Uij , (4.19)
i<j

where Uij is the Mann-Whitney-Wilcoxon statistic for testing groups i vs. j.

Reject
PK the null hypothesis when J is large. This statistic may be expressed
as k=1 ck R̄k. plus a constant, for some ck satisfying (4.5); that is, J may be
defined as a contrast of the rank means, and the approach of this subsection
may be viewed as the analog of the parametric approach of §4.1.1.
Critical values for J can be calibrated using a Gaussian approximation.
Under the null hypothesis, the expectation of J is
X X
E0 [J] = Mi Mj /2 = N 2 /4 − Mi2 /4,
i<j i

and the variance is

K K
1 X 1 X
Var0 [J] = Var0 [Ui ] = Mi mi−1 (mi + 1); (4.20)
12 i=2 12 i=2

here Ui is the Mann-Whitney Pstatistic for testing group i vs. all preceding
i
groups combined, and mi = j=1 Mj . The second equality in (4.20) follows
from independence of the values Ui (Terpstra, 1952). A simpler expression for
this variance is
" K
#
1 X
Var0 [J] = N (N + 1)(2N + 1) − Mi (Mi + 1)(2Mi + 1) . (4.21)
72 i=1

This test might be corrected for ties, and has certain other desirable properties
(Terpstra, 1952).
Jonckheere (1954), apparently independently, suggested a statistic that
is twice J, centered to have zero expectation, and calculated the vari-
ance, skewness, and kurtosis. The resulting test is generally called the
Jonckheere-Terpstra test.

Example 4.6.1 Consider again the Maize data from area TEAN in Ex-
ample 4.3.1. The treatment variable contains three digits; the first in-
dicated nitrogen level, with four levels, and is extracted in the code in
Example 4.3.1. Apply the Jonckheere-Terpstra test:
library(clinfun)# For the Jonckheere-Terpstra test
jonckheere.test(tean$wght,tean$nitrogen)
cat(’\n K-W Test for Maize, to compare with JT \n’)
kruskal.test(tean$wght,tean$nitrogen)
to perform this test, and the comparative three degree of freedom Kruskal-
84 Methods for Three or More Groups

Wallis test. The Jonckheere-Terpstra Test gives a p-value 0.3274, as com-

pared with the Kruskal-Wallis p-value 0.4994.

4.7 Powers of Tests

This section considers powers of tests calculated from linear and quadratic
combinations of of indicators
(
1 if Xik < Xjl
Iik,jl = .
0 if Xik > Xjl

The Jonckheere-Terpstra statistic (4.19) is of this form, as is the Kruskal-

Wallis statistic (4.17), since the constituent rank sums can be written in terms
of pairwise variable comparisons. Powers will be expressed in terms of the
expectations

κij = PA [Xi1 < Xj1 ] for i 6= j, and κii = 1/2. (4.22)

Under the null hypothesis of equal populations, κij = 1/2 for all i 6= j.
In the case of multidimensional alternative hypotheses, effect size and ef-
ficiency calculations are more difficult than in earlier one-dimensional cases.
In the case with K ordered categories, there are effectively K − 1 identifi-
able parameters, since, because the location of the underlying common null
distribution for the data is unspecified, group location parameters θj can all
be increased or decreased by the same constant amount while leaving the
underlying model unchanged. On the other hand, the notion of relative effi-
ciency requires calculating an alternative parameter value corresponding to,
at least approximately, the desired power, and as specified by (2.23). This
single equation can determine only a single parameter value, and so relative
efficiency calculations in this section will consider alternative hypotheses of
the form
θ A = ∆θ † (4.23)
for a fixed direction θ † . Arguments requiring solving for the alternative will
reduce to solving for ∆. The null hypothesis is still θ = 0.

4.7.1 Power of Tests for Ordered Alternatives

The statistic J of (4.19) has an approximate Gaussian distribution, and so
powers of tests of ordered alternatives based on J are approximated by (2.18).
Under both null and alternative hypotheses,
X
E [J] = Mi Mj κij , (4.24)
i<j
Powers of Tests 85

with κij defined as in (4.22). Alternative values for κij under shift models (4.2)
are calculated as in (3.25). Without loss of generality, one may take θ1 = 0.
Consider parallels with the two-group setup of §3. The cumulative distri-
bution function F1 of (4.2) corresponds to F of (3.1), and F2 corresponds to G
of (3.1). Then µ(θ) of (3.25) corresponds to κ12 . Calculation of κkl , defined in
(4.22), and applied to particular pairs of distributions, such as the Gaussian
in (3.26) and the logistic in (3.27), and other calculations from the exercises
of §3, hold in this case as well. Each of the difference probabilities κkl , for
k 6= l, depends on the alternative distribution only through θl − θk .
Power may be calculated from (2.18).

Example 4.7.1 Consider K = 3 groups of observations, Gaussian, with

unit variance and expectations θ1 = 0, θ2 = 1/2, and θ3 = 1, and all
groups of size Mi = 20. Consider
√ a one-sided level 0.025
√ test. Applying
(3.26), κ12 = κ23 = Φ(.5/ 2) = 0.638, and κ13 = Φ(1/ 2) = 0.760. The
null and alternative expectations of J are

602 [(1/3)(1/3) × 0.5 + (1/3)(1/3) × 0.5 + (1/3)(1/3) × 0.5] = 600

and

602 [(1/3)(1/3) × 0.638 + (1/3)(1/3) × 0.638 + (1/3)(1/3) × 0.760] = 814.6

respectively, from (4.24). The null variance of J, from (4.21), is

1 XK
ς 2 (0) = N (N + 1)(2N + 1) − Mi (Mi + 1)(2Mi + 1)
72 i=1

= (60 × 61 × 1211 − 3 × 20 × 21 × 41)/72 = 5433.3.

√
Applying (2.18), power is 1 − Φ((600 − 814.6)/ 5433.3 + 1.96) = 0.829.
I used (2.18), rather than (2.20), since the variance of the distribution of
the statistic was most naturally given above without division by sample
size, and rather than (2.17), because calculating the statistic variance
under the alternative is tedious.
This may be computed in R using
library(MultNonParam)
terpstrapower(rep(20,3),(0:2)/2,"normal")

and the approximate power may compared to a value determined by Monte

Carlo; this value is 0.857.

4.7.2 Power of Tests for Unordered Alternatives

Power for unordered alternatives is not most directly calculated as an exten-
sion of (2.22). In this unordered case, as noted above, the approximate null
86 Methods for Three or More Groups

distribution for WH is χ2K−1 . One might attempt to act by analogy with (2.17),
and calculate power using alternative hypothesis expectation and variance
matrix. This is possible, but difficult, since WH (and the other approximately
χ2K−1 statistics in this chapter) are standardized using the null hypothesis
variance structure. Rescaling this under the alternative hypothesis gives a
statistic that may be represented approximately as the sum of squares of inde-
pendent Gaussian random variables; however, not only do these variables have
non-zero means, which is easily addressed using a non-central χ2 argument
as in (1.3), but also have unequal variances for the variables comprising the
summands; an analog to (1.3) would have unequal weights attached to the
squared summands.
A much easier path to power involves analogy with (2.22): Approximate
the alternative distribution using the null variance structure. This results in
the non-central chi-square approximation using (1.4).
Because the Mann-Whitney and Wilcoxon statistics differ only by an ad-
ditive constant, the Kruskal-Wallis test may be re-expressed as
(K )
X
◦ 2
(Tk − Mk (N − Mk )κ ) /Mk /[ψ 2 (N + 1)N ]. (4.25)
k=1

Here κ◦ = 1/2; this is the null value of κkl , and

√ the null hypothesis specifies
that this does not depend on k or l, and ψ = 1/ 12, a multiplicative constant
arising in the variance of the Mann-Whitney-Wilcoxon statistic. Many of the
following equations follow from (4.25); furthermore, (4.25) also approximately
describes other statistics to be considered later, and analogous consequences
may be drawn for these statistics as well, with a different value for ψ. Hence
the additional complication of leaving a variable in (4.25) whose value is known
will be justified by using consequences of (4.25) later without deriving them
again. Here,
Mk X
X K XMl
Tk = I(Xkj > Xli ), (4.26)
j=1 l=1,l6=k i=1

the Mann-Whitney statistic for testing whether group k differs from all of the
other groups, with all of the other groups collapsed.
The variance matrix for rank sums comprising WH is singular (that is,
it does not have an inverse), and the argument justifying (1.4) relied on the
presence of an inverse. The argument of §4.2.2 calculated the appropriate
quadratic form, dropping one of the categories to obtain an invertible vari-
ance matrix, and then showed that this quadratic form is the same as that
generating t. The same argument shows that the appropriate non-centrality
parameter is
(K )
X
ξ= (EA [Tk ] − Mk (N − Mk )(1/2)) /Mk /[ψ 2 (N + 1)N ],
2

k=1
Powers of Tests 87
PK
where EA [Tk ] = Mk l=1,l6=k Ml κkl . The non-centrality parameter is
 2
K K
1 X X
ξ = Mk  Ml (κkl − κ◦ )
ψ 2 (N + 1)N
k=1 l=1,l6=k

K K
!2
1 X X
◦
= Mk Ml (κkl − κ ) . (4.27)
ψ 2 (N + 1)N
k=1 l=1

The restriction on the range of the inner summation l 6= k may be dropped

in (4.27), because the additional term is zero.
Let Gk and G−1 k be the chi-square cumulative distribution and quantile
functions respectively, as in §4.3.1. The power for a the Kruskal-Wallis test
with K groups, under alternative (4.2), is approximately

1 − GK−1 (G−1
K−1 (1 − α, 0); ξ), (4.28)

with ξ given by (4.27).

Example 4.7.2 Continue example 4.7.1. Again, consider K = 3 groups

of observations, Gaussian, with unit variance and expectations θ1 = 0,
θ2 = 1/2, and θ3 = 1, and all groups of size Mi = 20. The three inner
sums in (4.27) are 20×(0.5−0.5)+20×(0.638−0.5)+20×(0.760−0.5) =
7.968, 20 × (0.362 − 0.5) + 20 × (0.5 − 0.5) + 20 × (0.638 − 0.5) = 0,
and 20 × (0.240 − 0.5) + 20 × (0.362 − 0.5) + 20 × (0.5 − 0.5) = −7.968.
Squaring, multiplying each of these by Mk = 20, and adding gives 2539.6.
Multiplying by 12/(60 × 61) gives 8.327. The critical value for a test of
level 0.05 is given by the χ2 distribution with 2 degrees of freedom, and is
5.99. The tail probability associated with the non-central χ2 distribution
with non-centrality parameter 8.327 and two degrees of freedom, beyond
5.99, is 0.736; this is the power for the test. As expected, this power is
less than that given in example (4.7.1) for the Jonckheere-Terpstra test.
This might be compared with a Monte Carlo approximation of 0.770.
This may be computed using the R package MultNonParam using

kwpower(rep(20,3),(0:2)/2,"normal")

Approximation (4.27) may be approximated to give a simpler relation between

the non-centrality parameter and sample size, allowing for the calculation of
the sample size producing a desired power, denoted in this subsection as 1−β.
In (4.28), sample size enters only through the non-centrality parameter. As
in standard one-dimensional sample size calculations, re-express the relation
between power and non-centrality as

G−1 −1
K−1 (β; ξ) = GK−1 (1 − α, 0). (4.29)
88 Methods for Three or More Groups

From (4.27), 
K K
!2 
X X 
ξ≈ λk λl (κkl − κ◦ ) N/ψ 2 , (4.30)
 
k=1 l=1
√
for ψ = 1/ 12, and
 !2 
XK K
X 
N ≈ ξψ 2 / λk λl (κkl − κ◦ ) , (4.31)
 
k=1 l=1

for λk = limN →∞ Mk /N . Assume that λk > 0 for all k. Hence, to determine

the sample size needed for a level α test to obtain power 1−β, for an alternative
with group differences κkl , and with K groups in proportions λk , first solve
(4.29) for ξ, and then apply (4.31). An old-style approach to solving (4.29)
involves examining tables of the sort in Haynam et al. (1982) and Haynam
et al. (1982).

Example 4.7.3 Again, consider K = 3 groups of observations, Gaus-

sian, with unit variance and expectations θ1 = 0, θ2 = 1/2, and θ3 = 1
Calculate the sample size needed for the level α Kruskal-Wallis test in-
volving equal-sized groups to obtain power 0.8. Quantities κkl were calcu-
lated in example 4.7.1 to be 0.362, 0.5, and 0.638. The three inner sums
in (4.31) are (0.5 − 0.5)/3 + (0.638 − 0.5)/3 + (0.760 − 0.5)/3 = 0.133,
(0.362 − 0.5)/3 + (0.5 − 0.5)/3 + (0.638 − 0.5)/3 = 0, and (0.240 − 0.5)/3 +
(0.362 − 0.5)/3 + (0.5 − 0.5)/3 = −0.133. Squaring, multiplying each of
these by λk = 1/3, adding, and multiplying by 12 gives 0.1408. Solving the
equation (4.29) gives ξ = 9.63, and the sample size is 9.63/0.1408 ≈ 69,
indicating 23 subjects per group.
This may be computed using the R package MultNonParam using

kwsamplesize((0:2)/2,"normal")

As in the one-dimensional case, one can solve for effect size by approximating
the probabilities as linear in the shift parameters. Express
κkl − κ◦ ≈ κ0 (θkA − θlA ), (4.32)
and explore the multiplier ∆ from (4.23) giving the alternative hypothesis in
a given direction. Then
 
K XK X K
 X  ξψ 2
N ≈ ξψ 2 / (κ0 )2 λk λj λl (θkA − θlA )2 = 2 0 2 2 , (4.33)
  ∆ (κ ) ζ
k=1 j=1 l=1
s
2 P 2
PK † K †
for ζ = k=1 λk θk − k=1 λk θk ; ζ plays a role analogous to its
role in §3.8, except that here it incorporates the vector giving the direction of
departure from the null hypothesis.
Powers of Tests 89

Example 4.7.4 The sum with respect to j disappears from ζ 2 , since the
sum of the proportions λj is 1. Under the same conditions in Example
4.7.3, one might use (4.33) rather than (4.31). Take θ † = θ A = (0, 1/2, 1)
and ∆ = 1. The derivative κ0 is tabulated, for the Gaussian and logistic
distributions,
√ −1 in Table 3.6 as µ0 (0). In this Gaussian case, κ0 = µ0 (0) =
(2 π) = 0.282. Also, ζ 2 = (02 /3 + (1/2)2 /3 + 12 /3 − (0/3 + (1/2)/3 +
1/3)2 ) = 5/12 − 1/4 = 1/6. The non-centrality parameter, solving (4.29),
is ξ = 9.63. The approximate sample size is 9.63/(12 × 12 × 0.2822 /6) =
61, or approximately 21 observations per group; compare this with an
approximation of 23 per group using a Gaussian approximation, without
expanding the κij .
This may be computed using the R package MultNonParam using

kwsamplesize((0:2)/2,"normal",taylor=TRUE)

Relation (4.33) may be solved for the effect size ∆, to yield

p
∆ = ψ ξ/ κ0 ζN 1/2 . (4.34)

In order to determine the effect size necessary to give a level α test power
1 − β to detect an alternative hypothesis ∆θ † , determine ξ from (4.28), and
then ∆ from (4.34).

Example 4.7.5 Again use the same conditions as in Example 4.7.3.

Consider three
√ groups with 20 observations each; then N = 60. Recall
0
that ψ = 1/ 12. Example p 4.7.4 demonstrates that ξ = 9.63, κ = 0.282,
ζ 2 = 1/6, and ∆ = ( 9.63/(60 × 12 × (1/6)))/0.282 = 1.004. As ∆
is almost exactly 1, the alternative parameter vector in the direction of
(0, 1/2, 1) and corresponding to a level 0.05 test of power 0.80 with 60
total observations is (0, 1/2, 1).
This may be computed using the R package MultNonParam using
kweffectsize(60,(0:2)/2,"normal")

Figure 4.3 reflects the accuracy of the various approximations in this sec-
tion. It reflects performance of approximations to the power of the Kruskal-
Wallis test. Tests with K = 3 groups, with the same number of observations
per group, with tests of level 0.05, were considered. Group sizes between 5 and
30 were considered (corresponding to total sample sizes between 15 and 90).
For each sample size, (4.34) was used to generate an alternative hypothesis
with power approximately 0.80, in the direction of equally-spaced alternatives.
The dashed line represents a Monte Carlo approximation to the power, based
on 50,000 observations. The dotted line represents the standard non-central
chi-square approximation (4.28). The solid line represents this same approxi-
mation, except also incorporating the linear approximation (4.32) representing
90 Methods for Three or More Groups

the exceedance probabilities as linear in the alternative parameters, and hence

the non-centrality parameter as quadratic in the alternative parameters.
The solid line is almost exactly horizontal at the target power. The dis-
crepancy from horizontal arises from the error in approximating (4.27) by
(4.30). For small sample sizes (that is, group sizes of less than 25, or group
sizes of less than 75), (4.28) is not sufficiently accurate; for larger sample sizes
it should be accurate enough.
All curves in Figure 4.3 use the approximate critical value in (4.18).

FIGURE 4.3: Approximate Powers for the Kruskal-Wallis Test

0.80 ........
............................................................................................................................................
.......................................................
..............................
...........
................... ...... .......................................
.. ....
.......... ........................................
...............
... .........
.......... .............
........ . . . .
........ . . . .
0.75 ............. . . . .
.... . . .
........ . . . . .
...
...... . .
.
.... . .
..... . .
...... . .
......... .
.. . .
....
... .
0.70 ...
..
.. .
.
.
.
...
Approx- ...
... .
.
.
.
imate .
...
.
.
.
... .
Power 0.65 ..
...
.
.
.
.
.
. .................................................................... Approx. (4.28) with (4.27)
.
. and (4.32)
0.60 .
.

.
.
. . . . . . . . . . . Approx. (4.28) with (4.27)
. ...................................................... Monte Carlo approx.,
0.55 50000 observations.

5 10 15 20 25 30
Number per Group
3 groups, level 0.05, power 0.8

4.8 Efficiency Calculations

Power calculations for one-dimensional alternative hypotheses made use of
(2.22), applying a Gaussian approximation with exact values for means under
the null and alternatives, and approximating the variance under the alterna-
tive by the variance under the null. Efficiency calculations of §2.4 and §3.8
approximated means at the alternative linearly using the derivative of the
mean function at the null. Tests for ordered alternatives from this chapter
will be addressed similarly.
Efficiency Calculations 91

4.8.1 Ordered Alternatives

Consider first the one-sided Jonckheere-Terpstra test of level α. Let TJ =
J/N 2 . In this case, the subscript J represents a label, and not an index.
Denote the critical value by t◦J , satisfying Pθ0 [TJ ≥ t◦J ] = 1 − α.
As in (4.23), reduce the alternative hypothesis to a single dimension by
letting the alternative parameter vector be a fixed vector times a multiplier
∆. The power function $J,n (∆) = PθA [TJ ≥ t◦J ] satisfies (2.15), (2.19), and
(2.21), and hence the efficiency tools for one-dimensional hypotheses developed
in §2.4.2 may be used. Expressing µJ (∆) as a Taylor series with constant and
linear terms,
K−1
X X K
µJ (∆) ≈ λi λj (κ◦ + κ0 [θjA − θiA ])
i=1 j=i+1

for λi = Mi /N , where again κ◦ is the common value of κjk under the null
hypothesis, and κ0 is the derivative of the probability in (3.25), as a function
of the location shift between two groups, evaluated at the null hypothesis, and
calculated for various examples in §3.8.2. Hence
K−1 K
λi λj κ0 [θj† − θi† ].
X X
µ0J (0) =
i=1 j=i+1

Recall that κij depended on two group indices i and j only because the lo-
cations were potentially shifted relative to one another; the value κ◦ and
its derivative κ0 are evaluated at the null hypothesis of equality of distri-
butions, and hence
h do not depend
i on the indices. Furthermore, from (4.21),
1
PK 3
Var [TJ ] ≈ 36 1 − k=1 λk /N . Consider the simple case in which λk = 1/K
for all k, and in which θj† − θi† = (j − i). Then µ0J (0) = κ0 (K 2 − 1)/6K,
1

Var [TJ ] ≈ 36 1 − 1/K 2 /N , and

(K 2 − 1)/6K κ0 (K 2 − 1) p
eJ = κ0 q = √ = κ0 K 2 − 1.
1 2 K2 − 1
36 [1 − 1/K ]
√ √
The efficacy of the Gaussian-theory contrast, from (4.6), is K 2 − 1/(2 3ε).
Hence the asymptotic relative efficiency of the Jonckheere-Terpstra test to
the contrast of means is
(κ0 )2 (12ε2 ). (4.35)
This is the same as the asymptotic relative efficiency in the two-sample case
in Table 3.6.

4.8.2 Unordered Alternatives

While techniques exist to consider transform non-central chi-square statistics
to an approximate Gaussian distribution (Sankaran, 1963), the most direct
92 Methods for Three or More Groups

approach to comparing efficiency of various tests based on approximately χ2

statistics is to compare approximate sample sizes for a fixed level and power,
as in (4.33). Prentice (1979) does this using ratios of the non-centrality pa-
rameter. For test j, let Nj , be the sample size needed to provide power 1 − β
for the test of level j against alternative θA .
Calculations (4.27), and (4.30) through (4.34) were motivated specifically
for the Kruskal-Wallis test, using Mann-Whitney sums (4.26), but hold more
broadly for any group summary statistics replacing (4.26), so long as E0 [Tk ] ≈
PK
Mk l=1,l6=k Ml κ◦ for some differentiable κ◦ , and as long as (4.25), with the
new definition of Tk , is approximately χ2K−1 . In particular, a rescaled version
PK
of the F -test statistic (4.3) (K − 1)WA = k=1 Mk (X̄k. − X̄.. )2 /σ̃ 2 is such
a test, with ψ the variance of the Xkj , Tk = X̄k. − X̄.. , κ◦ = 0, and κ0 = 1,
approximated as the χ2 statistic by treating the variance as known. Suppose
that the value of ζ 2 remains unchanged for two such tests, and suppose these
tests have values of κ0 at the null hypothesis distinguished by indices (for
concreteness, label these as κ0H and κ0A = 1 for the correctly-scaled Kruskal-
Wallis and F tests respectively), and similarly √ having the norming factors
ψ distinguished by indices (again, ψH = 1/ 12 and ψA = ε2 ). Equate the
alternatives in (4.34), to obtain
p 1/2
p 1/2

ψH ξ/ κ0H ζNH ≈ ψA ξ/ κ0A ζNA .

The ratio of sample sizes needed for approximately the same power for the
same alternative from the F and Kruskal-Wallis tests is approximately
2 2
) (κ0H )2 /(κ0A )2 = 12ε2 (κ0H )2 ,

NA /NH ≈ (ψA /ψH

which is the same as in the unordered case (4.35) and in the two-sample case.

4.9 Exercises
1. The data set

HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt

gives data from an analysis of a series of documents. The first col-

above, has odd line breaks. A reformatted version can be found at

stat.rutgers.edu/home/kolassa/Data/federalistpapers.txt).
a. Test the null hypothesis that negativity is equally distributed
across the groups using a Kruskal-Wallis test.
b. Test at α = .05 the pairwise comparisons for negativity between
groups using the Bonferroni adjustment, and repeat for Tukey’s
HSD.
2. The data set

HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/Plasma Retinol

gives the results of a study on concentrations of certain nutrients

among potential cancer patients. This set gives data relating various
quantities, including smoking status (1 never, 2 former, 3 current)
in column 3 and beta plasma in column 13. Perform a nonparamet-
ric test to investigate an ordered effect of smoking status on beta
plasma.
3. Consider using a Kruskal-Wallis test of level 0.05, testing for equal-
ity of distribution for four groups of size 15 each. Consider the
alternative that two of these groups have observations centered at
zero, and the other two have observations centered at 1, with all
observations from a Cauchy distribution.
a. Approximate the power for this test to successfully detect the
above group differences.
b. Check the quality of this approximation via simulation.
c. Approximate the size in each group if a new study were planned,
with four equally-sized groups, and with the groups centered as
above, with a target power of 90%.
4. The data at

HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages

This chapter concerns testing and estimation of differences in location among

multiple groups, as raised in the previous chapter, in the presence of blocking.
Techniques useful when data are known to be approximately Gaussian are
first reviewed, for comparison purposes. The simplest example of blocking is
that of paired observations; paired observations are considered after Gaussian
techniques. More general multi-group inference questions are addressed later.

5.1 Gaussian Theory Approaches

Techniques appropriate when observations are well-approximated by a Gaus-
sian distribution are well-established. These are first reviewed to give a point
of departure for nonparametric techniques, which are the focus of this volume.

5.1.1 Paired Comparisons

Suppose pairs of values (Xi1 , Xi2 ) are observed on n subjects, indexed by i,
and suppose further that (Xi1 , Xi2 ) is independent of (Xj1 , Xj2 ) for i, j ∈
{1, . . . , n}, i 6= j, and that all of the vectors (Xi1 , Xi2 ) have the same distri-
bution. Under the assumption that the observations have a Gaussian distri-
bution, one often calculates differences Zi = Xi1 − Xi2 , and applies either the
testing procedure of §2.1.2, or the associated confidence interval procedure.
Such a test is called a paired t-test.

5.1.2 Multiple Group Comparisons

Suppose one observes independent random variable Xkli , with k ∈ {1, · · · , K},
l ∈ {1, · · · , L}, and i ∈ {1, · · · , Mkl }. Here index k represents treatment, and
index l represents block. A test of equality of treatment means is desired,
without an assumption about the effect of block. When these variables are
Gaussian, and have equal variance, one can construct a statistic analogous to
the F test (4.3), in which the numerator is a quadratic form in differences
PL PMkl PL
between the means specific to group, X̄k.. = l=1 i=1 Xkli / l=1 nkl , and
the overall mean. Formulas are available for the statistic in closed form only

95
96 Group Differences with Blocking

when the replicate numbers satisfy

Mkl = Mk. M.l /M.. (5.1)

PL PK PK PL
for Mk. = l=1 Mkl , M.l = k=1 Mkl , and M.. = k=1 l=1 Mkl .

5.2 Nonparametric Paired Comparisons

Suppose pairs of values (Xi1 , Xi2 ) are observed on n subjects, indexed by i,
and suppose further that (Xi1 , Xi2 ) is independent of (Xj1 , Xj2 ) for i, j ∈
{1, . . . , n}, i 6= j, and that all of the vectors (Xi1 , Xi2 ) have the same distri-
bution. Consider the null hypothesis that the marginal distribution of {Xi }
is the same as that of {Yi }, versus the alternative hypothesis that the dis-
tributions are different. This null hypothesis often, but not always, implies
that the differences Zi = Xi1 − Xi2 have zero location. One might test these
hypotheses by calculating the difference Zi between values, and apply the one-
sample technique of Chapter 2, the sign test; this reduces the problem to one
already solved.
After applying the differencing operation, one might expect the differences
to be more symmetric than either of the two original variables, and one might
exploit this symmetry. To derive a test under the assumption of symmetry,
Let Rj be the rank of |Zj | among all absolute values. Let
(
1 if item j is positive
Sj = .
0 if item j is negative

Define the Wilcoxon signed-rank test in terms of the statistic

X
TSR = Rj Sj . (5.2)

Under the null hypothesis that the distribution of the Xi is the same as the
distribution of the Yi , and again, assuming symmetry of the differences, then
(, Sj , ) and (, Rj , ) are independent random vectors, because Sj and |Xj | are
pairwise independent under H0 .
Components of the random vector (R1 , . . . , Rn ) are dependent, and hence
calculation of the variance from TSR via (5.2) requires calculation of the sum of
identically distributed but not independent random variables. An alternative
formulation, as the sum of independent but not identically distributed random
variables, will prove more tractable. Let
(
1 if the item whose absolute value is ranked j is positive
Vj = .
0 if the item whose absolute value is ranked j is negative
Nonparametric Paired Comparisons 97
P
Hence TSR = jVj , and the null expectation and variance are
X
E0 [TSR ] = j E0 [Vj ] = n(n + 1)/4 (5.3)

and
X X
Var0 [TSR ] = j 2 Var0 [Vj ] = j 2 /4 = n(2n + 1)(n + 1)/24. (5.4)

One can also calculate exact probabilities for TSR recursively, as one could
for the two-sample statistic. There are 2n ways to assign signs to ranks 1, . . . , n.
Let f (t, n) be the number of such assignments yielding TSR = t with n obser-
vations. Again, as in §3.4.1, summing the counts for shorter random vectors
with alternative final values,

0
 for t < 0 or t > n(n + 1)/2
f (t, n) = 1 if n = 1 and t ∈ {0, 1} .

f (t, n − 1) + f (t − n, n − 1) otherwise


This provides a recursion that can be used to calculate exact p-values.

Example 5.2.1 Consider data calculated on the size of brains of twins

(Tramo et al., 1998). This data set from
http://lib.stat.cmu.edu/datasets/IQ Brain Size
contains data on 10 sets of twins. Each child is represented by a separate
line in the data file, for a total of 20 lines. We investigate whether brain
volume (in field 9) is influenced by birth order (in field 4). Brain volumes
for the first and second child, and their difference, are given in Table 5.1.
The rank sum statistic is 3+4+10+8+7 = 32, the null expected rank sum
is 10×11/4 = 27.5, the null variance is 10×21×11/24
√ = 96.25, and so the
two-sided p-value is 2 × Φ(−(32 − .5 − 27.5)/ 96.25) = 2 × Φ(−0.408) =
.683. There is no evidence that twin brain volume differs by birth order.
This calculation might also have been done in R using
twinbrain<-as.data.frame(scan("IQ_Brain_Size",
what=list(CCMIDSA=0,FIQ=0,HC=0,ORDER=0,PAIR=0,SEX=0,
TOTSA=0, TOTVOL=0,WEIGHT=0),skip=27,nmax=20))
fir<-twinbrain[twinbrain$ORDER==1,]
fir$v1<-fir$TOTVOL
sec<-twinbrain[twinbrain$ORDER==2,]
sec$v2<-sec$TOTVOL
brainpairs<-merge(fir,sec,by="PAIR")[,c("v1","v2")]
brainpairs$diff<-brainpairs$v2-brainpairs$v1
wilcox.test(brainpairs$diff)
98 Group Differences with Blocking

TABLE 5.1: Twin brain volume

Pair 1 2 3 4 5 5 7 8 9 10
First 1005 1035 1281 1051 1034 1079 1104 1439 1029 1160
Second 963 1027 1272 1079 1070 1173 1067 1347 1100 1204
Diff. -42 -8 -9 28 36 94 -37 -92 71 44
Rank 6 1 2 3 4 10 5 9 8 7

giving an exact p-value of 0.695. Compare these to the results of the sign
test and t-test:
library(BSDA)#Need for sign test.
SIGN.test(brainpairs$diff)
t.test(brainpairs$diff)

giving p-values of 1 and 0.647 respectively.

As with the extension from two-sample testing to multi-sample testing referred

to in §3.4.2, one can extend the other rank-based modifications of §3.4.2 and
§3.6 to the blocking context as well. Ties may be handled by using mean
ranks, and testing against a specific distribution for the alternative may be
tuned using the Pappropriate scores. The more general
P score statistic for paired
data is TGP = j aj Sj ; in this case, E [TGP ] = j aj /2, and Var [TGP ] =
P 2
j aj /4.

Example 5.2.2 Perform the asymptotic score tests on the brain volume
differences of Example 5.2.1.
library(MultNonParam)
cat("Asymptotic test using normal scores\n")
brainpairs$normalscores<-qqnorm(seq(length(
brainpairs$diff)),plot.it=F)$x
symscorestat(brainpairs$diff,brainpairs$normalscores)
cat("Asymptotic test using savage scores\n")
brainpairs$savagescores<-cumsum(
1/rev(seq(length(brainpairs$diff))))
symscorestat(brainpairs$diff,brainpairs$savagescores)
giving p-values of 0.863 and 0.730 respectively.

Permutation testing can also be done, using raw data values as scores.
This procedure uses the logic that Xj and Yj having the same marginal
distribution implies Yj − Xj has a symmetric distribution. Joint distributions
exist for which this is not true, but these examples tend to be contrived.
Nonparametric Paired Comparisons 99

5.2.1 Estimating the Population Median Difference

Estimation of the median of the observations, which is their point of symmetry,
based on inversion of the Wilcoxon signed-rank statistic mirrors that based
on inversion of the Mann-Whitney-Wilcoxon statistic of §3.10. Let TSR (θ)
be the Wilcoxon signed-rank statistic calculated from the data set Zj (θ) =
Zj − θ = Yj − Xj − θ, after ranking the Zj (θ) by their absolute values, and
summing the ranks with positive values of Zj (θ). Estimate θ as the quantity
θ̂ equating TSR to the median of its sampling distribution; that is, θ̂ satisfies
TSR (θ̂) = n(n + 1)/4.
After sorting the values Zj , denote value at position j in the ordered list by
the order statistic Z(j) . One can express θ̂ in terms of Z(j) , by considering the
behavior of TSR (θ) as θ varies. For θ > Z(n) , TSR (θ) = 0. When θ decreases
past Z(n) , the Zi − θ with the lowest absolute value switches from negative to
positive, and TSR (θ) moves up to 1. When θ decreases past (Z(n) + Z(n−1) )/2,
then Z(n) −θ goes from having the smallest absolute value to having the second
lowest absolute value, and TSR goes to 2. The next jump in TSR occurs if one
more observation becomes positive (at Z(n−1) ) or if the absolute value of the
lowest shifted observation passes the absolute value of the next observation
(at Z(n) + Z(n−2) /2).
Generally, the jumps happen at averages of two observations, including av-
erages of an observation with itself. These averages are called Walsh averages.
These play the same role as differences in the two-sample case. First, note that
TSR is the number of positive Walsh averages. This can be seen by letting Z[i]
be observation ordered by absolute value, and letting Wij = (Z[i] + Z[j] )/2 for
i ≤ j. Suppose that Z[j] > 0. Then Z[i] + Z[j] > 0 for i < j, and all Wij > 0
for i < j, and Wjj > 0. On the other hand, if Z[j] < 0, then Z[i] + Z[j] < 0 for
i < j, and then all Wij < 0 for i < j, and Wjj < 0.
So θ̂ has half of the Walsh averages below it, and half above, and
θ̂ is the median of Walsh averages. This estimator is given by Hodges
and Lehmann (1963), in the same paper giving the analogous estima-
tor for the one-sample symmetric context of §3.10, and is called the
Hodges-Lehmann estimator.

Example 5.2.3 Walsh averages may be extracted in R using

aves<-outer(brainpairs$diff,brainpairs$diff,"+")/2
sort(aves[upper.tri(aves,diag=TRUE)])
to obtain the 10 × (10 + 1)/2 = 55 pairwise averages

-92.0 -67.0 -64.5 -50.5 -50.0 -42.0 -39.5 -37.0 -32.0 -28.0
-25.5 -25.0 -24.0 -23.0 -22.5 -10.5 -9.0 -8.5 -8.0 -7.0
-4.5 -3.0 -0.5 1.0 1.0 3.5 9.5 10.0 13.5 14.0
14.5 17.0 17.5 18.0 26.0 28.0 28.5 31.0 31.5 32.0
100 Group Differences with Blocking

36.0 36.0 40.0 42.5 43.0 44.0 49.5 53.5 57.5 61.0
65.0 69.0 71.0 82.5 94.0

Their median is observation 28, which is 10.0. Estimate the median dif-
ference as 10.0.

5.2.2 Confidence Intervals

Confidence intervals may also be constructed using the device of (1.16), sim-
ilarly as with the one-sample location interval construction of §2.3.3 and the
two-sample location shift interval construction of §3.10. Let W̃j be the or-
dered Walsh averages. Find the largest tl such that P [TSR < tl ] < α/2; then
tl is the α/2 quantile of the distribution of the Signed Rank statistic. Using
(5.3) and (5.4), one might approximate p the critical value using a Gaussian
approximation tl ≈ n(n + 1)/2 − zα/2 n(2n + 1)(n + 1)/24; note that this
approximation uses the distribution of the signed-rank statistic, which does
not depend on the distribution of the underlying data as long as symmetry
holds. Recall that zβ is the 1 − β quantile of the standard Gaussian distribu-
tion, for any β ∈ (0, 1); in particular, zα/2 is positive for any α ∈ (0, 1), and
z0.05/2 = 1.96. By symmetry, P [TSR ≥ tu ] ≤ α/2 for tu = n(n − 1)/2 − tl + 1.
As noted above, TSR (θ) jumps by one each time θ passes a Walsh average.
Hence the confidence interval is

(W̃n(n−1)/2+1−tl , W̃n(n−1)/2+1−tu ). (5.5)

By symmetry, this interval is (W̃tl , W̃tu ). See Figure 5.1.

Example 5.2.4 Refer again to the brain volume data of Example 5.2.1.
Find a 95% confidence interval for the difference in brain volume. The
0.025 quantile of the Wilcoxon signed-rank statistic with 10 observations
is t◦ = 9; this can be calculated from R using
qsignrank(0.025, 10)
and confidence interval endpoints are observations 9 and 55+1-9=47. As
tabulated above, Walsh average 9 and 47 are -32.0 and 49.5 respectively.
Hence the confidence interval is (-32.0, 49.5). This might have been cal-
culated directly in R using
wilcox.test(brainpairs$diff,conf.int=TRUE)

Similar techniques to those of this section were used in §2.3.3 to give con-
fidence intervals for the median, and in §3.10 to give confidence intervals for
median differences. In each case, a statistic dependent on the unknown param-
eter was constructed, and estimates and confidence intervals were constructed
by finding values of the parameter equating the statistic to appropriate quan-
tiles. However, the treatment of this section and that of §2.3.3 differ, in that
Nonparametric Paired Comparisons 101

FIGURE 5.1: Construction of Median Estimator

... ...
... ...
.. ...
.... ..
....................... ...
... ..
. . . . . . . ....... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... . . . . . tu
20 .. ...
... .... ...
.. ..
... .. ...
... ...
.. . ..
... ...
... .... ...
.. ..
... ...
15 ...
..
. ...
..
... ...... ...
.. ...
.. . ...
... ..
Signed ...
...
. ...
...
.. .......................................... ..
Rank ...
...
...
...
Statistic 10 ..
...
....
......
..
...
... ...
.. ..
.. . ...
.. ...
... ............................. ..
.. ...
.. ...
. ...
... ..
... ...
5 ...
..
......
...
..
.
... ...
... ............. ...
.. ..
... ....................................... ...
... ...
.. .
. . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . ............................................. . . . . . tl
... ...
0 ... ........................
.. ..
.. ..

-2 0 2 4 6 8 10
Potential Point of Symmetry
Hypothetical Data Set with 6 Observations

the resulting statistic in this section is non-increasing in θ in this section, while

in §2.3.3 it was non-decreasing. The increasing parameterization of §2.3.3 was
necessary to allow the same construction to be used in §2.3.4, when quantiles
other than the median were estimated; these quantiles estimates are an in-
creasing function of each observation separately. A parallel construction might
have been used in the current section, by inverting the statistic formed by sum-
ming ranks of negative observations; as this definition runs contrary to the
common definition of the Wilcoxon Signed Rank statistic, it was avoided.

5.2.3 Signed-Rank Statistic Alternative Distribution

Consider alternative hypotheses HA , specifying that the median of the distri-
bution of differences is θ. Hence
n X
X i
Eθ [T ] = Pθ [(Zi + Zj − 2θ) ≤ 0]
i=1 j=1
= n(n − 1)P [Z1 + Z2 ≤ 2θ] /2 + nP [Z1 ≤ θ] .

To apply asymptotic relative efficiency calculations, scale the test statistic to

have an expectation that varies with the parameter value specifying the null
102 Group Differences with Blocking

hypothesis, and approximately independent of sample size, as in (2.15), by

switching to S = 2TSR /(n(n − 1)). In this case, µ(θ) ≈ P0 [Z1 + Z2 ≥ 2θ], and
the statistic variance times the sample size is

σ 2 (0) = 4 × n × n(n + 1)(2n + 1)/(n2 (n + 1)2 24) ≈ 1/6,

to match the notation of (2.15) and (2.19). Note that µ0 (0) is now twice the
value from the Mann-Whitney-Wilcoxon statistic, for distributions symmetric
about 0. This will be used for asymptotic relative efficiency calculations in the
exercises.

5.3 Two-Way Non-Parametric Analysis of Variance

Consider the same data description as in §5.1.2: one observes random variable
Xkli , with k ∈ {1, · · · , K}, l ∈ {1, · · · , L}, and i ∈ {1, · · · , Mkl }. Suppose
the distribution of Xkli are independent, with a distribution that is allowed to
depend on k and l; that is, Xkli ∼ Fkl . We wish to test the null hypothesis that
treatment has no effect, while allowing different blocks to behave differently,
versus the alternative that the distribution depends on treatment as well.
That is, the null hypothesis is that Fkl does not depend on k, although it
may depend on l, and the alternative hypothesis is that for some l, some j
and k, and some x, Fkl (x) 6= Fjl (x). However, the test to be constructed
will not have reasonable power unless this difference is more systematic; that
is, consider alternatives for which for each pair of treatments j and k, either
Fjl (x) ≤ Fjl (x) for all x and all l, or Fjl (x) ≥ Fjl (x) for all x and all l. The
direction of the effect must be constant across blocks. Heuristically, treat k as
indexing treatment and l as indexing a blocking factor.
In order to build a statistic respecting treatment order across blocks, rank
the observations within blocks; that is, let Rkli represent the rank of Xkli
within X1l1 , . . . , XKlMKl , and sum them separately by block and treatment.

5.3.1 Distribution of Rank Sums

In order to use these ranks to detect treatment differences that are in a
consistent direction across blocks, one should consolidatePL rank
PMklsums over
blocks before comparing across treatments. Let Rk.. = l=1 i=1 Rkli and
PL
R̄k.. = Rk.. / l=1 Mkl . Consolidation via averaging, rather than via summing,
implies a choice in weighting the contributions of the various blocks.
Moments of these rank sums may be calculated as an extension of (3.22).
Under the null hypothesis of no treatment effect, the expectation of one rank
is X
K
[R
E kli ] = M jl + 1 /2,
j=1
A Generalization of the Test of Friedman 103

and hence
L X
X K
E [Rk.. ] = Mkl Mjl + 1 /2 . (5.6)
j=1
l=1
Variances of rank sums depend on the covariance structure of the ranks.
Ranks that make up each sum within a block are independent, but sums of
ranks across blocks are dependent. Within a treatment-block cell, Var [Rkl. ]
is the same as for the Mann-Whitney-Wilcoxon statistic:
X XK
Var [Rkl. ] = Mkl Mjl Mjl + 1 /12.
j6=k j=1

By independence across blocks,

L
X X XK
Var [Rk.. ] = Mkl Mjl Mjl + 1 /12 . (5.7)
j6=k j=1
l=1

Covariances may be calculated by comparing the variance of the sum to

the sum of the variances, as in (4.12), to obtain
X
K
Cov [Rkl. , Rml. ] = −[Mkl Mml ] Mjl + 1 /12.
j=1

Since blocks are ranked independently, variances and covariances for rank
sums add across blocks:
L X
X K
Cov [Rk.. , Rm.. ] = − [Mkl Mml ] Mjl + 1 /12. (5.8)
j=1
l=1

5.4 A Generalization of the Test of Friedman

From these rank sums one might calculate separate Kruskal-Wallis statistics
for each block, and, since statistics from separate blocks are independent, one
might add these Kruskal-Wallis statistics to get χ2L(K−1) degree of freedom
statistic under the null hypothesis. Such a statistic, however, is equally sen-
sitive to deviations from the null hypothesis in opposite directions in different
blocks, and so has low power against all alternatives. Interesting alternative
hypotheses involve treatment distributions in different blocks ordered in the
same way.
As with the Kruskal-Wallis test, a statistic is constructed by setting Y =
(R1.. − E0 [R1.. ] , . . . , RK−1.. − E0 [RK−1.. ])> , using (5.6), calculating Var0 [Y ]
from (5.7) and (5.8), and using
−1
WF = Y > Var0 [Y ] Y (5.9)
as the test statistic.
104 Group Differences with Blocking

5.4.1 The Balanced Case

−1
The inverse Var0 [Y ] is tractable in closed form under (5.1), including the
special case with balanced replicates.
Friedman (1937) addresses the case in which Mkl = 1∀k, l. The generaliza-
tion to the balanced case with
replicates
is trivial, and this section assumes
Mkl = M for all k, l. Here E R̄k.. = (KM + 1)/2, and

Var R̄k.. = (K − 1)(KM + 1)/[12L].

Covariances between sets of rank means are

Cov R̄k.. , R̄m.. = −(KM + 1)/[12L]

for k 6= m. In this balanced case, the test statistic (5.9) can be shown to equal
the sum of squares of ranks away from their average value per treatment
group:
XK
WF = 12L [R̄k.. − (M K + 1)/2]2 /[K(KM + 1)].
k=1

This demonstration is as for the Kruskal-Wallis test in §4.2.2. The limiting

distribution of WF is also demonstrated by observing the parallelism between
this covariance structure and the covariance structure of rank √ sums in §4.2.2,
since the covariance structure of Y = (R̄1.. , . . . , R̄K−1.. )/ K has the same
form as (4.12), for some vector ν and constant ω. A relationship like that
of §4.2.2 can then be used to demonstrate that WF is approximately χ2K−1 .
Again, some condition on the sample sizes being large enough to ensure that
the rank averages are approximately multivariate Gaussian is required; for
example, in the case with the same number of replicates per treatment-block
combination, having the number of blocks going to infinity suffices, and in
the balanced case with a fixed number of blocks, having the number of ob-
servations in each block-treatment combination go to infinity suffices. This
requirement is in addition to (5.1) needed to find a generalized inverse for the
variance matrix in closed form, and hence to exhibit the test statistic as a
sum of squares.

Example 5.4.1 Friedman (1937) provides data on variability of ex-

penses in various categories for families of different income levels. This
data is provided in Table 5.2, and at
http://stat.rutgers.edu/home/kolassa/Data/friedman.dat .
This may be read into R using

expensesd<-as.data.frame(scan("friedman.dat",
what=list(cat="",g1=0,g2=0,g3=0,g4=0,g5=0,g6=0,g7=0)))
A Generalization of the Test of Friedman 105

Here L = 14, M = 1, and K = 7. Ranking within groups (for example,

in R, using

expenserank<-t(apply(as.matrix(expensesd[,-1]),1,rank))
rownames(expenserank)<-expensesd[[1]]
gives the table of ranks
g1 g2 g3 g4 g5 g6 g7
Housing 5 1 3 2 4 6 7
Operations 1 3 4 6 2 5 7
Food 1 2 7 3 5 4 6
Clothing 1 3 2 4 5 6 7
Furnishings 2 1 6 3 7 5 4
Transportation 1 2 3 6 5 4 7
Recreation 1 2 3 4 7 5 6
Personal 1 2 3 6 4 7 5
Medical 1 2 4 5 7 3 6
Education 1 2 4 5 3 6 7
Community 1 5 2 3 7 6 4
Vocation 1 5 2 4 3 6 7
Gifts 1 2 3 4 5 6 7
Other 5 4 7 2 6 1 3
Rank sums are 23, 36, 53, 57, 70, 70, 83, and group means (for example,
via

apply(expenserank,2,mean)
gives 1.643, 2.571, 3.786, 4.071, 5.000, 5.000, 5.929. Subtracting the overall
mean, 4, squaring, and adding, gives 13.367. Multiplying by 12L/(K(K +
1)) = 12×14/(7×8) = 3 gives the statistic value WF = 40.10. Comparing
to the χ26 distribution gives a p-value of 4.35×10−7 . This might also have
been done entirely in R using
friedman.test(as.matrix(expensesd[,-1]))
The development of Friedman’s test in the two-way layout remarkably pre-
ceded analogous testing in one-way layouts. When K = 2 and Mkl are all 1,
Friedman’s test is equivalent to the sign test applied to differences within a
block.

5.4.2 The Unbalanced Case

This analysis fails in the unbalanced case, because no expression for the vari-
ance as simple as (4.12) holds, and so the variance matrix cannot be inverted
in closed form. Benard and Elteren (1953) present a numerical method that
uses the rule of Cramer to produce the matrix product involved in T . Pren-
106 Group Differences with Blocking

TABLE 5.2: Expense variability by income group, from Friedman (1937)

Category Gr. 1 Gr. 2 Gr. 3 Gr. 4 Gr. 5 Gr. 6 Gr. 7

Housing 103.30 68.42 89.53 77.94 100.00 108.20 184.90
Operations 42.19 44.31 60.91 73.90 43.87 61.74 102.30
Food 71.27 81.88 100.71 86.52 100.30 90.75 100.60
Clothing 37.59 60.05 56.97 60.79 71.82 83.04 117.10
Furnishings 58.31 52.73 96.04 60.42 104.33 89.78 85.77
Transport 46.27 82.18 129.80 181.00 172.33 164.80 246.80
Recreation 19.00 23.07 38.70 45.81 59.03 50.69 55.18
Personal 8.31 8.43 9.16 14.28 10.63 15.84 12.50
Medical 20.15 33.48 60.08 69.35 114.34 45.28 101.60
Education 3.16 4.12 12.73 18.95 8.89 41.52 66.33
Community 4.12 18.87 8.54 12.92 25.30 19.85 16.76
Vocation 7.68 11.18 10.44 10.95 10.54 13.96 14.39
Gifts 5.29 10.91 11.22 25.26 42.25 48.80 69.38
Other 6.00 5.57 22.23 2.45 6.24 1.00 4.00

tice (1979) performs these calculations in somewhat more generality, and the
general unbalanced two-way test is generally called the Prentice test. Skillings
and Mack (1981) address this question using explicit numerical matrix inver-
sion.

Example 5.4.2 Consider data on weight gain among chickens, given by

Cox and Snell (1981, Example K), and at

http://stat.rutgers.edu/home/kolassa/Data/chicken.dat .
Weight gain after 16 weeks, protein level, protein source, and fish soluble
level. The dependence of weight gain on protein source might be graphi-
cally depicted using box plots (Figure 5.2):

temp1<-temp<-as.data.frame(scan("chicken.dat",what=
list(source="", lev=0,fish=0,weight=0,w2=0)))
temp1$weight<-temp1$w2;temp$h<-0;temp1$h<-1
temp$w2<-NULL;temp1$w2<-NULL
chicken<-rbind(temp,temp1)
attach(chicken)
boxplot(split(weight,source),horizontal=TRUE,ylab="g.",
main="Weight gain for Chickens", xlab="Protein Source")
detach(chicken)
Hence blocking on source is justified. Test for dependence of weight gain
on protein level, blocking on source. Perform these calculations in R using

library(muStat)#Need for Prentice test.

Multiple Comparisons and Scoring 107

attach(chicken)
prentice.test(weight,source,blocks=lev)
detach(chicken)
to obtain the p-value 0.1336.

FIGURE 5.2: Weight Gain for Chickens

Source ............................................
... .. ..
.............................
G ..
................................................
..............................................

6000 6500 7000 7500 8000 8500

Weight Gain (g.)

5.5 Multiple Comparisons and Scoring

One might adjust for multiple comparisons in the presence of blocking in the
same way as was done without blocking, as discussed in §4.5. That is, one
might use the variances
p and covariances to show that under the null hypoth-
esis, (R̄i. − R̄j. )/ K(LK + 1)/6 is approximately Gaussian, and employ the
method of Bonferroni, Fisher’s LSD, or Tukey’s HSD.
One might also use scoring methods, including normal and Savage scores,
as before. Recall also that treating ranks as general scores, with tied observa-
tions given average ranks, makes tie handling automatic.
One might also use scores that are data values. In this case, the test statistic
is similar to that of the Gaussian-theory F test, but the reference distribution
is generated from all possible permutations of the data within blocks and
among treatments.

Example 5.5.1 Refer again to Example 5.4.2. Test dependence of

weight gain on protein source, blocking on level. There are eight ob-
3
servations at each protein level. Cycle through all 84 = 34300 ways
to associate source labels within block, and compute the between sum of
squares each time. The p-value is the number of rearrangements having
this statistic meet or exceed the observed.
library(MultNonParam)
#aov.P requires data sorted by block. Put block ends as the
108 Group Differences with Blocking

#third argument.
chicken<-chicken[order(chicken$lev),]
aov.P(chicken$weight,as.numeric(as.factor(chicken$source)),
c(8,16,24))
The p-value is 0.182.

5.6 Tests for a Putative Ordering in Two-Way Layouts

PK
Recall Friedman’s statistic of (5.9), WF ∝ k=1 [R̄k.. − 1/2(M K + 1)]2 , where
R̄k.. are mean rankings for treatment k when observations are ranked within
blocks. The statistic WF treats deviation of any treatment in any direction
similarly. If one wants to look specifically for alternatives suspected a priori
of being ordered according to the index k, in the balanced case use instead
K
X
TL = kRk.. . (5.10)
k=1

This test was proposed by Page (1963) in the balanced case with one replicate
per block-treatment pair, and is called Page’s test.
For more general replication patterns, the null expectation and variance
for the statistic may be calculated, using (5.6), (5.7), and (5.8), and
K
X
E0 [TL ] = k E0 [Rk.. ]
k=1
K
X K
X K
X
Var0 [TL ] = k 2 Var0 [Rk.. ] + ikCov0 [Ri.. , Rk.. ] .
k=1 i=1 k=1,k6=i

In the balanced case, with Mkl = M ∀k, l, moments simplify to

E0 [TL ] = (K + 1)KLM (KM + 1)/4

 
K
X K
X K
X
Var0 [TL ] = LM 2 (KM + 1)  k 2 (K − 1) − ik 
k=1 i=1 k=1,k6=i
K K
!
X X
= LM 2 (KM + 1) K k2 − ( i)2
k=1 i=1
= K 2 (K + 1)LM 2 (KM + 1) (K − 1) /12,

using (3.20).
Tests for a Putative Ordering in Two-Way Layouts 109

Example 5.6.1 Refer again to the expense variability data of Exam-

ple 5.4.1. Test for an ordered effect of income, treating categories as
blocks. Rank sums are 23, 36, 53, 57, 70, 70, 83, and the statistic is 23 ×
1 + 36 × 2 + 53 × 3 + 57 × 4 + 70 × 5 + 70 × 6 + 83 × 7 = 1833, with ex-
pected value 1568 and variance under the null hypothesis 1829.333. The
two-sided p-value is 2.90 × 10−10 . Page’s test may be performed in R
using

library(crank); page.trend.test(expensesd[,-1],FALSE)

Page (1963) presents only the case with Mkl all 1, and provides a table of
the distribution of TL for small values of K and L. For larger values of K and
L than appear in the table, Page (1963) suggests a Gaussian approximation.
The scores in (5.10) are equally spaced. This is often a reasonable choice
in practice. When the Mkl are not all the same, imbalance among numbers of
ranks summed may change the interpretation of these scores, and a preferred
statistic definition to replace (5.10) is
K
X
TL∗ = k R̄k.. .
k=1

with moments given by

   
L
X XK L
X

E R̄k.. = Mkl  Mjl + 1 /2 / Mkl ,
l=1 j=1 l=1
PL P PK
l=1 {Mkl ( j6=k Mjl )( j=1 Mjl + 1)}
Var R̄k.. = PL ,
12( l=1 Mkl )2
and
  !
L
X XK L
X L
X

Cov R̄k.. , R̄m.. = − [Mkl Mml ]  Mjl + 1 / 12 Mkl Mml .
l=1 j=1 l=1 l=1

Example 5.6.2 Refer again to the chicken weight gain data of Exam-
ple 5.4.2. This data set is balanced, but one could ignore the balance and
numerically invert the variance matrix of all rank sums except the last.
In this case, test for an ordered effect of protein level on weight gain,
blocking on protein source. Perform the test using
library(MultNonParam)
attach(chicken)
cat(’\n Page test with replicates \n’)
page.test.unbalanced(weight,lev,source)
110 Group Differences with Blocking

detach(chicken)

In this balanced case, rank mean expectations are all 6.5, variances are
1.083, covariances are −0.542. The rank means by treatment level are
7.625, 7.250, 4.625, giving an overall statistic value of 36, compared with
a null expectation of 39 and null variance of 3.25; the p-value is 0.096.
Do not reject the null hypothesis.

5.7 Exercises
1. The data set

HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo

reflects an experiment using measurements used to detect

schizophrenia, in both schizophrenic and non-schizophrenic pa-
tients. Various tests are given. Pick those data points with CS in
the second column and compare the first and second gain rations, in
the third and fourth columns, respectively. For this question, select
only non-schizophrenic patients.
a. Test the null hypothesis that first and second gain ratios (in the
third and fourth columns, respectively) have the same median using
the sign test.
b. Test the null hypothesis that first and second gain ratios (in the
third and fourth columns, respectively) have the same median using
the signed rank test.
2. The data set

HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo

contains data from an experiment using measurements used to de-

tect schizophrenia, on non-schizophrenic patients. Various tests are
given. Pick those data points with CS in the second column and
compare the first and second gain ratios, in the third and fourth
columns, by taking their difference. Test the null hypothesis of zero
median difference using the Wilcoxon signed rank test, for the non-
schizophrenic patients.
3. Calculate the asymptotic relative efficiency for the Wilcoxon signed
rank statistic relative to the one-sample t-test (which you should
approximate using the one-sample z-test). Do this for observations
from
a. the distribution uniform on the interval [θ − 1/2, θ + 1/2],
Exercises 111

b. the logistic distribution, symmetric about θ, with variance π 2 /3

and density exp(x − θ)/(1 + exp(x − θ))2 ,
c. and the standard normal distribution, with variance 1 and ex-
pectation θ.
4. The data set at

HTTP://stat.rutgers.edu/home/kolassa/Data/yarn.dat

represents strengths of yarn of two types from six bobbins. This

file has three fields: strength, bobbin, and type. Apply the balanced
variant of Friedman’s test to determine whether strength varies by
type, controlling for bobbin.
5. The data at

HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages

reflects wages from 1985. The first 42 lines of this file contain a
description of the data set, and an explanation of variables; delete
these lines first, or skip them when you read the file. All fields are
numeric. The tenth field is sector, the fifth field is union member-
ship, and the sixth field is hourly wage; you can skip everything
else. Test for a difference in wage between union members and non-
members, blocking on sector.
6
Bivariate Methods

Suppose that independent random vectors (Xi , Yi ) all have the same joint
density fX,Y (x, y). This chapter investigates the relationship between Xi and
Yi for each i. In particular consider testing the null hypothesis fX,Y (x, y) =
fX (x)fY (y), without specifying an alternative hypothesis for fX,Y , or even
without knowledge of the null marginal densities.
This null hypothesis is most easily tested against the alternative hypothesis
that, vaguely, large values of X are associated with large values of Y (or vice
versa). Furthermore, if the null hypothesis is not true, the strength of the
association between Xi and Yi must be measured.

6.1 Parametric Approach

Before developing nonparametric approaches to assessing relationships be-
tween variables, this section reviews a standard parametric approach, that of
the Pearson correlation (Edgeworth, 1893):
Pn
j=1 (Xj − X̄)(Yj − Ȳ )
rP = qP , (6.1)
n 2
Pn 2
j=1 (Xj − X̄) j=1 (Yj − Ȳ )

derived as what would now be termed the maximum likelihood estimator of

the correlation for the bivariate Gaussian distribution. This gives the slope of
the least squares line fitting Y to X, after scaling both variables by standard
deviation. The Cauchy-Schwartz Theorem says that rP is always in [−1, 1].
Perfect positive or negative linear association is reflected in a value for rP
of 1 or −1 respectively. Furthermore, there are a variety of early exact and
approximate distributional results for rP assuming that the observations are
multivariate Gaussian.

113
114 Bivariate Methods

6.2 Permutation Inference

Even when observations are not multivariate Gaussian, the measure (6.1)
remains a plausible summary of association between variables. This section
presents a distributional result for the Pearson correlation that does not re-
quire knowledge of the multivariate distribution of observations.
Under the null hypothesis that Xj is independent of Yj , for all j, then every
permutation of the Y values among experimental units, with the X values
held fixed, is equally likely. Under this permutation distribution, conditional
on the collections {X1 , . . . , Xn } and {Y1 , . . . , Yn }, the variance of rP may be
calculated directly. Note that rP depends only on the differences between the
observations and their means, and so, without loss of generality, assume that
X̄ = Ȳ = 0.PLet (Z1 , . . . , Zn ) be a random permutation of {Y1 , . . . , Yn }. Then
Var [Z1 ] = i Yi2 /n; denote this by σY2 . Also
X 1
Cov [Z1 , Z2 ] = Yi Yj
n(n − 1)
i6=j
X 1 X 1 1
= Yi Yj − Yi2 = − σY2 .
i,j
n(n − 1) i
n(n − 1) n − 1

So
 
X X X
Var  Xj Zj  = Xj2 Var [Zj ] + Xi Xj Cov [Zj , Zi ]
j j j6=i

[ j,i Xi Xj − i Xi2 ]σY2

P P
X
= Xj2 σY2 −
j
n−1
X X
= Xj2 σY2 + Xi2 σY2 /(n − 1)
j i

= n2 σX
2 2
σY /(n − 1),
2 1 2
P
where σX = i n Xi . So
Var [rP ] = 1/(n − 1), (6.2)
under the permutation distribution (Hotelling and Pabst, 1936). This value
was determined by Student (1908), after fitting a parametric model to em-
pirical data, rounding parameter estimates to values more consistent with
intuition, and calculating the moments of this empirical distribution. Higher-
order moments were determined by David et al. (1951) using a method of
proof similar to that presented above for the variance.
This result suggests a test of the null hypothesis of independence versus
the two-sided alternative at level α using rP , that rejects the null hypothesis
if √
|rP | > zα/2 / n − 1. (6.3)
Nonparametric Correlation 115

6.3 Nonparametric Correlation

The Pearson correlation pr is designed to measure linear association between
two variables, and fails to adequately reflect non-linear association. Further-
more, a family of distributional results, not recounted in this volume, depend
on data summarized being multivariate Gaussian. Various nonparametric al-
ternatives to the Pearson correlation have been developed to avoid these draw-
backs.

6.3.1 Rank Correlation

Instead of calculating the correlation for the original variables, calculate the
correlation of the ranks of the variables (Spearman, 1904). Under the null
hypothesis, each X rank should be be equally likely to be associated with
each Y rank. Under the alternative, extreme ranks should be associated with
each other. Let Rj be the rank of the Y value associated with X(j) . Define
the Spearman Rank correlation as
Pn
j=1 (j − (n + 1)/2)(Rj − (n + 1)/2)
rS = r P ;
P n 2 n 2
j=1 (j − (n + 1)/2) j=1 (Rj − (n + 1)/2)

that is, rS is the Pearson correlation on ranks. Exact agreement for ordering
of ranks results in a rank correlation of 1, exact agreement in the opposite
direction results in a rank correlation of -1, and the Cauchy-Schwartz theorem
indicates that these two vales are the extreme possible values for the rank
correlation.
The sums of squares in the denominator have the same value for every
data set, and the numerator can be simplified. Note that
n
X n
X
(j − (n + 1)/2)2 = j 2 − n(n + 1)2 /4
j=1 j=1

= n(n + 1)(2n + 1)/6 − n(n + 1)2 /4

= n n2 − 1 /12.

Pn Pn
Similarly, j=1 (Rj − (n + 1)/2)2 = n n2 − 1 /12. Furthermore, j=1 (j −
(n + 1)/2)(n + 1)/2 = 0, and
 
n n
12 X 12 X
rS = (j − (n + 1)/2)Rj =  jRj − n(n + 1)2 /4 .
n(n2 − 1) j=1 n(n2 − 1) j=1
(6.4)
Hoeffding (1948) provides a central limit theorem for the permutation
116 Bivariate Methods

distribution for both rP and rS , including under alternative distributions.

√ P-
values may be approximated by dividing the observed correlations by n − 1,
and comparing to a standard Gaussian distribution, but this approximation
has poor relative behavior for small test levels. Best and Roberts (1975) cor-
rect the Gaussian approximation using an Edgeworth approximation (Kolassa,
2006); that is, they determine constants κ1 , κ2 , κ3 , and κ4 such that
" # !
rs − κ1 κ3 h2 (r) κ23 h5 (r) κ4 h3 (r)
P p ≤ r ≈ Φ(r) − φ(r) 3/2 √
+ + , (6.5)
κ2 /n κ 6 n 72κ32 n 24κ22 n
2

with the maximum error in equation (6.5) bounded by a constant divided

by n3/2 . Here hj (r) are known polynomials called Hermite polynomials, and
constants κj are related to the moments of rs . When κ3 is calculated for a
symmetric distribution, its value is zero, and approximation (6.5) effectively
contains only its first and last terms. This is the case for many applications
of (6.5) to rank-based statistics, including the application to rs .

Example 6.3.1 Consider again the twin brain data of Example 5.2.1,
plotted in Figure 6.1 As before, the data set brainpairs has 10 records,
reflecting the results from 10 pairs of twins, and is plotted in Figure 6.1
via
attach(brainpairs); plot(v1,v2, xlab="Volume for Twin 1",
ylab="Volume for Twin 2",main="Twin Brain Volumes")
Ranks for twin brains are given in Table 6.1. The sum in the second factor
of (6.4) is

1×1+4×2+9×9+5×5+3×4+6×7+7×3+10×10+2×6+8×8 = 366,

and so the entire second factor is 366 − 10 × 112 /4 = 63.5, and the Spear-
man correlation is (12/(10 × 99) × 63.5 = 0.770. Observed correlations
may be calculated in R using

cat(’\n Permutation test for twin brain data \n’)

attach(brainpairs)
obsd<-c(cor(v1,v2),cor(v1,v2,method="spearman"))
to obtain the Pearson correlation (6.1) 0.914 and the Spearman corre-
lation (6.4) 0.770. Testing the null hypothesis of no association may be
performed using a permutation test with either of these measures.
out<-array(NA,c(2,20001))
dimnames(out)[[1]]<-c("Pearson","Spearman")
for(j in seq(dim(out)[2])){
newv1<-sample(v1)
Nonparametric Correlation 117

out[,j]<-c(cor(newv1,v2),cor(newv1,v2,method="spearman"))
}
cat("\n Monte Carlo One-Sided p value\n")
apply(apply(out,2,">=",obsd),1,"mean")
to obtain p-values 1.5 × 10−4 and 6.1 × 10−3 . The asymptotic critical
value, from (6.3), is given by

cat("\n Asymptotic Critical Value\n")

-qnorm(0.025)/sqrt(length(v1)-1)
which gives 0.6533. Permutation tests based on either correlation method
reject the null hypothesis.
c(cor.test(v1,v2)$p.value,
cor.test(v1,v2,method="spearman")$p.value)
detach(brainpairs)
giving p-value approximations 2.1 × 10−4 and 1.37 × 10−2 respectively.

FIGURE 6.1: Twin Brain Volumes

1400
◦

1300
◦

◦
1200
Volume ◦
for
Twin 2
1100 ◦
◦
◦ ◦

1000
◦

900
1000 1100 1200 1300 1400 1500
Volume for Twin 1
118 Bivariate Methods

TABLE 6.1: Twin brain volume ranks

Pair 1 2 3 4 5 5 7 8 9 10
Rank of First 1 4 9 5 3 6 7 10 2 8
Rank of Second 1 2 9 5 4 7 3 10 6 8

Note that
 
n
X n
X X
jRj = j 1 + I(Yj > Yi , Xj > Xi )
j=1 j=1 i6=j
n
X X
= n(n + 1)/2 + j I(Yj > Yi , Xj > Xi ).
j=1 i6=j

Some algebra shows that

n
6 X
rS = 1 − S for S = (j − Rj )2 . (6.6)
n(n2 − 1) j=1

This relationship will be exploited to give probabilities for rS in §6.4.1.3.

Pearson (1907) criticizes the Spearman correlation, in part on the grounds
that, as it is designed to reflect association even when the association is non-
linear, it also loses the interpretation as a regression parameter, even when
the underlying association is linear.

6.3.1.1 Alternative Expectation of the Spearman Correlation

Under the null hypothesis of independence between Xj and Yj , for all j,
E [rS ] = 0, and the variance is given by (6.2). Under the alternative hypothesis,
n
12 X X
E [rS ] = 2
(n(n + 1)/2 + j p1 − n(n + 1)2 /4)
n(n − 1) j=1 i6=j
12
= ((n − 1)n(n + 1)p1 /2 − n(n2 − 1)/4) = 3(2p1 − 1)
n(n2 − 1)
for
p1 = P [Yj > Yi , Xj > Xi ] . (6.7)

6.3.2 Kendall’s τ
Pairs of bivariate observations for which the X and Y values are in the
same order are called concordant; (6.7) refers to the probability that a pair
is concordant. Pairs that are not concordant are called discordant. Kendall
(1938) constructs a new measure based on counts of concordant and discor-
dant pairs. Consider the population quantity τ = 2p1 − 1, for p1 of (6.7),
Nonparametric Correlation 119

called Kendall’s τ ; to emphasize the parallelism between this measure and

other measures of association between two random variables, it will also be
referred to below as Kendall’s correlation measure. This quantity reflects the
probability of a concordant pair minus the probability of a discordant pair.
Denote the number of concordant pairs by
X
U= Zij ,
i<j

for Zij = I((Xj −Xi )(Yj −Yi ) > 0). Then U † = n(n−1)/2−U is the number of
discordant pairs; this number equals the number of rearrangements necessary
to make all pairs concordant. Estimate τ by the excess of concordant over
discordant pairs, divided by the maximum:
U − (n(n − 1)/2 − U ) 4U
rτ = = − 1. (6.8)
n(n − 1)/2 n(n − 1)
Note that E [U ] = n(n − 1)p1 /2, for p1 of (6.7), and E [rτ ] = 2p1 − 1. The null
value of p1 is half, recovering
E0 [U ] = n(n − 1)/4, E0 [rτ ] = 0. (6.9)
As with the Spearman correlation rS , Kendall’s correlation rτ may be used
to test the null hypothesis of independence, relying on its asymptotic Gaussian
distribution, but the test requires the variance of rτ . Note that
X ∗
X
Var [U ] = Var [Zij ] + Cov [Zij , Zkl ] .
i<j
P∗
Here the sum over three distinct indices i < j, k < l. This sum consists
of n2 (n − 1)2 /4 − n(n − 1)/2 − n(n − 1)(n − 2)(n − 3)/4 = n(n − 1)(n − 2)
terms. Hence
X ∗
X
Var [U ] = (p1 − p21 ) + (p3 − p21 )
i<j

= n(n − 1)(p1 − p21 )/2 + n(n − 1)(n − 2)(p3 − p21 ),

where
p3 = P [(X1 − X2 )(Y2 − Y1 ) ≥ 0, (X1 − X3 )(Y3 − Y1 ) ≥ 0] .
5
The null value of p3 is 18 , as can be seen by examining all 36 pairs of
permutations of {1, 2, 3}. Hence
Var0 [U ] = n(n − 1)/8 + n(n − 1)(n − 2)/36 = n(n − 1)(5 + 2n)/72, (6.10)
and
Var0 [rτ ] = 2(2n + 5)/(9n(n − 1)). (6.11)
The result of Hoeffding (1948) also proves that rτ is approximately Gaus-
sian, including under alternative distributions. El Maache and Lepage (2003)
discuss the multivariate distribution of both rτ and rS from collections of
variables.
120 Bivariate Methods

Example 6.3.2 Consider again the twin brain data of Example 5.2.1,
with ranks in Table 6.1. Discordant pairs are 2 – 5, 2 – 9, 4 – 7, 4 – 9,
5 – 7, 5 – 9, 6 – 7, and 7 – 9. Hence 8 of 45 pairs are discordant, and
the remainder, 37, are concordant. Hence rτ = 4 × 37/90 − 1 = 0.644,
from (6.8). This may be calculated using
cor(brainpairs$v1,brainpairs$v2,method="kendall")

in R. A test of the null p hypothesis of no correlation mayp be performed

by calculating z = rτ / 2(2n + 5)/(9n(n − 1)) = 0.644/ 2 × 25/810 =
0.644/0.248 = 2.60; reject the null hypothesis of no association for a two-
sided test of level 0.05. Function Kendall in package Kendall in R uses
a modification of the method of Best and Roberts (1975).

Figure 6.2 shows two artificial data sets with functional relationships be-
tween two variables. In the first relationship, the two variables are identical,
while in the second, the second variable is related to the arc tangent of the
first. Both variables relationships show perfect association. The first relation-
ship represents perfect linear association, while the second reflects perfect
nonlinear association. Hence the Spearman and Kendall association measures
are 1 for both relationships; the Pearson correlation is 1 for the first relation-
ship, and 0.937 for the second relationship.

FIGURE 6.2: Plot of Linear and Curved Relationships

10

♦ ♦ ♦ ♦
♦
♦ ♦
♦
♦
5
♦

y 0 ♦

Symbol Relationship
♦ y=x
-5
♦ ♦ y = 5 arctan(x)
♦
♦
♦ ♦ ♦
♦ ♦ ♦

-10

-10 -5 0 5 10
x
Bivariate Semi-Parametric Estimation via Correlation 121

6.4 Bivariate Semi-Parametric Estimation via Correla-

tion
Assume a linear model Yj = β1 + β2 Xi + ei , where ei has some distribution
with median 0 (making β1 identifiable). Setting the correlation between Yj −
β2 Xi and Xi to zero, for any of the preceding correlation measures, may be
used to estimate β2 . Furthermore, one might obtain a confidence interval by
inverting test of the null hypothesis that the residuals are independent of the
explanatory variables. This approach could be addressed using the Pearson,
Spearman, or Kendall correlation. Using the Pearson approach gives the
traditional least-squares estimate for β2 , and approximating the distribution
of the errors as Gaussian leads to standard Gaussian theory inference.

6.4.1 Inversion of the Test of Zero Correlation

One might invert the test using correlation between Xi and Yi − β2 Xi ,
as a function of β2 , using any of the three preceding correlation methods,
generically called r. That is, let T (β2 ) = r(X, Y − β2 X), and note that
this function is monotonic in β2 . The preceding paragraph discussed this
approach for the Pearson correlation; now consider applying this approach
to more general correlation measures. First, calculate t◦L and t◦U , such that
P [t◦L < T (β2 , data) < t◦U ] < 1−α, exactly as in (1.19). Then solve T (β2 ) = t◦L ,
r(β2 ) = t◦U .
When inverting correlation, inference on β1 is impossible, because it washes
out of ranks. One might estimate β1 by setting the median of the residuals to
zero, as is described below; some authors produce confidence intervals for β1
via the bootstrap of Chapter 10.

6.4.1.1 Inversion of the Pearson Correlation

As noted above, the fully-parametric analysis, assuming bivariate Gaussian
data, and using rP , reduces to the standard inference from ordinary least
squares regression. Inference under the permutation distribution avoids dis-
tributional assumptions on the data apart from independence.
Both determining critical values for rP , and inverting the test statistic as
a function of β2 to give confidence bounds, are tricky for Pearson correlation,
under the permutation distribution. Exact (non-Monte Carlo) probability cal-
culations for the permutation distribution of the Pearson correlation are as
difficult as enumerating all permutations. These calculations are often simpli-
fied by rounding data values to a lattice. Alternatively, since the null mean
and variance are known, and since the distributions of these measures are
approximately Gaussian, quantiles could be approximated using the Gaussian
distribution. Also, rP has steps of irregular size, and so translating qα/2 and
q1−α/2 into values of β2 is difficult.
122 Bivariate Methods

6.4.1.2 Inversion of Kendall’s τ

Calculations are easier for Kendall’s τ . Best and Gipps (1974), using the al-
gorithm of Kaarsemarker and van Wijngaarden (1953), present software for
calculating all of the probability atoms, and hence quantiles, exactly for sam-
ples as large as 8, and with an Edgeworth approximation for larger samples.
They note that, for c(n, u) the count of permutations of Y1 , . . . , Yn with u
concordant pairs, then
u
X
c(n, u) = c(n − 1, s), for n > 1, n(n − 1)/2 ≥ u ≥ 0, (6.12)
s=u−n+1
c(n, u) = 0 for u < 0, c(0, 0) = 1,
c(n, u) = 0 for u > n(n − 1)/2. (6.13)

Relation (6.12) is motivated by considering the possible effects of adding a final

observation for a data set with n − 1 observations. Furthermore, P [U = u] =
c(n, u)/n!. Derivation of the recursion is similar to that of (3.17).
A recursion for counts of permutations satisfying U ≤ u is identical, except
that
c(n, u) = 1 for u > n(n − 1)/2
replaces (6.13).
Quantile 1 − α of U can be calculated by inverting P [U ≤ u], for sample
sizes corresponding to feasible exact calculations, or via
p
E0 [U ] + zα Var0 [U ],

for moments given by (6.9) and (6.10). Jumps in rτ , as a function of β2 ,

occur at pairwise slopes Wij = (Yj − Yi )/(Xj − Xi ), since at these points the
pairs (Xi , Yi − β2 Xi ) and (Xj , Yj − β2 Xj ) change from being concordant to
discordant (Theil, 1950), and so an estimate β̂2 of β2 is given by the median of
these slopes, and the confidence interval endpoints are given as in §3.10.1. This
last operation is more easily performed using the number of concordant pairs
U , as in (6.8). Hence if t◦ satisfies P0 [U ≤ t◦ ] ≥ α/2 and P0 [U ≥ t◦ ] ≥ 1−α/2,
and if W̃1 , . . . , W̃n(n−1)/2 are the ordered values of the pairwise slopes Wij ,
then (W̃t◦ , W̃n(n−1)/2+1−t◦ ) is a 1 − α confidence interval for β2 . As noted
above, the intercept β1 may be estimated as that value β̂1 to give the residuals
Yj − β̂1 − β̂2 Xj zero median; that is,
h i
β̂1 = smed Y1 − β̂2 X1 , . . . , Yn − β̂2 Xn . (6.14)

Generally, β̂1 6= smed [Y1 , . . . , Yn ] − β̂2 smed [X1 , . . . , Xn ]. Sen (1968) investi-
gates this procedure further.
This approach to estimation of β2 only holds in contexts with no ties
among the explanatory variable values.
Bivariate Semi-Parametric Estimation via Correlation 123

Example 6.4.1 Sen applies this procedure to the data set

1 2 3 4 10 12 18
9 15 19 20 45 55 78
The following commands calculate a confidence interval for the slope pa-
rameter:
tt<-c(1,2,3,4,10,12,18); xx<-c(9,15,19,20,45,55,78)
out<-rep(NA,length(tt)*(length(tt)-1)/2)
count<-0
for(ii in seq(length(tt)-1)) for(jj in (ii+1):length(tt)){
count<-count+1
out[count]<-(xx[jj]-xx[ii])/(tt[jj]-tt[ii])
}
There are 7 × 6/2 = 21 pairwise slopes W̃j :
1.00 2.50 3.67 3.71 3.75 3.83 3.93 3.94 4.00 4.00
4.00 4.00 4.06 4.12 4.14 4.17 4.18 4.38 5.00 5.00 6.00

The estimate of β2 in the regression model med [Xj ] = β1 + β2 tj is the

median of these pairwise slopes, 4. Confidence intervals may be exhibited
using quantiles of the number of concordant pairs:
library(MultNonParam); qconcordant(0.025,7)

giving 4 as the quantile. Hence the 0.95 confidence interval is (W̃4 , W̃18 ) =
(3.71, 4.38). These calculations may be performed using theil(xx,tt). It
may also be performed using theilsen(xx~tt) from the package deming,
which also gives a confidence interval for the intercept term. The intercept
is estimated by (6.14). Plotting may be done via

library(deming); plot(tt,xx);abline(theilsen(xx~tt))
Results are in Figure 6.3.

6.4.1.3 Inversion of the Spearman Correlation

Calculation for the Spearman correlation rS is more complicated. van de
Wiel and Di Bucchianico (2001) provide tools for the exact calculation of
the null sampling distribution of the Spearman correlation rS ; this algo-
rithm is implemented in the R package pspearman, which gives probabili-
ties for the random variable S of (6.6). Closed-form inversion of the function
T (β2 ) = rS (X, Y − β2 X) is more difficult.
124 Bivariate Methods

FIGURE 6.3: Theil-Sen Estimator for Artificial Example

....
.......
80 .......
.......
...
........◦
.
.......
......
.......
70 ...
.
.......
.......
......
.......
.......
60 ......
...
.........
.
...
.......
..◦
.......
.......
50 .
......
..
........
.
.......◦
.......
.......
t 40 .....
.
...
........
.......
.......
......
.......
30 .
.......
..
...
.
......
.......
.......
....◦
.
20 .
......
◦
.......
..
..

◦.......
.......
..
........
10 ..
.......◦
.......
.
.
.......
.......
0
0 5 10 15 20
x

Example 6.4.2 Consider data on changes in systolic and diastolic blood

pressure after treatment by an angiotensin-converting enzyme inhibitor
(Cox and Snell, 1981, Example E), given at
http://www.stat.ucla.edu/ rgould/datasets/bloodpressure.dat .
We calculate the estimates and confidence intervals for the linear depen-
dence of change in diastolic blood pressure, Yi , on the change in systolic
blood pressure, Xi . Pearson, Spearman, and Kendall correlations between
change in systolic blood pressure, and the change in diastolic blood pres-
sure, are 0.105, 0.144, and 0.067 respectively. The three correlations
between Xi and Yi − β2 Xi , as a function of the slope, are plotted using
bp<-as.data.frame(scan(’bloodpressure.dat’,skip=1,
what=list(spb=0,spa=0,spd=0,dpb=0,dpa=0,dpd=0)))
library(NonparametricHeuristic)
ci<-invertcor(bp$spd,bp$dpd)
to generate Figure 6.4. The relationship for the Pearson correlation is
smooth; the relationships for the Spearman and Kendall correlations are
step functions. Horizontal lines corresponding to asymptotic Gaussian
critical values for the test of zero correlation are plotted; the permutation
variance (6.2) is used for the Pearson and Spearman correlation. The two
lines are slightly offset in the picture to display both lines. The Kendall
correlation uses variance (6.11). Vertical lines pass through the point at
Bivariate Semi-Parametric Estimation via Correlation 125

which the horizontal lines cross the correlation curve, and their intersec-
tions with the horizontal axis determine the end points of the regression
confidence interval. Recall that the Theil estimator is the inversion of the
Kendall correlation. Figure 6.5 displays the results of
attach(bp)
plot(spd,dpd,main="Blood Pressure Change", ylab="Diastolic",
xlab="Systolic")
library(deming)#For theilsen
tsout<-theilsen(dpd~spd)
abline(tsout); abline(lm(dpd~spd),lty=2)
legend(median(spd),median(dpd),lty=1:2,
legend=c("Inversion of tau","Least squares"))
detach(bp)

producing regression lines.

FIGURE 6.4: Construction of Regression Estimates for Blood Pressure Data

.... ..... . ... ...
... ... . .... ...
1 .. ....
... ...
. ... ...
. .... ...
... ... Pearson
. ...
. . .... .... ...........................................
......................... .... .... . .... ...
.
. . . . ... ... . . . . . . .
..
.
. . . ............
...
. .........
.......
.
.... .....
.
. .
. .... ....
Spearman
. .
.. . .......... .. ... . . .... ... ..................................
...............
.........
...........
. . ........
. .
..... . .. ..
.
. ...
.
. . . .... .... Kendall
..
...
..... .. ..
.. . . . ... ...
..............................................................................................................................................................................................................................................................................................................................
....................... .... . .... ....
0.5 ....... ..... ... .
.................... . . . ... ...
.
.....................................................................................................................................................................................................................................................
.
. .
. .
. .
.. ............. . ... ...
... ... .......... . . .... ...
... .... .............
... ... ... .. . ... ...
... ... ..... .
... . . .... ...
... ..... ......... . .... ....
... ... ...... .
... .. ......... . .... ....
... .... . ... ..
Statistic ... .... . .
... ...
... .... . .... ....
Value 0 ... ...
.... .....
.... .....
.... .. . .... ...
... .... .... ..... . .... ....
... ... ... ....
... .... ... ..... . ... ...
... ... ... ...... . ... ..
... .... .... ... . .
... ......... .... ....
... .... ..... ... .. ..
... .. . .... ... ... ..
. ....
. ... ...............
....................................................................................................................................................................................................................................................
. ... .. . ...........
.... .... . ....... ... ... ...
... .... ........... ... .....
-0.5 . . .
.............................................................................................................................................................................................................................................................................................................................................
. ............
.... ....
. . ..... ............ ..
... .... .
. .
.
. .. . ...... . ..
..
. . ...................
... ... . .... .... .......
........ . .
... ... ........ . . .
... .... . ... ... ..
..
......... . . .
... ... . .... ... ...........
............. . . . .
... .... . .... .... ................
... .... . ... ...
....................
....
... .. .
... .... . ... ...
... .... . .... ...
-1 ... ....
... ...
. . .. .
. .
.
. ... ..

-0.5 0 0.5 1 1.5 2

Regression
Parameter
Statistic Quantiles are approximated using Normal
126 Bivariate Methods

FIGURE 6.5: Blood Pressure Data

5 ◦ ...
......
....... .
.......
◦ ................
....
......
......
0 ...
..
......
...... ◦
.....
.....
. ......
. ......
◦ . .....
............
◦ . .....◦
..
. ......
-5 . .....
. ......
. ...........
◦
.
. ............
.
Diastolic ◦ . ......
. .....
. ......
. .....
Blood . ............
Pressure -10
. .....
.
. ........
. ........◦
Change ◦ . . . ..............
.............. Inversion of tau
. ..........
.
.
. ...........
. ......
.
. . . Least squares
-15 . ......
.
. .
........
.
◦. .
......
.
. ......
. ......
. ......
.
.
.
...
....... ◦
. .... ◦
. ......
. ......
-20 .
.
....
........
...... ◦
......
......
◦

-25
-35 -30 -25 -20 -15 -10 -5 0
Systolic Blood Pressure Change

6.5 Exercises
1. The data set

HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt

represents the result from an analysis of a series of docu-

ments. The first column gives document number, the second
gives the name of a text file, the third gives a group to
which the text is assigned, the fourth represents a measure of
the use of first person in the text, the fifth presents a mea-
sure of inner thinking, the sixth presents a measure of posi-
tivity, and the seventh presents a measure of negativity. There
are other columns that you can ignore. (The version at Statlib,
above, has odd line breaks. A reformatted version can be found
at stat.rutgers.edu/home/kolassa/Data/federalistpapers.txt). Plot
use of positivity versus negativity. Calculate the three correlation
measures, and relate their relative values to the shape of the curve.
a. Test at α = .05 the null hypothesis of zero population Spearman
correlation for these two variables, versus the general alternative.
Exercises 127

b. Give an estimate and a 95% confidence interval for the slope of

Positivity on Negativity.
2. The data set

HTTP://stat.rutgers.edu/home/kolassa/Data/twinbrain.dat

reflects brain volumes of twins. Using the method of Theil, estimate

the slope parameter in the regression of second brain volume on the
first brain volume, and give a confidence interval.
3. The data set

HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo

reflects an an experiment using measurements used to detect

Suppose that one observes n subjects, indexed by i ∈ {1, . . . , n}, and, for
subject i, observes responses Xij , indexed by j ∈ {1, . . . , J}. Potentially, co-
variates are also observed for these subjects.
This chapter explores explaining the multivariate distribution of Xij in
terms of these covariates. Most simply, these covariates often indicate group
membership.

7.1 Standard Parametric Approaches

When data vectors may be treated as approximately multivariate Gaussian,
the following standard techniques may be applied.

7.1.1 Multivariate Estimation

Often one wants to learn about a the vector of population central values for
each of the j responses on the various subjects. In this section, assume that
the vectors are independent and identically distributed.
Standard parametric analyses presuppose distributions of data well-enough
behaved that location can be well-estimated using a sample mean. Denote
the meanPn by the vector X̄, where component j of this vector is given by
X̄j = i=1 Xij /n. Then X̄ is the method of moments estimator for µ =
E [Xi ]. Assumptions guaranteeing that X̄ has an asymptotically Gaussian
distribution generally include the existence of some moment of the distribution
greater than the second moment.

7.1.2 One-Sample Testing

In this section, consider testing the null hypothesis that the vector µ of
expectations takes on some value specified in advance; without loss of gen-
erality, take this value to be 0. Still assuming that the observations have
a multivariate Gaussian distribution, then X̄ is approximately multivariate
Gaussian. First consider the case in which one knows the variance matrix
Σ = Var [(Xi1 , . . . , XiJ )], and assume that Σ is nonsingular. Then one can

129
130 Multivariate Analysis

use as a test statistic

T 2 = X̄ > (Σ/n)−1 X̄, (7.1)
and its distribution under the null hypothesis is χ2J .
Dropping the multivariate Gaussian assumption, if Σ is unknown, and if
one can estimate it as Σ̂ using the usual sum of squares, then

T 2 = X̄ > (Σ̂/n)−1 X̄ (7.2)

has an F distribution, with J numerator degrees of freedom (Hotelling, 1931).

If the distribution
h of (Xi1 , i. . . , XiJ ) has a density and a nonsingular variance
matrix, then P Σ̂ singular = 0. If Σ unknown, and is best estimated by a
nonsingular Σ̃, which is other than the sum of squares estimator, then gener-
ally (7.2) is approximately χ2J . These techniques require that X̄ is approximate
multivariate Gaussian. This assumption is stronger than the assumption that
each margin is univariate Gaussian; a simulated example is given in Figure 7.1.

7.1.3 Two-Sample Testing

Suppose that observations may be divided into two groups of sizes M1 and M2 ,
with the group for observation i indicated by gi ∈ {1, 2}. Test the null hypoth-
esis that the multivariate distributions in the two groups are identical; note
that this implies identical variance matrices. Let X̄k be the Pvector of sample
means for observations in group k, with components X̄kj = i|gi =k Xij /Mk .
Let Σ̂k,j,j 0 be the sample covariance for the group k values between responses
j and j 0 : Σ̂k,j,j 0 = i|gi =k (Xij − X̄k )(Xij 0 − X̄j0 )/(Mk − 1). Let Σ̂j,j 0 be the
P

pooled sample covariance for the all observations: Σ̂j,j 0 = ((M1 − 1)Σ̂1,j,j 0 +
(M2 − 1)Σ̂2,j,j 0 )/(M1 + M2 − 2). Then the Hotelling two-sample statistic

T 2 = (X̄1 − X̄2 )> ((1/M1 + 1/M2 )Σ̂)−1 (X̄2 − X̄2 ) (7.3)

measures the difference between sample mean vectors, in a way that accounts
for sample variance, and combines the response variables. Furthermore, un-
der the null hypothesis of equality of distribution, and assuming that this
distribution is multivariate Gaussian,
M1 + M2 − J − 1 2
T ∼ FJ,M1 +M2 −J−1 .
(M1 + M2 − 2)J

7.2 Nonparametric Multivariate Estimation

In the absence of such parametric assumptions, one might instead measure
location using the multivariate median.
Nonparametric Multivariate Estimation 131

FIGURE 7.1: Univariate Normal Data that are Not Bivariate Normal

4 3 ◦
◦
◦
3 ◦
2 ◦◦
◦◦
◦◦
◦
◦ ◦◦◦
◦
2 ◦ 1 ◦
◦
◦◦◦◦◦
◦
◦ ◦◦
◦
◦
◦ ◦◦
◦
◦◦ ◦
◦
◦◦◦
◦◦◦ ◦◦◦
1 ◦
◦
◦◦◦◦◦ 0 ◦◦
◦◦◦◦◦◦
◦◦
◦◦
◦◦
◦
◦ ◦◦◦
◦
◦
◦ ◦◦
◦
◦◦◦◦ ◦◦
◦
◦◦
◦◦◦◦◦ ◦
◦
◦
◦
0 ◦◦◦◦◦◦◦◦ -1 ◦
◦
◦◦◦◦◦◦
◦◦◦ ◦◦
◦◦◦
◦◦
◦
◦◦
◦ ◦
◦◦
◦◦
◦ ◦◦
-1 ◦
◦◦◦◦ -2 ◦◦
◦◦ ◦◦
◦◦
◦◦
◦◦◦
-2 ◦◦ -3 ◦
◦

-3 -4
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
4 ◦
◦ ◦
4
◦ ◦

2 ◦◦
◦ ◦
◦◦
◦◦◦
◦
2 ◦◦
◦
◦ ◦
◦
◦◦◦◦ ◦
◦
◦◦ ◦
◦
◦
◦◦
◦
◦◦ ◦◦◦
0 ◦
◦
◦
◦
◦
◦
◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦
◦
◦
◦
◦
◦
◦
◦
◦◦
◦
◦◦
◦
◦
◦◦
◦ 0 ◦
◦
◦◦◦
◦
◦◦
◦
◦◦
◦
◦ ◦
◦
◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦
◦
◦
◦
◦
◦◦ ◦
◦◦◦ ◦◦◦
◦◦◦ ◦◦◦
◦
-2 -2 ◦
◦◦
◦◦ ◦◦
◦
◦◦
◦
-4 -4
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Define the multivariate population median ν to be the vector of univariate

medians, as defined in §2.3.1. An estimator smed [X1 , . . . , Xn ] of the popula-
tion multivariate median may be constructed as the vector of whose compo-
nents are the separate marginal sample medians; that is, smed [X1 , . . . , Xn ] =
(smed [X11 , . . . , X1n ] , . . . , smed [XJ1 , . . . , XJn ]).
Alternatively, one might define smed [X1 , . . . , Xn ] so to minimize the sum
of distances from the median:
n X
X J
smed [X1 , . . . , Xn ] = argmin |Xij − ηj |; (7.4)
η
i=1 j=1

that is, the estimate minimizes the sum of distances from data vectors to
the parameter vector, with distance measured by the sum of absolute val-
132 Multivariate Analysis

ues of component-wise differences. Because one can interchange the order of

summation in (7.4), the minimizer in (7.4) is the vector of component-wise
minimizers. Furthermore, the minimizer for each component is the traditional
univariate median as above.
A summary of multivariate median concepts is given by Small (1990).

7.2.1 Equivariance Properties

In the univariate case (that is, J = 1), both the mean and the median are
equivariant with respect to affine transformations of the raw data, as seen in
§2.1.1 and §2.3.1. Equivariance to affine transformations in the multivariate
case holds for the mean: for a vector a and a matrix B with J columns, and
for Yj = a + BXj for all j, then Ȳ = a + B X̄. A similar equality fails to hold
for smed [X1 , . . . , Xn ] and smed [Y1 , . . . , Yn ], unless B is diagonal; hence the
multivariate median is not equivariant under affine transformations.

7.3 Nonparametric One-Sample Testing Approaches

Consider a null hypothesis stating that the marginal median vector ν takes
on a value specified in advance; without loss of generality, take this to be zero.
In the multivariate Gaussian context, the statistic (7.2) represents the com-
bination of separate location test statistics for the various components of the
random vectors, and its distribution depends on multivariate normality of the
underlying observations; an analogous statistic combining the various dimen-
sions of Xi that does not depend on parametric assumptions is constructed
in this section.
A nonparametric hypothesis test can be constructed by assembling
component-wise nonparametric statistics into a vector T , analogous to X̄,
and centered so that E0 [T ] = 0. One might combine sign test statistics, or
signed-rank statistics if one assumes symmetry, often in the context of paired
data. That is,
n
X
Tj (X1j , . . . , Xnj ) = s̃(Xij > 0), (7.5)
i=1
(
1 for u > 0
for s̃(u) = . Or, define Rij (X) to be the marginal rank of
−1 for u < 0
|Xij | among {|X1j |, . . . , |Xnj |}, and set
n
X
Tj (X1j , . . . , Xnj ) = Rij (X)s̃(Xij > 0). (7.6)
i=1
Nonparametric One-Sample Testing Approaches 133

A multivariate test statistic is constructed as a vector of univariate statistics,

T (X) = (T1 (X11 , . . . , Xn1 ), . . . , TJ (X1J , . . . , XnJ ))> .

Then combine components of T from (7.5) to give the multivariate sign test
statistic, or from (7.6) to give the multivariate sign rank test. In either case,
components are combined using

W = T > Υ−1 T (7.7)

for Υ = Var0 [T ]. As in §2.3, in the case that the null location value is 0, the
null distribution for the multivariate test statistic is generated by assigning
equal probabilities to all 2n modifications of the data set by multiplying the
rows (Xi1 , . . . , XiJ ) by +1 or −1. That is, the null hypothesis distribution of
T (X) is generated by placing probability 2−n on all of the 2n elements of

X = {X̃an n × J matrix|(X̃i1 , . . . , X̃iJ ) = ±(Xi1 , . . . , XiJ )}. (7.8)

Test statistics (7.1) and (7.2) arose as quadratic forms of independent and
identically distributed random vectors, and the variances included in their
definitions were scaled accordingly. Statistic (7.3) is built using a more com-
plicated variance; this pattern will repeat with nonparametric analogies to
parametric tests.
Combining univariate tests into a quadratic form raises two difficulties. In
previous applications of rank statistics, that is, in the case of univariate sign
and signed-rank one-sample tests, in the case of two-sample Mann-Whitney-
Wilcoxon tests, and in the case of of Kruskal-Wallis testing, all dependence
of the permutation distribution on the original data was removed through
ranking. This is not the case for T , since this distribution involves correla-
tions between ranks of the various response vectors. These correlations are not
specified by the null hypothesis. The separate tests are generally dependent,
and dependence structure depends on distribution of raw observations. The
asymptotic distribution of (7.7) relies on this dependence via the correlations
between components of T . The correlations must be estimated.
Furthermore, the distribution of W of (7.7) under the null hypothesis is
dependent on the coordinate system for the variables, but, intuitively, this
dependence on the coordinate system might be undesirable. For example,
suppose that (X1i , X2i ) has an approximate multivariate Gaussian distribu-
tion, with expectation µ, and variance Σ, with Σ known. Consider the null
hypothesis H0 : µ = 0. Then the canonical test is (7.1), and it is unchanged
if the test is based on (Ui , Vi ) for Ui = X1i + X2i and Vi = X1i − X2i , with
Σ modified accordingly. Hence the parametric analysis is independent of the
coordinate system.
The first of this difficulty is readily
√ addressed. Under H0 , the marginal
sign test statistic (7.5) satisfies Tj / n ≈ G(0, 1). Conditional on the relative
ranks of the absolute values of the observations, the permutation distribution
134 Multivariate Analysis

is entirely specified, and conditional joint moments are calculated. Under the
permutation distribution,
X
σ̂jj 0 = s̃(Xij )s̃(Xij 0 )/n, (7.9)
i

and so the variance estimate used in (7.7) has components υ̂jj 0 = σ̂jj 0 /n.
Here again, the covariance is defined under the distribution that consists of
2n random reassignment of signs to the data vectors, each equally weighted.
As before, variances do not depend on data values, but covariances do depend
on data values. The solution for the Wilcoxon signed-rank test is also available
(Bennett, 1965).
Using the data to redefine the coordinate system may be used to address
the second problem (Randles, 1989; Oja and Randles, 2004).
Combine the components of T (X) to construct the statistic

W = W (X) = T > Σ̂−1 T , (7.10)

using an estimate Σ̂ of Σ = Var [T ] as in (7.9), or similarly for the signed-rank

statistic.
The multivariate central limit theorem of Hájek (1960), and the quality
of approximation to Υ, justifies approximating the null distribution of W by
χ2J distribution. The test rejects the null hypothesis of zero component-wise
medians when
W (X) > G−1 J (1 − α, 0)

for G−1 2
J (1 − α, 0) the 1 − α quantile of the χJ distribution, with non-centrality
parameter 0. Bickel (1965) discusses these (and other tests) in generality.

Example 7.3.1 Consider the data of Example 6.4.2. We test the null
hypothesis that the joint distribution of systolic and diastolic blood pres-
sure changes is symmetric about (0, 0), using Hotelling’s T 2 and the two
asymptotic tests that substitutes signs and signed-ranks for data. This test
is performed in R using
# For Hotelling and multivariate rank tests resp:
library(Hotelling); library(ICSNP)
cat(’\n One-sample Hotelling Test\n’)
HotellingsT2(bp[,c("spd","dpd")])
cat(’\n Multivariate Sign Test\n’)
rank.ctest(bp[,c("spd","dpd")],scores="sign")
cat(’\n Multivariate Signed Rank Test\n’)
rank.ctest(bp[,c("spd","dpd")])
Nonparametric One-Sample Testing Approaches 135

P -values for Hotelling’s T 2 , the marginal sign rank test, and marginal
sign test, are 9.839 × 10−6 , 2.973 × 10−3 , and 5.531 × 10−4 .

Tables 7.1 and 7.2 contain attained levels and powers for one-sample multi-
variate tests with two manifest variables of nominal level 0.05, from various
distributions.

TABLE 7.1: Level of multivariate tests

Test Sample Size 20 Sample Size 40

Normal Cauchy Laplace Normal Cauchy Laplace
T2 0.04845 0.01550 0.04335 0.05025 0.01665 0.04605
Sign test 0.04215 0.04130 0.04365 0.04865 0.05055 0.04650
Sign rank test 0.04115 0.03895 0.04170 0.05065 0.04980 0.04745

TABLE 7.2: Power of multivariate tests

Test Sample Size 20 Sample Size 40

Normal Cauchy Laplace Normal Cauchy Laplace
T2 0.74375 0.08090 0.48460 0.97760 0.08845 0.79215
Sign test 0.70330 0.26290 0.55560 0.96955 0.52135 0.88700
Sign rank test 0.52615 0.32605 0.55025 0.87520 0.64720 0.89445

Tests compared are Hotelling’s T 2 tests, and test (7.10) applied to the
sign and signed-rank tests. Tests have close to their nominal levels, except
for Hotelling’s test with the Cauchy distribution; furthermore, the agreement
is closer for sample size 40 than for sample size 20. Furthermore, the sign
test power is close to that of Hotelling’s test for Gaussian variables, and the
signed-rank test has attenuated power. Both nonparametric tests have good
power for the Cauchy distribution, although Hotelling’s test performs poorly,
and both perform better than Hotelling’s test for Laplace variables.
Some rare data sets simulated to create Tables 7.1 and 7.2 include some
for which Υ is estimated as singular. Care must be taken to avoid difficulties;
in such cases, p-values are set to 1.

7.3.1 More General Permutation Solutions

One might address this problem using permutation testing. First, select an
existing parametric test statistic U (X), perhaps a Hotelling statistic, or a
rank-based statistic. Under the permutation null distribution, the sampling
distribution puts equal weight 2−n to all 2n values of the statistic evaluated
at each element of (7.8); these 2n values need not all be unique. For n large
enough to make exhaustive evaluation prohibitive, a random subset of ele-
ments of (7.8) may be selected. The p-value is reported as the proportion of
data sets with permuted signs having the test statistic value as large as, or
136 Multivariate Analysis

larger than, that observed. In this way, the analysis of the previous subsection
for the sign test, and by extension the signed-rank test, can be extended to
general rank tests, including tests with data as scores.

7.4 Confidence Regions for a Vector Shift Parameter

Proceed analogously to the one-dimensional confidence interval construction
as in §2.3.3. Introduce a shift parameter to move the data to an arbitrary
point null hypothesis. In this one-sample case, one may apply the one sample
test for the null hypothesis that marginal medians are 0 to the shifted data
X − 1n ⊗ µ, where 1n is a vector of ones of length n, and ⊗ is the outer
product, so 1n ⊗ µ is the matrix with entry µj in column j for all rows. That
is, calculate T (X − 1n ⊗ µ) from the data set X − 1n ⊗ µ. Calculate the
variance matrix ((7.9) for the multivariate sign test) from the shifted data
set. Calculate W (X − 1n ⊗ µ) from (7.10). Then the random set

I = {µ|W (X − 1n ⊗ µ) ≤ χ2J,α },

satisfies P [µ ∈ I] = 1 − α using the test inversion argument of (1.16). This

is the one-sample case of the region proposed by Kolassa and Seifu (2013).

Example 7.4.1 Recall again the blood pressure data set of Exam-
ple 6.4.2. Figure 7.2 is generated by
library(MultNonParam); shiftcr(bp[,c("dpd","spd")])
and exhibits the 0.05 contour of p-values for the multivariate test con-
structed from sign rank tests for each of systolic and diastolic blood pres-
sure, and forms a 95% confidence region. Note the lack of convexity.

7.5 Two-Sample Methods

Two-sample methods are generally of more interest than the preceding one-
sample methods. Consider a multivariate data set Xij for i ∈ {1, . . . , M1 +M2 }
and j ∈ {1, . . . , J}, with data from the first sample occupying the first M1
rows of this matrix, and data from the second sample occupying the last M2
rows. Assume that the vectors (Xi1 , . . . , XiJ ) and (Xi0 1 , . . . , Xi0 J ), for all i,
are independent if i 6= i0 . Assume further than the vectors (Xi1 , . . . , XiJ ) all
have the same distribution for i ≤ M1 , and that the vectors (Xi1 , . . . , XiJ )
Two-Sample Methods 137

FIGURE 7.2: Median Blood Pressure Change Confidence Region

-10
..............
.. ..
................................................. ...........................
.. ..
............... ...
.. ...
...................................... .............
.
....................................
.. ...
.. ..
.............. . ...
.. ...
......................... ...
... ...
-15 ..............
...
.
.............
...
.. ...
. .
.... ..
.
. .
.
.............
.. ...
...
................ ...
.. .
.
............. .
.. ...
Systolic .............. ..
.. ...
.............. ..............
.. .
.
.... .
...
.. ...............
... ..
.............. ...
-20 ..
..............
...
...
... ..
... .....................................
.............. .............
.. ...
... ....
... .
.
.... ..........................
.
.............. .........................
.. ..
.... ............. .
.
... ...
.. ..
. ...
.............. ..........................
.
.............. .
.
-25 ...
...
...................................
..............
.
.
.. ..
................................................................................................

-15 -10 -5 0
Diastolic
95% Confidence from inversion of Bivariate Sign Rank Test

all have the same distribution for i > M1 . Let g be a vector indicating group
membership; gi = 1 if i ≤ M1 , and gi = 2 if i > M1 . As in §7.1, consider
testing and confidence interval questions.

7.5.1 Hypothesis Testing

Combine the techniques for one-dimensional two-sample testing of Chapter 3
and for multi-dimensional one-sample testing of §7.1. Consider the null hy-
pothesis H0 that the distribution of (Xi1 , . . . , XiJ ) is the same for all i.

7.5.1.1 Permutation Testing

Under the null hypothesis of equality of distributions across the two groups,
all assignments of the observed vectors among the two groups that keep the
sizes of groups 1 and 2 at M1 and M2 respectively, are equally likely. Hence a
permutation test may be constructed by evaluating the Hotelling statistic, or
any other parametric statistic, at each of the M1M+M
1
2
such reassignments of
138 Multivariate Analysis

the observations to the groups,

X ∗ = {(X, g)|g has M1 entries that are 1 and M2 that are 2};

p-values are calculated by counting the number of such statistics with values
equal to or exceeding the observed value, and dividing the count by M1M+M

1
2
.
Other marginal statistics may be combined; for example, one might use
the Max t-statistic, defined by first calculating univariate t-statistics for each
manifest variable, and reporting the maximum. This statistic is inherently
one-sided, in that it presumes an alternative in which each marginal distri-
bution for the second group is systematically larger than that of the first
group. Alternatively, one might take the absolute value of the t-statistics be-
fore maximizing. One might do a parallel analysis with either the maximum of
Wilcoxon rank-sum statistics or the maximum of their absolute values, after
shifting to have a null expectation zero.
Finally, one might apply permutation testing to the statistic (7.3), calcu-
lated on ranks instead of data values, to make the statistic less sensitive to
extreme values.

Example 7.5.1 Consider the data on wheat yields, in metric tons per
hectare (Cox and Snell, 1981, Set 5), reflecting yields of six varieties of
wheat grown at ten different experimental stations, from

http://stat.rutgers.edu/home/kolassa/Data/set5.data .
Two of these varieties, Huntsman and Atou, are present at all ten sta-
tions, and so the analysis will include only these. Stations are included
from three geographical regions of England; compare those in the north
to those elsewhere. The standard Hotelling two-sample test may be per-
formed in R using
wheat<-as.data.frame(scan("set5.data",what=list(variety="",
y0=0,y1=0,y2=0,y3=0,y4=0,y5=0,y6=0,y7=0,y8=0,y9=0),
na.strings="-"))
# Observations are represented by columns rather than by
# rows. Swap this. New column names are in first column.
dimnames(wheat)[[1]]<-wheat[,1]
wheat<-as.data.frame(t(wheat[,-1]))
dimnames(wheat)[[1]]<-c("E1","E2","N3","N4","N5","N6","W7",
"E8","E9","N10")
wheat$region<-factor(c("North","Other")[1+(
substring(dimnames(wheat)[[1]],1,1)!="N")],
c("Other","North"))
attach(wheat)
plot(Huntsman,Atou,pch=(region=="North")+1,
main="Wheat Yields")
Two-Sample Methods 139

legend(6,5,legend=c("Other","North"),pch=1:2)

Data are plotted in Figure 7.3. The normal-theory p-value for testing
equality of the bivariate yield distributions in the two regions is given by
library(Hotelling)#for hotelling.test
print(hotelling.test(Huntsman+Atou~region))
The results of hotelling.test must be explicitly printed, because the
function codes the results as invisible, and so results won’t be printed
otherwise. The p-value is 0.0327. Comparing this to the univariate results
t.test(Huntsman~region);t.test(Atou~region)
gives two substantially smaller p-values; in this case, treatment as a multi-
variate distribution did not improve statistical power. On the other hand,
the normal quantile plot for Atou yields shows some lack of normality.
Outliers do not appear to be present in these data, but if they were,
they could be addressed by performing the analysis on ranks, either using
asymptotic normality:

cat(’Wheat rank test, normal theory p-values’)

print(hotelling.test(rank(Huntsman)+rank(Atou)~region))
or using permutation testing to avoid the assumption of multivariate nor-
mality:
#Brute-force way to get estimate of permutation p-value for
#both T2 and the max t statistic.
cat(’Permutation Tests for Wheat Data, Brute Force’)
obsh<-hotelling.test(Huntsman+Atou~region)$stats$statistic
obst<-max(c(t.test(Huntsman~region)$statistic,
t.test(Atou~region)$statistic))
out<-array(NA,c(1000,2))
dimnames(out)<-list(NULL,c("Hotelling","t test"))
for(j in seq(dim(out)[[1]])){
newr<-sample(region,length(region))
hto<-hotelling.test(Huntsman+Atou~newr)
out[j,1]<-hto$stats$statistic>=obsh
out[j,2]<-max(t.test(Huntsman~newr)$statistic,
t.test(Atou~newr)$statistic)>=obst
}
apply(out,2,mean)
giving permutation p-values for the Hotelling and max-t statistics of 0.023
and 0.003 respectively. The smaller max-t statistic reflects the strong
140 Multivariate Analysis

association between variety yields across stations. If one wants only the
Hotelling statistic significance via permutation, one could use

print(hotelling.test(Huntsman+Atou~region,perm=T,
progBar=FALSE))
The argument progBar will print a progress bar, if desired, and an addi-
tional argument controls the number of random permutations.

FIGURE 7.3: Wheat Yields

7.5

6.5 ♦

♦
5.5 Symbol Region
♦
Other
5 ♦ North
♦ ♦
4.5
4 4.5 5 5.5 6 6.5 7 7.5

7.5.1.2 Permutation Distribution Approximations

Let Tj be the Mann-Whitney-Wilcoxon statistic using manifest variable j, for
j ∈ {1, . . . , J}. Let T = (T1 , . . . , TJ ). Let Σ = Var [T ] be the variance matrix
of this statistic, under the permutation distribution. Let σjj 0 be the element
in row j and column j 0 . The diagonal elements σjj are independent of data
values (and equal M1 M2 (M2 + M1 + 1)/12, but that’s not important here).
The remaining entries of Σ depend on the data. For i = 1, . . . , M1 + M2 ,
let Fij be the number of observations in group 2 that beat observation i on
variable j if i is in group 1, and the number of observations in group 1 that i
beats on variable j, if i is in group 2. Then 4/(M2 + M1 ) times the covariance
matrix for F estimates the variance matrix of T . Kawaguchi et al. (2011)
provide details of these calculations. Superior performance can be obtained
using known diagonal values, and estimated correlations for the remaining
entries of the variance matrix (Chen and Kolassa, 2018).

Example 7.5.2 Consider again the wheat yield data of Example 7.5.1.
Asymptotic nonparametric testing is performed using
library(ICSNP)#For rank.ctest and HotelingsT2.
rank.ctest(cbind(Huntsman,Atou)~region)
Exercises 141

to obtain a p-value of 0.039, or

rank.ctest(cbind(Huntsman,Atou)~region,scores="sign")
detach(wheat)
to obtain a multivariate version of Mood’s median test.
Alternate syntax for rank.ctest consists of calling it with two argu-
ments corresponding to the two data matrices.

7.6 Exercises
1. The data set

HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt

gives data from an analysis of a series of documents. The first col-

umn gives document number, the second gives the name of a text
file, the third gives a group to which the text is assigned, the fourth
represents a measure of the use of first person in the text, the fifth
presents a measure of inner thinking, the sixth presents a measure of
positivity, and the seventh presents a measure of negativity. There
are other columns that you can ignore. (The version at Statlib,
above, has odd line breaks. A reformatted version can be found at
stat.rutgers.edu/home/kolassa/Data/federalistpapers.txt).
a. Test the null hypothesis that the multivariate distribution of
first person, inner thinking, positivity, and negativity, are the same
between groups 1 and 2, using a permutation test. Test at α = .05.
b. Construct new variables, the excess of positivity over negativity,
and the excess of thinking ahead over thinking behind, by sub-
tracting variable six minus variable seven, and variable eight minus
variable nine. Test the null hypothesis that the multivariate dis-
tribution of these two new variables has median zero, versus the
general alternative, using the multivariate version of the sign test.
Test at α = .05.
2. The data at

HTTP://lib.stat.cmu.edu/datasets/cloud

contain data from a cloud seeding experiment. The first fifteen lines
contain comment and label information; ignore these. The second
field contains the character S for a seeded trial, and U for unseeded.
a. The fourth and fifth represent rainfalls in two target areas. Test
142 Multivariate Analysis

the null hypothesis that the bivariate distribution of observations

after seeding is the same as that without seeding. Use the marginal
rank sum test.
b. Repeat part (a) using the permutation version of Hotelling’s test.
8
Density Estimation

This chapter considers the task of estimating a density from a sample of in-
dependent and identical observations. Previous chapters began with a review
of parametric techniques. Parametric techniques for density estimation might
involve estimating distribution parameters, and reporting the parametric den-
sity with estimates plugged in; this technique will not be further reviewed in
this volume.

8.1 Histograms
The most elementary approach to this problem is the histogram, which rep-
resents the density as a bar chart.
The bar chart represents frequencies of the values of a set of categorical
variable in various categories as the heights of bars. When the categories are
ordered, one places the bars in the same order. To construct a histogram for
a continuous random variable, then, coarsen the continuous variable into a
categorical variable, whose categories are subsets of the range of the original
variable. Construct a bar plot for this categorical variable, again, with bars
ordered according to the order of the intervals.
Because the choice of intervals is somewhat arbitrary, the boundary be-
tween bars is deemphasized by making the bars butt up against their neigh-
bors. The most elementary version of the histogram has the height of the bar
as the number of observations in the interval. A more sophisticated analysis
makes the height of the bar represent the proportion of observations in the
bar, and a still more sophisticated representation makes the area of the bar
equal to the proportion of observation in the bar; this allows bars of unequal
width while keeping the area under a portion of the curve to approximate the
proportion of observation in that region. Unequally-sized bars are unusual.
Construction of a histogram, then, requires selection of bar width and bar
starting point. One generally chooses the end points of intervals generating the
bars to be round numbers. A deeper question is the number of such intervals.
An early argument (Sturges, 1926) involved determining the largest number
so that if the data were in proportion to a binomial distribution, every interval

143
144 Density Estimation

in the range would be non-empty. This gives a number of bars proportional

to the log of the sample size.
The histogram is determined once the bin width ∆n , and any bin sep-
aration point, is selected. Once ∆n is selected, many bin separation points
determine the same histogram; without loss of generality, choose the smallest
non-negative value and denote it by tn .
Scott (1979) calculates optimal sample size by minimizing mean square
error of the histogram as an estimator of the density. Construct the histogram
with bars of uniform width, and such that the bar area equals the proportion
of observations out of the entire sample falling in the particular bar. Let fˆ(x)
represents the height of the histogram bar containing the potential data value
x. The integrated mean squared error of the approximation is then
Z ∞ h i
E |f (x) − fˆ(x)| dx.
2
(8.1)
−∞

The quality of the histogram depends primarily on the width of the bin ∆n ;
the choice of tn is less important. Scott (1979) takes tn = 0, and, using Taylor
approximation techniques within the interval, shows that the integrated mean
squared error of the approximation is
Z ∞
1
1/(n∆n ) + ∆n 2
f 0 (x)2 dx + O(1/n + ∆n 3 ). (8.2)
12 −∞

The first term in (8.2) represents the variance of the estimator, and the second
term represents the square of the bias. Minimizing
R∞ the sum of the first two
terms gives the optimal bin size ∆∗n = [6/ −∞ f 0 (x)2 dx]1/2 n−1/3 . One might
R∞
approximate −∞ f 0 (x)2 dx by using the value of the Gaussian density with
variance matching that of the data, to obtain ∆∗n = 3.49sn−1/2 , for s the
standard deviation of the sample.

8.2 Kernel Density Estimates

Histograms are easy to construct, but their choppy shape generally does not
reflect our expectations about true density. A more sophisticated approach
is kernel density estimation (Rosenblatt, 1956). Estimate the density as the
average of densities centered at data points:
n
X
fˆ(x) = (∆n n)−1 w((Xi − x)/∆n ). (8.3)
i=1

Here w is a density, also called a kernel; that is, a non-negative function

integrating to 1.
Kernel Density Estimates 145

This estimator fˆ depends on the kernel w, and the smoothing parameter

∆n .
Some plausible choices for w are the (standard) Gaussian density, the
3
Epanechnikov kernel w(x) = a − 16a3 x2 /9 for |x| ≤ 4a ; this kernel is scaled to
3
have unit variance with a = 4√5 , and minimizes the integrated mean square
error (8.1) among symmetric kernels (Epanechnikov, 1969). √
Another
√ plausible kernel is the triangle kernel w(x) = 1/ 6 − |x/6|, for
|x| ≤ 6, and 0 otherwise. One might also consider the box kernel; consider
the simplest version, w(x)√= 1 if |x| ≤ 1/2,
√ rather than the one standardized
to unit variance, w(x) = 3/2 if |x| ≤ 3. This kernel will be considered for
pedagogical reasons below, although in practical terms its use abandons the
advantages of the smoothness of the result.

Example 8.2.1 Refer again to the nail arsenic data from Example 2.3.2.
Figure 8.1 displays kernel density estimates with a default bandwidth and
for a variety of kernels. This figure was constructed using
cat(’\n Density estimation \n’)
attach(arsenic)
#Save the density object at the same time it is plotted.
plot(a<-density(nails),lty=1,
main="Density for Arsenic in Nails for Various Kernels")
lines(density(nails,kernel="epanechnikov"),lty=2)
lines(density(nails,kernel="triangular"),lty=3)
lines(density(nails,kernel="rectangular"),lty=4)
legend(1,2, lty=rep(1,4), legend=c("Normal","Quadratic",
"Triangular","Rectangular"))

Note that the rectangular kernel is excessively choppy, and fits the poorest.

Generally, any other density, symmetric and with a finite variance, may be
used. The parameter ∆n is called the bandwidth. This bandwidth should
depend on spread of data, and n; the spread of the data might be described
using the standard deviation or interquartile range, or, less reliably, the sample
range.
The choice of bandwidth balances effects of variance and bias of the den-
sity estimator, just as the choice of bin width did for the histogram. If the
bandwidth is too high, the density estimate will be too smooth, and hide fea-
tures of data. If the bandwidth is too low, the density estimate will provide
too much clutter to make understanding the distribution possible.
We might consider minimizing the mean squared error for x fixed, rather
than after integration. That is, choose ∆n to minimize the mean square error
of the estimate,
h i h i 2
MSE[fˆ(x)] = Var fˆ(x) + E fˆ(x) − f (x) .
146 Density Estimation

FIGURE 8.1: Density for Arsenic in Nails for Various Kernels

2.5 ......
..... ......
......... ....
.... .... .....
....... ........... ....
.. ....... ..
...... ........ ..
.... . .. .......
... ... ... ..
...... ..... ..... ....
.. . .
..... .. ...
...... ... ........
...... ... ........
2 ........ ... ..........
..... ... .........
...... .. .........
.... .
... ..
....... ....
........
.......
............................................................. Normal
.... .. ....
.....
.
................
.
. ......
.......
. . . . . . . . . . Quadratic
........ .......
......
........
.. ...
......
........
.........
................................................. Triangular
1.5 .....
.........
......
.........
.......
......................................................... Rectangular
..... .....
Den- ...... ......
........ .......
.. ... .........
sity ...... .........
....... .........
... ......
............. ...............
......... ... ........ ..
... ..... ... .... ...
1 ..........
..........
... ....... ...
.......... ...
......... ......
......
......... .......
......... .....
.....
.......
...... ......
...... .........
.......
...... ......
..... .......
.... ........
..
0.5 .......
.....
.........
....
..... ......
...... ....... ...... ......
....... ...... .... .......
... ....
...... ..........
..... ...........................................
...... .... ..... ..... ... ...
.... ..
.. ..... . ..........
.....
.... ...... .............. ....... ........ . .....
....... ................... .......
....... ..................................................
.
.
....... .... . .
........
...... ..............
. ..........
........
. . . . .
. .
......... ..
.... ......... ......... ..........
.................... ..... ................................................................................................................................................................................................................... .....................
0
0 0.5 1 1.5 2 2.5
Nail Arsenic

Using hthe box ˆ

i kernel, the estimate f (x) has a rescaled binomial distribution,
and E fˆ(x) = p/∆n for p = F (x+∆n /2)−F (x−∆n /2). Expanding F (x) as
a Taylor series, noting that F 0 (x) = f (x), and canceling terms when possible,

p/∆n = f (x) + ∆n 2 f 00 (x∗ )/24 for some x∗ ∈ [x − ∆n /2, x + ∆n /2]. (8.4)

The bias of the estimator is approximately ∆n 2 f 00 (x∗ )/24. Hence, generally,

if ∆n does not converge to zero as n → ∞, then bias does not converge to
zero, and the estimate is inconsistent; hence consider only strategies for which
limn ∆n = 0. Furthermore, since limn ∆n = 0, then bias is approximated by

∆n 2 C1 (x) for C1 (x) = f 00 (x∗ )/24. (8.5)

h i
Furthermore, Var fˆ(x) = p(1 − p)/(n∆n 2 ), and applying (8.4),

1 3 00 ∗ 1 2 00 ∗

h i 1 − ∆n f (x) − 24 ∆n f (x ) f (x) + 24 ∆n f (x )
Var fˆ(x) = ,
∆n n
and using the convergence of ∆n to zero, one obtains
h i
Var fˆ(x) ≈ C2 (x)/(∆n n) for C2 (x) = f (x). (8.6)
Kernel Density Estimates 147

Minimizing the mean square requires minimizing

C2 (x)/(∆n n) + C1 (x)2 ∆n 4 . (8.7)
Differentiating and setting (8.7) to zero implies that C2 (x)∆n −2 /n =
4C1 (x)2 ∆n 3 , or
∆n = 2−2/5 C2 (x)1/5 C1 (x)−2/5 n−1/5 (8.8)
1/5 −2/5 00 −2/5 −1/5
= f (x) 2 (f (x)) n .
This gives a bandwidth dependent on x; a bandwidth independent of
x may be constructed by minimizing the integral
R ∞ of the mean squared
error (8.7), C2 /(∆n n) + C12 ∆n 4 , for C2 = −∞ C2 (x) dx and C1 =
qR
∞
−∞ 1
C 2 (x) dx.

Example 8.2.2 Refer again to the arsenic nail data of Examples 2.3.2
and 8.2.1. The kernel density estimate, using the suggested bandwidth,
and bandwidths substantially larger and smaller than the optimal, are
given in Figure 8.2. The excessively small bandwidth follows separate
data points closely, but obscures the way they cluster together. The ex-
cessively large bandwidth wipes out all detail from the data set. Note the
large probability assigned negative arsenic concentrations by the estimate
with the excessively large bandwidth. This plot was drawn in R using
plot(density(nails,bw=a$bw/10),xlab="Arsenic",type="l",
main="Density estimate with inopportune bandwidths")
lines(density(nails,bw=a$bw*5),lty=2)
legend(1,2,legend=paste("Band width default",c("/10","*5")),
lty=c(1,2))
detach(arsenic)

and the object a is the kernel density estimate with the default bandwidth,
constructed in Example 8.2.1.

Silverman (1986, §3.3) discusses a more general kernel w(t). Equations (8.5)
and (8.6) hold, with
Z ∞ Z ∞
C1 (x) = f (x) w(t)2 dt, C2 (x) = 1/2f 00 (x) t2 w(t) dt,
−∞ −∞

and (8.8) continues to hold. Sheather and Jones (1991) further discuss the
constants involved.
Epanechnikov (1969) demonstrates that (8.3) extends to multivariate dis-
tributions; estimate the multivariate density f (x) from independent and iden-
tically distributed observations X1 , . . . , Xn , where Xi = (Xi1 , . . . , Xid ), using
d
Y n
X
fˆ(x) = n−1 ( ∆nj )−1 w((Xi1 − x1 )/∆n1 , . . . , (Xid − xd )/∆nd ).
j=1 i=1
148 Density Estimation

FIGURE 8.2: Density Estimate with Inopportune Bandwidths

8 ...
......
......
........
.....
......
6 ... ....
..... ....
.
....
. .. ......
... ... ......
.... ....
. . Band width divided by 10
........
.....
...............
.. ... ......
... ..
Density 4 ........ ....
...... ..
Band width multiplied by 5
........
.....
. . .
.... ... ......
.... ..... ........
.. ... ... ...
... ... ... ... ..
... ... ... .... . ......
.... .. .. .
2 ...
... .. ........ .... ..... .....
... ... ......... ...... ..... ...... .. ....
.
.
....
.....
.. ... ... ........ ...... ....... ...... .. ... ....
... ... ....... ...... ..... ...... .. .... ......
..... .. . . . .. . . . .. . . .
. .
.
. . .... . ..... . .... . ......... ..... ......... . .... ... . . .
. .
.. ... .. ...
. . ..... ..... .. ..
. ... ... ..... ... ... ... ... ... .. ...
.. ........ .. ...... ........ .................................... .................................................................................................................................................................................. . . . .
..
0
0 0.5 1 1.5 2 2.5
Nail Arsenic

Here w is a density over <d , and ∆nj are dimension-dependent bandwidth

parameters. Epanechnikov (1969) uses multivariate densities that are products
of separate densities for each dimension, but surveys work using more general
kernels in two dimensions. The number of observations needed for precise
estimation grows dramatically as the dimension d increases.

8.3 Exercises
1. The data at

HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages

reflects wages from 1985. The first 27 lines of this file represent an
explanation of variables; delete these lines first, or skip them when
you read the file. The first six fields are numeric. The sixth is hourly
wage; you can skip everything else. Fit a kernel density estimate to
the distribution of wage, and plot the result. Comment on what you
see.
2. The data at

HTTP://lib.stat.cmu.edu/datasets/cloud

contain data from a cloud seeding experiment. The first fifteen lines
contain comment and label information; ignore these. The third field
indicates season, and the sixth field represents a rainfall measure
expected not to be impacted by the experimental intervention. On
Exercises 149

the same set of axes, plot kernel density estimates of summer and
winter rain fall, and comment on the difference.
9
Regression Function Estimates

Consider modeling a collection of responses Yj as a function of explanatory

variables Xj = (Xj1 , . . . , XjK ). That is, express

Yj = g(Xj ) + j , (9.1)

with errors j independent and typically standardized to have a common scale

measure such as standard deviation or interquartile range, and with various
constraints on g. In this chapter, assume further that these errors are identi-
cally distributed. The least restrictive constraint on g allows for an arbitrary
conditional expectation of Yj ; since in many cases each unique value of Xj
appears only a small number of times in the data set, fitting such a model
reliably is difficult.
As with the previous chapters (except for Chapter 8), this chapter begins
with a review of standard Gaussian-theory results. Unlike previous chapters,
it also includes, in the form of quantile and resistant regression techniques,
some methods that are partially parametric in nature.

9.1 Standard Regression Inference

The most restrictive constraint on g is to require an affine function of Xj ;
that is,
g(Xj ) = β > Xj ; (9.2)
here β and Xj are column vectors. Generally, the first component of each of
the Xj is 1, making the first component of β an intercept parameter.
The notation above, with Xi capitalized, implies that the explanatory vari-
ables are random. Analyses considered in this chapter are generally performed
conditionally on these response variables, and so their random nature might
be ignored.
The standard approach chooses the regression parameters to minimize the
sum of squares of residuals; that is,
n
X
β̂ = argmin (Yi − β > Xi )2 = (X > X)−1 X > Y , (9.3)
β i=1

151
152 Regression Function Estimates

for
X the n × K matrix with rows given by Xi> , (9.4)
and Y is the column vector with entries Yi . The estimator (9.3) is defined only
if X is of full rank, or, in other words, if the inverse in (9.3) exists. Otherwise,
the model is not identifiable, in that two different parameter vectors β and
β ∗ give the same fitted values, or Xβ = Xβ ∗ .
When the errors j have a Gaussian distribution, then the vector β̂ has a
multivariate Gaussian distribution, exactly, and
p
(β̂i − βi )/ s2 vi ∼ Tn−K , (9.5)
n
for s2 = i=1 (Yi − β > Xi )2 /(n − K), and vi the entry in row and column
P
i of (X X)−1 . When the errors j are not Gaussian, under certain circum-
>

stances, a central limit theorem justifies approximating the distribution of β̂

as multivariate Gaussian, and (9.5) holds, approximately.

9.2 Kernel and Local Regression Smoothing

An intermediate level of constraint has g(x) continuous and differen-
tiable, with curves that turn quickly discouraged. One method of fit-
ting under such a constraint is kernel smoothing, and specifically as
Nadaraya-Watson smoothing (Nadaraya, 1964; Watson, 1964). One obtains
an expression that is explicit rather than implicit; estimate g(x) as
n
X n
X
ĝ(x) = Yj w((x − Xj )/∆n )/ w((x − Xj )/∆n ). (9.6)
j=1 j=1

The weight function can be the same as was used for kernel density estimation.
This weight function is often a Gaussian density, or a uniform density centered
at 0. Fan (1992) discusses a local regression smoother
L
X
ĝ(x) = βˆ` x` , (9.7)
`=0

for L = 1 and

n L
!2 
X X
β̂ = argmin  Yj − βˆ` Xj` w((x − Xj )/∆n ) , (9.8)
j=1 `=0

and argues that this estimator has smaller bias than (9.6). Köhler et al. (2014)
present considerations for bandwidth selection.
Kernel and Local Regression Smoothing 153

Example 9.2.1 Example 2.3.2 presents nail arsenic levels from a sam-
ple; arsenic levels in drinking water were also recorded, and one can in-
vestigate the dependence of nail arsenic on arsenic in water. The data
may be plotted in R using
attach(arsenic)
plot(water,nails,main="Nail and Water Arsenic Levels",
xlab="Water",ylab="Nails",
sub="Smoother fits. Bandwidth chosen by inspection.")
Kernel smoothing may be performed in R using the function ksmooth.
Bandwidth was chosen by hand.
lines(ksmooth(water,nails,"normal",bandwidth=0.10),lty=1)
lines(ksmooth(water,nails,"box",bandwidth=0.10),lty=2)
legend(0.0,2.0,lty=1:3, legend=c("Smoother, Normal Kernel",
"Smoother, Box Kernel", "Local Polynomial"))
The function locpoly from library KernSmooth applies a local regression
smoother:

library(KernSmooth)
lines(locpoly(water,nails,bandwidth=.05,degree=1),lty=3)
detach(arsenic)
Results are given in Figure 9.1. The box kernel performs poorly; a sample
this small necessarily undermines smoothness of the box kernel fit. Per-
haps the normal smoother is under-smoothed, but not by much. Library
KernSmooth contains a tool dpill for automatically selecting bandwidth.
The documentation for this function indicates that it sometimes fails,
and, in fact, dpill failed in this case. This local regression smoother ig-
nored the point with the largest values for each variable, giving the curve
a concave rather than convex shape.
These data might have been jointly modeled on the square-root scale,
to avoid issues relating to the distance of the point with the largest values
for each variable from the rest of the data. An exercise suggests exploring
this; in this case, the automatic bandwidth selector dpill returns a value.
Figure 9.2 demonstrates the results of selecting a bandwidth too small.
In this case, the Gaussian kernel gives results approximately constant in
the neighborhood of each data point, and the box kernel result is not de-
fined for portions of the domain, because both numerator and denominator
in (9.6) are zero.

An alternative procedure uses as ĝ(x) the fitted value at x for low-degree

(viz., linear or quadratic) regression of points with Xj near x (Cleveland,
1979; Savitzky and Golay, 1964). Cleveland and Devlin (1988) refer to this
procedure as locally weighted regression (loess). One specifies the number of
154 Regression Function Estimates

FIGURE 9.1: Nail Arsenic and Water Arsenic

2.5
........................... Smoother, Normal Kernel
. . . .◦
. . . . . Smoother, Box Kernel .
.
2
...................... Local Polynomial .
.
.
. ...
. ........
. ........
........
..
.....
1.5 .....
..... .
.
..... .
Nail . . . . . . ......... . .
.
.
.
. ....
As .. . . ..
.
....
....
....
. ....
....
1 . .....
. ..
.....
...
◦ ◦ . .....
. .....
.....
..... ..
. ........
. .
...
..
..........................................................................
.
. ....... . . .
................ . . . ............
............. ......
........... . . . . . .......
.......... .......
0.5 ◦
..... ........... . . ...
....
........
... .......... ◦
........ . .............
◦ ....... .................
◦ ....... ..◦ ....................... .
◦
........................................◦.................................... . . . .
.....◦
◦
◦
◦
◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Smoother fits. Bandwidth chosen by inspection.

FIGURE 9.2: Nail Arsenic and Water Arsenic

2.5
........................... Normal
....................................................◦
..
. . . . . Box
2

1.5
Nail
As
1
◦ ...............◦
.....................
....
. ..... ...
. ... ...
. ..... ...
.
. ....... ...
.
...
.... ...
..
0.5 . ◦
........... . ....
.........
. ...... . ...........◦
......................................................
. ....
◦ . .
....
◦
◦ ...
.
... ◦◦
.
.....
...........
......◦
.◦
◦
◦
◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Smoother fits. Bandwidth chosen too small.
Kernel and Local Regression Smoothing 155

points k, and up-weights points near x and down-weights them away from
x. The weighting function is scaled to make the point in the neighborhood
farthest from x have a weight going down to zero.
Consider the case with an intercept and one regressor, so L = 2. The
estimate is now (9.7), for

L
!2 
X X
`
β̂ = argmin  Yj − β` Xj w((x − Xj )/∆n (x)) , (9.9)
j∈N (x) `=0

for N (x) the indices of the k closest points to x, and

∆n (x) = max{|Xj − x||j ∈ N (x)}. (9.10)
The local linear regression fit (9.8) with (9.6) differs from (9.9) with (9.6)
in the restriction of the consideration of points defining local regression pa-
rameter estimates to points near the point of interest, rather than using all
points with positive values for a weighting function. Note further that for
the loess procedure, bandwidth is determined implicitly, and depends on the
point at which the smoother is applied.
A common weight function is w(x) = (1 − |x|3 )3 .

Example 9.2.2 The solution using the default proportion of the data
0.75 appears to under-smooth the data, and raising this parameter above
1 reduces curvature, but not by much. Return again to the arsenic data
previously examined in Example 9.2.1. Again, plot the data in R using

attach(arsenic)
plot(water,nails,main="Nail and Water Arsenic Levels",
xlab="Water",ylab="Nails", sub="Loess fits")
Loess smoothing may be performed in R using the function loess. Band-
width was chosen by hand. Unlike in the case of ksmooth, loess does
not provide output that can be plotted directly. A set of points at which to
calculate the smoother must be specified; this is stored in the variable x
below. Instead of specifying a bandwidth for the loess procedure, one spec-
ifies the number of observations contributing to each neighborhood from
which the fit is calculated. In R, this is specified as the proportion of the
total sample, via the input parameter span. Hence the second call below
uses the entire data set.
x<-min(water)+(0:50)*diff(range(water))/50
lines(x,y=predict(loess(nails~water),x))
lines(x,y=predict(loess(nails~water,span=1),x),lty=2)

Values of span above 1 specify for the bandwidth to be expanded beyond

(9.10), thus increasing smoothing:
156 Regression Function Estimates

lines(x,y=predict(loess(nails~water,span=10),x),lty=3)
legend(0.0,2.0,legend=c("Default span .75",
"Span 1","Span 10"),lty=1:3)
detach(arsenic)
Results are given in Figure 9.3. Note that the solution using the default
proportion of the data 0.75 appears to under-smooth the data, and that
raising this parameter above 1 misses detail of the shape of the relation-
ship.

FIGURE 9.3: Nail Arsenic and Water Arsenic

2.5
........................... Default span .75 ◦..
. . . . . Span 1 ...
......
..........
.
2 ...................... Span 10 .......
.
......
.......
.
............
. .
..... ...
.... ...
..... ........
.... .
...... .........
1.5 .....
. ...
..... ..........
.
.
Nail .....
.....
.
......
.
..
.... .
..... .......
As .
...
.
.....
...... .............
.
......
. ... .
...... ........ .
1 ..... ........... . .
........................... ..................
..... ................................................................... . . .
◦ ....... ◦ ..
.. .
.. ..... . . .
. . . . .............. .
... . . . .
... . . ......
... .
. .............
.
. ◦
. ... .......
0.5 .
. .... ..............
........ ◦
◦ . . ....
◦ . ....................
◦ .............. ◦ ..◦
........◦
...◦
. .
............ ....
.
◦... .........
◦◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Loess fits
Cleveland and Devlin (1988) extend this loess technique to higher dimensions.

9.3 Isotonic Regression

Many contexts justify a non-decreasing or non-increasing nonparametric re-
lationship between variables. In this section consider non-decreasing relation-
ships; reversing this constraint is straight-forward. Pn
In the case in which K = 1, one might choose Ŷj to minimize j=1 (Yj −
Ŷj )2 subject to Ŷj ≥ Ŷi whenever Xj ≥ Xi . This model fitting is an example
of quadratic programming. When K > 1, then the space of possible covariates
does not have a natural ordering. One may construct a partial ordering; that is,
for some distinct covariate vectors x and y, neither x nor y is ordered higher.
Splines 157

For example, one may define the partial ordering (x1 , . . . , xK ) (y1 , . . . , yK )
if xi yi for all i; that is, consider one vector as greater than or equal to the
other if and only if each component of the first vector does not exceed the
same component of the second vector. Such an ordering lacks the property
that any two elements of the set may be compared, and so is called a partial
ordering. For example, neither (2, 1) (1, 2) nor (1, 2) (2, 1). In the case of
a single regressor this complication does not arise, since for any two distinct
real numbers, one can be determined to be the smaller and the other is the
larger.
Brunk (1955) introduces the notion of model fitting that respects the
partial ordering g(x) ≤ g(y) if x y. Such techniques are called
isotonic regression. Dykstra (1981) reviews an algorithm for fitting such a
model, called the pooled adjacent violators algorithm, and produces theoreti-
cal justification for this algorithm. Best and Chakravarti (1990) reviews more
general algorithmic considerations.

Example 9.3.1 Example 2.3.2 presents arsenic levels in water and

nails. Isotonic regression may be performed in R using the function
isoreg, and R provides a plotting method for isotonic regression results.
Hence the regression output may be plotted directly, using
attach(arsenic)
plot(isoreg(water,nails), xlab="Water As",ylab="Nail As",
main="Nail Arsenic and Water Arsenic")
detach(arsenic)
Results are given in Figure 9.4.

9.4 Splines
A spline is a smooth curve approximating the relationship between an ex-
planatory variable x and a response variable y, based on observed pairs of
points (X1 , Y1 ), . . . , (Xn , Yn ), constructed according to the following method,
in order to describe the dependence of y on x between two points x0 and xN .
One first picks N − 1 intermediate points x1 < x2 < · · · < xN −2 < xN −1 .
The intermediate points are called knots. One then determines a polynomial
of degree M between xj−1 and xj , constrained so that the derivatives of order
up to M − 1 match up at knots. Denote the fitted mean by ĝ(x).
Taken to an extreme, if all Xj are unique, then one can fit all n points with
a polynomial of degree n − 1; this, however, will yield a fit with unrealistically
158 Regression Function Estimates

FIGURE 9.4: Nail Arsenic and Water Arsenic

2.5
. . . . . . . . . . . . . . . . . . .◦
.
.
.
2 .
.
.
.
.
.
1.5 .
Nail .
.
As .
.
.
1 .
◦ .
◦ .
. . . . . . . . . . . . . . . . . .
.
.
.◦
0.5 .
.
◦
◦◦ .
◦ .. . . .◦◦
.
◦.◦ ◦
◦◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Isotonic Regression Fit.

extreme fluctuations. Instead, choose the polynomials to minimize

n
X Z X(n)
(Yj − ĝ(Xj ))2 + λ |ĝ 00 (x)| dx;
j=1 X(1)

here λ is a parameter that penalizes estimates with large curvature.

Example 9.4.1 Again revisit the arsenic data of Example 9.2.1. The
spline is fit using R as
attach(arsenic)
plot(water,nails, main="Nail Arsenic and Water Arsenic")
hgrid<-min(water)+(0:50)*diff(range(water))/50
lines(predict(spl<-smooth.spline(water,nails),hgrid))
lines(predict(smooth.spline(water,nails,spar=.1),hgrid),
lty=3)
lines(predict(smooth.spline(water,nails,spar=.5),hgrid),
lty=2)
legend(1250,1110,lty=1:3,col=1:3,
legend=paste("Smoothing",c(round(spl$spar,3),.5,.1),
c("Default","","")))
detach(arsenic)
Here hgrid is a set of points at which to calculate the spline fit. As with
most of the smoothing methods, R contains a lines method that will plot
Quantile Regression 159

predicted values directly. See Figure 9.5. Smoothing parameters smaller

than optimal yield over-fitted curves.

FIGURE 9.5: Nail Arsenic and Water Arsenic

2.5
........................... Smoothing 0.934 Default ..
.◦
...
. . . . . Smoothing 0.5 .
.
....
....
.
.
2 ...................... Smoothing 0.1 .....
.....
.......
.......
.
.
......
.......
......
........
1.5 ..
....
..
.... .......
..
.
..
.......
.
...
.. .......
Nail .. ......... ...
.
......
........
.... ....... ......... ....
As ... ...
.. ..
.......
.....
.
..
......
.......
..... ...... ........
... ..... ...........
1 .
.......
....
......
....
...
......
........
◦ ..... ...
◦ .........
.... ..... ........
....
....
......
...... .
.. .............
....... ........
. ......... ..........
... .......... ..........
◦... ............
0.5 ...
.
..
.... ......
..... .........
...... .......................◦
..................
.
..
..
.............
◦
◦ ..................◦
. ..
◦ ...
.... ..◦
.........
........◦
.◦ ...
◦
◦◦. ................
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Spline Fit

Schoenberg (1946) introduced this technique.

9.5 Quantile Regression

Least squares regression, as in (9.3), provides parameter estimates minimizing
the sum of squares of errors in fitting. Such an approach is often criticized
as allowing excessive influence to outliers, and, hence, is non-robust. As an
alternative, consider minimizing the sum of residuals, to obtain
 
Xn
β̂ = argmin  |Yj − β1 − β2 Xj | . (9.11)
β j=1

When comparing (9.11) with the standard least-squares criterion (9.3), note
the absence of ·2 in (9.11).
Equivalently, one might minimize
X
−
e+j + ej , (9.12)
j
160 Regression Function Estimates

for
− −
e+ +
j ≥ 0, ej ≥ 0, ej − ej = Yj − β1 − β2 Xj ∀j. (9.13)
The objective function is not differentiable; the optimum is where (9.12)
is either at a point, or flat, making the optimizer not unique.

FIGURE 9.6: Hypothetical L1 Fit

4.5
...
...
...
...
....
4 ....
....
....
....
....
....
....
....
....
....
....
....
Sum of 3.5 ....
....
....
....
Absolute ....
....
....
....
Residual ....
....
.....
Values 3 .....
.....
.....
..
...
..... ..
..... ....
.....
..... ...
.....
..... ...
..... ..
..... ....
..... ..
.....
2.5 ..... ....
.......

2
0 5 10 15 20
Regression Parameter
Intercept Fixed at Optimum

This solution is called L1 regression, after the power on |Yj − β1 − β2 Xj |.

This solution is also called quantile regression. To understand the motivation
behind this term, note that if β2 = 0, then the best β1 is the median. More
generally, if the linear model fits, and if the errors are identically distributed,
then the quantile regression line runs through the median of the distribution
of the responses, conditional on the explanatory variable. Note the parallelism
between the minimizations of (2.4) and (9.11). Just as median is not always
uniquely defined, these estimates are not necessarily uniquely defined.
Stigler (1984) attributes the suggestion to fit the linear model using (9.12)
to Roger Boscovich, and a partial solution to the minimization problem to
Thomas Simpson, both circa 1760. Stigler (1973) attributes the solution of
the problem when β1 = 0 to Laplace (1818). Koenker (2000) traces further
developments through the work of Edgeworth.
Table 9.1 compares coverage and length for the quantile and least squares
regression estimators, for a variety of distributions and sample sizes. Surpris-
ingly, both techniques maintain approximate 90% coverage for all putative
distributions and all sample sizes. Unsurprisingly, L2 intervals are shorter for
Gaussian variables, uniform variables, and for Laplace and exponential vari-
Quantile Regression 161

TABLE 9.1: Achieved coverage and average interval length for 90% confidence
intervals for L1 and L2 regression, with various error distributions

Interval Coverage, 10 observations

Gaussian Cauchy Laplace Uniform Exponential
L1 0.8854 0.8783 0.8811 0.8830 0.8826
L2 0.9051 0.9179 0.9042 0.8975 0.9047
Average Length, 10 observations
Gaussian Cauchy Laplace Uniform Exponential
L1 0.4766 3.5869 0.6092 0.1441 0.4035
L2 0.3981 10.8032 0.5466 0.1162 0.3773
Interval Coverage, 50 observations
Gaussian Cauchy Laplace Uniform Exponential
L1 0.8876 0.8832 0.8847 0.8876 0.8900
L2 0.9006 0.9121 0.9047 0.9017 0.9021
Average Length, 50 observations
Gaussian Cauchy Laplace Uniform Exponential
L1 0.0391 0.0568 0.0386 0.0145 0.0303
L2 0.0363 2.0033 0.0509 0.0105 0.0357

ables with the smaller sample size, and longer for Cauchy variables and for
Laplace and exponential variables with the larger sample size.
One can adapt this technique to fit quantiles of the errors other than the
median. One can replace the objective (9.12) by
X X
γ e+
j + (1 − γ) e−
j , (9.14)
j j

still subject to constraints (9.13). This change causes the regression line to
run through the 1 − γ quantile of Y |X.

9.5.1 Fitting the Quantile Regression Model

This minimization of (9.12) or (9.14), subject to (9.13), is an example of
linear programming. The solution to this minimization is more computation-
ally intensive than the minimization of (9.3), and, unlike the least squares
case, does not have a closed-form expression. This added difficulty arises be-
cause the objectives (9.12) and (9.14) are non-differentiable. At the minimizer
β̂, the objective function does not decrease as β moves away from β̂. Since
P n
j=1 |Yj − β1 − β2 Xj | is a piecewise-linear function of β1 and β2 , the function
S(β) giving a counterpart of the derivative is piecewise constant, with jumps.
Then β̂ should satisfy S(β̂) ≈ 0. The estimate either sets the function to
0, or is a point where it jumps across 0. This score function S(β) may be
expressed as a rank statistic, and the null hypothesis H0 : β = β ◦ may be
162 Regression Function Estimates

tested by comparing S(β ◦ ) to 0, either exactly or asymptotically. This test

can be inverted to give confidence sets for β.
A solution to the more general problem seems to have been an early ap-
plication of general linear programming methods.

Example 9.5.1 Consider again the arsenic data of Example 9.2.1. The
quantile regression fitter is found in R in library quantreg. The following
commands fit this model:
library(quantreg)#Need for rq
rq(arsenic$nails~arsenic$water)
to give the results
Call: rq(formula = nails ~ water)

tau: [1] 0.5

Coefficients:
coefficients lower bd upper bd
(Intercept) 0.11420 0.10254 0.14093
water 15.60440 3.63161 18.07802

Hence the best estimate of the slope of the linear relationship of ar-
senic in nails to arsenic in water is 15.604, with 90% confidence interval
(3.632,18.078). This confidence level is the default for rq. Figure 9.6 is
drawn using
attach(arsenic)
rqo<-summary(rq(nails~water))$coef
regp<-rqo[2,2]+diff(rqo[2,2:3])*(0:100)/100
m<-apply(abs(outer(nails,rep(1,length(regp)))
- outer(water,regp)-rqo[1,1]),2,sum)
plot(regp,m,type="l",xlab="Regression Parameter",
ylab="Sum of absolute values of residuals",
main="L1 fit for Arsenic Large Scale",
sub="Intercept fixed at optimum")
detach(arsenic)
and demonstrates the estimation of the slope; in this figure the intercept
is held at its optimal value.
Compare this to the Theil-Sen estimator:
attach(arsenic)
library(deming); theilsen(nails~water,conf=.9)$coef
yielding
Quantile Regression 163

(Intercept) water
0.1167158 14.2156504

A confidence interval for the slope parameter is calculated using

library(MultNonParam);theil(water,nails)
detach(arsenic)
giving the confidence interval (8.12,16.41); note the higher precision of
this approach.

If the linear model fits, and if the variance of the errors does not depend on
the explanatory variable, then the lines representing various quantiles of the
distribution of the response conditional on the explanatory variable will be
parallel, but this parallelism need not hold for the estimates in any particular
data set, as can be seen in the next example.

Example 9.5.2 Consider the blood pressure data set of Example (6.4.2).
Quantile regression can be used to model the median systolic blood pres-
sure after treatment in terms of systolic blood pressure before treatment.

#Expect warnings about nonunique estimators. This is OK.

rqout<-rq(bp$spa~bp$spb); summary(rqout)
yielding
coefficients lower bd upper bd
(Intercept) -11.41026 -34.76720 29.00312
bp$spb 0.94872 0.69282 1.11167
Hence blood pressure after treatment increases with blood pressure before
treatment; the best estimate is indicates that a one-for-one increase is
plausible, in that 1 sits inside the confidence interval. One can also fit
the .2 quantile:
rqoutt<-rq(bp$spa~bp$spb,tau=.2); summary(rqoutt)
to obtain
Coefficients:
coefficients lower bd upper bd
(Intercept) 17.25000 -68.48262 23.34294
bp$spb 0.75000 0.40129 1.20640
One can plot these relationships:
plot(bp$spb,bp$spa,main="Systolic Blood Pressure",
xlab="Before Treatment",ylab="After Treatment")
164 Regression Function Estimates

abline(rqout);abline(rqoutt,lty=2)
legend(150,200,legend=c("Median Regression",
".2 Quantile Regresson"),lty=1:2)
Results are plotted in Figure 9.7. Note the extreme lack of parallelism.

FIGURE 9.7: Systolic Blood Pressure

◦
200 ........................................... Median Regression
. . . . . . . .2 Quantile Regression ......
......
......
......
.
...
.......
.....
......
......
180 ◦.............
...
.
◦
.
...
...... . .
...
........ . .
..
......
. .
.
...... . .
◦ ............. . .
◦.
..
...◦
After ◦
..
........
. . . .
.
...... . .
......
Treat- 160 ......
...... . .
. .
◦ ......
ment .
......
......
..
...
. .
.
...... . .
...... ◦ . .
..
........
. .
.
...... . ◦.
......◦
...... . . . ◦
......
..
........ . .
.
.... . .
140 ..
.
. .
......
..
◦ ...... .
...... .
...... . .
.
...
........ .
.
...... . ◦
....... .
.......◦
......
.
...
.......
..
............
. ........
120 ......

140 150 160 170 180 190 200 210

Before Treatment

9.6 Resistant Regression

Again consider the linear model (9.1) and (9.2), with an estimator of the form
" n #
X
β̂ = argmin %((Yi − β > xi )/σ) . (9.15)
β i=1

This extends the location estimator (2.6). Least-squares estimates are given
by %(z) = z 2 /2, and the estimator of §9.5 is given by %(z) = |z|. These were
the same penalty functions used in §2.3.1 to give the sample mean and median
respectively. In both of these cases, the regression parameters minimizing the
penalty function did not depend on σ.
In parallel with that analysis, one might form an intermediate technique
between ordinary and quantile regression by solving (9.15) for an alterna-
tive choice of %. One might use (2.7); in such a case, one begins with an
Resistant Regression 165

initial estimate of σ, and minimization of (9.15) is typically alternated with

re-estimation of σ via the method of moments in a recursive manner. This
produces estimates that are resistant to the effects of outliers.

Example 9.6.1 Consider again the blood pressure data of Exam-

ple 6.4.2. Model the systolic blood pressure after treatment as linear on
the blood pressure before treatment, and fit the parameters using the least-
squares estimator (9.3), the quantile regression estimator (9.11), and gen-
eral estimator (9.15) using the resistant penalty (2.7).

library(MASS); rlmout<-rlm(spa~spb,data=bp)
library(quantreg); rqout<-rq(spa~spb,data=bp)
plot(bp$spb,bp$spa,
main="Blood Pressure After and Before Treatment",
xlab="Systolic Blood Pressure Before Treatment (mm HG)",
ylab="Systolic Blood Pressure After Treatment (mm HG)",
sub="Linear Fits to Original Data")
abline(rlmout,lty=2)
abline(rqout,lty=3)
abline(lm(spa~spb,data=bp),lty=4)
legend(min(bp$spb),max(bp$spa),
legend=c("Huber","Median","OLS"),lty=2:4)
Compare this with results for a contaminated data set. Change the re-
sponse variable for the first observation to an outlier, and refit.
bp$spa[1]<-120
rqout<-rq(bp$spa~bp$spb);rlmout<-rlm(bp$spa~bp$spb)
plot(bp$spb,bp$spa,
xlab="Systolic Blood Pressure Before Treatment (mm HG)",
ylab="Systolic Blood Pressure After Treatment (mm HG)",
sub="Linear Fits to Perturbed Data")
abline(rlmout,lty=2);abline(rqout,lty=3);
abline(lm(spa~spb,data=bp),lty=4)
legend(min(bp$spb),max(bp$spa),
legend=c("Huber","Median","OLS"),lty=2:4)
These fits are shown in Figure 9.8. The median regression line is slightly
different in the two plots, although if one does not look closely one might
miss this. The least squares fit is radically different as a result of the shift
in the data point. The Huber fit is moved noticeably, but not radically.
166 Regression Function Estimates

FIGURE 9.8: Blood Pressure After and Before Treatment, Original and Per-
turbed Data

◦
200 . . . . . . . Huber
.................................. Median ...
..............
.............
........................................ OLS ...
.
..
..
............
...............
......
........ .
....... ◦
180 ◦...............
.
....
..
..
....... .
....... .
Systolic .......
.........
...
..
.............
Blood .
◦ ..................... ◦
..........◦
◦ .....
Pressure . ..
.
............
................
..........
After 160 ◦
..............
.............
.............
Treatment ....................
... . ...
........ .....
(mm HG) .............. ◦
..................
.
...
.
. ...........◦. .. ◦
. ..
........ ...... ◦
........ .....
..
.....................
140 .
...
....... .......
....... ......
◦ ....... .....
..... .....
....... .....
...
......... .........
..... .... ◦
...... ......
...... .........◦
........ ..
.
.......... .........
... ..
......
......
120 ......

140 150 160 170 180 190 200 210

Systolic Blood Pressure Before Treatment (mm HG)
Linear Fits, Original Data
200 . . . . . . . Huber
.................................. Median
........................................ OLS ......
...
......
.....
.....
180 ◦
......
... ..
. .◦
.
. .
. .
...... . .
........ .
Systolic ......
......
.
. .
.
...... . .
Blood ◦ ...... ◦
........ . . . ...........
....
◦ ◦....... .
. ...........
Pressure .. . ...........
...... . ...........
.
.. ...... . .
. ..
.....
..............
After 160 ◦ .......
...
....... . ...........
...........
...........
.

Treatment ....... ...........

.
...
. ...... .....................
.
.... ..............
(mm HG) . ..................◦
..
. ............
..................
...
...
......................◦
. ◦
... . ......
........... . ◦
........... . . ..........
...........
........... . . . ...........
140 .............
... ..............
.
. . .
.....
....
........... ◦ . . ............
. . ......
. . .......
.
......
.. ◦
. . ...... ◦
......
......
.
.. .....
.
......
120 ◦

140 150 160 170 180 190 200 210

Systolic Blood Pressure Before Treatment (mm HG)
Linear Fits, Perturbed Data

9.7 Exercises
1. The data at
Exercises 167

HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages

reflects wages from 1985. The first 42 lines of this file contain a
description of the data set, and an explanation of variables; delete
these lines first, or skip them when you read the file. The first six
fields are numeric. The first is educational level, and the sixth is
hourly wage; you can skip everything else.
a. Fit a relationship of wage on educational level, in such a way as
to minimize the sum of absolute value of residuals. Plot this, and
compare with the regular least squares regression.
b. Fit a relationship of wage on educational level, in such a way as
to enforce monotonicity. Superimpose a plot of this relationship on
a plot of the data.
c. Fit a kernel smoother to estimate the dependence of wage on ed-
ucational level, and compare it to the results for the loess smoother.
d. Fit a smoothing spline to estimate the dependence of wage on ed-
ucational level, and compare it to the results for the loess smoother
and the kernel smoother.
2. Repeat the analysis of Example 9.2.1 for arsenic levels on the square
root scale. As an intermediate step, calculate the optimal bandwidth
for the local linear smoother. Plot your results. Compare with the
results in Figure 9.1.
10
Resampling Techniques

This chapter addresses general procedures for performing statistical inference

on a parameter using an estimator with minimal assumptions about its dis-
tribution. In contrast to fully nonparametric approaches used earlier, these
techniques use information contained in the sample to make inferences about
the sampling distribution of the estimator. Two approaches, the bootstrap
and the jackknife, are discussed in this chapter.
Suppose that independent and identically distributed vector observations
Z1 , . . . , Zn are drawn from a distribution with a continuous cumulative dis-
tribution function F , and suppose that inference on a parameter θ is required,
and an estimator θ̂(Z1 , . . . , Zn ) is used.
The parameter θ can be thought of as a function of the R distribution func-
tion; for example, the population expectation θ(F ) = z dF (z), and the
population median solves F (θ) = 1/2. Furthermore, the estimator θ̂ is a func-
tion of the empirical distribution function F̂ , and often is the same functional
that gives the parameter: θ̂ = θ(F̂ ). For example, an estimator of a population
expectation is the sample mean, which is the expectation of the population
formed by placing equal weights on each sample point. As a second exam-
ple, an estimator of a population median is the sample median, which is the
median of population formed by placing equal weights on each sample point.
The distribution of the estimator, and particularly how this distribution
depends on the quantity to be estimated, is also required for statistical in-
ference. In some cases, the structure of the model for a data set implies the
distribution of the estimator; for example, if observations are stochastically in-
dependent indicators of whether an event occurs, the resulting distribution is
known to be derivable from the distribution of Bernoulli trials. In other cases,
recourse is made to a central limit theorem, to produce traditional Gaussian-
theory inference. The current chapter concerns using the the observed data
set to provide this distributional information.

10.1 The Bootstrap Idea

The bootstrap is a suite of tools for inference on a model parameter, while
using the data to give information that might otherwise come from assump-

169
170 Resampling Techniques

tions about the parametric shape of its distribution. Let G(θ̂; F ) represent
the desired, but unobservable, distribution of θ̂ computed from independent
random vectors Z1 , . . . , Zn , with each random vector Zi having a distribution
function F . Assume that this distribution depends on F only via the param-
eter of interest, so that one might write G(θ̂; F ) = H(θ̂; θ) for some function
H.
Consider constructing confidence intervals using the argument of §1.2.2.1.
Using (1.18) with T = θ̂, a 1 − α confidence interval for θ satisfies H(θ̂; θL ) =
1 − α/2 and H(θ̂; θU ) = α/2. The function H, as a function of θ̂, is unknown,
and will be estimated from the data. Since observed data all arise from a
distribution governed by a single value of θ, the dependence of H(θ̂; θ) on
θ cannot be estimated. An assumption is necessary in order to produce a
confidence interval. Assume that

θ̂ − θ has a distribution that, approximately, does not depend on θ. (10.1)

10.1.1 The Bootstrap Sampling Scheme

Estimate H(θ̂, θ) by H † (θ̂), the distribution of the estimator evaluated on
the set of n random vectors, (Z1∗ , . . . , Zn∗ ), where Zi∗ are independent, and
selected from {Z1 , . . . , Zn } with probability 1/n for each. That is, if θ̂B,i are
the values of θ̂ evaluated at each of the nn samples, then
n
n
X
†
H (θ̂) = I(θ̂B,i ≤ θ̂)/nn .
i=1

Here again, the function I of a logical argument is 1 if the argument is true,

and 0 if it is false. Generally, the function θ̂ is symmetric in its arguments (for
example, θ̂(z1 , z2 , . . . , zn ) = θ̂(z2 , z1 , . . . , zn ), and similarly for other permu-
tations of the arguments), and so the distribution of θ̂(Y1 , . . . , Yn ) is a discrete
distribution supported on fewer values than nn , although any shortcuts that
exploit this symmetry still leave the exact enumeration of distribution of θ̂
under the resampling distribution intractable.
Because this exhaustive approach consumes excessive resources, one almost
always proceeds via random sampling. Choose a number of random samples B
to draw. One draws new random samples from the population represented by
the original sample, with replacement. This sampling with replacement dis-
tinguishes the bootstrap from previous permutation techniques. For random
sample i, evaluate the estimator on this sample, and call it θ̂B,i . The collection
of such values is called the bootstrap sample. Let
B
X
H ∗ (θ̂) = I(θ̂B,i ≤ θ̂)/B.
i=1
The Bootstrap Idea 171

Call this distribution the resampling distribution.

When approximating H by H ∗ , two sources of errors arise: the error in
approximating H by H † , governed by sample size, and the error in approx-
imating H † by H ∗ , which is governed by B. Generally speaking, moderate
values for B (for example, 999, or 9999) are sufficient to make the second
source of error ignorable, in the presence of the first source of data.
Techniques below will need quantiles of H ∗ , which are determined by or-
dered values of the bootstrap samples. Express the ordered values of θ̂B,i by
θ̂B,(i) . Order statistics from the bootstrap sample are used to estimate quan-
tiles of the bootstrap distribution. The most naive approach uses θ̂B(i) to
represent quantile i/B of the bootstrap distribution. By this logic, θ̂B(1) es-
timates the 1/B quantile, and θ̂B(B) represents the 1 quantile; that is, θ̂B(B)
approximates a value with all of the true sampling distribution for θ̂ at or
below it. Conceptually, the estimation problem ought to be symmetric if the
order of bootstrap observations is swapped, but these naive quantiles are not.
The upper quantile is wrong, since the population that θ̂B might be intended
to represent might not exist on a bounded interval. One may make this quan-
tile definition symmetric by taking θ̂B(i) to represent quantile i/(B + 1) of this
distribution.
Bootstrap techniques below will use the analogy

H is to θ as H ∗ is to θ̂. (10.2)

Since H ∗ is approximately centered about θ̂, bootstrap techniques will require

that θ is defined so that

H as a function of θ̂ is centered at θ. (10.3)

The statements of conditions (10.1) and (10.3) are purposely vague;

Abramovitch and Singh (1985) give an early set of specific conditions, and
Hall (1992) presents a manuscript-length set of tools for assessing the ap-
propriateness of bootstrapping in various contexts. Most results guaranteeing
bootstrap
√ accuracy rely on the existence of and Edgeworth approximation
to O(1/ n); that is, they require the existence of constants κ2 and κ3 such
that (6.5) holds for the distribution of θ̂, with the approximate expectation κ1
equal to θ, the terms involving κ4 and κ23 removed, and with error bounded
by a constant divided by n. Such results will not apply to bootstrap inference
using the sample mean for the Cauchy distribution, for example, since the
variance for the Cauchy distribution is not finite. The bootstrap sometimes
performs poorly when (6.5) fails to hold (Hall, 1988).
Alternatively, if one is willing to assume that F takes a parametric form,
one might sample from the distribution F (·, θ̂), and use these as above to
construct H ∗ . This technique is called the parametric bootstrap (Efron and
Tibshirani, 1993, §6.5).
172 Resampling Techniques

10.2 Univariate Bootstrap Techniques

Various strategies exist for using these samples in the most simple univariate
contexts. Terminology below is consistent with the R package boot.

10.2.1 The Normal Method

If one is willing to assume that, approximately, θ̂ has an approximately Gaus-
sian distribution centered at θ and with some unknown variance, one may use
the bootstrap sample to estimate the standard error of an estimator. Consider
the standard error estimate
B
X
ςˇ2 = (θB,i − θ)2 /B. (10.4)
i=1
PB
Using θ̄B = i=1 θB,i /B in place of θ in (10.4) results in an estimate system-
atically too small. Use B − 1 instead of B in denominator of ςˇ2 , or θ̂ in place
of θ, to respond to this undercoverage. The better estimate is
B
X
ςˆ2 = (θB,i − θ)2 /(B − 1). (10.5)
i=1

Then the standard deviation estimate ςˆ is the sample standard deviation of the
bootstrap samples (Efron, 1981), and a 1 − α confidence interval is θ̂ ± ςˆz1−α/2
for ςˆ the sample standard deviation of bootstrap samples (10.5).

10.2.2 Basic Interval

Often one is unwilling to assume that the estimator is approximately Gaussian.
Assume conditions (10.1) and (10.3). Use the distribution of θ̂B,i − θ̂ as a proxy
for that of θ̂ − θ. A confidence interval for θ may be constructed as

(θ̂ − vU ≤ θ ≤ θ̂ − vL ),

by determining vL and vU to satisfy

h i
P vL ≤ θ̂ − θ ≤ vU = 1 − α, (10.6)

if this distribution were known.

Let uL and uU be α/2 and 1 − α/2 quantiles of θ̂B,i respectively:

uL = θB,(α(B+1)/2) and uU = θB,((1−α/2)(B+1)) .

Then h i
P∗ uL − θ̂ ≤ θB,i − θ̂ ≤ uU − θ̂ = 1 − α, (10.7)
Univariate Bootstrap Techniques 173

where P∗ [·] is the probability function associated with H ∗ . Using analogy

(10.2), equate endpoints of (10.7) and (10.6), to estimate quantiles vU and vL
of θ̂B,i − θ̂ by v̂L = uL − θ̂, and v̂U = uU − θ̂. Then a confidence interval for
θ is

(θ̂ − vU , θ̂ − vL ) = (2θ̂ − θB,((1−α/2)(B+1)) , 2θ̂ − θB,(α(B+1)/2) ).

This is the basic bootstrap confidence interval (Davison and Hinkley, 1997, p.
29).

10.2.3 The Percentile Method

Suppose that θ̂ has distribution symmetric about θ. In this case, treat θ̂ − vU
and −θ̂ + vL as equivalent approximations to interchange, and so one can use
vU = θ̂ − uL , vL = θ̂ − uU . The confidence interval is now

(uL , uU ) = (θB,α(B+1)/2 , θB,(1−α/2)(B+1) )

(Efron, 1981).
This method is referred to as the percentile method (Efron, 1981), or as the
usual method (Shao and Tu, 1995, p. 132f). If the parameter θ is transformed
to a new parameter ϑ, using a monotonic transformation, then the bootstrap
samples transform in the same way, and so the percentile method holds if
there exists a transformation that can transform to symmetry, regardless of
whether one knows and can apply this transformation.

Example 10.2.1 Consider again the arsenic data of Example 2.3.2. We

calculate a confidence interval for the median.
meds<-rep(NA,999)
attach(arsenic)
for(j in seq(length(meds)))
meds[j]<-median(sample(nails,length(nails),replace=TRUE))
gives the bootstrap samples. The sample function draws a random sample
with replacement. Then
cat(ci<-quantile(meds,probs=c(.025,.975)),"\n")

gives the percentile confidence interval (0.119 0.310), and

cat(’\n Residual Bootstrap for Median\n’)
cat(ci<-2*median(nails)-rev(ci),"\n")
detach(arsenic)

gives the basic or residual confidence interval (0.040,0.231). Recall that

174 Resampling Techniques

the estimate of the density for the nail arsenic values was plotted in Fig-
ure 8.1. This distribution is markedly asymmetric, and so the percentile
bootstrap is not reliable; use the residual bootstrap.

10.2.4 BCa Method

This method was introduced by Efron (1987), who called it the BCa
method. The BCa method extends his BC method, which he terms “bias-
corrected”; Efron and Tibshirani (1993) refer to the method of this section as
bias corrected and accelerated. Again, suppose one desires a confidence inter-
val for θ, with estimator θ̂. As in §10.2.3, suppose θ̂ can be transformed to
symmetry using a transformation φ (which need not be known). Without loss
of generality, this symmetric distribution may be taken as Gaussian. Suppose
further that φ(θ̂) has a standard deviation that depends linearly on φ(θ), and
that φ(θ̂) has a bias that depends linearly on the standard deviation of φ(θ̂).
That is, assume that there exists a transformation φ, and constants a and ζ
such that (φ(θ̂) − φ(θ))/(1 + aφ(θ)) + ζ is approximately standard Gaussian.
That is, " #
φ(θ̂) − φ(θ)
P + ζ ≤ x ≈ Φ(x)
1 + aφ(θ)
and
h i y + ζ(1 + aφ(θ)) − φ(θ)
P φ(θ̂) ≤ y ≈ Φ . (10.8)
1 + aφ(θ)
Let θ∗ be the value of θ giving quantile 1 − α for θ̂. Substituting θ∗ for θ into
(10.8), and equating θ with θ̂, gives

−azα ζ + ζ(aζ − 2) + zα ζ − zα
Φ =Φ ζ+ . (10.9)
a(ζ − zα ) − 1 1 − a(ζ − zα )
Equating the bootstrap distribution function to this tail probability, the corre-
sponding quantile is defined by (10.9). The quantity a is called the acceleration
constant. The bias ζ may be estimated by the difference between the estimate
and the median of the bootstrap samples, and a may be estimated using the
skewness of the bootstrap sample.

Example 10.2.2 Return again to the nail arsenic values of the previous
example. We again generate a confidence interval for the median. The
BCa method, and the previous two methods, may be obtained using the
package boot.
library(boot)#gives boot and boot.ci.
#Define the function to be applied to data sets to get
#the parameter to be bootstrapped.
boot.ci(boot(arsenic$nails,function(x,index)
Univariate Bootstrap Techniques 175

return(median(x[index])),9999))

to give
Level Normal Basic
95% ( 0.0158, 0.2733 ) ( 0.0400, 0.2310 )

Level Percentile BCa

95% ( 0.119, 0.310 ) ( 0.118, 0.277 )
In this case, the normal and percentile intervals are suspect, because of
the asymmetry of the distribution. The more reliable interval is the bias
corrected and accelerated interval.
Recall that an exact confidence interval may be constructed using

library(MultNonParam); exactquantileci(arsenic$nails)
to obtain the interval (0.118, 0.354). Efron (1981) notes that this exact
interval will generally agree closely with the percentile bootstrap approach.

One can use the bootstrap to generate intervals for more complicated statis-
tics. The bootstrap techniques described so far, except for the percentile
method, presume that parameter values over the entire real line are possi-
ble. One can account for this through transformation.

Example 10.2.3 A similar approach may be taken to a confidence in-

terval for the standard deviation of nail arsenic values. In this case, first
change to the log scale.
logscale<-function(x,index) return(log(sd(x[index])))
sdbootsamp<-boot(arsenic$nails,logscale,9999)
sdoutput<-boot.ci(sdbootsamp)
Figure 10.1 shows the bootstrap samples; the plot produced by
plot(density(sdbootsamp$t))

shows a highly asymmetric distribution, and the BCa correction for asym-
metry is strong. (As noted before, the actual bootstrap distribution is
supported on a large but finite number of values, and is hence discrete
and does not have a density; the plot is heuristic only.) The output from
boot.ci contains some information not generally revealed using its de-
fault printing method. In particular, sdoutput$bca is a vector with five
numeric components. The first of these is the confidence level. The fourth
and fifth are the resulting confidence interval end points. The second and
third give quantiles resulting from (10.9). The upper quantile is very close
to the maximum value of 9999; boot.ci gives a warning, and serious
176 Resampling Techniques

application of BCa intervals would better be done using more bootstrap

samples. Rerunning with

boot.ci(boot(arsenic$nails,logscale,99999))$bca
gives the BCa 0.95 interval (-1.647,-0.096) for the log of standard devi-
ation of nail arsenic.

FIGURE 10.1: Boot Strap Samples for Log of Nail Arsenic Standard Deviation

2 .
.....
... ..
.. ..
... ....
.. ..
.. ..
.. ..
.. ...
.... ....
. .
.. ..
.. ..
.. ..
1.5 .. ..
... ....
.. ..
... ....
... ....
.... ...
.. ...
... ...
.... ...
...
... ...
... ...
Density 1
...
...
...
..
..
.. ...
.. ..
... ... .. ...
.... ... .... ....
.. ... ..
.. .. . ...
.. .. .. ..
... ..
.. .. ...
...
.. ..
... .. ..
... .. ..
.
. ..
............ ..
. ... ... ...
... ... .. ... .. ..
. .
... ..
..
. ... ..
. ..
..
.
. .. .
. ..
. . .
0.5 .
........
.
.. ..
....
...
.
.
.
.
.
. ...
..
..
.. .
. ..
. .
... .
... ... ..
. ..
. ..
... ... .
.
. ..
.. .. .
. ..
.. .. ..
. ..
..
.. .. .
. ...
.. .. ..
.. .
... .
. ..
... . .
. ..
.
.........
...
...
...
....... ... .
. ...
...
... . .
. ...
...
...
...
...
...
...... ..... ...
..
. ........
0 ...........................
...
... ...........

-3 -2.5 -2 -1.5 -1 -0.5 0

Log Standard Deviation
Kernel Density Estimate

10.2.5 Summary So Far, and More Examples

Table 10.1 contains the observed coverages for nominal 0.90 confidence inter-
vals for the medians of simulated data sets, for various distributions. Random
samples of size 20 were drawn 1000 times from each distribution, and in each
case, 9999 bootstrap replicates were constructed. For each random distribu-
tion, the interval was checked to see whether it contained the true population
median.
The percentile bootstrap performed remarkably well for the exponential
distribution, in light of the interval’s construction assuming symmetry. None
showed significant degradation when data came from a very heavy-tailed
Cauchy distribution.
One can apply many of these inferential techniques to the parametric boot-
strap.
Bootstrapping Multivariate Data Sets 177

TABLE 10.1: Observed coverage for nominal 0.90 Bootstrap intervals, 10 ob-
servations

Normal Basic Percentile BCa

Exponential 0.863 0.772 0.891 0.895
Uniform 0.833 0.754 0.901 0.900
Cauchy 0.917 0.835 0.875 0.862

Example 10.2.4 Consider again the brain-volume data of Exam-

ple 5.2.1. Treat the brain volume differences as having a Gaussian dis-
tribution. Function boot recognizes the parametric context through the
sim="parametric" argument; one specifies the specific parametric as-
sumption by providing a random number generator for bootstrap samples.

cat("\nParametric Bootstrap for median difference\n")

qmed<-function(x,indices) return(median(x[indices]))
ran.diff<-function(x,ests)
return(ests[1]+ests[2]*rnorm(length(x)))
bootout<-boot(brainpairs$diff,qmed,R=999,sim="parametric",
ran.gen=ran.diff,mle=c(mean(brainpairs$diff),
sd(brainpairs$diff)))
boot.ci(bootout,type=c("basic","norm"))
The basic interval is (-30.056, 52.720). The normal interval is almost
identical.

10.3 Bootstrapping Multivariate Data Sets

The bootstrapping idea of the previous section extends directly to more com-
plicated data contexts. The most immediate extension involves resampling
data vectors as a group. When applied in regression contexts, responses
per subject are often denoted by Yj , and explanatory vector by Xj . In
the notation of §10.1, Zj = (Xj , Yj ). Consider model (9.1) and (9.2). The
first approach samples the ensembles (Xj , Yj ) as a whole; that is, if a sub-
ject’s response is selected for the sample, the corresponding explanatory vari-
ables will be also selected, and with the same multiplicity. This is called a
random X bootstrap.

Example 10.3.1 Consider again the brain-volume data of Exam-

ple 5.2.1. We use a random X bootstrap to get confidence intervals for the
178 Resampling Techniques

Pearson correlation between first and second twin brain volumes. First,
define the correlation function

rho<-function(d,idx) return(cor(d[idx,1],d[idx,2]))
and calculate the bootstrap samples
bootout<-boot(brainpairs[,c("v1","v2")],rho,9999)

for pairs of data. Figure 10.2 presents a histogram of the sample, with
vertical lines marking the estimate and percentile confidence interval end
points:
hist(bootout$t,freq=FALSE)
legend(-.4,6,lty=1:2,
legend=c("Estimate","Confidence Bounds"))
abline(v=bootout$t0)
abline(v=sort(bootout$t)[(bootout$R+1)*c(.025,.975)],lty=2)
Confidence intervals may be calculated using
boot.ci(bootout)

to give
Level Normal Basic
95% ( 0.7463, 1.1092 ) ( 0.8516, 1.1571 )

Level Percentile BCa

95% ( 0.6718, 0.9773 ) ( 0.5609, 0.9719 )
To repeat, the percentile interval is represented in Figure 10.2 by broken
lines. The basic (that is, residual) interval is the percentile interval re-
flected about the estimate. This forces the upper end point above 1, and is
outside allowable values for a correlation coefficient. Similarly, the vari-
ance used for the normal interval is increased by the long tail to the left,
pushing the upper bound above 1. The BCa approach is tailored to the
asymmetry in the bootstrap replicates, and is more reliable.

10.3.1 Regression Models and the Studentized Bootstrap

Method
We apply random X bootstrap techniques to inference on a regression pa-
rameter. The ordinary least-squares regression estimator has a standard error
that depends on the covariate patterns in the data set, and hence regres-
sion parameters from random bootstrap samples from a data set will have
different precisions associated with them. Hence members of a collection of
bootstrapped regression parameters are not identically distributed.
Bootstrapping Multivariate Data Sets 179

FIGURE 10.2: Histogram of Bootstrap Samples of Brain Volume Correlations

. .. .
. .. .
. ...........................
. ...
. .
. .
.. ... . ....
. . ....
. .... ....
. ... ... . ....
. ... ... . ...
6 . ... ...
... ...
. ....
. ... ... . ....
. ... ... . ....
. ... ... . ...
. ... ... . ....
. .
...................... Estimate .
.
.
.... ....
... ...
. . ....
. ....
. . . ...
. . . . Confidence Bounds .
.
.
.... ....
. ..
.
. ....
. . ....
. .... ....
. ... ... . ...
. ... ... . ....
. ... ... . ....
4 .
. ..
.
.... .... . ....
. ... ... . ...
. ... ... . ....
Density . ... ...
. ..
.
. ....
. .... .... . ....
. ... ... . ...
. ... ... . ....
. ... ... . ....
.
. .
.
. ........................ ..
.. .. .. . ....
. ... ... ... . ...
. ... ... ... . ....
. ... ... ... . ....
. . ..
.
.
.... .... . ....
2 .
.
....
... ... ... . ...
. ... ... ... . ....
. ... ... ... . ....
.
. . ..
.
. .... .. ... . ...
. . ....
. ... .... ....
. ... ... ... . ....
. .
. .
. .
. . ....
. .... .... .... . ...
.
. . ..
.
. .... .... .... . ....
. ... ... ... . ....
. ......
...
...
...
...
...
... . .
. .
. . ....
.
...
...
...
...
...
...
....
. .... .... .... . ...
. . .
. .
. .
.
................................................................................................................................................................................................ . .. .. .. .. . ...
0
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Correlation

Recall the Gaussian-theory approach to constructing confidence intervals

for a regression parameter θ̂ ± t1−α/2 × s.e.(θ̂), where t is a T quantile, with an
appropriate degree of freedom. The ratio (θ̂ − θ)/s.e.(θ̂) is called Studentized,
in that it is divided by an empirical estimator of its variance in order to make
it comparable to a reference distribution. This section applies this idea to
bootstrap sampling.
Suppose that bootstrap samples are still independent, but, as in the pre-
vious section, the variance of the estimator may depend on the unknown
parameter. Suppose furthermore that a practical estimate of the variability of
the estimator, on the scale of the original data, also exists. Then, re-sample
B data sets as above, and for each re-sampled data set, calculate the esti-
mate θ̂B,i , and an estimate of variability σ̂B,i . Let empirical distribution of
TB,i = (θ̂B,i − θ̂)/σ̂B,I stand in for distribution of T = (θ̂ − θ)/σ. This distri-
bution doesn’t involve unknown parameters; such a quantity is called pivotal.
One might apply this in conjunction with the percentile method, as in §10.2.3.
In order to construct a 1 − α confidence interval, let tL and tU be the α/2 and
1 − α/2 quantiles from bootstrap distribution.
Hall (1992, §1.3) notes that, unlike some of the other bootstrapping meth-
ods, the Studentized bootstrap method is not invariant to increasing non-affine
transformations of the parameter; that is, for example, Studentized bootstrap
180 Resampling Techniques

confidence intervals for exp(θ) are not the exponentiation of Studentized boot-
strap confidence intervals for θ, and, furthermore, the quality of the Studen-
tized bootstrap confidence intervals depends on the effectiveness of Studenti-
zation.

Example 10.3.2 Consider the Studentized interval applied to the brain-

volume data of Example 5.2.1. The biggest innovation here is to add
standard error information to the bootstrap samples. In R, this is done
by having the function to be run on the data samples return a vector with
two components. The second component is the measure of variability, and
should be on the squared scale.

#Second component is an estimate of its scale, squared.

regparam<-function(d,idx) return(summary(
lm(d[idx,2]~d[idx,1]))$coefficients[2,1:2]^(1:2))
Now the calculations may be run as before:

boot.ci(boot(brainpairs[,c("v1","v2")],regparam,9999))
to obtain
Level Normal Basic Studentized
95% (0.2836, 1.1388) ( 0.1746, 0.9239) (0.4733, 1.0067)

Level Percentile BCa

95% (0.6296, 1.3788) ( 0.5962, 1.2082)
Note that the warning about missing information for the Studentized in-
terval no longer appears. The appropriate interval is the Studentized in-
terval; use that one.

10.3.2 Fixed X Bootstrap

In certain contexts, resampling explanatory and response variables together
may be unrealistic. For example, one might be analyzing a designed experi-
ment, with a certain number of responses specified in advance for each of a set
number of explanatory variable patterns. In this case, one might want to keep
the observed pattern of explanatory variables, and associate with them re-
sampled responses. A similar idea was used in permutation testing earlier. The
present approach differs from the permutation approach in at least two ways.
Most trivially, the bootstrap approach involves resampling with replacement,
and the permutation approach involves resampling without replacement. More
fundamentally, the sampling distribution for permutation testing needs to hold
only for the null distribution; in contrast, the bootstrap approach should be
constructed to produce a pivotal distribution independent of the value of the
regression parameter. The next example exhibits resampling residuals; some
Bootstrapping Multivariate Data Sets 181

refinements and a second example follow that produce a technique more closely
following the regularity conditions for the bootstrap.

Example 10.3.3 Refer again to the brain size data of Example 5.2.1.
In order to apply the fixed-X bootstrap, first calculate fitted values and
residuals:
residuals<-resid(lm(brainpairs$v2~brainpairs$v1))
fittedv<-brainpairs$v2-residuals
If a function needs a variable that is not locally defined, R looks for it to
be defined globally. Using the function boot inside another function, and
manipulating the data on a level above the bootstrap function but below
the command line, may cause the code to fail, or give something other
than what was intended. In the example below, fittedv and residuals
are passed explicitly to boot, after the data, indices, and bootstrap sample
size, and must be referred to by name.

regparam<-function(data,indices,fittedv,residuals){
y<-fittedv+residuals[indices]
return(summary(lm(y~data[,1]))$coefficients[2,1:2]^(1:2))
}
library(boot)
bootout<-boot(brainpairs[,c("v1","v2")],regparam,9999,
fittedv=fittedv,residuals=residuals)
Again, the Studentized interval is appropriate:
boot.ci(bootout,type="stud")

gives the 0.95 interval (0.5100,1.0509).

Without a hypothesis of a zero regression parameter, residuals remain with

expectation zero, but have a variance that depends on the distance of the
explanatory vector from the center of the set of explanatory vectors. Hence
when performing inference on regression parameters by resampling residuals,
the quantities re-sampled are not independent and identically distributed.
The variance matrix of the residuals is σ 2 (I − H), for X is the matrix of
covariate vectors, as defined in (9.4), and for H = X(X > X)X > (Hoaglin and
Welsch, 1978). One might bring residuals closer to the identically-distributed
case by first dividing the residuals by 1 − hi , for hi the diagonal elements of
H, resampling these rescaled residuals, and then multiplying the re-sampled
residual by 1 − hi . The values of hi are specific to the covariate pattern, and
will generally be applied to residuals arising from different observations when
contributing to the divisor and the multiplier.
182 Resampling Techniques

Example 10.3.4 Refer again to the brain size data of Example 5.2.1.
Modify the approach of Example 10.3.3:

cat("\n Fixed X bootstrap adjusting for variance \n")

modelfit<-lm(brainpairs$v2~brainpairs$v1)
hat<-lm.influence(modelfit)$hat
adjresid<-resid(modelfit)/sqrt(1-hat)
newregparam<-function(data,indices,fittedv,adjresid,hat){
y<-fittedv+sqrt(1-hat)*adjresid[indices]
return(summary(lm(y~data[,1]))$coefficients[2,1:2]^(1:2))
}
bootout<-boot(brainpairs[,c("v1","v2")],newregparam,9999,
fittedv=fittedv,adjresid=adjresid,hat=hat)
boot.ci(bootout,type="student")

gives the 0.95 Studentized interval (0.5654,0.9951). Proper adjustment

for differences in variance results in a tighter confidence interval.
In this case, because the covariate pattern is fixed, the standard error
of the regression coefficient varies between bootstrap samples only because
of differences in the estimated residual standard deviation, and so the
case for the Studentized interval is less compelling. The BCa interval,
(0.6030,0.9416), obtainable by changing the argument to boot.ci, is ap-
propriate in this case.

The above example demonstrates how to adjust for unequal variances for
residuals; adjusting for a lack of independence is more difficult.
One might bootstrap the location difference between two data sets. This
two-sample location model is a nonparametric version of the classical two-
sample t confidence interval setting. It may be mimicked using a linear models
approach (9.1) and (9.2) by using a single covariate vector, taking the value
0 for one group and 1 for the other group, to give the pooled version of t
intervals; the slope parameter is then the difference in locations for the two
groups, and fitted values are group means. In this case, the fixed-X approach
compares the mean difference estimate to other samples with the same number
of observations in the first group as is observed in the data set, with a similar
statement about the second group, and the random-X approach fails to do
this. In this respect, the fixed-X approach is more intuitive.
The same approach may be applied to explore differences among a larger
number of groups. One might use the bootstrap to give confidence bounds for
a wide variety of possible ANOVA summaries.

Example 10.3.5 Refer again to the chicken weight gain data of Ex-
ample 5.4.2. Use fixed-X bootstrap techniques to bound the largest group
mean minus the smallest group mean, when data are split by protein level.
The Jackknife 183

meandiff<-function(data,indices){
fitv<-fitted(lm(data[,1]~as.factor(data[,2])))
residv<-data[,1]-fitv
y<-fitv+residv[indices]
return(diff(range(unlist(lapply(split(y,data[,2]),
"mean")))))
}
attach(chicken)
cat("Bootstrap CI for maximal difference in means\n")
boot.ci(boot(cbind(weight,lev),meandiff,9999))
Because the design is balanced, with an equal number of chickens at each
level of diet, no adjustment for different standard deviations of the resid-
uals is needed. Because the fixed-X bootstrap is employed, Studentization
of the resulting mean differences is not necessary. The resulting BCa
confidence interval is (85.6, 769.5).

A similar approach might have been employed for inference about quantities
like the R2 .

10.4 The Jackknife

Consider a related technique, the jackknife (Quenouille, 1956; Miller, 1974), a
technique to estimate the bias of an estimator, using values of the estimator
evaluated on subsets of the observed sample, under certain conditions on the
form of the bias. Suppose that the bias of estimator Tn of θ based on n
observations is
−3/2
E [Tn ] − θ = a/n + O(n ), (10.10)
for some unknown a. Here O(n−3/2 ) denotes a quantity which, after multipli-
cation by n3/2 , is bounded for all n,

10.4.1 Examples of Biases of the Proper Order

Quenouille (1956) suggested this technique in situations where (10.10) held
with at least one more term of form b/n2 , and with a correspondingly smaller
error. That is, (10.10) is replaced with
2 −5/2
E [Tn ] − θ = a/n + b/n + O(n ). (10.11)

For example, if X1 , . . . , Xn are independent, and identically distributed

with
Pn distribution G(µ, σ 2 ), the maximum likelihood estimator for σ 2 is σ̂ 2 =
2
j=1 (Xj − X̄) /n. Note that σ̃ 2 = nσ̂ 2 /(n − 1) is an unbiased estimator of
184 Resampling Techniques

σ 2 , since Eσ2 σ̂ 2 = (n − 1)σ 2 /n = σ 2 − σ 2 /n. Then (10.10) holds with a = σ 2
and b = 0.
Restriction (10.11) seems unnecessarily strict, and cases in which it holds
are less common. For example, let Wn be an unbiased estimate of a parameter
ω, and let Tn = g(Wn ), for some smooth function g. Let θ = g(ω), and assume
that Var [Wn ] ≈ a/n. Then

g(Wn ) ≈ g(ω) + g 0 (ω)(Wn − ω) + g 00 (ω)(Wn − ω)2 /2

+g 000 (ω)(Wn − ω)3 /6 + g 0000 (ω)(Wn − ω)4 /24,

and

E [g(Wn )] ≈ g(ω) + g 00 (ω)Var [Wn ] /2

+g 000 (ω)E (Wn − ω)3 /6 + g 0000 (ω)E (Wn − ω)4 /24.

√
Then (10.10) holds, and (10.11) holds if the skewness of Wn times n con-
verges to 0.

10.4.2 Bias Correction

∗
Let be the estimator based on the sample of size n − 1 with observation
Tn−1,i
Pn
i omitted. Let T̄n∗ = ∗ ∗
i=1 Tn−1,i /n. Then the bias of T̄n is approximately
∗
a/(n − 1). Under (10.10), B = (n − 1)(T̄n − Tn ) estimates the bias, since

E [B] = (n − 1)(θ + a/(n − 1) − θ − a/n) + O(n−3/2 )

= a/n + O(n−3/2 ).

Then the bias of Tn − B is O(n−3/2 ). Furthermore, under the more stringent

requirement (10.11), the bias is O(n−2 ).

10.4.2.1 Correcting the Bias in Mean Estimators

Sample means are always unbiased estimators of the population expectation.
Consider
Pn application of
Pthe jackknife to the sample mean. In this case, Tn =
∗ n
j=1 Xj /n, Tn−1,i = j=1,j6=i Xj /(n − 1), and

n
X n
X
∗
T̄n−1 = Xj /(n(n − 1)) = X̄,
i=1 j=1,j6=i

and hence the bias estimate is zero.

10.4.2.2 Correcting the Bias in Quantile Estimators

Consider the jackknife bias estimate for the median for continuous data, and,
∗
for the sake of defining the Tn−1,i , let i index the order statistic X(i) .
The Jackknife 185

When the sample size n is even, then Tn = (X(n/2) + X(n/2+1) )/2, and
(
∗ X(n/2+1) if i ≤ n/2
Tn−1,i =
X(n/2) if i ≥ n/2 + 1.
∗
Then T̄n−1 = (X(n/2) + X(n/2+1) )/2 = Tn , and the bias estimate is always 0.
For n odd, then Tn = X((n+1)/2) , and

(X((n−1)/2) + X((n+1)/2) )/2 if i > (n + 1)/2

∗
Tn−1,i = (X((n+3)/2) + X((n+1)/2) )/2 if i < (n + 1)/2

(X((n+3)/2) + X((n−1)/2) )/2 if i = (n + 1)/2,


and the average of results from the smaller sample is

∗ n+1 n+1 n−1
Tn−1 = 4n X((n−1)/2) + 4n X((n+3)/2) + 2n X((n+1)/2) .

Hence
∗ n+1 n+1 n−1

T̄n−1 − Tn = 4n X((n−1)/2) + 4n X((n+3)/2) + 2n X((n+1)/2) −
X((n+1)/2)
n+1
= X((n−1)/2) + X((n+3)/2) − 2X((n+1)/2) , (10.12)
4n
and the bias estimate is
∗ n2 − 1
B = (n − 1)(T̄n−1 − Tn ) = X((n−1)/2) + X((n+3)/2) − 2X((n+1)/2) .
4n

Example 10.4.1 Consider again the nail arsenic data of Example 2.3.2.
Calculate the Jackknife estimate of bias for the mean, median, and
trimmed mean for these data. Jackknifing is done using the bootstrap
library, using the function jackknife.
library(bootstrap)#gives jackknife
jackknife(arsenic$nails,median)

This function produces the 21 values, each with one observation omitted:
$jack.values
[1] 0.2220 0.2220 0.2220 0.2220 0.1665 0.1665 0.2220 0.2220
[9] 0.1665 0.2220 0.2220 0.1665 0.1665 0.1665 0.1665 0.1665
[17] 0.1665 0.2220 0.1665 0.2220 0.2135

Each of these values is the median of 20 observations. Ten of them, corre-

sponding to the omission of the lowest ten values, are the averages of X(10)
and X(11) . Ten of them, corresponding to the omission of the highest ten
values, are the averages of X(9) and X(10) . The last, corresponding to the
186 Resampling Techniques

omission of the middle value, is the average of X(9) and X(11) . The mean
of the jackknife observations is 0.1952. The sample median is 0.1750, and
the bias adjustment is 20 × (0.1952 − 0.1750) = 20 × 0.0202 = 0.404, as
is given by R:
$jack.bias
[1] 0.4033333
This bias estimate for the median seems remarkably large. From, (10.12)
the jackknife bias estimate for the median is governed by the difference
between the middle value and the average of its neighbors. This data set
features an unusually large gap between the middle observation and the
one above it.
Applying the jackknife to the mean via

jackknife(arsenic$nails,mean)
gives the bias correction 0:
$jack.bias
[1] 0

as predicted above. One can also jackknife the trimmed mean:

jackknife(arsenic$nails,mean,trim=0.25)
The above mean is defined as the mean of the middle half of the data, with
0.25 proportion trimmed from each end. Although the conventional mean
is unbiased, this unbiasedness does not extend to the trimmed mean:
$jack.bias
[1] -0.02450216
In contrast to the arsenic example with an odd number of data points,
consider applying the jackknife to the median of the ten brain volume
differences, from Example 5.2.1:
attach(brainpairs);jackknife(diff,median)
to give 0 as the bias correction.

The average effects of such corrections may be investigated via simulation. Ta-
ble 10.2 contains the results of a simulation based on 100,000 random samples
for data sets of size 11. In this exponential case, the jackknife bias correction
over-corrects the median, but appears to address the trimmed mean exactly.
Under some more restrictive conditions, one can also use this idea to esti-
mate the variance of T .
Exercises 187

TABLE 10.2: Expectations of statistic and Jackknife bias estimate

Distribution Statistic T E [T ] E [B] Parameter

Exponential mean 0.998 0.000 1
Exponential median 0.738 0.091 0.693
Exponential 0.25 trimmed mean 0.810 -0.060 0.738

10.5 Exercises
1. The data set

HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/lupus

gives data on 87 lupus patients. The fourth column gives trans-

formed disease duration.
a. Give a 90% bootstrap confidence interval for the mean trans-
formed disease duration, using the basic, Studentized, and BCa ap-
proaches.
b. Give a jackknife estimate of the bias of the mean and of the 0.25
trimmed mean transformed disease duration (that is, the sample
average of the middle half of the transformed disease duration).
2. The data set

HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt

gives data from an analysis of a series of documents. The first col-

umn gives document number, the second gives the name of a text
file, the third gives a group to which the text is assigned, the fourth
represents a measure of the use of first person in the text, and the
fifth presents a measure of inner thinking. There are other columns
that you can ignore. (The version at Statlib, above, has odd line
breaks. A reformatted version can be found at

stat.rutgers.edu/home/kolassa/Data/federalistpapers.txt ).

a. Calculate a bootstrap confidence interval, with confidence level

.95, for the regression coefficient of inner thinking regressed on first
person. Test at α = .05. Provide basic, Studentized, and BCa inter-
vals. Do the fixed-X bootstrap.
b. Calculate a bootstrap confidence interval, with confidence level
.95, for the regression coefficient of inner thinking regressed on first
188 Resampling Techniques

person. Provide basic, Studentized, and BCa intervals. Do not do

the fixed-X bootstrap; re-sample pairs of data.
c. Calculate a bootstrap confidence interval, with confidence level
.95, for the R2 statistic for inner thinking regressed on first person.
Provide basic and BCa intervals. Do not do the fixed-X bootstrap;
re-sample pairs of data.
A
Analysis Using the SAS System

The preceding chapters detailed statistical computations using R. This ap-

pendix describes parallel computations using SAS. Some of the computations
require the use of macros, available at
http://stat.rutgers.edu/home/kolassa/Data/common.sas .
Include this file before running any macros below, by first downloading into
the current working directory, and adding

%include "/folders/myfolders/common.sas";
to your SAS program before using the macros. Adjust the local folder
/folders/myfolders/ to reflect your local configuration. Furthermore, the
examples use data sets that need to be read into SAS before analysis; reading
the data set is given below the first time the data set is used.
Example 2.3.2: Perform the sign test to evaluate the null hypothesis that
the population median for nail arsenic levels is .26. This is done using proc
univariate:
/**************************************************************/
/* Data from http://lib.stat.cmu.edu/datasets/Arsenic */
/* reformatted into ASCII on the course home page. Data re- */
/* flect arsenic levels in toenail clippings; covariates in- */
/* clude age, sex (1=M), categorical measures of quantities */
/* used for drinking and cooking, arsenic in the water, and */
/* arsenic in the nails. To make arsenic.dat from Arsenic, do*/
/*antiword Arsenic|awk ’((NR>39)&&(NR<61)){print}’>arsenic.dat*/
/* Potential threshold for ill health effects for toenails is */
/* .26 http://www.efsa.europa.eu/de/scdocs/doc/1351.pdf */
/**************************************************************/
data arsenic; infile ’/folders/myfolders/arsenic.dat’;
input age sex drink cook water nails; run;
proc univariate data=arsenic mu0=.26 ciquantdf(alpha=.05);
var nails; run;
Material between /* and */ are ignored by SAS, and are presented
above so that information describing the data set may be included with
the analysis code. The option mu0=.26 specifies the null hypothesis, and

189
190 Analysis Using the SAS System

ciquantdf(alpha=.05) specifies that distribution free intervals for quantiles,

with confidence .95=1-.05, should be produced. The var statement indicates
which variable should be summarized.
The hypothesis test might also have been done using proc freq:
data arsenic; set arsenic;y=nails>.26; run;
proc freq data=arsenic; tables y/binomial(p=.5 level=1);
exact binomial; run;

Example 2.3.3: The proc univariate call above gives intervals for a variety
of quantiles, including quartiles, but not for arbitrary user-selected quantiles.
The following code gives arbitrary quantiles. Manually edit the code below to
replace 21 with the actual number of observations, 0.025 with half the desired
complement of the confidence, and .75 with the desired quantile.
proc sort data=arsenic; by nails; run;
data ci; set arsenic;
a=quantile("binomial",.025,.75,21);
b=21+1-quantile("binomial",.025,1-.75,21);
if _N_=a then output; if _N_=b then output; run;
title ’Upper quartile .95 CIs for nail arsenic’;
proc print data=ci; run;

Example 2.5.1: Calculations for the empirical cumulative distribution function

may be performed using
proc univariate data=arsenic; var nails; cdfplot nails; run;
and graphical output may be adjusted to your local SAS installation as nec-
essary.
Example 3.3.1: This example calculates Mood’s median test for the subset
of yarn strength data coming from bobbin 3. Note that the variable denot-
ing type is a character variable; this is designated by $. First, construct the
reduced data set. Then, use proc npar1way to run the test. This procedure
does the bulk of nonparametric group comparison calculations available in
SAS; two-group comparisons are a special case. Group membership is spec-
ified by the variable specified in the class statement. The median keyword
triggers Mood’s median test.
/********************************************************/
/* Yarn strength data from Example Q of Cox & Snell */
/* (1981). Variables represent strength of two types of*/
/* yarn collected from six different bobbins. */
/********************************************************/
data yarn; infile ’/folders/myfolders/yarn.dat’;
input strength bobbin type $; run;
data yarn1; set yarn;
191

if bobbin=3 then; else delete; run;

title ’Median test applied to bobbin 3 yarn data’;
proc npar1way data=yarn1 median ; class type;
exact; var strength; run;
Compare this with the t-test results.
proc ttest data=yarn1 ; class type; var strength; run;

Example 3.4.1: This example applied the Wilcoxon rank sum test to investi-
gating differences between the strengths of yarn types. At present, analysis is
restricted to bobbin 3, since this data set did not have ties. These score tests
are calculated using proc npar1way , with the option wilcoxon. The first has
continuity correction turned off, with the option correct=no. Correction is on
by default, and so no such option appears in the second run, with correction.
Exact values are given using the exact statement, followed by the test whose
exact p-values are to be calculated.
title ’Wilcoxon Rank Sum Test, yarn strength by type, bobbin 3’;
title2 ’Approximate values, no continuity correction’;
proc npar1way data=yarn1 wilcoxon correct=no ;
class type; var strength; run;
title2 ’Approximate values, continuity correction’;
proc npar1way data=yarn1 wilcoxon ; class type;
var strength; run;
title2 ’Exact values’;
proc npar1way data=yarn1 wilcoxon ; class type;
exact wilcoxon; var strength; run;

Example 3.4.2: Scores are requested as part of proc npar1way statement.

proc npar1way data=arsenic vw savage ;
class sex; var nails; run;

Example 3.4.3: Permutation testing may be done using proc npar1way and
scores=data:
title ’Permutation test for Arsenic Data’;
proc npar1way data=arsenic scores=data ;
class sex; exact scores=data; var nails; run;

Tables 3.4 and 3.5: Test sizes and powers may be calculated using the macro
test2 in the file common.sas, as above. First and second arguments are the
group sizes. The third argument is the Monte Carlo sample size. The last
argument is the offset between groups.
title ’Monte Carlo Assessment of Test Sizes’;
%include "/folders/myfolders/common.sas";
192 Analysis Using the SAS System

%test2(10,10,10000,0); proc print data=size noobs; run;

title ’Monte Carlo Assessment of Test Powers’;
%test2(10,10,10000,1); proc print data=size noobs; run;

Example 3.9.1: The npar1way procedure also performs the Siegel-Tukey and
Ansari-Bradley tests, using the keywords ab and st in the statement.
proc npar1way data=yarn ab st ; class type; var strength; run;

Example 3.10.1: Hodges-Lehmann estimation is performed by npar1way, using

the hl option to the proc npar1way statement. Adding the exact hl first
inverts the Mann-Whitney-Wilcoxon test to determine which order statistics
make up the interval; otherwise, a Gaussian approximation to the Mann-
Whitney-Wilcoxon test is used to get critical values, and confidence interval
end points are interpolated between order statistics.
proc npar1way data=arsenic hl wilcoxon plots=none;
class sex; exact hl ; var nails; run;

Example 3.11.1: Kolmogorov-Smirnov and Cramér-von Mises tests are also

performed by proc npar1way, by specifying edf in the statement:
proc npar1way data=yarn edf; class type; var strength; run;

Examples 4.3.1 and 4.4.1: The Kruskal-Wallis test, and the variants using
the Savage and the van der Waerden scores, can be performed using proc
npar1way.
data maize; infile ’/folders/myfolders/T58.1’ ;
input exno tabno lineno loc $ block $ plot
treat $ ears weight;
nitrogen=substr(treat,1,1); if weight<0 then weight=.; run;
data tean; set maize; if loc="TEAN" then; else delete;
if weight=. then delete; run;
title ’Kruskal Wallis H Test for Maize Data’;
proc npar1way wilcoxon savage vw data=tean plots=none;
class treat; var weight; /* exact;*/ run;
The option exact is commented out above; depending on the configura-
tion of the example, exact computation can be very slow, and resource-
intensive enough to make the calculations fail. The option wilcoxon triggers
the Kruskal-Wallis test, since it uses Wilcoxon scores. These calculations may
be compared with the standard Analysis of Variance approach.
proc anova data=tean; class treat; model weight=treat; run;

Example 4.4.2: Here are two calls to npar1way that give in principle the same
answer. The exact command causes exact p-values to be computed. If you
193

specify scores (in this case data) in the exact statement, you’ll get that exact
p-value. If you don’t specify scores, you’ll get the exact p-value for the scores
specified in the proc npar1way statement. If you don’t specify either, you’ll
get ranks without scoring. As we saw before, exact computations are hard
enough that SAS quits. The second proc npar1way presents a compromise:
random samples from the exact distribution. Give the sample size you want
after /mc . The two calls give different answers, since one is approximate.
Successive calls will give slightly different answers.

title ’Permutation test for Maize Data’;

proc npar1way data=tean scores=data plots=none; class treat;
exact ; var weight; run;
title1 ’Maize Permutation test, Monte Carlo Version’;
proc npar1way data=tean scores=data plots=none; class treat;
exact scores=data/mc n=100000 ; var weight; run;

Example 4.7.5: Here we test for an ordered effect of nitrogen, and compare
with the unordered version.
title ’Jonckheere-Terpstra Test for Maize’;
proc freq data=tean noprint; tables weight*nitrogen/jt;
output out=jttab jt; run;
proc print data=jttab noobs; run;
title ’K-W Test for Maize, to compare with JT’;
proc npar1way data=tean plots=none; class nitrogen;
var weight; run;

Example 5.2.1: Perform a paired test on brain volume, assuming symmetry.

First, read the data:

data twinbrain; infile ’twinbrain.dat’;

input CCMIDSA FIQ HC ORDER PAIR SEX TOTSA TOTVOL WEIGHT; run;
data fir; set twinbrain; v1=totvol; if order=1 then output; run;
data sec; set twinbrain; v2=totvol; if order=2 then output; run;
data brainpair; merge fir sec; by pair; diff=v2-v1; run;

title ’1-sample sign, T, and Wilcoxon tests for brain volume’;

proc univariate data=brainpair; var diff; run;

Examples 5.4.1 and 5.6.1: Perform a paired test on brain volume, assuming
symmetry. First, read the data:
data expensesd; infile ’/folders/myfolders/friedman.dat’;
input cat $ g1 g2 g3 g4 g5 g6 g7; run;
Convert to a set with one observation per line, with level l as well:
194 Analysis Using the SAS System

data friedman; set expensesd; cate=_n_;

l=1; sd=g1; output; l=2; sd=g2; output; l=3; sd=g3; output;
l=4; sd=g4; output; l=5; sd=g5; output; l=6; sd=g6; output;
l=7; sd=g7; output; run;
Now calculate the test statistics:
proc freq data=friedman;
table cate*l*sd/cmh2 noprint score=rank; run;
Friedman’s test is the six degree of freedom test. Page’s test, in the non-
replicate balanced case, coincides with the one-degree of freedom correlation
test above.
Example 6.3.1: This code performs the permutation test. All intermediate
values are saved, resulting in an inefficient use of computer memory.
data big; set brainpair;
do i=1 to 200000 ; u=rand("uniform"); output; end; run;
* Here we have to sort three times. When big is created,;
* set contains 200000 lines for twin 1, then 200000 for;
* twin 2, etc. We will need to rank observations by us, within;
* each simulated data set, and proc rank needs these sorted;
* by i. Then we need to associate the new ranks of second obs.;
* with the old first obs. by merging by subject, and so we sort;
* data by subject. Finally, we need to calculate correlations;
* separately for each simulated data set, and so we sort by i.;
* There has to be a more efficient way to do this, but I can;
* not figure it out.;
proc sort data=big; by i; run;
proc rank data=big out=big; var u; by i; ranks pair; run;
proc sort data=big; by pair; run;
data big; merge big (drop=v1) brainpair (drop=v2); by pair; run;
proc sort data=big; by i; run;
proc reg data=big outest=rgout noprint; model v2=v1; by i; run;
proc reg data=brainpair outest=oneout noprint; model v2=v1; run;
data oneout; set oneout ; v10=v1; keep v10 _type_; run;
data rgout ; merge oneout rgout; by _type_; a=1;
if v1<v10 then a=0; run;
proc means data=rgout; var a; run;
The permutation p-value is produced as the means in the last call above.
Example 6.3.1: Only the parametric analysis is shown below.
title ’One-sample Multivariate analyses’;
* The h in manova statement below is syntax and NOT a variable
* in the data set. Exact does NOT do a permutation test.;
proc glm data=bp; model spd dpd =;
manova h=INTERCEPT/mstat=exact; run;
195

Example 7.5.1: Calculate the parametric analysis for comparison purposes:

title ’One-sample Multivariate analyses’;

* The h in manova statement below is syntax and NOT a variable
* in the data set. Exact does NOT do a permutation test.;
proc glm data=bp; model spd dpd =;
manova h=INTERCEPT/mstat=exact; run;

Brute-force permutation testing of the wheat data may be performed using

proc sort data=big; by i; run;
proc rank data=big out=big; var u; by i; ranks item; run;
proc sort data=big; by item; run;
data big; merge big (drop=region) wheat; by item; run;
proc sort data=big; by i; run;
* Block nearly identical to the permutation test for;
* correlation ends.;
* a and h following the model statement are variable names;
* and are short for Atou and Huntsman. h following manova;
* is part of the manova syntax and will be the same regard-;
* less of variable names.;
proc glm data=big; class region; by i; model a h=region;
ODS output MultStat=a;
manova h=region/mstat=exact; run;
proc ttest data=big plots=none;
ODS output TTests=ttestout;
class region; by i; var a h; run;
data ttestout; set ttestout; keep i Variable tValue;
if method=’Satterthwaite’ then delete; run;
proc means data=ttestout max noprint; by i;var tValue;
output out=ttmax max=tValue; run;
* Data set a has four lines per permutation. We can keep any;
* of these. It does not matter which.;
data a; set a; by i; if first.i then; else delete; run;
data both; merge a ttmax; by i;
keep FValue tValue i Hypothesis; run;
data obss; set obss; TObs=tValue;FObs=FValue;
keep TObs FObs Hypothesis; run;
* Merge by any variable that is constant in testout. Proc glm;
* with manova creates a variable Hypothesis that does this.;
data both ; merge both obss; by Hypothesis;
countf=0; if FValue ge FObs then countf=1;
countt=0; if tValue ge TObs then countt=1; run;
proc means data=both mean; var countt countf; run;

Example 8.2.1: SAS does density estimation via proc univariate. Only graph-
196 Analysis Using the SAS System

ical output is required, and so noprint is specified. The histogram is plotted

by default.

proc univariate data=arsenic noprint;

histogram nails/kernel (k=normal )
kernel(k=triangular )
kernel(k=quadratic ) ; run;
title ’Kernel density est. with an excessively small bandwidth’;
proc univariate data=arsenic noprint;
histogram nails/kernel (k=normal color=blue c=.001) ; run;
The following is another way to draw density estimates. This method allows
only the normal kernel. Syntax has changed with versions of SAS, and so
this will run only with recent versions. There is no reason to see the text
output, but suppressing text output also suppresses graphics. There is no
overall noprint option.
title1 ’Kernel smoother for arsenic via kde’;
proc kde data=arsenic ; univar nails/plots=density; run;

Example 9.2.2: The default degree in loess in SAS is 1.

proc loess data=arsenic plots(only)=fit;

model nails=water/degree=2 all; run;
proc loess data=both plots(only)=fit; model v2=v1; run;

Example 9.3.1: Various sources suggest that isotonic regression can be done
with the macro at
http://www.bios.unc.edu/distrib/redmon/SASv8/redmon2.sas .

Include the macro using

%include "redmon2.sas"
I was not able to make this work.
Example 9.4.1: Fit the spline using
proc transreg data=arsenic ;
model identity(nails)=spline(water); run;
The option noprint suppresses graphical output, and so it is not used.
Example 9.5.1: Perform these calculations using
proc quantreg data=arsenic; model nails=water; run;

Example 9.5.2: Fit the median and 0.2 quantile regression estimates:
197

data bp; infile ’/folders/myfolders/bloodpressure.dat’

firstobs=2; input spb spa spd dpb dpa dpd; run;
title ’Regression for Median of Systolic After’;
proc quantreg data=bp; model spa=spb; run;
title ’Regression for .2 Quantile of Systolic After’;
proc quantreg data=bp; model spa=spb/quantile=.2;
Example 10.2.2: Bootstrapping is done via a macro:
*Documentation at http://support.sas.com/kb/24/982.html ,code;
*support.sas.com/kb/24/addl/fusion_24982_1_jackboot.sas.txt ;
*This works similarly to the R bootstrap. You need to tell the;
*macro what statistic to use. In R this is via a function, but;
*in SAS this is via a macro.;
%include "/folders/myfolders/fusion_24982_1_jackboot.sas.txt" ;
and to use it, analogously with the R function boot, one writes a macro to do
the calculations for each bootstrap samples. This macro has two arguments,
the input data set and the output data set. These are referred to in the code
within the macro by preceding their names by &:
%macro analyze(data=,out=);
proc univariate data=&data noprint; var nails;
output median=median qrange=iqr out=&out; run;
%mend;
where, unlike R, you need to name your macro analyze. Run this on the
observed data set to see what you get:
%analyze(data=arsenic,out=results);
proc print data=results; run;
and then apply the macro that calls it 999 times on bootstrap samples
%boot(data=arseniarsenic,samples=999);
and after creating the bootstrap samples, evaluate the confidence interval
using %bootci. This macro has at least two arguments. The first is a keyword.
Use Hybrid for the residual method, pctl for the percentile method, t for
the Studentized method, and BCa for BCa. Do not put quotes around the
keyword. Second is the name of the variable in the data set passed from the
sampler to the interval creator giving the name of the statistic on which to
do calculations. Third (if present, and using argument name student) is the
divisor for Studentization. This quantity must be a variable in the data set
created by %analyze.
* Suppress printing with arg. print=0.;
%bootci(pctl,stat=median);
%bootci(Hybrid,stat=median);
%bootci(BCa,stat=median);
%bootci(t,stat=median,student=iqr);
198 Analysis Using the SAS System

Example 10.4.1: Again we’ll use the SAS input file as above. This works sim-
ilarly to the R jackknife. You need to tell the macro what statistic to use. In
R this is via a function, but in SAS this is via a macro.
%macro analyze(data=,out=);
proc means noprint median data=&data vardef=n;
output out=&out(drop=_freq_ _type_) median=med ;
var nails ;
%bystmt;
run;
%mend;
%jack(data=arsenic,chart=0)
B
Construction of Heuristic Tables and
Figures Using R

Some of the preceding tables and figures were presented, not as tools in the
analysis of specific data sets, but as tools for comparing and describing various
statistical methods. Calculations producing these tables and figures in R are
given below. Some other of the preceding tables and figures were specific to
certain data sets, but are of primarily heuristic value and are not required
for the data set to which they apply. Commands to produce these tables
and figures are also given below. Before running these commands, load two
R packages, MultNonParam from CRAN, and NonparametricHeuristic from
github:
install.packages(c("MultNonParam","devtools"))
library(devtools)
install_github("kolassa-dev/NonparametricHeuristic")
library(MultNonParam); library(NonparametricHeuristic)

Figure 1.1:
fun.comparedensityplot()

Table 2.1:

library(VGAM)#vgam gives laplace distribution

library(BSDA)#Library gives z test
#Below the shift by -.5 for the uniform centers it at 0.
level<-fun.comparepower(samp1sz=c(10,17,40),nsamp=100000,
dist=list("rnorm","rcauchy","rlaplace","runif"),
hypoth=c(0,0,0,-.5),alternative=c("two.sided","greater"))
print(level)

Figure 2.1:
fun.studentizedcaucyplot(10,10000)

Table 2.2:
fun.achievable()

199
200 Construction of Heuristic Tables and Figures Using R

Figure 2.2:
drawccplot()

Table 2.3:
mypower<-array(NA,c(4,3,3))
nobsv<-c(10,17,40)
for(jj in seq(length(nobsv))){
temp<-fun.comparepower(samp1sz=nobsv[jj], nsamp=100000,
dist=list("rnorm","rcauchy","rlaplace"),
hypoth=(1.96+.85)*c(1,sqrt(2),1)/sqrt(nobsv[jj]))
if(jj==1) dimnames(mypower)<-list(dimnames(temp)[[1]],
dimnames(temp)[[2]],as.character(nobsv))
mypower[,,jj]<-temp[,,1,1,1]
}
cat("\nPower for T and Sign Tests \n")
print(mypower)

Table 2.4:
library(VGAM); testare(.5)

Table 3.4:
fun.comparepower(samp1sz=10,samp2sz=10,altvalue=0)

Table 3.5:
fun.comparepower(samp1sz=10,samp2sz=10,altvalue=1)

Figures 4.1 and 4.2:

showmultigroupscoretest(c(3,3,4))
showmultigroupscoretest(c(3,3,4),fun.givescore(1:10,sv="ns"))

Figure 4.3:
powerplot()

Figure 5.1:
hodgeslehmannexample()

Figure 6.2:
x<-(-10):10; y<-x ; z<-5*atan(x)
plot(range(x),range(c(y,z)),type="n")
points(x,y,pch=1); points(x,z,pch=2)
legend(0,-5,pch=1:2,legend=c("y=x","y=5 atan(x)"))
201

Figure 7.1:

y<-x<-rnorm(100)
coin<-rbinom(100,1,.5)
y[coin==0]<--x[coin==0]
par(oma=c(0,0,3,0))
par(mfrow=c(2,2))
p1<-qqnorm(x,main="Marginal Distribution for X")
p2<-qqnorm(y,main="Marginal Distribution for Y")
p3<-qqnorm((x+y)/sqrt(2),
main="Marginal Distribution for (X+Y)/sqrt(2)")
p4<-qqnorm((x-y)/sqrt(2),
main="Marginal Distribution for (X-Y)/sqrt(2)")

Tables 7.1 and 7.2:

cat(’\n Size of Multivariate Tests\n’)

library(ICSNP)
fun.comparepower(c(20,40),ndim=2,nsamp=20000,
dist=c("rnorm","rcauchy","rlaplace"))
cat(’\n Power of Multivariate Tests\n’)
fun.comparepower(c(20,40),ndim=2,nsamp=20000,
dist=c("rnorm","rcauchy","rlaplace"),hypoth=.5)

Table 9.1:
distv<-c("rnorm","rcauchy","rlaplace","runif","rexp")
t1<-fun.testreg(dists=distv)
t2<-fun.testreg(dists=distv,npergp=50)

Warnings in the calculation of t1 note non-uniqueness of the quantile regres-

sion solutions.
Table 10.1:
fun.testboot(function(x,index) return(median(x[index])),
alpha=.1,sampsize=20, mcsamp=1000,
dists=c("rexp","runif","rcauchy"),true=c(log(2),.5,0))

Table 10.2:

library(bootstrap)#To give the jackknife function.

testjack(list(mean,median,mean),dists=list(rexp,rnorm),
others=list(NULL,NULL,list(trim=0.25)),nsamp=100000)
Bibliography

Abramovitch, L. and K. Singh (1985, 03). Edgeworth corrected pivotal statis-

tics and the bootstrap. The Annals of Statistics 13 (1), 116–132.

Andrews, D. F. and A. M. Herzberg (1985). Data: A Collection of Prob-

lems from Many Fields for the Student and Research Worker. New York:
Springer–Verlag.
Arbuthnott, J. (1712). An argument for divine providence, taken from the con-
stant regularity observed in the births of both sexes. Philosophical Trans-
actions of the Royal Society of London 7, 186–190. Reprinted in Kendall
and Plackett (1977).
Benard, A. and P. V. Elteren (1953). A generalization of the method of
m rankings. Proceedings of the Koninklijke Nederlandse Akademie van
Weteschappen. Series A 56 (Indagiones Mathematicae) 15, 358–369.
Bennett, B. M. (1965, Dec). On multivariate signed rank tests. Annals of the
Institute of Statistical Mathematics 17 (1), 55–61.
Best, D. J. and P. G. Gipps (1974). Algorithm as 71: The upper tail proba-
bilities of Kendall’s tau. Journal of the Royal Statistical Society. Series C
(Applied Statistics) 23 (1), 98–100.
Best, D. J. and D. E. Roberts (1975). Algorithm as 89: The upper tail prob-
abilities of Spearman’s rho. Journal of the Royal Statistical Society. Series
C (Applied Statistics) 24 (3), 377–379.

Best, M. J. and N. Chakravarti (1990). Active set algorithms for isotonic

regression: A unifying framework. Mathematical Programming 47, 425–439.
Bickel, P. J. (1965, 02). On some asymptotically nonparametric competitors
of Hotelling’s T 2 . The Annals of Mathematical Statistics 36 (1), 160–173.
Brunk, H. D. (1955, 12). Maximum likelihood estimates of monotone param-
eters. The Annals of Mathematical Statistics 26 (4), 607–616.
Chen, X. and J. Kolassa (2018). Various improved approximations to distri-
butions of quadratic test statistics for dependent rank sums. Biomedical J.
of Scientific and Technical Research 9.

203
204 Bibliography

Cleveland, W. S. (1979). Robust locally weighted regression and smoothing

scatterplots. Journal of the American Statistical Association 74 (368), 829–
836.
Cleveland, W. S. and S. J. Devlin (1988). Locally weighted regression: An
approach to regression analysis by local fitting. Journal of the American
Statistical Association 83 (403), 596–610.

Clopper, C. J. and E. S. Pearson (1934, 12). The use of confidence or fiducial

limits illustrated in the case of the binomial. Biometrika 26 (4), 404–413.
Conover, W. J. and R. L. Iman (1979). On multiple-comparisons procedures.
Technical report, Los Alamos Scientific Laboratory, Los Alamos, NM.
Cox, D. R. and E. J. Snell (1981). Applied Statistics: Principles and Examples.
New York: Chapman and Hall.
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University
Press.
David, S. T., M. G. Kendall, and A. Stuart (1951). Some questions of distri-
bution in the theory of rank correlation. Biometrika 38 (1/2), 131–140.
Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods And Their
Application (first ed.). Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press.
Dinneen, L. C. and B. C. Blakesley (1973). Algorithm as 62: A generator for
the sampling distribution of the Mann-Whitney U statistic. Journal of the
Royal Statistical Society. Series C (Applied Statistics) 22 (2), 269–273.
Dunn, O. J. (1964). Multiple comparisons using rank sums. Technomet-
rics 6 (3), 241–252.

Dwass, M. (1956, 06). The large-sample power of rank order tests in the two-
sample problem. The Annals of Mathematical Statistics 27 (2), 352–374.
Dwass, M. (1985). On the convolution of Cauchy distributions. The American
Mathematical Monthly 92 (1), 55–57.
Dykstra, R. L. (1981). An isotonic regression algorithm. Journal of Statistical
Planning and Inference 5, 355–363.
Edgeworth, F. Y. (1893). Viii. exercises in the calculation of errors. The
London, Edinburgh, and Dublin Philosophical Magazine and Journal of Sci-
ence 36 (218), 98–111.

Efron, B. (1981). Nonparametric standard errors and confidence intervals.

Canadian Journal of Statistics 9 (2), 139,158.
Bibliography 205

Efron, B. (1987). Better bootstrap confidence intervals. Journal of the Amer-

ican Statistical Association 82 (397), 171–185.
Efron, B. and R. J. Tibshirani (1993). An Introduction to the Bootstrap. New
York: Chapman and Hall.
El Maache, H. and Y. Lepage (2003). Spearman’s rho and Kendall’s tau for
multivariate data sets. In M. Moore, S. Froda, and C. Léger (Eds.), Math-
ematical Statistics and Applications: Festschrift for Constance van Eeden,
Volume 42 of Lecture Notes–Monograph Series, Beachwood, OH, pp. 113–
130. Institute of Mathematical Statistics.
Epanechnikov, V. (1969). Non-parametric estimation of a multivariate prob-
ability density. Theory of Probability & Its Applications 14 (1), 153–158.
Erdös, P. and A. Réyni (1959). On the central limit theorem for samples
from a finite population. Publications of the Mathematical Institute of the
Hungarian Academy of Sciences 4, 49–61.
Fan, J. (1992). Design-adaptive nonparametric regression. Journal of the
American Statistical Association 87 (420), 998–1004.
Festinger, L. (1946). The significance of difference between means without
reference to the frequency distribution function. Psychometrika 11 (2), 97–
105.
Fieller, E. C. (1954). Some problems in interval estimation. Journal of the
Royal Statistical Society. Series B (Methodological) 16 (2), 175–185.
Fisher, R. A. (1925). Statistical Methods for Research Workers. Edinburgh:
Oliver and Boyd.
Fisher, R. A. (1926). On a distribution yielding the error functions of sev-
eral well known statistics. In Proceedings of the International Mathematical
Congress, pp. 805–813. Congress ran 1924. Fisher, Statistical Methods for
Research Workers, dates it to 1924. Hald dates to 1928. Addendum in col-
lected papers is dated 1927.
Fisher, R. A. (1930). Statistical Methods for Research Workers (Third ed.).
Edinburgh: Oliver and Boyd.
Fisher, R. A. (1973). Statistical Methods for Research Workers (Fourteenth
(reprinted) ed.). Edinburgh: Oliver and Boyd.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality
implicit in the analysis of variance. Journal of the American Statistical
Association 32 (200), 675–701.
Hájek, J. (1960). Limiting distributions in simple random sampling from a
finite population. Publications of the Mathematical Institute of the Hungar-
ian Academy of Sciences 5, 361–374.
206 Bibliography

Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930.

John Wiley and Sons, Inc.

Hall, P. (1988, 10). Rate of convergence in bootstrap approximations. The

Annals of Probability 16 (4), 1665–1684.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag.
Haynam, G. E., Z. Govindarajalu, F. C. Leone, and P. Siefert (1982). Ta-
bles of the cumulative non-central chi-square distribution - part 2. Series
Statistics 13 (4), 577–634.
Haynam, G. E., Z. Govindarajulu, F. C. Leone, and P. Siefert (1982). Ta-
bles of the cumulative non-central chi-square distribution - part l. Series
Statistics 13 (3), 413–443.

Hettmansperger, T. and J. McKean (2011). Robust Nonparametric Statistical

Methods. Boca Raton: CRC Press.
Hettmansperger, T. P. (1984). Statistical Inference Based on Ranks. Mel-
bourne, FL: Krieger.

Higgins, J. J. (2004). Introduction to Modern Nonparametric Statistics. Cen-

gage Learning.
Hoaglin, D. C. and R. E. Welsch (1978). The hat matrix in regression and
anova. The American Statistician 32 (1), 17–22.

Hodges, J. L. and E. L. Lehmann (1963). Estimates of location based on rank

tests. The Annals of Mathematical Statistics 34 (2), 598–611.
Hoeffding, W. (1948, 09). A class of statistics with asymptotically normal
distribution. The Annals of Mathematical Statistics 19 (3), 293–325.
Hotelling, H. (1931, 08). The generalization of student’s ratio. The Annals of
Mathematical Statistics 2 (3), 360–378.
Hotelling, H. and M. R. Pabst (1936, 03). Rank correlation and tests of signif-
icance involving no assumption of normality. The Annals of Mathematical
Statistics 7 (1), 29–43.

Huber, P. J. (1964, 03). Robust estimation of a location parameter. The

Annals of Mathematical Statistics 35 (1), 73–101.
Johnson, N. L., S. Kotz, and N. Balakrishnan (1995). Continuous Univariate
Distributions (Second ed.), Volume 2. Wiley-Interscience.
Jonckheere, A. R. (1954). A distribution-free k-sample test against ordered
alternatives. Biometrika 41 (1/2), 133–145.
Bibliography 207

Kaarsemarker, L. and A. van Wijngaarden (1953). Tables for use in rank

correlation. Statistica Neerlandica 7, 41–54.
Kawaguchi, A., G. G. Koch, and X. Wang (2011). Stratified multivariate
Mann–Whitney estimators for the comparison of two treatments with ran-
domization based covariance adjustment. Statistics in Biopharmaceutical
Research 3 (2), 217–231.
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30 (1-
2), 81–93.
Koenker, R. (2000). Galton, Edgeworth, Frisch, and prospects for quantile
regression in econometrics. Journal of Econometrics 95 (2), 347 – 374.
Köhler, M., A. Schindler, and S. Sperlich (2014). A review and comparison of
bandwidth selection methods for kernel regression. International Statistical
Review 82 (2), 243–274.
Kolassa, J. E. (2006). Series Approximation Methods in Statistics, 3rd Edn.
New York: Springer – Verlag.
Kolassa, J. E. and Y. Seifu (2013). Nonparametric multivariate inference on
shift parameters. Academic Radiology 20 (7), 883 – 888.
Kramer, C. Y. (1956). Extension of multiple range tests to group means with
unequal numbers of replications. Biometrics 12 (3), 307–310.
Kramer, C. Y. (1957). Extension of multiple range tests to group correlated
adjusted means. Biometrics 13 (1), 13–18.
Kruskal, W. H. and W. A. Wallis (1952). Use of ranks in one-criterion variance
analysis. Journal of the American Statistical Association 47 (260), 583–621.
Laplace, P. S. D. (1818). Deuxiéme Supplḿent a la Théorie Analytique des
Probablilitiés. Paris: Courcier.
Lehmann, E. L. (1953). The power of rank tests. The Annals of Mathematical
Statistics 24 (1), 23–43.
Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hy-
potheses: One theory or two? Journal of the American Statistical Associa-
tion 88 (424), 1242–1249.
Lehmann, E. L. (2006). Nonparametrics: Statistical Methods Based on Ranks
(First edition revised ed.). Springer.
Mann, H. B. and D. R. Whitney (1947, 03). On a test of whether one of
two random variables is stochastically larger than the other. The Annals of
Mathematical Statistics 18 (1), 50–60.
Miller, R. G. (1974). The jackknife–a review. Biometrika 61 (1), 1–15.
208 Bibliography

Mood, A. M. (1950). Introduction to the Theory of Statistics (First ed.). New

York: McGraw Hill.
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability &
Its Applications 9 (1), 141–142.
Nemenyi, P. (1963). Distribution-Free Multiple Comparisons. Ph. D. thesis,
Princeton University.
Neyman, J. and E. S. Pearson (1933). On the problem of the most efficient
tests of statistical hypotheses. Philosophical Transactions of the Royal Soci-
ety of London. Series A, Containing Papers of a Mathematical or Physical
Character 231, 289–337.
Noether, G. E. (1950, 06). Asymptotic properties of the Wald-Wolfowitz test
of randomness. The Annals of Mathematical Statistics 21 (2), 231–246.
Oja, H. and R. H. Randles (2004, 11). Multivariate nonparametric tests.
Statist. Sci. 19 (4), 598–605.
Page, E. B. (1963). Ordered hypotheses for multiple treatments: A signifi-
cance test for linear ranks. Journal of the American Statistical Associa-
tion 58 (301), 216–230.
Pearson, E. S. (1931). The analysis of variance in cases of non-normal varia-
tion. Biometrika 23 (1/2), 114–133.
Pearson, K. (1907). On Further Methods of Determining Correlation. Drapers’
company research memoirs. Biometric ser. IV. London: Dulau and Co.
Pitman, E. J. G. (1948). Lectures on nonparametric statistical inference.
Columbia University.
Prentice, M. J. (1979). On the problem of m incomplete rankings.
Biometrika 66 (1), 167–170.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika 43 (3/4),
353–360.
R Core Team (2018). R: A Language and Environment for Statistical Com-
puting. Vienna, Austria: R Foundation for Statistical Computing.
Randles, R. H. (1989). A distribution-free multivariate sign test based on
interdirections. Journal of the American Statistical Association 84 (408),
1045–1050.
Rosenblatt, M. (1956, 09). Remarks on some nonparametric estimates of a
density function. The Annals of Mathematical Statistics 27 (3), 832–837.
Sankaran, M. (1963, 06). Approximations to the non-central chi-square dis-
tribution. Biometrika 50 (1-2), 199–204.
Bibliography 209

SAS Institute Inc. (2017). SAS/STAT 14.3 User’s Guide. Cary, NC: SAS
Institute Inc.

Savage, I. R. (1956, 09). Contributions to the theory of rank order statistics-

the two-sample case. The Annals of Mathematical Statistics 27 (3), 590–615.
Savitzky, A. and M. J. E. Golay (1964). Smoothing and differentiation of
data by simplified least squares procedures. Analytical Chemistry 36 (8),
1627–1639.
Schoenberg, I. J. (1946). Contributions to the problem of approximation of
equidistant data by analytic functions. Part A. On the problem of smoothing
or graduation. A first class of analytic approximation formulae. Quarterly
of Applied Mathematics 4, 46–99.

Scott, D. W. (1979, 12). On optimal and data-based histograms.

Biometrika 66 (3), 605–610.
Sen, P. K. (1968). Estimates of the regression coefficient based on Kendall’s
tau. Journal of the American Statistical Association 63 (324), 1379–1389.

Shao, J. and D. Tu (1995). The Jackknife and Bootstrap (first ed.). Springer
Series in Statistics. New York: Springer-Verlag.
Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selec-
tion method for kernel density estimation. Journal of the Royal Statistical
Society. Series B (Methodological) 53 (3), 683–690.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.

Monographs on Statistics and Applied Probability. London: Chapman and
Hall.
Skillings, J. H. and G. A. Mack (1981). On the use of a Friedman-type statistic
in balanced and unbalanced block designs. Technometrics 23 (2), 171–177.

Small, C. G. (1990). A survey of multidimensional medians. International

Statistical Review / Revue Internationale de Statistique 58 (3), 263–277.
Snedecor, G. W. (1934). Calculation and Interpretation of Analysis of Vari-
ance and Covariance. Collegiate Press, Inc.

Spearman, C. (1904). The proof and measurement of association between two

things. The American Journal of Psychology 15 (1), 72–101.
Stieltjes, T. J. (1894). Recherches sur les fractions continues. Annales de la
Faculté des Sciences de Toulouse 8, J1–J122.

Stigler, S. M. (1973). Studies in the history of probability and statistics.

xxxii: Laplace, Fisher and the discovery of the concept of sufficiency.
Biometrika 60 (3), 439–445.
210 Bibliography

Stigler, S. M. (1984, 12). Studies in the history of probability and statistics xl

boscovich, simpson and a 1760 manuscript note on fitting a linear relation.
Biometrika 71 (3), 615–620.
Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncer-
tainty Before 1900. Cambridge: Harvard University Press.
Student (1908). Probable error of a correlation coefficient. Biometrika 6 (2/3),
302–310.
Sturges, H. A. (1926). The choice of a class interval. Journal of the American
Statistical Association 21 (153), 65–66.
Terpstra, T. J. (1952). The asymptotic normality and consistency of Kendall’s
test against trend, when ties are present in one ranking. Indagationes Math-
ematicae 14, 327–333.
Theil, H. (1950). A rank-invariant method of linear and polynomial regression
analysis. i. Indagationes Mathematicae 12, 85–91.
Tramo, M., W. Loftus, T. Stukel, R. Green, J. Weaver, and M. Gazzaniga
(1998, May). Brain size, head size, and intelligence quotient in monozygotic
twins. Neurology 50 (5), 1246–52.
Tukey, J. W. (1949). Comparing individual means in the analysis of variance.
Biometrics 5 (2), 99–114.
Tukey, J. W. (1953). The problem of multiple comparisons : introduction
and parts a, b, and c. Photocopy of typescript from Princeton University
eventually published in 1993.
Tukey, J. W. (1993). Reminder sheets for “allowances for various types of
error rates”. In H. I. Braun (Ed.), The Collected Works of John W. Tukey
Volume VIII: Multiple Comparisons, 1948-1983, pp. 335–339. New York,
NY: Chapman and Hall.
van de Wiel, M. and A. Di Bucchianico (2001). Fast computation of the exact
null distribution of Spearman’s rho and Page’s L statistic for samples with
and without ties. Journal Of Statistical Planning And Inference 92 (1-2),
133–145.
Waerden, B. L. V. D. (1952). Order tests for the two-sample problem and
their power. Indagationes Mathematicae 14, 453–458.
Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian
Journal of Statistics, Series A 26 (4), 359–372.
Westenberg, J. (1948). Significance test for median and interquartile range in
samples from continuous populations of any form. In Proceedings for Konin-
klijke Nederlandse Akademie van Wetenschappen, pp. 252–261. Koninklijke
Nederlandse Akademie van Wetenschappen.
Bibliography 211

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics

Bulletin 1 (6), 80–83.

Yarnold, J. K. (1972). Asymptotic approximations for the probability that a

sum of lattice random vectors lies in a convex set. The Annals of Mathe-
matical Statistics 43 (5), 1566–1580.
Zhong, D. and J. Kolassa (2017, September). Moments and Cumulants of The
Two-Stage Mann-Whitney Statistic. ArXiv e-prints.
Index

alternative hypothesis, 7 Edgeworth approximation, 116, 122,

analysis of variance, 69–72, 74, 75, 171
77, 79, 81, 102, 182 efficacy, 31–33, 56–58, 71, 91
Ansari-Bradley test, 58, 59, 64 Epanechnikov kernel, 145
asymptotic relative efficiency, 29 exponential distribution, 4, 161, 176,
186
bandwidth, 145 exponential scores, 50
basic bootstrap confidence interval,
173 F distribution, 5, 6, 70, 130
bias corrected and accelerated, 174 family-wise error rate, 71, 72
binomial distribution, 5, 20, 24, 144,
146 Gaussian distribution, 1
binomial test, 20 generalized sign test, 27
Bonferroni procedure, 71, 82, 107
H test, see Kruskal-Wallis test
bootstrap, 169–183
Hermite polynomials, 116
bootstrap sample, 170
histogram, 143, 144, 178
Cauchy distribution, 3, 16, 23, 32, Hodges-Lehmann estimator, 62, 99
53, 135, 161, 171, 176 Honest Significant Difference, 72, 82,
central limit theorem, 15, 19, 42, 49, 107
70, 116, 134, 152, 169 isotonic regression, 157
chi-square distribution, 5, 70, 72–77,
86, 87, 92, 103–105, 130, jackknife, 183–186
134, 136 Jonckheere-Terpstra test, 82–85, 87,
concordant, 118, 120 91
contrast, 71, 83, 91
correction for continuity, 22, 29, 45, Kendall’s τ , 119–122, 124, 125
50, 55, 77 kernel, 144, 145, 147, 152, 153
Cramér-von Mises test, 64, 65 kernel density estimation, 144, 145,
critical region, 7, 9, 10, 77 147, 152
critical value, 7–10, 20, 21, 24, 28–30, kernel smoothing, 152, 153, 155, 156
49, 62, 63, 76, 77, 87, 90, Kolmogorov-Smirnov test, 64, 65
91, 100, 117 Kruskal-Wallis test, 76–78, 81, 82,
84, 86–89, 92, 103, 104, 133
degrees of freedom, 5
discordant, 118, 120 L1 regression, 160–164
double exponential distribution, 3 Laplace distribution, 3, 16, 22, 23,
32, 57, 58, 135, 161

213
214 Index

Least Significant Difference, 71 quantile regression, 160–164

Lehmann alternative, 50, 58
locally weighted regression, 153, 155, random X bootstrap, 177
156 relative efficiency, 28, 29, 31, 32,
location-scale family, 4 55–57, 84, 91, 102
logistic distribution, 4, 50, 56, 57, 85, resampling distribution, 171
89 resistant, 165

Mann-Whitney test, 48–50, 55–58, Savage scores, 50, 51, 53, 78, 79, 107
61–63, 81, 83, 86, 102, 103, Siegel-Tukey test, 58, 59
133 sign test, 20, 22–25, 27, 31–33, 43,
Mood’s median test, 43–46, 50, 53, 63, 96, 98, 105, 132–136
54, 61, 141 Spearman correlation, 116, 118–121,
multinomial distribution, 5, 74 123, 124
multivariate Gaussian distribution, Spearman Rank correlation, 115
113, 115, 121, 129 spline, 157, 159
multivariate median, 130, 132 standard Gaussian distribution, 2
standard normal distribution, 2
Nadaraya-Watson smoothing, 152, Student’s t distribution, 6, 16
155, 156 Studentized, 179
non-central Cauchy distribution, 3 Studentized range distribution, 72,
non-central chi-square distribution, 81
6, 86, 89, 92
non-centrality parameter, 6 test level, 7, 9–11, 16, 17, 20–23, 28,
normal distribution, 1 31, 116
normal scores, 50, 53, 79 test statistic, 7
null hypothesis, 6 tied observations, 54, 107
two-sample pooled t statistic, 40
one-sided hypothesis, 7 two-sample pooled t-test, 40, 41, 51,
order statistics, 26, 50–52, 57, 62, 63 52, 71, 72, 81
two-sided hypothesis, 9
p-value, 10 type I error rate, 7
Page’s test, 108, 109 type II error rate, 8
paired t-test, 95
parametric bootstrap, 171 U statistic, 48
Pearson correlation, 113–116, 120, uniform distribution, 2, 3, 16, 152
121, 124, 178 usual method, 173
percentile method, 173
permutation test, 52 van der Waerden scores, 50, 51, 78
pivot, 12, 179, 180
pivotal, 179 Walsh averages, 99, 100
pooled adjacent violators, 157 Wilcoxon rank-sum test, 47–50,
power, 8, 10, 11, 22, 23, 28–33, 44, 53–58, 61, 62, 81, 83, 86,
53–55, 58, 69, 71, 84–87, 99, 138
89–92, 102, 103, 135, 139 Wilcoxon signed-rank test, 96, 99,
Prentice test, 106 100, 132–136

CS601 - Machine Learning - Unit 1 - Regression
No ratings yet
CS601 - Machine Learning - Unit 1 - Regression
11 pages
STATA Propensity Score Matching Guide
No ratings yet
STATA Propensity Score Matching Guide
3 pages
Interpretation of SLRM
No ratings yet
Interpretation of SLRM
62 pages
Logistic Regression Analysis Guide
No ratings yet
Logistic Regression Analysis Guide
4 pages
Moderation & Mediation with Process
No ratings yet
Moderation & Mediation with Process
8 pages
Introduction To Regression Analysis
No ratings yet
Introduction To Regression Analysis
8 pages
Lesson 18 Basic Statistical Tool
100% (1)
Lesson 18 Basic Statistical Tool
36 pages
T-Test For Two Independent Samples
No ratings yet
T-Test For Two Independent Samples
44 pages
09 Power & Sample Size
No ratings yet
09 Power & Sample Size
16 pages
MATH& 146 Lesson 41: Multiple Regression
No ratings yet
MATH& 146 Lesson 41: Multiple Regression
28 pages
Quantitative Methods: Regression Models, Types of Errors
No ratings yet
Quantitative Methods: Regression Models, Types of Errors
42 pages
Regression Analysis: Slope & Intercept
No ratings yet
Regression Analysis: Slope & Intercept
4 pages
Consequences and Detection of Misspecified Nonlinear Regression Models
No ratings yet
Consequences and Detection of Misspecified Nonlinear Regression Models
16 pages
An Introduction To Business Statistics.
No ratings yet
An Introduction To Business Statistics.
19 pages
Syllabus Statistics&Methodology 2022 2023 Block3 Version1
No ratings yet
Syllabus Statistics&Methodology 2022 2023 Block3 Version1
13 pages
Statnews #78 What Is Survival Analysis?
No ratings yet
Statnews #78 What Is Survival Analysis?
3 pages
Linear Regression Assignment Overview
0% (2)
Linear Regression Assignment Overview
8 pages
Forecasting with Moving Averages
No ratings yet
Forecasting with Moving Averages
13 pages
Ids Mod2
No ratings yet
Ids Mod2
34 pages
ANOVA (Analysis of Variance)
No ratings yet
ANOVA (Analysis of Variance)
5 pages
Linear Regression on Heart Disease Data
No ratings yet
Linear Regression on Heart Disease Data
11 pages
Lecture Notes - Prob and Stat
No ratings yet
Lecture Notes - Prob and Stat
229 pages
Theory of Estimation in Statistics
100% (1)
Theory of Estimation in Statistics
30 pages
Homogeneity of Variance in ANOVA
No ratings yet
Homogeneity of Variance in ANOVA
14 pages
F-Test and ANOVA Overview
100% (1)
F-Test and ANOVA Overview
41 pages
4.1-Intro To Stat
No ratings yet
4.1-Intro To Stat
29 pages
Business Applications of Hypothesis Testing
No ratings yet
Business Applications of Hypothesis Testing
26 pages
DMDW 10 Prediction
No ratings yet
DMDW 10 Prediction
14 pages
BM2 Chapter 5 Forecasting
No ratings yet
BM2 Chapter 5 Forecasting
24 pages