Book
Book
Kolassa
An Introduction to
Nonparametric Statistics
Contents
Introduction vii
1 Background 1
1.1 Probability Background . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Probability Distributions for Observations . . . . . . . 1
1.1.1.1 Gaussian Distribution . . . . . . . . . . . . . 1
1.1.1.2 Uniform Distribution . . . . . . . . . . . . . 2
1.1.1.3 Laplace Distribution . . . . . . . . . . . . . . 3
1.1.1.4 Cauchy Distribution . . . . . . . . . . . . . . 3
1.1.1.5 Logistic Distribution . . . . . . . . . . . . . . 4
1.1.1.6 Exponential Distribution . . . . . . . . . . . 4
1.1.2 Location and Scale Families . . . . . . . . . . . . . . . 4
1.1.3 Sampling Distributions . . . . . . . . . . . . . . . . . 5
1.1.3.1 Binomial Distribution . . . . . . . . . . . . . 5
1.1.4 χ2 -distribution . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 T -distribution . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.6 F -distribution . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Elementary Tasks in Frequentist Inference . . . . . . . . . . 6
1.2.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 6
1.2.1.1 One-Sided Hypothesis Tests . . . . . . . . . 7
1.2.1.2 Two-sided Hypothesis Tests . . . . . . . . . . 9
1.2.1.3 P -values . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . 11
1.2.2.1 P -value Inversion . . . . . . . . . . . . . . . 11
1.2.2.2 Test Inversion with Pivotal Statistics . . . . 12
1.2.2.3 A Problematic Example . . . . . . . . . . . . 12
1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
ii Contents
3 Two-Sample Testing 39
3.1 Two-Sample Approximately Gaussian Inference . . . . . . . 39
3.1.1 Inference on Expectations . . . . . . . . . . . . . . . . 39
3.1.2 Inference on Dispersions . . . . . . . . . . . . . . . . . 40
3.2 General Two-Sample Rank Tests . . . . . . . . . . . . . . . . 41
3.2.1 Null Distributions of General Rank Statistics . . . . . 41
3.2.2 Moments of Rank Statistics . . . . . . . . . . . . . . . 42
3.3 A First Distribution-Free Test . . . . . . . . . . . . . . . . . 43
3.4 The Mann-Whitney-Wilcoxon Test . . . . . . . . . . . . . . 46
3.4.1 Exact and Approximate Mann-Whitney Probabilities 48
3.4.1.1 Moments and Approximate Normality . . . . 48
3.4.2 Other Scoring Schemes . . . . . . . . . . . . . . . . . . 50
3.4.3 Using Data as Scores: the Permutation Test . . . . . . 51
3.5 Empirical Levels and Powers of Two-Sample Tests . . . . . . 52
3.6 Adaptation to the Presence of Tied Observations . . . . . . . 53
3.7 Mann-Whitney-Wilcoxon Null Hypotheses . . . . . . . . . . 54
3.8 Efficiency and Power of Two-Sample Tests . . . . . . . . . . 55
3.8.1 Efficacy of the Gaussian-Theory Test . . . . . . . . . . 55
3.8.2 Efficacy of the Mann-Whitney-Wilcoxon Test . . . . . 55
3.8.3 Summarizing Asymptotic Relative Efficiency . . . . . 56
3.8.4 Power for Mann-Whitney-Wilcoxon Testing . . . . . . 56
3.9 Testing Equality of Dispersion . . . . . . . . . . . . . . . . . 58
3.10 Two-Sample Estimation and Confidence Intervals . . . . . . 59
3.10.1 Inversion of the Mann-Whitney-Wilcoxon Test . . . . 60
3.11 Tests for Broad Alternatives . . . . . . . . . . . . . . . . . . 62
3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Bibliography 201
Index 211
Introduction
Preface
This book is intended to accompany a one-semester MS-level course in non-
parametric statistics. Prerequisites for the course are calculus through mul-
tivariate Taylor series, elementary matrix algebra including matrix inversion,
and a first course in frequentist statistical methods including some basic prob-
ability. Most of the techniques described in this book apply to data with only
minimal restrictions placed on their probability distributions, but performance
of these techniques, and the performance of analogous parametric procedures,
depends on these probability distributions. The first chapter below reviews
probability distributions. It also reviews some objectives of standard frequen-
tist analyses. Chapters covering methods that have elementary parametric
counterparts begin by reviewing those counterparts. These introductions are
intended to give a common terminology for later comparisons with new meth-
ods, and are not intended to reflect the richness of standard statistical analysis,
or to substitute for an intentional study of these techniques.
vii
viii Introduction
package is so tightly tied to the presentation in this book, and hence of less
general interest, it is hosted on the github repository, and installed via
library(devtools)
install_github("kolassa-dev/NonparametricHeuristic")
and, once installed, loaded into R using the library command.
An appendix gives guidance on performing some of these calculations using
SAS.
Errata and other materials will be posted at
http://stat.rutgers.edu/home/kolassa/NonparametricBook
as they become available.
Acknowledgments
In addition to works referenced in the following chapters, I consulted Stigler
(1986) and Hald (1998) for early bibliographical references. Bibliographic trails
have been tracked through documentation of software packages R and SAS,
and bibliography resources from JSTOR, Citulike.org, Project Euclid, and
various publishers’ web sites, have been used to construct the bibliography.
I am grateful to the Rutgers Physical Sciences librarian Melanie Miller, the
Rutgers Department of Statistics Administrative Assistant Lisa Curtin, and
the work study students that she supervised, and the Rutgers Interlibrary
Loan staff, for assistance in locating reference material.
I am grateful to my students at both Rutgers, and at the University of
Rochester, to whom I taught this material over the years. Halley Constantino,
Jianning Yang, and Peng Zhang used a preliminary version of this manuscript,
and were generous with their suggestions for improvements. My experience
teaching this material helped me to select material for this volume, and to de-
termine its level and scope. I consulted various text books during this time, in-
cluding those of Hettmansperger and McKean (2011), Hettmansperger (1984),
and Higgins (2004).
I thank my family for their patience with me during preparation of this
volume. I thank my editor and proofreader for their important contributions.
I dedicate this volume to my wife, Yodit.
1
Background
1
2 Background
.......
.... ....
... ....
.. ...
0.3 .
..
. ..
..
.
... ..
...
.
.... . . ....
... . . ....
.
. . . ..
.... ...
0.25 ...
............................................................................................................................
.. . .... ... .
... .
..
..
.. . ... ... ..
. . .... ... . ..
.. ..
... . .. . .
... . ..
.. . ... . .
... . ..
... ... . ..
0.2 ...
.. .
.
.
.
.
...
.
. .
...
.
...
.
..
..
..
... . .... ... . ..
. ..
dens- ...
.. .
. .
...
. ..
..
..
. ..
..
... . ..
ity ... .
. .
.
.
. .
..
.
.
. ..
..
0.15 .. .
... . .
.
.
.
.. ..
.
...
..
.
. ....
..
... ... .
.. .. . ...
... . .. ... . ...
.. . .. ..
... .. ... ....
..... .. ... ...
... ... .. .
0.1 . ..
. ..
.
. .
..
..
.
.
.. ....
....
..
.. .
.
. .
.. ..... ... .
.
. .. ....
. ......... . ..... ... .
. .
... ..
..... .
..
. . ........ .
. ..
.......
. .. .
......
.....
. .
... .
.. ....... Cauchy .
. ...
.
. ...... ... .. ......... .
0.05 . ..............
.
..............
.
.
.
..
.
..
........ .
......... .
.................
.
.............
.
. .
.
..
..
. . . . . Gaussian .
..
..
........... .
.............
................
...................... . . ... .. . .......................
. .
. .
. .
..
.
....................... Uniform ..
..
...........................................................
.
..........................................................
0
-4 -2 0 2 4
ordinate
The parameter µ is both the expectation and the median, and σ is the standard
deviation. The Gaussian cumulative distribution function is
Z x
FG (x) = fG (y) dy.
−∞
Again, the expectation and median for this distribution are both √ θ, and the
distribution is symmetric about θ. The standard deviation is λ/ 12. A com-
mon canonical member of this family is the distribution uniform on [0, 1],
with θ = 1/2 and λ = 1. The distribution in its generality will be denoted by
U(θ, λ).
As before, the expectation and median of this distribution are both θ. The
standard deviation of this distribution is σ. The distribution is symmetric
about θ. A canonical member of this family is the one with θ = 0 and σ = 1.
The distribution in its generality will be denoted by La(θ, σ 2 ).
The expectation is 1, and the median is log(2). The inequality of these values
is an indication of the asymmetry of the distribution. The standard deviation
is 1. The distribution will be denoted by E.
1.1.4 χ2 -distribution
If X1 , . . . , Xk are independent random variables, each with a standard
Gaussian distribution (that is, Gaussian with expectation zero and vari-
ance one), then the distribution of the sum of their squares is called the
chi-square distribution, and is denoted χ2k . Here the index k is called the
degrees of freedom.
Distributions of quadratic forms of correlated summands sometimes have
a χ2 distribution as well. When Y has a multivariate Gaussian distribution
with dimension k, expectation 0 and variance matrix Υ, and if Υ has an
inverse, then
Y > Υ−1 Y ∼ χ2k .
One can see this by noting that Υ may be written as ΘΘ> . Then X =
Θ−1 Y is multivariate Gaussian with expectation 0, and variance matrix
Θ−1 ΘΘ> Θ−1> = I, where I is the identity matrix with k rows and
columns. Then X is a vector of independent standard Gaussian variables,
and Y > Υ−1 Y = X > X.
Furthermore, still assuming
Xj ∼ G(0, 1), independent (1.2)
6 Background
the distribution of
k
X
W = (Xi − δi )2 (1.3)
i=1
1.1.5 T -distribution
2
When U has a standard Gaussian distribution,
p V has a χk distribution, and
U and V are independent, then T = U/ V /k has a distribution called
Student’s t distribution, denoted here by Tk , with k called the degrees of
freedom.
1.1.6 F -distribution
When U and V are independent random variables, with χ2k and χ2m distribu-
tions respectively, then F = (U/k)/(V /m) is said to have an F distribution
with k and m degrees of freedom; denote this distribution by Fk,m . If a vari-
able T has a Tm distribution, then T 2 has an F1,m distribution.
The other type of possible error occurs when the alternative hypothesis is
true, but the null hypothesis is not rejected. The probability of erroneously
failing to reject the null hypothesis is called the type II error rate, and is
denoted β. More commonly, the behavior of the test under an alternative
hypothesis is described in terms of the probability of a correct answer, rather
than of an error; this probability is called power; power is 1 − β.
One might attempt to control power as well, but, unfortunately, in the
common case in which an alternative hypothesis contains probability distri-
butions arbitrarily close to those in the null hypothesis, the type II error rate
will come arbitrarily close to one minus the test level, which is quite large.
Furthermore, for a fixed sample size and mechanism for generating data, once
a particular distribution in the alternative hypothesis is selected, the smallest
possible type II error rate is fixed, and cannot be independently controlled.
Hence generally, tests are constructed primarily to control level. Under this
paradigm, then, a test is constructed by specifying α, choosing a critical value
to give the test this type I error, and determining whether this test rejects the
null hypothesis or fails to reject the null hypothesis.
In the one-sided alternative hypothesis formulation {θ > θ0 }, the inves-
tigator is, at least in principle, interested in detecting departures from the
null hypothesis that vary in proximity to the null hypothesis. (The same
observation will hold for two-sided tests in the next subsection). For plan-
ning purposes, however, investigators often pick a particular value within the
alternative hypothesis. This particular value might be the minimal value of
practical interest, or a value that other investigators have estimated. They
then calculate the power at this alternative, to ensure that it is large enough
to meet their needs. A power that is too small indicates that there is a sub-
stantial chance that the investigator’s alternative hypothesis is correct, but
that they will fail to demonstrate it. Powers near 80% are typical targets.
Consider a test with a null hypothesis of form θ = θ0 and an alternative
hypothesis of form θ = θA , using a statistic T such that under the null hy-
pothesis T ∼ G(θ0 , σ02 ), and under the alternative hypothesis T ∼ G(θA , σA 2
).
0
Test with a level α, and without loss of generality assume that θA > θ . In
this case, the critical value is approximately t◦ = θ0 + σ0 zα . Here zα is the
number such that Φ(zα ) = 1 − α. A common level for such a one-sided test is
α = 0.025; z0.025 = 1.96. The power for such a one-sided test is
1 − Φ((t◦ − θA )/σA ) = 1 − Φ((θ0 + σ0 zα − θA )/σA ). (1.8)
One might plan an experiment by substituting null hypothesis values θ0 and
σ0 and alternative hypothesis values θA and σA into (1.8), and verifying that
this power is high enough to meet the investigator’s needs; alternatively, one
might require power to be 1 − β, and solve for the effect size necessary to give
this power. This effect size is
θA = θ0 + σA zβ + σ0 zα . (1.9)
One can then ask whether this effect size is plausible. More commonly, σ0 and
Elementary Tasks in Frequentist Inference 9
{W ≥ w◦ } (1.13)
with the θA substituted for θ0 , and power substituted for α. Again, assume
that large values of θ make larger values of T more likely. Then, for alternatives
θA greater than θ0 , the first probability added in
1.2.1.3 P -values
Alternatively, one might calculate a test statistic, and determine the test level
at which one transitions from rejecting to not rejecting the null hypothesis.
This quantity is called a p-value. For a one-sided test with critical region of
form (1.5), the p-value is given by
P0 [T ≥ tobs ] , (1.14)
for tobs the observed value of the test statistic. For two-sided critical values of
form (1.10), with condition (1.12), the p-value is given by
Pθ [L < θ < U ] ≥ 1 − α.
There may be t such that the equation PθL [T ≥ t] = α/2 has no solution,
because Pθ [T ≥ t] > α/2 for all θ. In such cases, take θL to be the lower
bound on possible values for θ. For example, if π ∈ [0, 1], and T ∼ Bin(n, π),
then Pπ [T ≥ 0] = 1 > α/2 for all π, Pπ [T ≥ 0] = α/2 has no solution, and
π L = 0. Alternatively, if θ can take any real value, and T ∼ Bin(n, exp(θ)/(1+
exp(θ))), then Pθ [T ≥ 0] = α/2 has no solution, and θL = −∞. Similarly,
there may be t such that the equation PθU [T ≤ t] = α/2 has no solution,
because Pθ [T ≤ t] > α/2 for all θ. In such cases, take θU to be the upper
bound on possible values for θ.
Construction of intervals for the binomial proportion represents a simple
example in which p-values may be inverted (Clopper and Pearson, 1934).
Then
{θ|t◦L < T (θ, data) < t◦U } (1.20)
is a confidence interval, if it is really an interval. In the case when (1.20) is
an interval, and when T (θ, data) is continuous in θ, then the interval is of the
form (L, U ); that is, the interval does not include the endpoints.
for υ = σzα/2 .
If X 2 + Y 2 < υ 2 , then Q(ρ) in (1.21) has a negative coefficient for
ρ , and the maximum value is at ρ = XY /(X 2 − υ 2 ). The maximum is
2
Exercises 13
(υ 2 −υ 2 + X 2 + Y 2 )/(υ 2 − X 2 ) < 0, and so the inequality in (1.21) holds
for all ρ, and the confidence interval is the entire real line.
If X 2 + Y 2 > υ 2 > X 2 , then the quadratic form in (1.21) has a negative
coefficient for ρ2 , and the maximum is positive. Hence values satisfying the
inequality in (1.21) are very large and very small values of ρ; that is, the
confidence interval is
√ ! √ !
−XY − υ X 2 + Y 2 − υ 2 −XY + υ X 2 + Y 2 − υ 2
−∞, ∪ ,∞ .
υ2 − X 2 υ2 − X 2
If X 2 > υ 2 , then the quadratic form in (1.21) has a positive coefficient for
2
ρ , and the minimum is negative. Then the values of ρ satisfying the inequality
in (1.21) are those near the minimizer XY /(X 2 − υ 2 ). Hence the interval is
√ √ !
XY − υ X 2 + Y 2 − υ 2 XY + υ X 2 + Y 2 − υ 2
, .
X 2 − υ2 X 2 − υ2
1.3 Exercises
1. Demonstrate that the moment generating function for P the statistic
k
(1.3), under (1.2), depends on δ1 , . . . , δk only through j=1 δj2 .
2
One-Sample Nonparametric Inference
15
16 One-Sample Nonparametric Inference
mon techniques designed for data with a Gaussian distribution require con-
sequences of this distribution beyond the marginal distribution of the sample
average.
TABLE 2.1: True levels for the T Test, and Sign Test, and Exact Sign Test,
nominal level 0.05
ation in the denominator of the Studentized statistic act more strongly than
larger values of components of the average in the numerator.
In all cases above, the t-test succeeds in providing a test level not much
larger than the target nominal level. On the other hand, in some cases the
true level is significantly below that expected.
This effect decreases as sample level increases.
.. ...
..... .. .
0.4 .. .. .. ..
.. ... .. ....
.... .... .... ....
.. . .. .
.. .. .. ..
.. .. .. ..
.. .... .. ....
.... .. ..
. ...
.. ... .. ...
... .. ... ...
.. ..
.... ...
..
.
.
..
...
...
... .. . ...
... ..
0.3 ...
..
..
... .
.
.
. ...
..
... .. .
. ...
... ... ... ..
..
.... .. ... ...
.. ... ..
. ..
.... . ..
.... ... ...... ...
.. ...... ..
... ..
Density ...
...
..
... ..
..
0.2 .... ...
.. ...
... ..
... ..
..
... ..
... ..
...
... ...
... ...
... ..
... ..
..
... ..
... ...
0.1 ... ..
..
... ...
... ..
..
.. ...
..
. ..
..
. ...
.
.. ...
...
. ...
....
....
. ....
..
...
... ........
.
.... ...................
......................................... .........................
0
-4 -2 0 2 4
Data Value
Below, the term median refers to the population version, unless otherwise
specified.
that is, the estimate minimizes the sum of distances from data points to
One-Sample Median Methods 19
the potential median value, with distance measured by the sum of absolute
values. This definition (2.4) exactly coincides with the earlier definition (2.3)
for n odd, shares in the earlier definition’s lack of uniqueness for even sample
sizes, and typically shares the opposite resolution (averaging the middle two
observations) of this non-uniqueness. In contrast, the sample mean X̄ of (2.1)
satisfies
Xn
X̄ = argmin |Xi − η|2 . (2.5)
η
i=1
for some convex function %; the sample mean uses %(z) = z 2 /2 and the sample
median uses %(z) = |z|. Huber (1964) suggests an alternative estimator com-
bining the behavior of the mean and median, by taking ρ quadratic for small
values, and continuing linearly for larger values, thus balancing increased effi-
ciency of the mean and the smaller dependence on outliers of the median; he
suggests (
z 2 /2 for |z| < k
%(z) = 2
, (2.7)
k|z| − k /2 for |z| ≥ k
and recommends a value of the tuning parameter k between 1 and 2.
That is, tl is that potential value for T such that not more than α/2 probability
sits below it. The largest such tl has probability at least 1 − α/2 equal to or
larger than it, and at least α/2 equal to or smaller than it; hence tl is the α/2
quantile of the Bin(n, 1/2) distribution. Generally, the inequality in (2.10) is
strict; that is, ≤ is actually <. For combinations of n and α for which this
inequality holds with equality, the quantile is not uniquely defined, and take
the quantile to be the lowest candidate. Symmetrically, one might choose the
smallest tu so that
n
n n
X
(1/2) ≤ α/2; (2.11)
j=t
j
u
n + 1 − tu is the α/2 quantile of the Bin(n, 1/2) distribution, with the opposite
convention used in case the quantile is not uniquely defined.
Then, reject the null hypothesis if T ≤ t◦L or T ≥ t◦U for t◦L = tl − 1 and
◦
tU = tu . This test is called the (exact) sign test, or the binomial test (Higgins,
2004). An approximate version of the sign test might be created by selecting
critical values from the Gaussian approximation to the distribution of T (θ0 ).
Again, direct attention to Table 2.1. Both variants of the sign test succeed
in keeping the test level no larger than the nominal value. However, the sign
test variants, because of the discreteness of the binomial distribution, in some
cases achieve levels much smaller than the nominal target. Subtable (a), for
sample size 10, is the most extreme example of this; subtable (b), for sample
size 17, represents the smallest reduction in actual sample size, and subtable
(c), for sample size 40, is intermediate. Note further that, while the asymp-
totic sign test, based on the Gaussian approximation, is not identical to the
exact version, for subtables (a) and (c) the levels coincide exactly, since for
all simulated data sets, the p-values either exceed 0.05 or fail to exceed 0.05
for both tests. Subtable (b) exhibits a case in which for one data value, the
exact and approximate sign tests disagree on whether p-values exceed 0.05.
Table 2.2 presents characteristics of the exact two-sided binomial test of
the null hypothesis that the probability of success is half, with level α = 0.05,
applied to small samples. In this case, the two-sided p-value is obtained by
doubling the one-sided p value.
One-Sample Median Methods 21
TABLE 2.2: Exact levels and exact and asymptotic lower critical values for
symmetric two-sided binomial tests of nominal level 0.05
For small samples (n < 6), the smallest one-sided p-value, 1/2n , is greater
than .025, and the null hypothesis is never rejected. Such small samples are
omitted from Table 2.2. This table consists of two subtables side by side, for
n ≤ 23, and for n > 23. The first column of each subtable is sample size. The
second is tl − 1 from (2.10). The third is the value taken from performing the
same operation on the Gaussian approximation; that is, it is the largest a such
that √
Φ((tl − 1 − n/2 + 0.5)/(0.5 n)) ≤ α/2. (2.12)
The fourth is the observed test level; that is, it is double the right side of
(2.10). Observations here agree with those from Table 2.1; for sample size 10,
the level of the binomial test is severely too small, for sample size 17, the
binomial test has close to the optimal level, and for sample size 40, the level
for the binomial test is moderately too small.
A complication (or, to an optimist, an opportunity for improved approx-
imation) arises when approximating a discrete distribution by a continuous
distribution. Consider the case with n = 10, exhibited in Figure 2.2. Bar areas
represent the probability under the null hypothesis of observing the number of
successes. Table 2.2 indicates that the one-sided test of level 0.05 rejects the
null hypothesis for W ≤ 1. The actual test size is 0.0215, which is graphically
represented as the sum of the areas in the bar centered at 1, and the very small
area of the neighboring bar centered at 0. Expression (2.12) approximates the
22 One-Sample Nonparametric Inference
sum of these two bar areas by the area under the dotted curve, representing
the Gaussian
√ density with the appropriate expectation n/2 = 5 and standard
deviation n/2 = 1.58. In order to align the areas of the bars most closely
with the area under the curve, the Gaussian area should be taken to extend to
the upper end of the bar containing 1; that is, evaluate the Gaussian distribu-
tion function at 1.5, explaining the 0.5 in (2.12). More generally, for a discrete
distribution with potential values ∆ units apart, the ordinate is shifted by
∆/2 before applying a Gaussian approximation; this adjustment is called a
correction for continuity.
0 1 2 3 4 5 6 7 8 9 10
Sample Size 10, Target Level 0.05, and from Table 2.2 a − 1 = 1
The power of the sign test is determined
by PθA Xj ≤ θ0 for values
of θA 6= θ0 . Since θA > θ0 if PθA Xj ≤ θ0 < 1/2, alternatives θA > θ0
correspond to one sided alternatives P [Yj = 1] < 1/2.
If θ0 is the true population median of the Xj , and if there exists
a set of
form (θ0 − , θ0 + ), with > 0, such that P Xj ∈ (θ0 − , θ0 + ) = 0, then
any other θ in this set is also a population median for Xj , and hence the test
will have power against such alternatives no larger than the test level. Such
occurrences are rare.
Table 2.3 represents powers for these various tests for various sample levels.
The alternative is chosen to make the t-test have power approximately .80 for
the Gaussian and Laplace distributions, using (1.9). In this case both σ0 and
One-Sample Median Methods 23
TABLE 2.3: Power for the T Test, Sign Test, and Exact Sign Test, nominal
level 0.05
√
σA for the Gaussian and Laplace distributions are 1/ n. Formula (1.9) is
inappropriate for the Cauchy distribution, since in this case X̄ does not have
a distribution that is approximately Gaussian. For the Cauchy distribution,
the same alternative as for the Gaussian and Laplace distributions is used.
Results in Table 2.3 show that for a sample size for which the sign test
level approximates the nominal level (n = 17), use of the sign test for Gaussian
data results in a moderate loss in power relative to the t-test, while use of the
sign test results in a moderate gain in power for Laplace observations, and in
a substantial gain in power for Cauchy observations.
Example 2.3.1 An early (and very simple) application of this test was
to test whether the proportion of boys born in a given year is the same as
the proportion of girls born that year (Arbuthnott, 1712). Number of births
was determined for a period of 82 years. Let Xj represent the number
of births of boys, minus the number of births of girls, in year j. The
parameter θ represents the median amount by which the number of girls
exceeds the number of boys; its null value is 0. Let Yj take the value 0 for
years in which more girls than boys are born, and 1 otherwise. Note that
in this case, (2.9) is violated, but P [Xj = 0] is small, and this violation
is not important. Test at level 0.05.
24 One-Sample Nonparametric Inference
4.76 × 10−7 , 1.00 × 10−5 , 1.00 × 10−4 , 6.34 × 10−4 , 2.85 × 10−3 ,
9.70 × 10−3 , 2.59 × 10−2 , 5.54 × 10−2 , 9.70 × 10−2 , 1.40 × 10−1 ,
4.77 × 10−7 , 1.05 × 10−5 , 1.11 × 10−4 , 7.45 × 10−4 , 3.60 × 10−3 ,
1.33 × 10−2 , 3.91 × 10−2 , 9.46 × 10−2 , 1.92 × 10−1 , 3.32 × 10−1 .
The largest of these cumulative sums smaller than 0.025 is the sixth,
corresponding to T < 6. Hence tl = 6. Similarly, tu = 16. Reject the null
hypothesis that the mean is 0.26 if T < 6 or if T ≥ 16. Since 11 of the
observations are greater than the null median 0.26, T = 11. Do not reject
the null hypothesis.
Alternatively, one might calculate a p-value. Using (1.15), the p-value
is 2 min(P0 [T ≥ 11] , P0 [T ≤ 11]) = 1.
Furthermore, the confidence interval for the median is (X(6) , X(16) ) =
(0.118, 0.358).
The values tl and tu may be calculated in R by
a<-qbinom(0.025,21,.5); b<-21+1-qbinom(0.025,21,.5)
and the ensemble of calculations might also have been performed in R
using
arsenic<-as.data.frame(scan(’arsenic.dat’,
what=list(age=0,sex=0,drink=0,cook=0,water=0,nails=0)))
library(BSDA)#Gives sign test.
SIGN.test(arsenic$nails,md=0.26)#Argument md gives null hyp.
Graphical construction of a confidence interval for the median is calcu-
lated by
library(NonparametricHeuristic)
invertsigntest(log(arsenic$nails),maint="Log Nail Arsenic")
and is given in Figure 2.3. Instructions for installing this last library are
given in the introduction, and in Appendix B.
not rejected are between the horizontal lines; log medians in the confidence
intervals are values of the test statistic within this region.
-3 -2 -1 0 1 2
In this construction, order statistics (that is, the ordered values) are first
plotted on the horizontal axis, with the place in the ordered data set on the
vertical axis. These points are represented by the points in Figure 2.3 where
the step function transitions from vertical to horizontal, as one moves from
lower left to upper right. Next, draw horizontal lines at the values tl and tu ,
given by (2.10) and (2.11) respectively. Finally, draw vertical lines through
the data points that these horizontal lines hit.
For this particular example, the exact one-sided binomial test of level 0.025
rejects the null hypothesis that the event probability is half if the sum of event
indicators is 0, 1, 2, 3, 4, or 5; tl = 6. For Yj of (2.8), the sum is less than
6 for all θ to the left of the point marked X(tl ) . Similarly, the one-sided level
0.025 test in the other direction rejects the null hypothesis if the sum of event
indicators is at least tu = 16. The sum of the Yj exceeds 15 for θ to the right
of the point marked X(tu ) .
By symmetry, one might expect tl = n − tu , but this is not the case. The
asymmetry in definitions (2.10) and (2.11) arises because construction of the
confidence interval requires counting not the data points, but the n − 1 spaces
One-Sample Median Methods 27
between them, plus the regions below the minimum and above the maximum,
for a total of n + 1 ranges. Then tl = n + 1 − tu .
√This interval is not of the usual form θ̂ ± 2σ̂, for σ̂ with a factor of
1/ n. Cramér (1946, pp. 368f.) shows that if X1 , . . . , Xn is a set of indepen-
dent random variables, each having density f , then Var [smed [X1 , . . . , Xn ]] ≈
1/(4f (θ)2 n). Chapter 8 investigates estimation of this density; this estimate
can be used to estimate the median variance, but density estimation is harder
than the earlier confidence interval rule.
Example 2.3.3 Test the null hypothesis that the upper quartile (that is,
the 0.75 quantile) of the arsenic nail data from Example 2.3.2 is the ref-
erence value 0.26, and give a confidence interval for this quantile. The
analysis is the same as before, except that tl and tu are different. We de-
termine tl and tu in (2.13). Direct calculation, or using the R commands
a<-qbinom(0.025,21,.75);b<-21+1-qbinom(0.025,21,1-0.75)
sort(arsenic$nails)[c(a,b)]
represent the power for test Tj using n observations, under the alternative θA :
Assume that
$j,n (θA ) is continuous and increasing in θA for all j, n,
limθA →∞ $j,n (θA ) = 1, (2.14)
limn→∞ $j,n (θA ) = 1 for all θA > θ0 .
lim n1 /n2 ,
n1 →∞
2.4.1.1 Power
Consider test statistics satisfying
Tj ∼ G(µj (θ), ςj2 (θ)), for ςj (θ) > 0, µj (θ) increasing in θ. (2.15)
The Gaussian distribution in (2.15) does not need to hold exactly; holding
approximately is sufficient. In this
h case, onei can find the critical values for
◦ ◦
the two tests, tj,nj , such that P0 Tj ≥ tj,nj = α. Since (Tj − µj (0))/ςj (0) is
approximately standard Gaussian under the null hypothesis, then
Hence
t◦j,nj = µj (0) + ςj (0)zα . (2.16)
The power for test j is approximately
Often the variance of the test statistic changes slowly as one moves away from
the null hypothesis; in this case, the power for test j is approximately
Then
A √ σj (0)zα A A
$j,nj (θ ) = 1 − Φ nj µj (0) + √ − µj (θ ) /σj (θ ) . (2.20)
nj
As sample sizes increase, power increases for a fixed alternative, and calcula-
tions will consider a class of alternatives moving towards the null. Calculations
below will consider behavior of the expectation of the test statistic near the
null hypothesis (which is taken as θ = 0). Suppose that
µj (0) − µj (θA )
A √
$j,nj (θ ) ≈ 1 − Φ nj + zα
σj (0)
µj (θA ) − µj (0)
√
= Φ nj − zα . (2.22)
σj (0)
This expression for approximate power may be solved for sample size, by
noting that if
$j,nj (θA ) = 1 − β, (2.23)
√ A
h i
µ (θ )−µ (0)
then $j,nj (θA ) = Φ(zβ ), and (2.22) holds if nj j σj (0) j − zα = zβ , or
Common values for α and β are 0.025 and 0.2, giving upper Gaussian quantiles
of zα = 1.96 and zβ = 0.84. Recall that z with a subscript strictly between
0 and 1 indicates that value for which a standard Gaussian random variable
has that probability above it.
It may be of use in practice, and will be essential in the efficiency cal-
culations below, to approximate which member of the alternative hypothesis
Comparing Tests 31
corresponds with a test of a given power, with sample size held fixed. Solving
(2.24) exactly for θA is difficult, since the function µ is generally non-linear.
Approximating this function using a one-term Taylor approximation,
for
ej = µ0j (0)/σ(0).
The quantity ej is called the efficacy of test j. Setting this power to 1 − β,
√
zα − nj ej θA = z1−β . Solving this equation for θA ,
√
θA ≈ (zα + zβ )/ nj ej , (2.25)
verifying requirement that θA get close to zero. This expression can be used
to approximate an effect size needed to obtain a certain power with a certain
sample size and test level, and will be used in the context of asymptotic relative
efficiency.
TABLE 2.4: Empirical powers for one-sample location tests with sample size
ratios indicated by asymptotic relative efficiency
t-test Sign
√ test Relative
Gaussian µ0 (0) 1/ρ 1/( 2πρ)
σ(0) 1 p 1/2 p
e 1/ρ 2/π/ρ
√ π/2
Laplace µ0 (0) 1 1/ 2
σ(0) 1 1/2
√ √
e 1 2 1/ 2
Cauchy µ0 (0) 1 π −1
σ(0) ∞ 1/2
e 0 2/π 0
Example 2.4.1 In this example I calculate power for a sign test ap-
plied to 49 observations from a Gaussian distribution with unit vari-
ance. Suppose X1 , . . . , X49 ∼ G(θ, 1), with null hypothesis θ = 0 and
alternative hypothesis θ = 1/2. The sign test statistic, divided by
n, approximately satisfies (2.15) and √ (2.19), with µ and σ given by
(2.27). Then√ µ1 (0) = .5, σ1 (0) = 0.5 × 0.5 = .5, µ1 (0.5) = 0.691,
σ1 (0.5) = .691 × 0.309 = 0.462, and power for a one-sided test of
level 0.025, or a two-sided test of level 0.05, is approximated by (2.17):
1−Φ(7×(0.5+0.5×1.96/7−0.691)/0.462) = 1−Φ(−0.772) = 0.780. The
null and alternative standard deviations are close enough to motivate the
use of the simpler approximation (2.22), approximating power as
If, instead, a test of power 0.85 were desired for alternative expectation
1/2, with a one-sided test of level 0.025, zα = 1.96, and zβ = 1.036. From
(2.24), one needs at least
The above intervals will extend outside [0, 1], which is not reasonable; this can
be circumvented by transforming the probability scale.
Figure 2.4 represents the bounds from (2.28), without any rescaling, to
be discussed further in the next example. Confidence bounds in Figure 2.4
exhibit occurrences of larger estimates being associated with upper confi-
dence bounds that are smaller (ex., in Figure 2.4, the region between the
second-to-largest and the largest observations), and for the region with the
cumulative distribution function estimated at zero or one (that is, the region
below the smallest observed value, and the region above the largest observed
value), confidence limits lie on top of the estimates, indicating no uncertainty.
Both of these phenomena are unrealistic. The first phenomenon, that of non-
monotonic confidence bounds, cannot be reliably avoided through rescaling;
the second, with the upper confidence bounds outside the range of the data,
can never be repaired through rescaling. A preferred solution is to substitute
the intervals of Clopper and Pearson (1934), described in §1.2.2.1, to avoid
all three of these problems (viz., bounds outside (0, 1), bounds ordered differ-
ently than the estimate, and bounds with zero variability). Such intervals are
exhibited in Figure 2.5.
Finally, the confidence associated with these bounds is point-wise, and not
simultaneous. That is, if (L1 , U1 ) and (L2 , U2 ) and are 1−α confidence bounds
associated with two ordinates x1 and x2 , then P [L1 ≤ F (x1 ) ≤ U1 ] ≥ 1 − α
and P [L2 ≤ F (x2 ) ≤ U2 ] ≥ 1 − α, at least approximately, but the preced-
ing argument does not bound P [L1 ≤ F (x1 ) ≤ U1 and L2 ≤ F (x2 ) ≤ U2 ] any
higher than 1 − 2α.
Example 2.5.1 Consider the arsenic data of Example 2.3.2. For every
real x, one counts the number of data points less than this x. For any
x less than the smallest value 0.073, this estimate is F̂ (x) = 0. For x
greater than or equal to this smallest value and smaller than the next
smallest value 0.080, the estimate is F̂ (x) = 1/21. This data set con-
tains one duplicate value 0.118. For values below, but close to, 0.118 (for
example, x = 0.1179), F̂ (x) = 4/21, since 21 of the observations are
less than x. However, F̂ (x) = 6/21; the jump here is twice what it is
at other data values, since there are two observations here. This esti-
Exercises 35
mate is sketched in both Figures 2.4 and 2.5, and may be constructed in
R using ecdf(arsenic$nails), presuming the data of Example 2.3.2 is
still present to R. The command ecdf(arsenic$nails) does not produce
confidence intervals; use
library(MultNonParam); ecdfcis(arsenic$nails,exact=FALSE)
to add confidence bounds, and changing exact to TRUE forces exact in-
tervals.
FIGURE 2.4: Empirical CDF and Confidence Bounds for Arsenic in Nails
. . . . . . . . . . . . . . ..
.
1 .. . . ◦......................................................................
.
.
◦..........................................................................................•..
.. .
. ◦.•. .
. .
. .. . . . . . . . . . . . . . . .
. ◦...................• .
. ◦ ....•
.. .
0.8 .
.◦ ....•
. .
.
. . .
. ◦ ..•
.. . . . .
. .
. ◦ .•
.. .
.
. ..
. ◦ • .
0.6 . . ..
. ◦ •
.
Probab- .◦
.
.....•
.. .
.
ility .◦.. .
•
. .
.. .
.◦
•
0.4 .. .
◦
.• .
.. .
.•
◦
. ..
•. .
.◦
..
..
.. . Empirical CDF
0.2 .◦...
•
...
. ◦
•
.
. . Confidence Bounds
..•
◦ .
..
..
◦
•
..
..
0 ..................................................................•
..
.
-1 0 1 2 3
Arsenic in Nails
Confidence Level 0.95 Approximate
2.6 Exercises
1. Calculate the asymptotic relative efficiency for the sign statistic
relative to the one-sample t-test (which you should approximate
using the one-sample z-test). Do this for observations from the
a. uniform distribution, on [−1/2, 1/2] with variance 1/12 and mean
under the null hypothesis of 0, and
b. logistic distribution, symmetric about 0, with variance π 2 /3 and
density exp(x)/(1 + exp(x))2 .
36 One-Sample Nonparametric Inference
FIGURE 2.5: Empirical CDF and Confidence Bounds for Arsenic in Nails
1 .
. . . . . . . . . . . . . . .• ....................................................................
. . . . ...........................................................................................
.. • ◦
.
. •.◦. .
. •...................◦
..
. .. . . . . . . . . .
. • ....◦
.. .
0.8 .
.• ....◦
. . . . . . . . . . . . . . . .
.
. . .
. • ..◦
.. .
. ... .
. • ◦ .
. . . . . .
. • ◦ .
0.6 . ◦ . . .
. • .
Probab- .•
.
.. . .
.....◦
ility .•.. . .
◦
. ..
.•..
◦
0.4 .. .
.•
◦ .
.•.. .
◦
. ..
.• ◦. ..
.
. .
..
Empirical CDF .
0.2 . . . . . . . . . . . .
.• ...
◦
Confidence Bounds
•
◦..
.
.
•.◦..
•
◦...
..
0 ..................................................................◦
-1 0 1 2 3
Arsenic in Nails
Confidence Level 0.95 Exact
HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/lupus
HTTP://lib.stat.cmu.edu/datasets/bodyfat
gives data on body fat in 252 men. The second column gives propor-
tion of lean body tissue. Give a 95% confidence interval for upper
quartile proportion of lean body tissue. Note that the first 116 lines
Exercises 37
and last 10 lines are data set description, and should be deleted.
(Line 117 is blank, and should also be deleted).
4. Suppose 49 observations are drawn from a Cauchy distribution, dis-
placed to have location parameter 1.
a. What is the power of the sign test at level 0.05 to test the null
hypothesis of expectation zero for these observations?
b. What size sample is needed to distinguish between a null hy-
pothesis of median 0 and an alternative hypothesis of median 1, for
independent Cauchy variables with a one-sided level 0.025 sign test
to give 80% power?
3
Two-Sample Testing
This chapter addresses the question of two-sample testing. Data will gener-
ally consist of observations X1 , . . . , XM1 from continuous distribution function
F , and observations Y1 , . . . , YM2 from a continuous distribution function G.
Model these observations as independent, and unless otherwise specified, treat
their distributions as identical, up to some known shift θ; that is,
Techniques for testing null a hypothesis of form θ = θ0 in (3.1), vs. the al-
ternative that (3.1) holds for some alternative θ 6= θ0 , and for estimating θ
assuming (3.1), are presented. Techniques for tests in which (3.1) is the null
hypothesis, for an unspecified θ, are also presented.
{k}
for Ij equal to the 1 if the item ranked j in the combined sample comes
{k}
from the group k, and 0 otherwise. The superscript in Ij refers to group,
and does not represent power.
{2}
The statistic TG is designed to take on large values when items in group
two are generally larger than the remainder of the observations (that is, the
items in group one), and to take small values when items in group two are
{1}
generally smaller than the remainder of the observations. The statistic TG is
designed to take on large values when items in group one are generally larger
than the remainder of the observations (that is, the items in group two), and
to take small values when items in group one are generally smaller than the
{1}
remainder of the observations. The statistic TG provides no information not
{2} {2} P N {1}
also captured in TG , since TG = j=1 aj − TG .
sampling without replacement, is used. There are some conditions on the set
of scores needed to ensure that they are approximately Gaussian. Critical
values for the two-sided test of level α of the null hypothesis of equality of
distribution, vs. the alternative, relying on the Gaussian approximation, are
given by
h i r h i
{2} {2}
E0 G ± Var0 TG zα/2
T (3.8)
Moments needed to evaluate (3.8) and (3.9) are given in the next section.
h i N
{k}
X
E0 TG = Mk aj /N = Mk ā; (3.10)
j=1
Y X Total
Greater than Median A N/2 − A N/2
Less than Median B N/2 − B N/2
Total M2 M1 N
So
h i N X
N h i
{k} {k} {k}
X
Var TG = ai aj Cov Ii , Ij
i=1 j=1
N
X X
= a2i b1 + ai aj b2 = (b1 − b2 )N â + N 2 b2 ā2
i=1 i6=j
although, as originally formulated, this test was applied only in the case of
even sample sizes, and so the score 0 would not be used.
Mood’s test, and other rank tests, ignore the ordering of the observations
from the first group among themselves, and similarly ignore the orderings of
the second group among themselves. Represent the data set as a vector of N
symbols, M1 of them X and M2 of them Y . The letter X in position j indicates
that, after ordering a combined set of N observations, the observation ranked
j comes from the first group, and the letter Y in position j indicates that
the observation ranked j comes from the second group. The advantage of
44 Two-Sample Testing
Y X Total
Greater than Median A (N − 1)/2 − A (N − 1)/2
Equal to Median C 1−C 1
Less than Median B (N − 1)/2 − B (N − 1)/2
Total M2 M1 N
Mood’s test lies in its simplicity, and its disadvantage is its low power. To
see why its power is low, consider the test with M1 = M2 = 3, for a total
of six observations. A data set label X, Y, X, Y, X, Y indicates that the lowest
observation is from the first group, the second lowest is from the second group,
the third lowest is from the first group, the fourth lowest is from the second
group, the fifth lowest (that is, the second highest) is from the first group,
and the highest is from the second group. Mood’s test treats X, Y, X, Y, X, Y
and X, Y, X, X, Y, Y as having equal evidence against H0 , but the second
should be treated as having more evidence. Furthermore, Mood’s test takes a
value between 0 and min(M2 , bN/2c). This high degree of discreteness in the
statistic’s support undermines power.
Westenberg (1948) presented the equal-sample case of the statistic, and
Mood (1950) detailed the use in the case with an even combined sample size.
Mood’s test is of sufficiently low importance in applications that the early
references did not bother to present the slight complication that arises when
the combined sample size is odd.
When the total sample size is odd, one might represent the median test
as inference from a 2 × 3 contingency table with ordered categories, as in
Table 3.2. Then TM = A − B. Then the one-sided p-values may be calculated
as
M1 M2
P0 [A − B ≥ t] = P0 [A − B ≥ t|C = 0] + P0 [A − B ≥ t|C = 1] .
N N
The probabilities P0 [A − B ≥ t|C = c] are calculated from the hypergeomet-
ric distribution.
The null expectation and variance of TM are given by (3.10) and (3.11)
respectively. Note ā = 0, and
(
(N − 1)/N if N odd
â =
1 if N even.
Critical values and p-values for Mood’s test may be calculated from (3.8) and
(3.9) respectively.
library(MultNonParam)
attach(yarn)
mood.median.test(strength[type=="A"],strength[type=="B"])
or
genscorestat((strength>median(strength))*2-1,type,correct=1)
Compare this with the two-sample t-test:
46 Two-Sample Testing
t.test(strength[type=="A"],strength[type=="B"])
detach(yarn)
M
18
♦
♦
17
-2 -1 0 1 2
Normal Quantile
×
19
♦
18 × Symbol Bobbin
Sample 1
Quantile 17 ♦
×× ♦ 2
Yarn
M 3
Type 5♦
B 16 M5+ + 4
5+ ×
M+ 5
15 5 + ♦ 5 6
M
M
14
-2 -1 0 1 2
Normal Quantile
Yarn Type B
Mood’s test, using (3.7) in conjunction with (3.12), is almost never used,
and is presented here only as a point of departure for more powerful tests.
Mood’s test looses power primarily because of the discreteness of the scores
(3.12). The balance of this chapter explores tests with less discrete scores.
The Mann-Whitney-Wilcoxon Test 47
.................................................
........................................... ... ...........................................................
A . ..
..............................................
..
Type ......................................................................
.... ... .
....................................... ..................................................................................
B .
..
.
...........................................................................
.
12 13 14 15 16 17 18 19 20
Strength
This new statistic TU and the previous statistic TW can be shown to be iden-
tical. For j ∈ {1, . . . , M2 } indexing a member of the second sample, define Rj
as the rank of observation j among all of the observations from the combined
samples. Note that
M2
X M2
X
TW = Rj = #(sample entries less than or equal to Yj )
j=1 j=1
M2
X
= #(X values less than or equal to Yj )
j=1
M2
X
+ #(Y values less than or equal to Yj ) (3.15)
j=1
48 Two-Sample Testing
The first sum in (3.15) is TU , and the second is M2 (M2 + 1)/2, and so
TW = TU + M2 (M2 + 1)/2.
The test based on TU is called the Mann-Whitney test (Mann and Whitney,
1947). This statistic is a U statistic; that is, a statistic formed by summing
over pairs of observations in a data set.
Equating quadratic terms above gives a = 1/3. Setting the linear term to zero
gives b = 1/2, and setting the constant term to zero gives c = 1/6. Then
w
X
j2 = g(w) = w(2w + 1)(w + 1)/6, (3.20)
j=1
â = (2N + 1)(N + 1)/6,
2
â − ā = (2N + 1)(N + 1)/6 − (N + 1)2 /4 = (N 2 − 1)/12, (3.21)
In conjunction with the central limit theorem argument described above, one
can test for equality of distributions, with critical values and p-values given
by (3.8) and (3.9) respectively.
Example 3.4.1 Refer again to the yarn data of Example 3.3.1. Consider
yard strengths for bobbin 3.
wilcox.test(strength~type,data=yarn[yarn$bobbin==3,],
exact=FALSE, correct=FALSE)
50 Two-Sample Testing
√
The continuity-corrected p-value uses statistic (13 + 0.5 − 18)/ 12, and
is 0.194, and might be done by
wilcox.test(strength~type,data=yarn[yarn$bobbin==3,],
exact=FALSE)
Finally, p-values might be calculated exactly using (3.17), (3.18), and
(3.19), and in R by
wilcox.test(strength~type,data=yarn[yarn$bobbin==3,],
exact=TRUE)
Moments (3.22) apply to the statistic given by scores aj = j. By contrast,
the Mann-Whitney statistic TU is constructed using (3.7) from scores aj = j −
(N +1)/2. The variance of this statistic is still given by (3.22); the expectation
is E [TU ] = M2 (N + 1)/2 − M2 (M2 + 1)/2 = M2 M1 /2.
The Wilcoxon variance in (3.22) increases far more quickly than that of
Mood’s test as the sample size N increases; relative to this variance, the
continuity correction is quite small, and is of little importance.
Example 3.4.2 Consider the nail arsenic data of Example 2.3.2. One
might perform an analysis using these scoring methods.
N M2
{2}
X X
TP = Z(j) Ij = Yj = M2 Ȳ . (3.23)
j=1 j=1
N Z̄ − M2 Ȳ N Ȳ − N Z̄
Ȳ − X̄ = Ȳ − = = N TP − M2 Z̄ /(M1 M2 ),
M1 M1
PN
where Z̄ = i=1 Z(i) /N . The pooled variance estimate for the two-sample t
52 Two-Sample Testing
statistic is
PM1 PM2
j=1 (Xj − X̄)2 + j=1 (Yj − Ȳ )2
s2p =
N −2
PM1 2
PM2
j=1 (Xj − Z̄) − M1 (X̄ − Z̄)2 + j=1 (Yj − Z̄)2 − M2 (Ȳ − Z̄)2
=
N −2
(N − 1)s2Z − M1 (X̄ − Z̄)2 − M2 (Ȳ − Z̄)2
= .
N −2
Some algebra shows this to be
N − 1 (TP − M2 Z̄)2 (1/M1 + 1/M2 )
s2p = s2Z − .
N −2 (N − 2)
Hence, conditional on (Z(1) , . . . , Z(N ) ), the two-sample pooled t statistic is
p
(N − 2)N TP − M2 Z̄
q ,
(s2Z (N − 1)M1 M2 − (TP − M2 Z̄)2 N
Example 3.4.3 Again consider the nail arsenic data of Example 2.3.2.
Recall that there are 21 subjects in this data set, of whom 8 are male.
The permutation test testing the null hypothesis of equality of distribution
across gender may be performed in R using
library(MultNonParam)
aov.P(dattab=arsenic$nails,treatment=arsenic$sex)
to give a two-sided p-value of 0.482. In this case, all 21
8 = 203490 ways
to reassign arsenic nail levels to the various groups were considered. The
Empirical Levels and Powers of Two-Sample Tests 53
TABLE 3.4: Levels for various Two-Sample Two-Sided Tests, Nominal level
0.05, from 100,000 random data sets each, sample size 10 each
TABLE 3.5: Powers for various Two-Sample Two-Sided Tests, Nominal level
0.05, from 100,000 random data sets each, sample size 10 each, samples offset
by one unit
statistic TP of (3.23) was calculated for each assignment, this value was
subtracted from the null expectation Z̄, and the difference was squared
to provide a two-sided statistic. The p-value reported is the proportion of
these for which the squared differences among the reassignments meets or
exceeds that seen in the original data.
Table 3.5 excludes the exact version of the Wilcoxon test and Mood’s test,
since for these sample sizes (Mj = 10 for j = 1, 2), they fail to achieve the
desired level for any data distribution. The approximate Wilcoxon test has
comparable power to that of the t-test under the conditions optimal for the
t-test, and also maintains high power throughout.
importance in case of rank tests. When average ranks replace the original
ranks, the continuity correction argument using Figure 2.2 no longer holds.
Potential values of the test statistic are in some cases closer together than 1
unit apart, and, in such cases, the continuity correction might be abandoned.
For example, suppose that Yj ∼ G(0, 1), and Xi ∼ G(θ, 1). The differences
Xi − Yj ∼ G(θ, 2), and so √
µ(θ) = Φ(θ/ 2). (3.26)
0
√
Hence µ (0) = 1/(2 π). Also, (3.24) still holds, and
1 √ p
e= √ 12ζ = 3/πζ = .977ζ.
2 π
π2
Logistic µ0 (0) = 1 0
√ µ (0) = 6
1
√ = 1.10
−1 9
σ(0) = √ πζ / 3 σ(0) =√ζ −1 /(2 3)
e = ζ 3/π e = ζ/ 3
p
ζ = (λ(1 − λ))1/2 , ε = Var [Xi ].
and √ √
µ0 (0) = 1/6, e = (1/6) 12ζ = (1/ 3)ζ = .577ζ.
Efficacies for more general rank statistics may be obtained using calcula-
tions involving expectations of derivatives of underlying densities, with respect
to the model parameter, evaluated at order statistics under the null hypoth-
esis, without providing rank expectations away from the null (Dwass, 1956).
moment depends not only on the probability (3.25), but also on probabilities
involving two independent copies of X and one copy of Y , and of two inde-
pendent copies of Y and one copy of X. This additional calculation allows the
use of (2.17); calculations below involve the simpler formula.
In contrast with the shift alternative (3.1), one might consider the
Lehmann alternative
from the outside in, with extremes getting equal rank, and again summing the
ranks from one sample.
The Ansari-Bradley test has a disadvantage with respect to the Siegel-
Tukey test, in that one can’t use off-the-shelf Wilcoxon tail calculations. On
the other hand, the Ansari-Bradley test is exactly invariant to reflection.
Example 3.9.1 Consider again the yarn data of Example 3.3.1. Test
equality of dispersion between the two types of yarn. Ranks are given in
Table 3.7, and are calculated in package NonparametricHeuristic as
yarn$ab<-pmin(rank(yarn$strength),rank(-yarn$strength))
yarn$st<-round(siegel.tukey.ranks(yarn$strength),2)
yarnranks<-yarn[order(yarn$strength),
c("strength","type","ab","st")]
R functions may be used to perform the test.
library(DescTools)#For SiegelTukeyTest
SiegelTukeyTest(strength~type,data=yarn)
yarnsplit<-split(yarn$strength,yarn$type)
ansari.test(yarnsplit[[1]],yarnsplit[[2]])
to find the Siegel-Tukey p-value as 0.7179, and the Ansari-Bradley p-value
as 0.6786. There is no evidence of inequality of dispersion.
in close parallel with definitions of §2.3.2. Then, reject the null hypothesis if
{2} {2}
TG (θ0 ) ≤ t◦L or TG (θ0 ) ≥ t◦U for t◦L = tl − 1 and t◦U = tu , and use as the
confidence interval
{2}
{θ|tl ≤ TG (θ0 ) < tu }. (3.32)
{2}
Applying the Gaussian approximation to TG (θ),
r h i
{2}
tl , tu ≈ M2 ā ± zα/2 Var TG (0) . (3.33)
The above analysis did not reflect the fact of a small number of ties among
these pairwise differences. The code
wilcox.test(nails~sex,data=arsenic,conf.int=TRUE)
gives intervals found by using approximation (3.33) to the test critical val-
ues, and interpolating between the appropriate order statistics, to obtain
an identical result to the same accuracy.
-2 -1 0 1
Location Difference
Compare Figure 3.3 to Figure 2.3. The test statistic in Figure 2.3 is con-
structed to be non-decreasing in θ; this simplified the discussion of the interval
construction, at the cost of employing a slightly non-standard version of the
sign test statistic. Figure 3.3 uses a more standard version of the test statistic.
64 Two-Sample Testing
Example 3.11.1 Again consider the the yarn data of Example 3.3.1.
Figure 3.4 shows the cumulative distribution functions of strengths for
the two types of yarn. This figure might be generated using
par(mfrow=c(1,1))
plot(range(yarn$strength),c(0,1),type="n",
main="Yarn Strength",xlab="Yarn Strength",
ylab="Probability")
yarnsplit<-split(yarn$strength,yarn$type)
lines(ecdf(yarnsplit[[1]]),col=1)
lines(ecdf(yarnsplit[[2]]),col=2)
legend(17,.2,lty=c(1,1),col=c(1,2),legend=c("A","B"))
Tests for Broad Alternatives 65
1 5 .............................
.................................5
..
...................
5 ... 5 .................. ...
................
5 .. .......
5 ..
.............. Type A ..
5 .. 5 ..
.
...................
5 ... ..
5 .
............ Type B
0.8 5 .....
... .....................
5 ...
.....
5 ... ....
5 ..
5 ..
..
5 ..
.. 5 ..
.
..
5 .. ..
5 .
0.6 ..
5 .. ................
5 ..
Probab- ..
5 .. 5 .....
.
ility ..
5 .
................
5 ..
0.4 .......
5 ..
..
5 .
5 ..
.. 5 .......
..
..
5 ..
5 .....
.
0.2 ................... ..
...5 .
5
............
5 ...
..........................
5 ... ..........
5 ..
.....
5 ... .......
5 ..
0 .................................
... ...........................
..
12 14 16 18 20
Yarn
66 Two-Sample Testing
3.12 Exercises
1. The data set
HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo
HTTP://lib.stat.cmu.edu/datasets/biomed.desc
This chapter develops nonparametric techniques for one way analysis of vari-
ance.
Suppose Xki are samples from K potentially different populations. That is,
for fixed k, Xk1 , . . . , XkMk are independent and identically distributed, each
with cumulative distribution function Fk . Here k ∈ {1, . . . , K} indexes group,
and Mk represents the number of observations in each group. In order to
determine whether all populations are the same, test a null hypothesis
vs. the alternative hypothesis HA : there exists j, k, and x such that Fj (x) 6=
Fk (x). Most tests considered in this chapter, however, are most powerful
against alternatives of the form HA : Fk (x) ≤ Fj (x)∀x for some indices k, j,
with strict inequality at some k, j, and x. Of particular interest, particularly
for power calculations, are alternatives of the form
Fi (x − θi ) = Fj (x − θj ) (4.2)
69
70 Methods for Three or More Groups
When the data have a Gaussian distribution, and (4.1) holds, the numerator
and denominator of (4.3) have χ2 distributions, and are independent; hence
the ratio W has an F distribution.
When the data are not Gaussian, the central limit theorem implies that the
numerator is still approximately χ2K−1 , as long as the minimal Mk is large, and
as long as the distribution of the data is not too far from Gaussian. However,
neither the χ2 distribution for the denominator of (4.3), nor the independence
of numerator and denominator, are guaranteed in this case. Fortunately, again
for large sample sizes and data not too far from Gaussian, the strong law of
large numbers indicates that the denominator of (4.3) is close to the population
variance of the observations, and the denominator degree of freedom for the
F distribution is large enough to make the F distribution close the the χ2
distribution. Hence in this large-sample close-to-Gaussian case, the standard
analysis of variance results will not mislead.
4.1.1 Contrasts
Let µk = E [Xki ]. Continue considering the null hypothesis that µj = µk for
all j, k pairs, and consider alternative hypotheses in which Xki all have the
same finite variance σ 2 , but the means differ, in a more structured way than
for standard ANOVA. Consider alternatives such that µk+1 − µk are the same
for all k, and denote the common value by ∆ > 0. One might construct a
test particularly sensitive to this departure from the null hypothesis using the
estimate ∆.ˆ If ∆ ˆ is approximately Gaussian, then the associated test of the
null hypothesis (4.1) vs.rthe ordered and equidistant alternative is constructed
h i h i
as T = (∆ ˆ − E0 ∆ ˆ )/ Var0 ∆ ˆ ; this statistic is compared to the standard
Gaussian distribution in the usual way.
An intuitive estimator ∆ ˆ is the least squares estimators; for example, when
ˆ
K = 3 then ∆ = (X̄3. − X̄1. )/2, and when K = 4 then ∆ ˆ = (3X̄4. + X̄3. −
X̄2. − 3X̄1. )/10. Generally, the least squares estimator is a linear combination
PK
of group means, of form k=1 ck X̄k. for a set of constants ck such that
K
X
ck = 0, (4.5)
k=1
h i
ˆ = 0 and
with ck evenly spaced. In this case, E0 ∆
h i K
X
ˆ
Var0 ∆ = Var0 [Xk ] c2k /Mk ,
k=1
In the case when the shift parameters and the contrast coefficients are equally
spaced, and for groups of equal size (that is, θk = (k − 1)∆ for some ∆ > 0,
ck = 2k − (K + 1), and λ = 1/K),
K
X
ck X̄k ∼ G(K 2 (K − 1)∆/6, (ε2 /3)K 2 (K 2 − 1)/N ).
k=1
p
Then µ0 (∆) = (K 2 − 1)K/6, and σ(0) = εK (K 2 − 1)/3, and the efficacy,
as defined in §2.4.1.2, is
√
K(K 2 − 1)/6 K2 − 1
e= p = √ . (4.6)
εK (K 2 − 1)/3 2 3ε
nominal size. Tests in this second stage may be performed using the two-
sample pooled t-test (3.2), except that the standard deviation estimate of
(4.4) may be substituted for sp , with the corresponding increase in degrees of
freedom in (3.5).
Fisher’s LSD method fails to control family-wise error rate if K > 3. To
see this, suppose F1 (x) = F2 (x) = · · · = FK−1 (x) = FK (x − ∆) for ∆ 6= 0.
Then null hypotheses Fj (x) = Fi (x) are true, for i, j < K. One can make ∆
so large that the analysis of variance test rejects equality of all distributions
with probability close to 1. Then multiple true null hypotheses are tested,
without control for multiplicity. If K = 3, there is only one such test with a
true hull hypothesis, and so no problem with multiple comparisons.
Contrast this with Tukey’s Honest Significant Difference (HSD) method
(Tukey, 1953, 1993). Suppose that Yj ∼ G(0, 1/Mj ) for j ∈ {1, . . . , K}, U ∼
χ2m , and that the Yj and U are independent. Assume p further that Mj are all
p
equal. The distribution of max1≤i,j≤K (|Xj − Xi |/( U/n/ Mj ) is called the
Studentized range distribution with K and m degrees of freedom. If Mj are
not all equal,
√ p q
2 max ((Xj − Xi )/( U/m 1/Mj + 1/Mk )
1≤i,j≤K
then
P [µk − µj ∈
/ Cjk for some j, k] ≤ α. (4.10)
This method had been suggested before the Studentized range distribution
had been derived (Tukey, 1949).
We now proceed to analogs of one-way analysis of variance that preserve
nominal test size for small samples and highly non-Gaussian data.
General Rank Tests 73
N −1
uk = .
(N 2 (â
− ā2 )Mk
{k}
The remainder of this section considers the joint distribution of the TG ,
calculates their moments, and confirms the asymptotic distribution for WG .
(N − Mk − Mj )(Mk + Mj ) h
{j,k}
i
(â − ā2 ) = Var TG
(N − 1)
h i h i h i
{j} {k} {j} {k}
= Var TG + Var TG + 2Cov TG , TG
(N − Mj )Mj (N − Mk )Mk h
{j} {k}
i
= (â − ā2 ) + (â − ā2 ) + 2Cov TG , TG
(N − 1) (N − 1)
74 Methods for Three or More Groups
and h i −M M
{j} {k} j k
Cov TG , TG = (â − ā2 ).
(N − 1)
Var [Y ] (I + (N/MK )νν > ) = I + (−1 + (N/MK )(1 − ν > ν))νν > = I, (4.13)
PK−1
since ν > ν = j=1 Mj /N = 1 − MK /N . Then
−1
Var [Y ] = I + (N/MK )νν > . (4.14)
Hence
Y > (I + (N/MK )νν > )Y ∼ χ2K−1 . (4.15)
Also,
K−1
N X {j}
Y > (I + νν > )Y = ω2 (TG − Mj ā)2 /Mj
MK j=1
2
K−1
{j}
X
2
+ ω (TG − Mj ā) /MK
j=1
K−1 {j} 2 {K} 2
X (TG − Mj ā) (TG − MK ā))
= ω2 +
j=1
Mj MK
K {j}
N − 1 X (TG − Mj ā)2
= . (4.16)
(â − ā2 )N j=1 Mj
The above calculation required some notions from linear algebra. The cal-
culation (4.13) requires an understanding of the definition of matrix multipli-
cation, and the associative and distributive properties of matrices, and (4.14)
requires an understanding of the definition of a matrix inverse. Observation
(4.15) is deeper; it requires knowing that a symmetric non-negative definite
matrix may be decomposed as V = D > D, for a square matrix D, and an
understanding that variances matrices in the multivariate case transform as
do scalar variances in the one-dimensional case.
One might compare this procedure to either the standard analysis of vari-
ance procedure, which is heavily reliant on distribution of responses. Alterna-
tively, one might perform the ANOVA analysis on ranks; this procedure does
not depend on distribution of responses.
76 Methods for Three or More Groups
This test is called the Kruskal-Wallis test, and is often referred to as the
H test.PHere, again, Rki is the rank of Xki within the combined sample, and
Mk
Rk. = i=1 Rki , and (3.21) gives the first multiplicative factor.
G−1
K−1 (1 − α; 0), (4.18)
http://lib.stat.cmu.edu/datasets/Andrews/T58.1
represent corn (maize) yields resulting from various fertilizer treatments
(Andrews and Herzberg, 1985, Example 58). Test the null hypothesis
that corn weights associated with various fertilizer combinations have the
same distribution, vs. the alternative hypothesis that a measure of location
varies among these groups. Treatment is a three-digit string representing
three fertilizer components. Fields in this file are separated by space. The
first three fields are example, table, and observation number. The fourth
and following fields are location, block, plot, treatment, ears of corn, and
weight of corn. Location TEAN has no ties; restrict attention to that
location. The yield for one of the original observations 36 was missing
(denoted by -9999 in the file), and is omitted in this analysis. We cal-
The Kruskal-Wallis Test 77
maize<-as.data.frame(scan("T58.1",what=list(exno=0,tabno=0,
lineno=0,loc="",block="",plot=0,trt="",ears=0, wght=0)))
maize$wght[maize$wght==-9999]<-NA
maize$nitrogen<-as.numeric(substring(maize$trt,1,1))
#Location TEAN has no tied values. R treats ranks of
#missing values nonintuitively. Remove missing values.
tean<-maize[(maize$loc=="TEAN")&(!is.na(maize$wght)),]
cat(’\n Kruskal Wallis H Test for Maize Data \n’)
kruskal.test(split(tean$wght,tean$trt))
#Alternative R syntax:
#kruskal.test(tean$wght,tean$trt)
anova(lm(rank(wght,na.last=NA)~trt,data=tean))
These last two tests have p-values 0.705 and 0.655 respectively. Note the
difference between these Gaussian theory results and the Kruskal-Wallis
test.
Figure 4.1 shows the support of the normal scores statistic on the set of
possible group rank sums for a hypothetical very small data set; the contour
of the approximate critical region for the test of level 0.05 is superimposed.
As is the case for the chi-square test for contingency tables, points enter the
critical region as the level increases in an irregular way, and so constructing an
additive continuity correction to (4.17) is difficult. Yarnold (1972) constructs a
continuity correction that is additive on the probability, rather on the statistic,
scale. Furthermore, even though group sizes are very small, the sample space
for the group-wise rank sums is quite rich. This richness of the sample space,
as manifest by the small ratio of the point separation (in this case, 1) to
the marginal standard deviations (the square roots of the variance in (3.22)),
implies that continuity correction will have only very limited utility (Chen
and Kolassa, 2018).
78 Methods for Three or More Groups
Example 4.4.1 Revisiting the TEAN subset of the maize data of Ex-
ample 4.3.1, one might perform the van der Waerden and Savage score
tests,
library(exactRankTests)#Gives savage, normal scores
library(NonparametricHeuristic)#Gives genmultscore
cat("Other scoring schemes: Normal Scores\n")
genmultscore(tean$wght,tean$trt,
cscores(tean$wght,type="Normal"))
cat("Other scoring schemes: Savage Scores\n")
genmultscore(tean$wght,tean$trt,
cscores(tean$wght,type="Savage"))
Other Scores for Multi-Sample Rank Based Tests 79
FIGURE 4.1: Asymptotic Critical Region for Kruskal Wallis Test, level 0.05
. . . . ....................................................................... . . .
......... ..........
....... ...
. . ............ . . . . . . . . .................. . .
.. .......
..... ..
25 . ....... . . . . . . . . . . . ............... .
.. ......
. ..
. .... . . . . . . . . . . . . . ............. .
.. .....
..
..
...
. . . . . . . . . . . . . . . ............ .
.....
.... ..
.... . . . . . . . . . . . . . . . ........... .
.. .....
...... . . . . . . . . . . . . . . . . .......... .
.
... ....
... ..
20 .... . . . . . . . . . . . . . . . . . ......... .
... ...
...
... . . . . . . . . . . . . . . . . . . ....... .
... ...
. .... . . . . . . . . . . . . . . . . . . . ...... . .
.
... ...
. ..... . . . . . . . . . . . . . . . . . . . ...... .
Group 2 ...
..
...
.
. .... . . . . . . . . . . . . . . . . . . . ..... .
rank ...
. ...... . . . . . . . . . . . . . . . . . . . ......
...
sum 15 ... ...
. .... . . . . . . . . . . . . . . . . . . .....
...
... ...
...
. ...... . . . . . . . . . . . . . . . . . ......
.... ...
....
. ........ . . . . . . . . . . . . . . . . ......
..... .
..... ..
. ........ . . . . . . . . . . . . . . . ......
.....
..... ....
10 . ........ . . . . . . . . . . . . . . ...
..... .
.. ..
. ............. . . . . . . . . . . . . . .... .
...... .
...... ..
.
. ........... . . . . . . . . . . . ....... .
....... .
.. .....
. . ................... . . . . . . . ............. . .
.......... .
........... .
.. ...
........
. . . . .............................................................. . . . .
5 10 15 20 25
Group 1 rank sum
Group sizes 3, 3, 4
The p-values for normal and Savage scores are 0.9544 and 0.9786 respec-
tively.
One might also apply a permutation test in this context. In this case, the
original data are used in place of ranks. The reference distribution arises from
the random redistribution of group labels among the observed responses. A
Gaussian approximation to this sampling distribution leads to analysis similar
to that of an analysis of variance.
Example 4.4.2 Revisiting the TEAN subset of the maize data of Exam-
ple 4.3.1, the following syntax performs the exact permutation test, again
using package MultNonPram.
#date()
#aov.P(tean$wght[!is.na(tean$wght)],
# tean$trt[!is.na(tean$wght)])
#date()
80 Methods for Three or More Groups
giving an approximation
p to the p-value of 0.7254, or, more carefully,
0.7254 ± 1.96 0.7254 × 0.2746/10000 = (0.717, 0.734).
Kruskal-Wallis test (4.17) of §4.3 for the analysis of variance test (4.3) in the
first stage of the procedure, and substituting the Mann-Whitney-Wilcoxon
test (3.14) for the two-sample t-test, with the same lack of Type I error con-
trol.
In the case of rank testing, when sample sizes are equal, the Studentized
range method may be applied to rank means for simultaneous population dif-
ferentiation (Dunn, 1964); Conover and Iman (1979) credits this to Nemenyi
(1963). This technique may be used to give corrected p-values and corrected
simultaneous confidence intervals for rank means. Since rank mean expecta-
tions are generally not of interest, the application of the Studentized range
distribution to rank means is typically of direct interest only for testing. Use
{k}
TG /Mk in place of X̄k . Note that for j 6= k,
h i h i h i
" # {k} {j} {j} {k}
T
{k}
T
{j} Var TG Var TG Cov TG , TG
Var G − G = 2 + 2 −2
Mk Mj Mk Mj Mj Mk
2
N − Mj N − Mk â − ā
= + +2
Mj Mk N −1
2
1 1 N (â − ā )
= + .
Mj Mk N −1
p
Hence S in (4.7) may be replaced with N (â − ā2 )/(N − 1) to obtain si-
multaneous p-values to satisfy (4.8). Also, take the denominator degrees of
freedom to be ∞ as the second argument to q. The same substitution may be
made in (4.9) to obtain (4.10), but the parameters bounded in these intervals
are differences in average rank, which are seldom of interest.
Example 4.5.1 Consider again the yarn data of Example 3.3.1. Con-
sider just type A, and explore pairwise bobbin differences. One might do
all pairwise Mann-Whitney-Wilcoxon tests.
yarna<-yarn[yarn$type=="A",]
cat(’\nMultiple Comparisons for Yarn with No Correction\n’)
pairwise.wilcox.test(yarna$strength,yarna$bobbin,exact=F,
p.adjust.method="none")
This gives pairwise p-values
1 2 3 4 5
2 0.112 - - - -
3 0.470 1.000 - - -
4 0.245 0.030 0.112 - -
5 0.885 0.312 0.772 0.470 -
6 0.661 0.146 0.470 0.042 1.000
82 Methods for Three or More Groups
In this case, the initial Kruskal-Wallis test fails to reject the null hypoth-
esis of equality of distribution, and no further exploration is performed.
library(MultNonParam)
tukey.kruskal.test(yarna$strength,yarna$bobbin)
indicating no significant differences.
for all indices i ∈ {1, . . . , K − 1}, with strict equality at some x and some i.
This alternative reduces parameter space to 1/2K of former size. One might
use as the test statistic X
J= Uij , (4.19)
i<j
here Ui is the Mann-Whitney Pstatistic for testing group i vs. all preceding
i
groups combined, and mi = j=1 Mj . The second equality in (4.20) follows
from independence of the values Ui (Terpstra, 1952). A simpler expression for
this variance is
" K
#
1 X
Var0 [J] = N (N + 1)(2N + 1) − Mi (Mi + 1)(2Mi + 1) . (4.21)
72 i=1
This test might be corrected for ties, and has certain other desirable properties
(Terpstra, 1952).
Jonckheere (1954), apparently independently, suggested a statistic that
is twice J, centered to have zero expectation, and calculated the vari-
ance, skewness, and kurtosis. The resulting test is generally called the
Jonckheere-Terpstra test.
Example 4.6.1 Consider again the Maize data from area TEAN in Ex-
ample 4.3.1. The treatment variable contains three digits; the first in-
dicated nitrogen level, with four levels, and is extracted in the code in
Example 4.3.1. Apply the Jonckheere-Terpstra test:
library(clinfun)# For the Jonckheere-Terpstra test
jonckheere.test(tean$wght,tean$nitrogen)
cat(’\n K-W Test for Maize, to compare with JT \n’)
kruskal.test(tean$wght,tean$nitrogen)
to perform this test, and the comparative three degree of freedom Kruskal-
84 Methods for Three or More Groups
Under the null hypothesis of equal populations, κij = 1/2 for all i 6= j.
In the case of multidimensional alternative hypotheses, effect size and ef-
ficiency calculations are more difficult than in earlier one-dimensional cases.
In the case with K ordered categories, there are effectively K − 1 identifi-
able parameters, since, because the location of the underlying common null
distribution for the data is unspecified, group location parameters θj can all
be increased or decreased by the same constant amount while leaving the
underlying model unchanged. On the other hand, the notion of relative effi-
ciency requires calculating an alternative parameter value corresponding to,
at least approximately, the desired power, and as specified by (2.23). This
single equation can determine only a single parameter value, and so relative
efficiency calculations in this section will consider alternative hypotheses of
the form
θ A = ∆θ † (4.23)
for a fixed direction θ † . Arguments requiring solving for the alternative will
reduce to solving for ∆. The null hypothesis is still θ = 0.
with κij defined as in (4.22). Alternative values for κij under shift models (4.2)
are calculated as in (3.25). Without loss of generality, one may take θ1 = 0.
Consider parallels with the two-group setup of §3. The cumulative distri-
bution function F1 of (4.2) corresponds to F of (3.1), and F2 corresponds to G
of (3.1). Then µ(θ) of (3.25) corresponds to κ12 . Calculation of κkl , defined in
(4.22), and applied to particular pairs of distributions, such as the Gaussian
in (3.26) and the logistic in (3.27), and other calculations from the exercises
of §3, hold in this case as well. Each of the difference probabilities κkl , for
k 6= l, depends on the alternative distribution only through θl − θk .
Power may be calculated from (2.18).
and
distribution for WH is χ2K−1 . One might attempt to act by analogy with (2.17),
and calculate power using alternative hypothesis expectation and variance
matrix. This is possible, but difficult, since WH (and the other approximately
χ2K−1 statistics in this chapter) are standardized using the null hypothesis
variance structure. Rescaling this under the alternative hypothesis gives a
statistic that may be represented approximately as the sum of squares of inde-
pendent Gaussian random variables; however, not only do these variables have
non-zero means, which is easily addressed using a non-central χ2 argument
as in (1.3), but also have unequal variances for the variables comprising the
summands; an analog to (1.3) would have unequal weights attached to the
squared summands.
A much easier path to power involves analogy with (2.22): Approximate
the alternative distribution using the null variance structure. This results in
the non-central chi-square approximation using (1.4).
Because the Mann-Whitney and Wilcoxon statistics differ only by an ad-
ditive constant, the Kruskal-Wallis test may be re-expressed as
(K )
X
◦ 2
(Tk − Mk (N − Mk )κ ) /Mk /[ψ 2 (N + 1)N ]. (4.25)
k=1
the Mann-Whitney statistic for testing whether group k differs from all of the
other groups, with all of the other groups collapsed.
The variance matrix for rank sums comprising WH is singular (that is,
it does not have an inverse), and the argument justifying (1.4) relied on the
presence of an inverse. The argument of §4.2.2 calculated the appropriate
quadratic form, dropping one of the categories to obtain an invertible vari-
ance matrix, and then showed that this quadratic form is the same as that
generating t. The same argument shows that the appropriate non-centrality
parameter is
(K )
X
ξ= (EA [Tk ] − Mk (N − Mk )(1/2)) /Mk /[ψ 2 (N + 1)N ],
2
k=1
Powers of Tests 87
PK
where EA [Tk ] = Mk l=1,l6=k Ml κkl . The non-centrality parameter is
2
K K
1 X X
ξ = Mk Ml (κkl − κ◦ )
ψ 2 (N + 1)N
k=1 l=1,l6=k
K K
!2
1 X X
◦
= Mk Ml (κkl − κ ) . (4.27)
ψ 2 (N + 1)N
k=1 l=1
1 − GK−1 (G−1
K−1 (1 − α, 0); ξ), (4.28)
kwpower(rep(20,3),(0:2)/2,"normal")
G−1 −1
K−1 (β; ξ) = GK−1 (1 − α, 0). (4.29)
88 Methods for Three or More Groups
From (4.27),
K K
!2
X X
ξ≈ λk λl (κkl − κ◦ ) N/ψ 2 , (4.30)
k=1 l=1
√
for ψ = 1/ 12, and
!2
XK K
X
N ≈ ξψ 2 / λk λl (κkl − κ◦ ) , (4.31)
k=1 l=1
kwsamplesize((0:2)/2,"normal")
As in the one-dimensional case, one can solve for effect size by approximating
the probabilities as linear in the shift parameters. Express
κkl − κ◦ ≈ κ0 (θkA − θlA ), (4.32)
and explore the multiplier ∆ from (4.23) giving the alternative hypothesis in
a given direction. Then
K XK X K
X ξψ 2
N ≈ ξψ 2 / (κ0 )2 λk λj λl (θkA − θlA )2 = 2 0 2 2 , (4.33)
∆ (κ ) ζ
k=1 j=1 l=1
s
2 P 2
PK † K †
for ζ = k=1 λk θk − k=1 λk θk ; ζ plays a role analogous to its
role in §3.8, except that here it incorporates the vector giving the direction of
departure from the null hypothesis.
Powers of Tests 89
Example 4.7.4 The sum with respect to j disappears from ζ 2 , since the
sum of the proportions λj is 1. Under the same conditions in Example
4.7.3, one might use (4.33) rather than (4.31). Take θ † = θ A = (0, 1/2, 1)
and ∆ = 1. The derivative κ0 is tabulated, for the Gaussian and logistic
distributions,
√ −1 in Table 3.6 as µ0 (0). In this Gaussian case, κ0 = µ0 (0) =
(2 π) = 0.282. Also, ζ 2 = (02 /3 + (1/2)2 /3 + 12 /3 − (0/3 + (1/2)/3 +
1/3)2 ) = 5/12 − 1/4 = 1/6. The non-centrality parameter, solving (4.29),
is ξ = 9.63. The approximate sample size is 9.63/(12 × 12 × 0.2822 /6) =
61, or approximately 21 observations per group; compare this with an
approximation of 23 per group using a Gaussian approximation, without
expanding the κij .
This may be computed using the R package MultNonParam using
kwsamplesize((0:2)/2,"normal",taylor=TRUE)
In order to determine the effect size necessary to give a level α test power
1 − β to detect an alternative hypothesis ∆θ † , determine ξ from (4.28), and
then ∆ from (4.34).
Figure 4.3 reflects the accuracy of the various approximations in this sec-
tion. It reflects performance of approximations to the power of the Kruskal-
Wallis test. Tests with K = 3 groups, with the same number of observations
per group, with tests of level 0.05, were considered. Group sizes between 5 and
30 were considered (corresponding to total sample sizes between 15 and 90).
For each sample size, (4.34) was used to generate an alternative hypothesis
with power approximately 0.80, in the direction of equally-spaced alternatives.
The dashed line represents a Monte Carlo approximation to the power, based
on 50,000 observations. The dotted line represents the standard non-central
chi-square approximation (4.28). The solid line represents this same approxi-
mation, except also incorporating the linear approximation (4.32) representing
90 Methods for Three or More Groups
0.80 ........
............................................................................................................................................
.......................................................
..............................
...........
................... ...... .......................................
.. ....
.......... ........................................
...............
... .........
.......... .............
........ . . . .
........ . . . .
0.75 ............. . . . .
.... . . .
........ . . . . .
...
...... . .
.
.... . .
..... . .
...... . .
......... .
.. . .
....
... .
0.70 ...
..
.. .
.
.
.
...
Approx- ...
... .
.
.
.
imate .
...
.
.
.
... .
Power 0.65 ..
...
.
.
.
.
.
. .................................................................... Approx. (4.28) with (4.27)
.
. and (4.32)
0.60 .
.
.
.
. . . . . . . . . . . Approx. (4.28) with (4.27)
. ...................................................... Monte Carlo approx.,
0.55 50000 observations.
5 10 15 20 25 30
Number per Group
3 groups, level 0.05, power 0.8
for λi = Mi /N , where again κ◦ is the common value of κjk under the null
hypothesis, and κ0 is the derivative of the probability in (3.25), as a function
of the location shift between two groups, evaluated at the null hypothesis, and
calculated for various examples in §3.8.2. Hence
K−1 K
λi λj κ0 [θj† − θi† ].
X X
µ0J (0) =
i=1 j=i+1
Recall that κij depended on two group indices i and j only because the lo-
cations were potentially shifted relative to one another; the value κ◦ and
its derivative κ0 are evaluated at the null hypothesis of equality of distri-
butions, and hence
h do not depend
i on the indices. Furthermore, from (4.21),
1
PK 3
Var [TJ ] ≈ 36 1 − k=1 λk /N . Consider the simple case in which λk = 1/K
for all k, and in which θj† − θi† = (j − i). Then µ0J (0) = κ0 (K 2 − 1)/6K,
1
Var [TJ ] ≈ 36 1 − 1/K 2 /N , and
(K 2 − 1)/6K κ0 (K 2 − 1) p
eJ = κ0 q = √ = κ0 K 2 − 1.
1 2 K2 − 1
36 [1 − 1/K ]
√ √
The efficacy of the Gaussian-theory contrast, from (4.6), is K 2 − 1/(2 3ε).
Hence the asymptotic relative efficiency of the Jonckheere-Terpstra test to
the contrast of means is
(κ0 )2 (12ε2 ). (4.35)
This is the same as the asymptotic relative efficiency in the two-sample case
in Table 3.6.
The ratio of sample sizes needed for approximately the same power for the
same alternative from the F and Kruskal-Wallis tests is approximately
2 2
) (κ0H )2 /(κ0A )2 = 12ε2 (κ0H )2 ,
NA /NH ≈ (ψA /ψH
which is the same as in the unordered case (4.35) and in the two-sample case.
4.9 Exercises
1. The data set
HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt
HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/Plasma Retinol
HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages
reflects wages from 1985. The first 42 lines of this file contain a
description of the data set, and an explanation of variables; delete
these lines first, or skip them when you read the file. All fields are
numeric. The tenth field is sector, and the sixth field is hourly wage;
you can skip everything else. Test for a difference in wage between
various sector groups, using a Kruskal–Wallis test.
5
Group Differences with Blocking
95
96 Group Differences with Blocking
Under the null hypothesis that the distribution of the Xi is the same as the
distribution of the Yi , and again, assuming symmetry of the differences, then
(, Sj , ) and (, Rj , ) are independent random vectors, because Sj and |Xj | are
pairwise independent under H0 .
Components of the random vector (R1 , . . . , Rn ) are dependent, and hence
calculation of the variance from TSR via (5.2) requires calculation of the sum of
identically distributed but not independent random variables. An alternative
formulation, as the sum of independent but not identically distributed random
variables, will prove more tractable. Let
(
1 if the item whose absolute value is ranked j is positive
Vj = .
0 if the item whose absolute value is ranked j is negative
Nonparametric Paired Comparisons 97
P
Hence TSR = jVj , and the null expectation and variance are
X
E0 [TSR ] = j E0 [Vj ] = n(n + 1)/4 (5.3)
and
X X
Var0 [TSR ] = j 2 Var0 [Vj ] = j 2 /4 = n(2n + 1)(n + 1)/24. (5.4)
One can also calculate exact probabilities for TSR recursively, as one could
for the two-sample statistic. There are 2n ways to assign signs to ranks 1, . . . , n.
Let f (t, n) be the number of such assignments yielding TSR = t with n obser-
vations. Again, as in §3.4.1, summing the counts for shorter random vectors
with alternative final values,
0
for t < 0 or t > n(n + 1)/2
f (t, n) = 1 if n = 1 and t ∈ {0, 1} .
f (t, n − 1) + f (t − n, n − 1) otherwise
Pair 1 2 3 4 5 5 7 8 9 10
First 1005 1035 1281 1051 1034 1079 1104 1439 1029 1160
Second 963 1027 1272 1079 1070 1173 1067 1347 1100 1204
Diff. -42 -8 -9 28 36 94 -37 -92 71 44
Rank 6 1 2 3 4 10 5 9 8 7
giving an exact p-value of 0.695. Compare these to the results of the sign
test and t-test:
library(BSDA)#Need for sign test.
SIGN.test(brainpairs$diff)
t.test(brainpairs$diff)
Example 5.2.2 Perform the asymptotic score tests on the brain volume
differences of Example 5.2.1.
library(MultNonParam)
cat("Asymptotic test using normal scores\n")
brainpairs$normalscores<-qqnorm(seq(length(
brainpairs$diff)),plot.it=F)$x
symscorestat(brainpairs$diff,brainpairs$normalscores)
cat("Asymptotic test using savage scores\n")
brainpairs$savagescores<-cumsum(
1/rev(seq(length(brainpairs$diff))))
symscorestat(brainpairs$diff,brainpairs$savagescores)
giving p-values of 0.863 and 0.730 respectively.
Permutation testing can also be done, using raw data values as scores.
This procedure uses the logic that Xj and Yj having the same marginal
distribution implies Yj − Xj has a symmetric distribution. Joint distributions
exist for which this is not true, but these examples tend to be contrived.
Nonparametric Paired Comparisons 99
-92.0 -67.0 -64.5 -50.5 -50.0 -42.0 -39.5 -37.0 -32.0 -28.0
-25.5 -25.0 -24.0 -23.0 -22.5 -10.5 -9.0 -8.5 -8.0 -7.0
-4.5 -3.0 -0.5 1.0 1.0 3.5 9.5 10.0 13.5 14.0
14.5 17.0 17.5 18.0 26.0 28.0 28.5 31.0 31.5 32.0
100 Group Differences with Blocking
36.0 36.0 40.0 42.5 43.0 44.0 49.5 53.5 57.5 61.0
65.0 69.0 71.0 82.5 94.0
Their median is observation 28, which is 10.0. Estimate the median dif-
ference as 10.0.
Example 5.2.4 Refer again to the brain volume data of Example 5.2.1.
Find a 95% confidence interval for the difference in brain volume. The
0.025 quantile of the Wilcoxon signed-rank statistic with 10 observations
is t◦ = 9; this can be calculated from R using
qsignrank(0.025, 10)
and confidence interval endpoints are observations 9 and 55+1-9=47. As
tabulated above, Walsh average 9 and 47 are -32.0 and 49.5 respectively.
Hence the confidence interval is (-32.0, 49.5). This might have been cal-
culated directly in R using
wilcox.test(brainpairs$diff,conf.int=TRUE)
Similar techniques to those of this section were used in §2.3.3 to give con-
fidence intervals for the median, and in §3.10 to give confidence intervals for
median differences. In each case, a statistic dependent on the unknown param-
eter was constructed, and estimates and confidence intervals were constructed
by finding values of the parameter equating the statistic to appropriate quan-
tiles. However, the treatment of this section and that of §2.3.3 differ, in that
Nonparametric Paired Comparisons 101
-2 0 2 4 6 8 10
Potential Point of Symmetry
Hypothetical Data Set with 6 Observations
to match the notation of (2.15) and (2.19). Note that µ0 (0) is now twice the
value from the Mann-Whitney-Wilcoxon statistic, for distributions symmetric
about 0. This will be used for asymptotic relative efficiency calculations in the
exercises.
and hence
L X
X K
E [Rk.. ] = Mkl Mjl + 1 /2 . (5.6)
j=1
l=1
Variances of rank sums depend on the covariance structure of the ranks.
Ranks that make up each sum within a block are independent, but sums of
ranks across blocks are dependent. Within a treatment-block cell, Var [Rkl. ]
is the same as for the Mann-Whitney-Wilcoxon statistic:
X XK
Var [Rkl. ] = Mkl Mjl Mjl + 1 /12.
j6=k j=1
Since blocks are ranked independently, variances and covariances for rank
sums add across blocks:
L X
X K
Cov [Rk.. , Rm.. ] = − [Mkl Mml ] Mjl + 1 /12. (5.8)
j=1
l=1
for k 6= m. In this balanced case, the test statistic (5.9) can be shown to equal
the sum of squares of ranks away from their average value per treatment
group:
XK
WF = 12L [R̄k.. − (M K + 1)/2]2 /[K(KM + 1)].
k=1
expensesd<-as.data.frame(scan("friedman.dat",
what=list(cat="",g1=0,g2=0,g3=0,g4=0,g5=0,g6=0,g7=0)))
A Generalization of the Test of Friedman 105
expenserank<-t(apply(as.matrix(expensesd[,-1]),1,rank))
rownames(expenserank)<-expensesd[[1]]
gives the table of ranks
g1 g2 g3 g4 g5 g6 g7
Housing 5 1 3 2 4 6 7
Operations 1 3 4 6 2 5 7
Food 1 2 7 3 5 4 6
Clothing 1 3 2 4 5 6 7
Furnishings 2 1 6 3 7 5 4
Transportation 1 2 3 6 5 4 7
Recreation 1 2 3 4 7 5 6
Personal 1 2 3 6 4 7 5
Medical 1 2 4 5 7 3 6
Education 1 2 4 5 3 6 7
Community 1 5 2 3 7 6 4
Vocation 1 5 2 4 3 6 7
Gifts 1 2 3 4 5 6 7
Other 5 4 7 2 6 1 3
Rank sums are 23, 36, 53, 57, 70, 70, 83, and group means (for example,
via
apply(expenserank,2,mean)
gives 1.643, 2.571, 3.786, 4.071, 5.000, 5.000, 5.929. Subtracting the overall
mean, 4, squaring, and adding, gives 13.367. Multiplying by 12L/(K(K +
1)) = 12×14/(7×8) = 3 gives the statistic value WF = 40.10. Comparing
to the χ26 distribution gives a p-value of 4.35×10−7 . This might also have
been done entirely in R using
friedman.test(as.matrix(expensesd[,-1]))
The development of Friedman’s test in the two-way layout remarkably pre-
ceded analogous testing in one-way layouts. When K = 2 and Mkl are all 1,
Friedman’s test is equivalent to the sign test applied to differences within a
block.
tice (1979) performs these calculations in somewhat more generality, and the
general unbalanced two-way test is generally called the Prentice test. Skillings
and Mack (1981) address this question using explicit numerical matrix inver-
sion.
http://stat.rutgers.edu/home/kolassa/Data/chicken.dat .
Weight gain after 16 weeks, protein level, protein source, and fish soluble
level. The dependence of weight gain on protein source might be graphi-
cally depicted using box plots (Figure 5.2):
temp1<-temp<-as.data.frame(scan("chicken.dat",what=
list(source="", lev=0,fish=0,weight=0,w2=0)))
temp1$weight<-temp1$w2;temp$h<-0;temp1$h<-1
temp$w2<-NULL;temp1$w2<-NULL
chicken<-rbind(temp,temp1)
attach(chicken)
boxplot(split(weight,source),horizontal=TRUE,ylab="g.",
main="Weight gain for Chickens", xlab="Protein Source")
detach(chicken)
Hence blocking on source is justified. Test for dependence of weight gain
on protein level, blocking on source. Perform these calculations in R using
attach(chicken)
prentice.test(weight,source,blocks=lev)
detach(chicken)
to obtain the p-value 0.1336.
.................................................................
.... ... .
.................................................. .............................................................................
S ...
..
.. ..
Protein ..................................................................
Source ............................................
... .. ..
.............................
G ..
................................................
..............................................
#third argument.
chicken<-chicken[order(chicken$lev),]
aov.P(chicken$weight,as.numeric(as.factor(chicken$source)),
c(8,16,24))
The p-value is 0.182.
This test was proposed by Page (1963) in the balanced case with one replicate
per block-treatment pair, and is called Page’s test.
For more general replication patterns, the null expectation and variance
for the statistic may be calculated, using (5.6), (5.7), and (5.8), and
K
X
E0 [TL ] = k E0 [Rk.. ]
k=1
K
X K
X K
X
Var0 [TL ] = k 2 Var0 [Rk.. ] + ikCov0 [Ri.. , Rk.. ] .
k=1 i=1 k=1,k6=i
using (3.20).
Tests for a Putative Ordering in Two-Way Layouts 109
library(crank); page.trend.test(expensesd[,-1],FALSE)
Page (1963) presents only the case with Mkl all 1, and provides a table of
the distribution of TL for small values of K and L. For larger values of K and
L than appear in the table, Page (1963) suggests a Gaussian approximation.
The scores in (5.10) are equally spaced. This is often a reasonable choice
in practice. When the Mkl are not all the same, imbalance among numbers of
ranks summed may change the interpretation of these scores, and a preferred
statistic definition to replace (5.10) is
K
X
TL∗ = k R̄k.. .
k=1
Example 5.6.2 Refer again to the chicken weight gain data of Exam-
ple 5.4.2. This data set is balanced, but one could ignore the balance and
numerically invert the variance matrix of all rank sums except the last.
In this case, test for an ordered effect of protein level on weight gain,
blocking on protein source. Perform the test using
library(MultNonParam)
attach(chicken)
cat(’\n Page test with replicates \n’)
page.test.unbalanced(weight,lev,source)
110 Group Differences with Blocking
detach(chicken)
In this balanced case, rank mean expectations are all 6.5, variances are
1.083, covariances are −0.542. The rank means by treatment level are
7.625, 7.250, 4.625, giving an overall statistic value of 36, compared with
a null expectation of 39 and null variance of 3.25; the p-value is 0.096.
Do not reject the null hypothesis.
5.7 Exercises
1. The data set
HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo
HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo
HTTP://stat.rutgers.edu/home/kolassa/Data/yarn.dat
HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages
reflects wages from 1985. The first 42 lines of this file contain a
description of the data set, and an explanation of variables; delete
these lines first, or skip them when you read the file. All fields are
numeric. The tenth field is sector, the fifth field is union member-
ship, and the sixth field is hourly wage; you can skip everything
else. Test for a difference in wage between union members and non-
members, blocking on sector.
6
Bivariate Methods
Suppose that independent random vectors (Xi , Yi ) all have the same joint
density fX,Y (x, y). This chapter investigates the relationship between Xi and
Yi for each i. In particular consider testing the null hypothesis fX,Y (x, y) =
fX (x)fY (y), without specifying an alternative hypothesis for fX,Y , or even
without knowledge of the null marginal densities.
This null hypothesis is most easily tested against the alternative hypothesis
that, vaguely, large values of X are associated with large values of Y (or vice
versa). Furthermore, if the null hypothesis is not true, the strength of the
association between Xi and Yi must be measured.
113
114 Bivariate Methods
So
X X X
Var Xj Zj = Xj2 Var [Zj ] + Xi Xj Cov [Zj , Zi ]
j j j6=i
= n2 σX
2 2
σY /(n − 1),
2 1 2
P
where σX = i n Xi . So
Var [rP ] = 1/(n − 1), (6.2)
under the permutation distribution (Hotelling and Pabst, 1936). This value
was determined by Student (1908), after fitting a parametric model to em-
pirical data, rounding parameter estimates to values more consistent with
intuition, and calculating the moments of this empirical distribution. Higher-
order moments were determined by David et al. (1951) using a method of
proof similar to that presented above for the variance.
This result suggests a test of the null hypothesis of independence versus
the two-sided alternative at level α using rP , that rejects the null hypothesis
if √
|rP | > zα/2 / n − 1. (6.3)
Nonparametric Correlation 115
that is, rS is the Pearson correlation on ranks. Exact agreement for ordering
of ranks results in a rank correlation of 1, exact agreement in the opposite
direction results in a rank correlation of -1, and the Cauchy-Schwartz theorem
indicates that these two vales are the extreme possible values for the rank
correlation.
The sums of squares in the denominator have the same value for every
data set, and the numerator can be simplified. Note that
n
X n
X
(j − (n + 1)/2)2 = j 2 − n(n + 1)2 /4
j=1 j=1
Pn Pn
Similarly, j=1 (Rj − (n + 1)/2)2 = n n2 − 1 /12. Furthermore, j=1 (j −
(n + 1)/2)(n + 1)/2 = 0, and
n n
12 X 12 X
rS = (j − (n + 1)/2)Rj = jRj − n(n + 1)2 /4 .
n(n2 − 1) j=1 n(n2 − 1) j=1
(6.4)
Hoeffding (1948) provides a central limit theorem for the permutation
116 Bivariate Methods
Example 6.3.1 Consider again the twin brain data of Example 5.2.1,
plotted in Figure 6.1 As before, the data set brainpairs has 10 records,
reflecting the results from 10 pairs of twins, and is plotted in Figure 6.1
via
attach(brainpairs); plot(v1,v2, xlab="Volume for Twin 1",
ylab="Volume for Twin 2",main="Twin Brain Volumes")
Ranks for twin brains are given in Table 6.1. The sum in the second factor
of (6.4) is
1×1+4×2+9×9+5×5+3×4+6×7+7×3+10×10+2×6+8×8 = 366,
and so the entire second factor is 366 − 10 × 112 /4 = 63.5, and the Spear-
man correlation is (12/(10 × 99) × 63.5 = 0.770. Observed correlations
may be calculated in R using
out[,j]<-c(cor(newv1,v2),cor(newv1,v2,method="spearman"))
}
cat("\n Monte Carlo One-Sided p value\n")
apply(apply(out,2,">=",obsd),1,"mean")
to obtain p-values 1.5 × 10−4 and 6.1 × 10−3 . The asymptotic critical
value, from (6.3), is given by
1400
◦
1300
◦
◦
1200
Volume ◦
for
Twin 2
1100 ◦
◦
◦ ◦
1000
◦
900
1000 1100 1200 1300 1400 1500
Volume for Twin 1
118 Bivariate Methods
Pair 1 2 3 4 5 5 7 8 9 10
Rank of First 1 4 9 5 3 6 7 10 2 8
Rank of Second 1 2 9 5 4 7 3 10 6 8
Note that
n
X n
X X
jRj = j 1 + I(Yj > Yi , Xj > Xi )
j=1 j=1 i6=j
n
X X
= n(n + 1)/2 + j I(Yj > Yi , Xj > Xi ).
j=1 i6=j
6.3.2 Kendall’s τ
Pairs of bivariate observations for which the X and Y values are in the
same order are called concordant; (6.7) refers to the probability that a pair
is concordant. Pairs that are not concordant are called discordant. Kendall
(1938) constructs a new measure based on counts of concordant and discor-
dant pairs. Consider the population quantity τ = 2p1 − 1, for p1 of (6.7),
Nonparametric Correlation 119
for Zij = I((Xj −Xi )(Yj −Yi ) > 0). Then U † = n(n−1)/2−U is the number of
discordant pairs; this number equals the number of rearrangements necessary
to make all pairs concordant. Estimate τ by the excess of concordant over
discordant pairs, divided by the maximum:
U − (n(n − 1)/2 − U ) 4U
rτ = = − 1. (6.8)
n(n − 1)/2 n(n − 1)
Note that E [U ] = n(n − 1)p1 /2, for p1 of (6.7), and E [rτ ] = 2p1 − 1. The null
value of p1 is half, recovering
E0 [U ] = n(n − 1)/4, E0 [rτ ] = 0. (6.9)
As with the Spearman correlation rS , Kendall’s correlation rτ may be used
to test the null hypothesis of independence, relying on its asymptotic Gaussian
distribution, but the test requires the variance of rτ . Note that
X ∗
X
Var [U ] = Var [Zij ] + Cov [Zij , Zkl ] .
i<j
P∗
Here the sum over three distinct indices i < j, k < l. This sum consists
of n2 (n − 1)2 /4 − n(n − 1)/2 − n(n − 1)(n − 2)(n − 3)/4 = n(n − 1)(n − 2)
terms. Hence
X ∗
X
Var [U ] = (p1 − p21 ) + (p3 − p21 )
i<j
Example 6.3.2 Consider again the twin brain data of Example 5.2.1,
with ranks in Table 6.1. Discordant pairs are 2 – 5, 2 – 9, 4 – 7, 4 – 9,
5 – 7, 5 – 9, 6 – 7, and 7 – 9. Hence 8 of 45 pairs are discordant, and
the remainder, 37, are concordant. Hence rτ = 4 × 37/90 − 1 = 0.644,
from (6.8). This may be calculated using
cor(brainpairs$v1,brainpairs$v2,method="kendall")
Figure 6.2 shows two artificial data sets with functional relationships be-
tween two variables. In the first relationship, the two variables are identical,
while in the second, the second variable is related to the arc tangent of the
first. Both variables relationships show perfect association. The first relation-
ship represents perfect linear association, while the second reflects perfect
nonlinear association. Hence the Spearman and Kendall association measures
are 1 for both relationships; the Pearson correlation is 1 for the first relation-
ship, and 0.937 for the second relationship.
10
♦ ♦ ♦ ♦
♦
♦ ♦
♦
♦
5
♦
y 0 ♦
Symbol Relationship
♦ y=x
-5
♦ ♦ y = 5 arctan(x)
♦
♦
♦ ♦ ♦
♦ ♦ ♦
-10
-10 -5 0 5 10
x
Bivariate Semi-Parametric Estimation via Correlation 121
Generally, β̂1 6= smed [Y1 , . . . , Yn ] − β̂2 smed [X1 , . . . , Xn ]. Sen (1968) investi-
gates this procedure further.
This approach to estimation of β2 only holds in contexts with no ties
among the explanatory variable values.
Bivariate Semi-Parametric Estimation via Correlation 123
1 2 3 4 10 12 18
9 15 19 20 45 55 78
The following commands calculate a confidence interval for the slope pa-
rameter:
tt<-c(1,2,3,4,10,12,18); xx<-c(9,15,19,20,45,55,78)
out<-rep(NA,length(tt)*(length(tt)-1)/2)
count<-0
for(ii in seq(length(tt)-1)) for(jj in (ii+1):length(tt)){
count<-count+1
out[count]<-(xx[jj]-xx[ii])/(tt[jj]-tt[ii])
}
There are 7 × 6/2 = 21 pairwise slopes W̃j :
1.00 2.50 3.67 3.71 3.75 3.83 3.93 3.94 4.00 4.00
4.00 4.00 4.06 4.12 4.14 4.17 4.18 4.38 5.00 5.00 6.00
giving 4 as the quantile. Hence the 0.95 confidence interval is (W̃4 , W̃18 ) =
(3.71, 4.38). These calculations may be performed using theil(xx,tt). It
may also be performed using theilsen(xx~tt) from the package deming,
which also gives a confidence interval for the intercept term. The intercept
is estimated by (6.14). Plotting may be done via
library(deming); plot(tt,xx);abline(theilsen(xx~tt))
Results are in Figure 6.3.
◦.......
.......
..
........
10 ..
.......◦
.......
.
.
.......
.......
0
0 5 10 15 20
x
which the horizontal lines cross the correlation curve, and their intersec-
tions with the horizontal axis determine the end points of the regression
confidence interval. Recall that the Theil estimator is the inversion of the
Kendall correlation. Figure 6.5 displays the results of
attach(bp)
plot(spd,dpd,main="Blood Pressure Change", ylab="Diastolic",
xlab="Systolic")
library(deming)#For theilsen
tsout<-theilsen(dpd~spd)
abline(tsout); abline(lm(dpd~spd),lty=2)
legend(median(spd),median(dpd),lty=1:2,
legend=c("Inversion of tau","Least squares"))
detach(bp)
5 ◦ ...
......
....... .
.......
◦ ................
....
......
......
0 ...
..
......
...... ◦
.....
.....
. ......
. ......
◦ . .....
............
◦ . .....◦
..
. ......
-5 . .....
. ......
. ...........
◦
.
. ............
.
Diastolic ◦ . ......
. .....
. ......
. .....
Blood . ............
Pressure -10
. .....
.
. ........
. ........◦
Change ◦ . . . ..............
.............. Inversion of tau
. ..........
.
.
. ...........
. ......
.
. . . Least squares
-15 . ......
.
. .
........
.
◦. .
......
.
. ......
. ......
. ......
.
.
.
...
....... ◦
. .... ◦
. ......
. ......
-20 .
.
....
........
...... ◦
......
......
◦
-25
-35 -30 -25 -20 -15 -10 -5 0
Systolic Blood Pressure Change
6.5 Exercises
1. The data set
HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt
HTTP://stat.rutgers.edu/home/kolassa/Data/twinbrain.dat
HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/schizo
Suppose that one observes n subjects, indexed by i ∈ {1, . . . , n}, and, for
subject i, observes responses Xij , indexed by j ∈ {1, . . . , J}. Potentially, co-
variates are also observed for these subjects.
This chapter explores explaining the multivariate distribution of Xij in
terms of these covariates. Most simply, these covariates often indicate group
membership.
129
130 Multivariate Analysis
pooled sample covariance for the all observations: Σ̂j,j 0 = ((M1 − 1)Σ̂1,j,j 0 +
(M2 − 1)Σ̂2,j,j 0 )/(M1 + M2 − 2). Then the Hotelling two-sample statistic
measures the difference between sample mean vectors, in a way that accounts
for sample variance, and combines the response variables. Furthermore, un-
der the null hypothesis of equality of distribution, and assuming that this
distribution is multivariate Gaussian,
M1 + M2 − J − 1 2
T ∼ FJ,M1 +M2 −J−1 .
(M1 + M2 − 2)J
FIGURE 7.1: Univariate Normal Data that are Not Bivariate Normal
4 3 ◦
◦
◦
3 ◦
2 ◦◦
◦◦
◦◦
◦
◦ ◦◦◦
◦
2 ◦ 1 ◦
◦
◦◦◦◦◦
◦
◦ ◦◦
◦
◦
◦ ◦◦
◦
◦◦ ◦
◦
◦◦◦
◦◦◦ ◦◦◦
1 ◦
◦
◦◦◦◦◦ 0 ◦◦
◦◦◦◦◦◦
◦◦
◦◦
◦◦
◦
◦ ◦◦◦
◦
◦
◦ ◦◦
◦
◦◦◦◦ ◦◦
◦
◦◦
◦◦◦◦◦ ◦
◦
◦
◦
0 ◦◦◦◦◦◦◦◦ -1 ◦
◦
◦◦◦◦◦◦
◦◦◦ ◦◦
◦◦◦
◦◦
◦
◦◦
◦ ◦
◦◦
◦◦
◦ ◦◦
-1 ◦
◦◦◦◦ -2 ◦◦
◦◦ ◦◦
◦◦
◦◦
◦◦◦
-2 ◦◦ -3 ◦
◦
-3 -4
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
4 ◦
◦ ◦
4
◦ ◦
2 ◦◦
◦ ◦
◦◦
◦◦◦
◦
2 ◦◦
◦
◦ ◦
◦
◦◦◦◦ ◦
◦
◦◦ ◦
◦
◦
◦◦
◦
◦◦ ◦◦◦
0 ◦
◦
◦
◦
◦
◦
◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦
◦
◦
◦
◦
◦
◦
◦
◦◦
◦
◦◦
◦
◦
◦◦
◦ 0 ◦
◦
◦◦◦
◦
◦◦
◦
◦◦
◦
◦ ◦
◦
◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦◦
◦
◦
◦
◦
◦◦ ◦
◦◦◦ ◦◦◦
◦◦◦ ◦◦◦
◦
-2 -2 ◦
◦◦
◦◦ ◦◦
◦
◦◦
◦
-4 -4
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
that is, the estimate minimizes the sum of distances from data vectors to
the parameter vector, with distance measured by the sum of absolute val-
132 Multivariate Analysis
Then combine components of T from (7.5) to give the multivariate sign test
statistic, or from (7.6) to give the multivariate sign rank test. In either case,
components are combined using
for Υ = Var0 [T ]. As in §2.3, in the case that the null location value is 0, the
null distribution for the multivariate test statistic is generated by assigning
equal probabilities to all 2n modifications of the data set by multiplying the
rows (Xi1 , . . . , XiJ ) by +1 or −1. That is, the null hypothesis distribution of
T (X) is generated by placing probability 2−n on all of the 2n elements of
Test statistics (7.1) and (7.2) arose as quadratic forms of independent and
identically distributed random vectors, and the variances included in their
definitions were scaled accordingly. Statistic (7.3) is built using a more com-
plicated variance; this pattern will repeat with nonparametric analogies to
parametric tests.
Combining univariate tests into a quadratic form raises two difficulties. In
previous applications of rank statistics, that is, in the case of univariate sign
and signed-rank one-sample tests, in the case of two-sample Mann-Whitney-
Wilcoxon tests, and in the case of of Kruskal-Wallis testing, all dependence
of the permutation distribution on the original data was removed through
ranking. This is not the case for T , since this distribution involves correla-
tions between ranks of the various response vectors. These correlations are not
specified by the null hypothesis. The separate tests are generally dependent,
and dependence structure depends on distribution of raw observations. The
asymptotic distribution of (7.7) relies on this dependence via the correlations
between components of T . The correlations must be estimated.
Furthermore, the distribution of W of (7.7) under the null hypothesis is
dependent on the coordinate system for the variables, but, intuitively, this
dependence on the coordinate system might be undesirable. For example,
suppose that (X1i , X2i ) has an approximate multivariate Gaussian distribu-
tion, with expectation µ, and variance Σ, with Σ known. Consider the null
hypothesis H0 : µ = 0. Then the canonical test is (7.1), and it is unchanged
if the test is based on (Ui , Vi ) for Ui = X1i + X2i and Vi = X1i − X2i , with
Σ modified accordingly. Hence the parametric analysis is independent of the
coordinate system.
The first of this difficulty is readily
√ addressed. Under H0 , the marginal
sign test statistic (7.5) satisfies Tj / n ≈ G(0, 1). Conditional on the relative
ranks of the absolute values of the observations, the permutation distribution
134 Multivariate Analysis
is entirely specified, and conditional joint moments are calculated. Under the
permutation distribution,
X
σ̂jj 0 = s̃(Xij )s̃(Xij 0 )/n, (7.9)
i
and so the variance estimate used in (7.7) has components υ̂jj 0 = σ̂jj 0 /n.
Here again, the covariance is defined under the distribution that consists of
2n random reassignment of signs to the data vectors, each equally weighted.
As before, variances do not depend on data values, but covariances do depend
on data values. The solution for the Wilcoxon signed-rank test is also available
(Bennett, 1965).
Using the data to redefine the coordinate system may be used to address
the second problem (Randles, 1989; Oja and Randles, 2004).
Combine the components of T (X) to construct the statistic
for G−1 2
J (1 − α, 0) the 1 − α quantile of the χJ distribution, with non-centrality
parameter 0. Bickel (1965) discusses these (and other tests) in generality.
Example 7.3.1 Consider the data of Example 6.4.2. We test the null
hypothesis that the joint distribution of systolic and diastolic blood pres-
sure changes is symmetric about (0, 0), using Hotelling’s T 2 and the two
asymptotic tests that substitutes signs and signed-ranks for data. This test
is performed in R using
# For Hotelling and multivariate rank tests resp:
library(Hotelling); library(ICSNP)
cat(’\n One-sample Hotelling Test\n’)
HotellingsT2(bp[,c("spd","dpd")])
cat(’\n Multivariate Sign Test\n’)
rank.ctest(bp[,c("spd","dpd")],scores="sign")
cat(’\n Multivariate Signed Rank Test\n’)
rank.ctest(bp[,c("spd","dpd")])
Nonparametric One-Sample Testing Approaches 135
P -values for Hotelling’s T 2 , the marginal sign rank test, and marginal
sign test, are 9.839 × 10−6 , 2.973 × 10−3 , and 5.531 × 10−4 .
Tables 7.1 and 7.2 contain attained levels and powers for one-sample multi-
variate tests with two manifest variables of nominal level 0.05, from various
distributions.
Tests compared are Hotelling’s T 2 tests, and test (7.10) applied to the
sign and signed-rank tests. Tests have close to their nominal levels, except
for Hotelling’s test with the Cauchy distribution; furthermore, the agreement
is closer for sample size 40 than for sample size 20. Furthermore, the sign
test power is close to that of Hotelling’s test for Gaussian variables, and the
signed-rank test has attenuated power. Both nonparametric tests have good
power for the Cauchy distribution, although Hotelling’s test performs poorly,
and both perform better than Hotelling’s test for Laplace variables.
Some rare data sets simulated to create Tables 7.1 and 7.2 include some
for which Υ is estimated as singular. Care must be taken to avoid difficulties;
in such cases, p-values are set to 1.
larger than, that observed. In this way, the analysis of the previous subsection
for the sign test, and by extension the signed-rank test, can be extended to
general rank tests, including tests with data as scores.
I = {µ|W (X − 1n ⊗ µ) ≤ χ2J,α },
Example 7.4.1 Recall again the blood pressure data set of Exam-
ple 6.4.2. Figure 7.2 is generated by
library(MultNonParam); shiftcr(bp[,c("dpd","spd")])
and exhibits the 0.05 contour of p-values for the multivariate test con-
structed from sign rank tests for each of systolic and diastolic blood pres-
sure, and forms a 95% confidence region. Note the lack of convexity.
-10
..............
.. ..
................................................. ...........................
.. ..
............... ...
.. ...
...................................... .............
.
....................................
.. ...
.. ..
.............. . ...
.. ...
......................... ...
... ...
-15 ..............
...
.
.............
...
.. ...
. .
.... ..
.
. .
.
.............
.. ...
...
................ ...
.. .
.
............. .
.. ...
Systolic .............. ..
.. ...
.............. ..............
.. .
.
.... .
...
.. ...............
... ..
.............. ...
-20 ..
..............
...
...
... ..
... .....................................
.............. .............
.. ...
... ....
... .
.
.... ..........................
.
.............. .........................
.. ..
.... ............. .
.
... ...
.. ..
. ...
.............. ..........................
.
.............. .
.
-25 ...
...
...................................
..............
.
.
.. ..
................................................................................................
-15 -10 -5 0
Diastolic
95% Confidence from inversion of Bivariate Sign Rank Test
all have the same distribution for i > M1 . Let g be a vector indicating group
membership; gi = 1 if i ≤ M1 , and gi = 2 if i > M1 . As in §7.1, consider
testing and confidence interval questions.
X ∗ = {(X, g)|g has M1 entries that are 1 and M2 that are 2};
p-values are calculated by counting the number of such statistics with values
equal to or exceeding the observed value, and dividing the count by M1M+M
1
2
.
Other marginal statistics may be combined; for example, one might use
the Max t-statistic, defined by first calculating univariate t-statistics for each
manifest variable, and reporting the maximum. This statistic is inherently
one-sided, in that it presumes an alternative in which each marginal distri-
bution for the second group is systematically larger than that of the first
group. Alternatively, one might take the absolute value of the t-statistics be-
fore maximizing. One might do a parallel analysis with either the maximum of
Wilcoxon rank-sum statistics or the maximum of their absolute values, after
shifting to have a null expectation zero.
Finally, one might apply permutation testing to the statistic (7.3), calcu-
lated on ranks instead of data values, to make the statistic less sensitive to
extreme values.
Example 7.5.1 Consider the data on wheat yields, in metric tons per
hectare (Cox and Snell, 1981, Set 5), reflecting yields of six varieties of
wheat grown at ten different experimental stations, from
http://stat.rutgers.edu/home/kolassa/Data/set5.data .
Two of these varieties, Huntsman and Atou, are present at all ten sta-
tions, and so the analysis will include only these. Stations are included
from three geographical regions of England; compare those in the north
to those elsewhere. The standard Hotelling two-sample test may be per-
formed in R using
wheat<-as.data.frame(scan("set5.data",what=list(variety="",
y0=0,y1=0,y2=0,y3=0,y4=0,y5=0,y6=0,y7=0,y8=0,y9=0),
na.strings="-"))
# Observations are represented by columns rather than by
# rows. Swap this. New column names are in first column.
dimnames(wheat)[[1]]<-wheat[,1]
wheat<-as.data.frame(t(wheat[,-1]))
dimnames(wheat)[[1]]<-c("E1","E2","N3","N4","N5","N6","W7",
"E8","E9","N10")
wheat$region<-factor(c("North","Other")[1+(
substring(dimnames(wheat)[[1]],1,1)!="N")],
c("Other","North"))
attach(wheat)
plot(Huntsman,Atou,pch=(region=="North")+1,
main="Wheat Yields")
Two-Sample Methods 139
legend(6,5,legend=c("Other","North"),pch=1:2)
Data are plotted in Figure 7.3. The normal-theory p-value for testing
equality of the bivariate yield distributions in the two regions is given by
library(Hotelling)#for hotelling.test
print(hotelling.test(Huntsman+Atou~region))
The results of hotelling.test must be explicitly printed, because the
function codes the results as invisible, and so results won’t be printed
otherwise. The p-value is 0.0327. Comparing this to the univariate results
t.test(Huntsman~region);t.test(Atou~region)
gives two substantially smaller p-values; in this case, treatment as a multi-
variate distribution did not improve statistical power. On the other hand,
the normal quantile plot for Atou yields shows some lack of normality.
Outliers do not appear to be present in these data, but if they were,
they could be addressed by performing the analysis on ranks, either using
asymptotic normality:
association between variety yields across stations. If one wants only the
Hotelling statistic significance via permutation, one could use
print(hotelling.test(Huntsman+Atou~region,perm=T,
progBar=FALSE))
The argument progBar will print a progress bar, if desired, and an addi-
tional argument controls the number of random permutations.
7.5
7
6.5 ♦
6
♦
5.5 Symbol Region
♦
Other
5 ♦ North
♦ ♦
4.5
4 4.5 5 5.5 6 6.5 7 7.5
Example 7.5.2 Consider again the wheat yield data of Example 7.5.1.
Asymptotic nonparametric testing is performed using
library(ICSNP)#For rank.ctest and HotelingsT2.
rank.ctest(cbind(Huntsman,Atou)~region)
Exercises 141
rank.ctest(cbind(Huntsman,Atou)~region,scores="sign")
detach(wheat)
to obtain a multivariate version of Mood’s median test.
Alternate syntax for rank.ctest consists of calling it with two argu-
ments corresponding to the two data matrices.
7.6 Exercises
1. The data set
HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt
HTTP://lib.stat.cmu.edu/datasets/cloud
contain data from a cloud seeding experiment. The first fifteen lines
contain comment and label information; ignore these. The second
field contains the character S for a seeded trial, and U for unseeded.
a. The fourth and fifth represent rainfalls in two target areas. Test
142 Multivariate Analysis
This chapter considers the task of estimating a density from a sample of in-
dependent and identical observations. Previous chapters began with a review
of parametric techniques. Parametric techniques for density estimation might
involve estimating distribution parameters, and reporting the parametric den-
sity with estimates plugged in; this technique will not be further reviewed in
this volume.
8.1 Histograms
The most elementary approach to this problem is the histogram, which rep-
resents the density as a bar chart.
The bar chart represents frequencies of the values of a set of categorical
variable in various categories as the heights of bars. When the categories are
ordered, one places the bars in the same order. To construct a histogram for
a continuous random variable, then, coarsen the continuous variable into a
categorical variable, whose categories are subsets of the range of the original
variable. Construct a bar plot for this categorical variable, again, with bars
ordered according to the order of the intervals.
Because the choice of intervals is somewhat arbitrary, the boundary be-
tween bars is deemphasized by making the bars butt up against their neigh-
bors. The most elementary version of the histogram has the height of the bar
as the number of observations in the interval. A more sophisticated analysis
makes the height of the bar represent the proportion of observations in the
bar, and a still more sophisticated representation makes the area of the bar
equal to the proportion of observation in the bar; this allows bars of unequal
width while keeping the area under a portion of the curve to approximate the
proportion of observation in that region. Unequally-sized bars are unusual.
Construction of a histogram, then, requires selection of bar width and bar
starting point. One generally chooses the end points of intervals generating the
bars to be round numbers. A deeper question is the number of such intervals.
An early argument (Sturges, 1926) involved determining the largest number
so that if the data were in proportion to a binomial distribution, every interval
143
144 Density Estimation
The quality of the histogram depends primarily on the width of the bin ∆n ;
the choice of tn is less important. Scott (1979) takes tn = 0, and, using Taylor
approximation techniques within the interval, shows that the integrated mean
squared error of the approximation is
Z ∞
1
1/(n∆n ) + ∆n 2
f 0 (x)2 dx + O(1/n + ∆n 3 ). (8.2)
12 −∞
The first term in (8.2) represents the variance of the estimator, and the second
term represents the square of the bias. Minimizing
R∞ the sum of the first two
terms gives the optimal bin size ∆∗n = [6/ −∞ f 0 (x)2 dx]1/2 n−1/3 . One might
R∞
approximate −∞ f 0 (x)2 dx by using the value of the Gaussian density with
variance matching that of the data, to obtain ∆∗n = 3.49sn−1/2 , for s the
standard deviation of the sample.
Example 8.2.1 Refer again to the nail arsenic data from Example 2.3.2.
Figure 8.1 displays kernel density estimates with a default bandwidth and
for a variety of kernels. This figure was constructed using
cat(’\n Density estimation \n’)
attach(arsenic)
#Save the density object at the same time it is plotted.
plot(a<-density(nails),lty=1,
main="Density for Arsenic in Nails for Various Kernels")
lines(density(nails,kernel="epanechnikov"),lty=2)
lines(density(nails,kernel="triangular"),lty=3)
lines(density(nails,kernel="rectangular"),lty=4)
legend(1,2, lty=rep(1,4), legend=c("Normal","Quadratic",
"Triangular","Rectangular"))
Note that the rectangular kernel is excessively choppy, and fits the poorest.
Generally, any other density, symmetric and with a finite variance, may be
used. The parameter ∆n is called the bandwidth. This bandwidth should
depend on spread of data, and n; the spread of the data might be described
using the standard deviation or interquartile range, or, less reliably, the sample
range.
The choice of bandwidth balances effects of variance and bias of the den-
sity estimator, just as the choice of bin width did for the histogram. If the
bandwidth is too high, the density estimate will be too smooth, and hide fea-
tures of data. If the bandwidth is too low, the density estimate will provide
too much clutter to make understanding the distribution possible.
We might consider minimizing the mean squared error for x fixed, rather
than after integration. That is, choose ∆n to minimize the mean square error
of the estimate,
h i h i 2
MSE[fˆ(x)] = Var fˆ(x) + E fˆ(x) − f (x) .
146 Density Estimation
2.5 ......
..... ......
......... ....
.... .... .....
....... ........... ....
.. ....... ..
...... ........ ..
.... . .. .......
... ... ... ..
...... ..... ..... ....
.. . .
..... .. ...
...... ... ........
...... ... ........
2 ........ ... ..........
..... ... .........
...... .. .........
.... .
... ..
....... ....
........
.......
............................................................. Normal
.... .. ....
.....
.
................
.
. ......
.......
. . . . . . . . . . Quadratic
........ .......
......
........
.. ...
......
........
.........
................................................. Triangular
1.5 .....
.........
......
.........
.......
......................................................... Rectangular
..... .....
Den- ...... ......
........ .......
.. ... .........
sity ...... .........
....... .........
... ......
............. ...............
......... ... ........ ..
... ..... ... .... ...
1 ..........
..........
... ....... ...
.......... ...
......... ......
......
......... .......
......... .....
.....
.......
...... ......
...... .........
.......
...... ......
..... .......
.... ........
..
0.5 .......
.....
.........
....
..... ......
...... ....... ...... ......
....... ...... .... .......
... ....
...... ..........
..... ...........................................
...... .... ..... ..... ... ...
.... ..
.. ..... . ..........
.....
.... ...... .............. ....... ........ . .....
....... ................... .......
....... ..................................................
.
.
....... .... . .
........
...... ..............
. ..........
........
. . . . .
. .
......... ..
.... ......... ......... ..........
.................... ..... ................................................................................................................................................................................................................... .....................
0
0 0.5 1 1.5 2 2.5
Nail Arsenic
1 3 00 ∗ 1 2 00 ∗
h i 1 − ∆n f (x) − 24 ∆n f (x ) f (x) + 24 ∆n f (x )
Var fˆ(x) = ,
∆n n
and using the convergence of ∆n to zero, one obtains
h i
Var fˆ(x) ≈ C2 (x)/(∆n n) for C2 (x) = f (x). (8.6)
Kernel Density Estimates 147
Example 8.2.2 Refer again to the arsenic nail data of Examples 2.3.2
and 8.2.1. The kernel density estimate, using the suggested bandwidth,
and bandwidths substantially larger and smaller than the optimal, are
given in Figure 8.2. The excessively small bandwidth follows separate
data points closely, but obscures the way they cluster together. The ex-
cessively large bandwidth wipes out all detail from the data set. Note the
large probability assigned negative arsenic concentrations by the estimate
with the excessively large bandwidth. This plot was drawn in R using
plot(density(nails,bw=a$bw/10),xlab="Arsenic",type="l",
main="Density estimate with inopportune bandwidths")
lines(density(nails,bw=a$bw*5),lty=2)
legend(1,2,legend=paste("Band width default",c("/10","*5")),
lty=c(1,2))
detach(arsenic)
and the object a is the kernel density estimate with the default bandwidth,
constructed in Example 8.2.1.
Silverman (1986, §3.3) discusses a more general kernel w(t). Equations (8.5)
and (8.6) hold, with
Z ∞ Z ∞
C1 (x) = f (x) w(t)2 dt, C2 (x) = 1/2f 00 (x) t2 w(t) dt,
−∞ −∞
and (8.8) continues to hold. Sheather and Jones (1991) further discuss the
constants involved.
Epanechnikov (1969) demonstrates that (8.3) extends to multivariate dis-
tributions; estimate the multivariate density f (x) from independent and iden-
tically distributed observations X1 , . . . , Xn , where Xi = (Xi1 , . . . , Xid ), using
d
Y n
X
fˆ(x) = n−1 ( ∆nj )−1 w((Xi1 − x1 )/∆n1 , . . . , (Xid − xd )/∆nd ).
j=1 i=1
148 Density Estimation
8 ...
......
......
........
.....
......
6 ... ....
..... ....
.
....
. .. ......
... ... ......
.... ....
. . Band width divided by 10
........
.....
...............
.. ... ......
... ..
Density 4 ........ ....
...... ..
Band width multiplied by 5
........
.....
. . .
.... ... ......
.... ..... ........
.. ... ... ...
... ... ... ... ..
... ... ... .... . ......
.... .. .. .
2 ...
... .. ........ .... ..... .....
... ... ......... ...... ..... ...... .. ....
.
.
....
.....
.. ... ... ........ ...... ....... ...... .. ... ....
... ... ....... ...... ..... ...... .. .... ......
..... .. . . . .. . . . .. . . .
. .
.
. . .... . ..... . .... . ......... ..... ......... . .... ... . . .
. .
.. ... .. ...
. . ..... ..... .. ..
. ... ... ..... ... ... ... ... ... .. ...
.. ........ .. ...... ........ .................................... .................................................................................................................................................................................. . . . .
..
0
0 0.5 1 1.5 2 2.5
Nail Arsenic
8.3 Exercises
1. The data at
HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages
reflects wages from 1985. The first 27 lines of this file represent an
explanation of variables; delete these lines first, or skip them when
you read the file. The first six fields are numeric. The sixth is hourly
wage; you can skip everything else. Fit a kernel density estimate to
the distribution of wage, and plot the result. Comment on what you
see.
2. The data at
HTTP://lib.stat.cmu.edu/datasets/cloud
contain data from a cloud seeding experiment. The first fifteen lines
contain comment and label information; ignore these. The third field
indicates season, and the sixth field represents a rainfall measure
expected not to be impacted by the experimental intervention. On
Exercises 149
the same set of axes, plot kernel density estimates of summer and
winter rain fall, and comment on the difference.
9
Regression Function Estimates
Yj = g(Xj ) + j , (9.1)
151
152 Regression Function Estimates
for
X the n × K matrix with rows given by Xi> , (9.4)
and Y is the column vector with entries Yi . The estimator (9.3) is defined only
if X is of full rank, or, in other words, if the inverse in (9.3) exists. Otherwise,
the model is not identifiable, in that two different parameter vectors β and
β ∗ give the same fitted values, or Xβ = Xβ ∗ .
When the errors j have a Gaussian distribution, then the vector β̂ has a
multivariate Gaussian distribution, exactly, and
p
(β̂i − βi )/ s2 vi ∼ Tn−K , (9.5)
n
for s2 = i=1 (Yi − β > Xi )2 /(n − K), and vi the entry in row and column
P
i of (X X)−1 . When the errors j are not Gaussian, under certain circum-
>
The weight function can be the same as was used for kernel density estimation.
This weight function is often a Gaussian density, or a uniform density centered
at 0. Fan (1992) discusses a local regression smoother
L
X
ĝ(x) = βˆ` x` , (9.7)
`=0
for L = 1 and
n L
!2
X X
β̂ = argmin Yj − βˆ` Xj` w((x − Xj )/∆n ) , (9.8)
j=1 `=0
and argues that this estimator has smaller bias than (9.6). Köhler et al. (2014)
present considerations for bandwidth selection.
Kernel and Local Regression Smoothing 153
Example 9.2.1 Example 2.3.2 presents nail arsenic levels from a sam-
ple; arsenic levels in drinking water were also recorded, and one can in-
vestigate the dependence of nail arsenic on arsenic in water. The data
may be plotted in R using
attach(arsenic)
plot(water,nails,main="Nail and Water Arsenic Levels",
xlab="Water",ylab="Nails",
sub="Smoother fits. Bandwidth chosen by inspection.")
Kernel smoothing may be performed in R using the function ksmooth.
Bandwidth was chosen by hand.
lines(ksmooth(water,nails,"normal",bandwidth=0.10),lty=1)
lines(ksmooth(water,nails,"box",bandwidth=0.10),lty=2)
legend(0.0,2.0,lty=1:3, legend=c("Smoother, Normal Kernel",
"Smoother, Box Kernel", "Local Polynomial"))
The function locpoly from library KernSmooth applies a local regression
smoother:
library(KernSmooth)
lines(locpoly(water,nails,bandwidth=.05,degree=1),lty=3)
detach(arsenic)
Results are given in Figure 9.1. The box kernel performs poorly; a sample
this small necessarily undermines smoothness of the box kernel fit. Per-
haps the normal smoother is under-smoothed, but not by much. Library
KernSmooth contains a tool dpill for automatically selecting bandwidth.
The documentation for this function indicates that it sometimes fails,
and, in fact, dpill failed in this case. This local regression smoother ig-
nored the point with the largest values for each variable, giving the curve
a concave rather than convex shape.
These data might have been jointly modeled on the square-root scale,
to avoid issues relating to the distance of the point with the largest values
for each variable from the rest of the data. An exercise suggests exploring
this; in this case, the automatic bandwidth selector dpill returns a value.
Figure 9.2 demonstrates the results of selecting a bandwidth too small.
In this case, the Gaussian kernel gives results approximately constant in
the neighborhood of each data point, and the box kernel result is not de-
fined for portions of the domain, because both numerator and denominator
in (9.6) are zero.
2.5
........................... Smoother, Normal Kernel
. . . .◦
. . . . . Smoother, Box Kernel .
.
2
...................... Local Polynomial .
.
.
. ...
. ........
. ........
........
..
.....
1.5 .....
..... .
.
..... .
Nail . . . . . . ......... . .
.
.
.
. ....
As .. . . ..
.
....
....
....
. ....
....
1 . .....
. ..
.....
...
◦ ◦ . .....
. .....
.....
..... ..
. ........
. .
...
..
..........................................................................
.
. ....... . . .
................ . . . ............
............. ......
........... . . . . . .......
.......... .......
0.5 ◦
..... ........... . . ...
....
........
... .......... ◦
........ . .............
◦ ....... .................
◦ ....... ..◦ ....................... .
◦
........................................◦.................................... . . . .
.....◦
◦
◦
◦
◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Smoother fits. Bandwidth chosen by inspection.
2.5
........................... Normal
....................................................◦
..
. . . . . Box
2
1.5
Nail
As
1
◦ ...............◦
.....................
....
. ..... ...
. ... ...
. ..... ...
.
. ....... ...
.
...
.... ...
..
0.5 . ◦
........... . ....
.........
. ...... . ...........◦
......................................................
. ....
◦ . .
....
◦
◦ ...
.
... ◦◦
.
.....
...........
......◦
.◦
◦
◦
◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Smoother fits. Bandwidth chosen too small.
Kernel and Local Regression Smoothing 155
points k, and up-weights points near x and down-weights them away from
x. The weighting function is scaled to make the point in the neighborhood
farthest from x have a weight going down to zero.
Consider the case with an intercept and one regressor, so L = 2. The
estimate is now (9.7), for
L
!2
X X
`
β̂ = argmin Yj − β` Xj w((x − Xj )/∆n (x)) , (9.9)
j∈N (x) `=0
Example 9.2.2 The solution using the default proportion of the data
0.75 appears to under-smooth the data, and raising this parameter above
1 reduces curvature, but not by much. Return again to the arsenic data
previously examined in Example 9.2.1. Again, plot the data in R using
attach(arsenic)
plot(water,nails,main="Nail and Water Arsenic Levels",
xlab="Water",ylab="Nails", sub="Loess fits")
Loess smoothing may be performed in R using the function loess. Band-
width was chosen by hand. Unlike in the case of ksmooth, loess does
not provide output that can be plotted directly. A set of points at which to
calculate the smoother must be specified; this is stored in the variable x
below. Instead of specifying a bandwidth for the loess procedure, one spec-
ifies the number of observations contributing to each neighborhood from
which the fit is calculated. In R, this is specified as the proportion of the
total sample, via the input parameter span. Hence the second call below
uses the entire data set.
x<-min(water)+(0:50)*diff(range(water))/50
lines(x,y=predict(loess(nails~water),x))
lines(x,y=predict(loess(nails~water,span=1),x),lty=2)
lines(x,y=predict(loess(nails~water,span=10),x),lty=3)
legend(0.0,2.0,legend=c("Default span .75",
"Span 1","Span 10"),lty=1:3)
detach(arsenic)
Results are given in Figure 9.3. Note that the solution using the default
proportion of the data 0.75 appears to under-smooth the data, and that
raising this parameter above 1 misses detail of the shape of the relation-
ship.
2.5
........................... Default span .75 ◦..
. . . . . Span 1 ...
......
..........
.
2 ...................... Span 10 .......
.
......
.......
.
............
. .
..... ...
.... ...
..... ........
.... .
...... .........
1.5 .....
. ...
..... ..........
.
.
Nail .....
.....
.
......
.
..
.... .
..... .......
As .
...
.
.....
...... .............
.
......
. ... .
...... ........ .
1 ..... ........... . .
........................... ..................
..... ................................................................... . . .
◦ ....... ◦ ..
.. .
.. ..... . . .
. . . . .............. .
... . . . .
... . . ......
... .
. .............
.
. ◦
. ... .......
0.5 .
. .... ..............
........ ◦
◦ . . ....
◦ . ....................
◦ .............. ◦ ..◦
........◦
...◦
. .
............ ....
.
◦... .........
◦◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Loess fits
Cleveland and Devlin (1988) extend this loess technique to higher dimensions.
For example, one may define the partial ordering (x1 , . . . , xK ) (y1 , . . . , yK )
if xi yi for all i; that is, consider one vector as greater than or equal to the
other if and only if each component of the first vector does not exceed the
same component of the second vector. Such an ordering lacks the property
that any two elements of the set may be compared, and so is called a partial
ordering. For example, neither (2, 1) (1, 2) nor (1, 2) (2, 1). In the case of
a single regressor this complication does not arise, since for any two distinct
real numbers, one can be determined to be the smaller and the other is the
larger.
Brunk (1955) introduces the notion of model fitting that respects the
partial ordering g(x) ≤ g(y) if x y. Such techniques are called
isotonic regression. Dykstra (1981) reviews an algorithm for fitting such a
model, called the pooled adjacent violators algorithm, and produces theoreti-
cal justification for this algorithm. Best and Chakravarti (1990) reviews more
general algorithmic considerations.
9.4 Splines
A spline is a smooth curve approximating the relationship between an ex-
planatory variable x and a response variable y, based on observed pairs of
points (X1 , Y1 ), . . . , (Xn , Yn ), constructed according to the following method,
in order to describe the dependence of y on x between two points x0 and xN .
One first picks N − 1 intermediate points x1 < x2 < · · · < xN −2 < xN −1 .
The intermediate points are called knots. One then determines a polynomial
of degree M between xj−1 and xj , constrained so that the derivatives of order
up to M − 1 match up at knots. Denote the fitted mean by ĝ(x).
Taken to an extreme, if all Xj are unique, then one can fit all n points with
a polynomial of degree n − 1; this, however, will yield a fit with unrealistically
158 Regression Function Estimates
2.5
. . . . . . . . . . . . . . . . . . .◦
.
.
.
2 .
.
.
.
.
.
1.5 .
Nail .
.
As .
.
.
1 .
◦ .
◦ .
. . . . . . . . . . . . . . . . . .
.
.
.◦
0.5 .
.
◦
◦◦ .
◦ .. . . .◦◦
.
◦.◦ ◦
◦◦
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Isotonic Regression Fit.
Example 9.4.1 Again revisit the arsenic data of Example 9.2.1. The
spline is fit using R as
attach(arsenic)
plot(water,nails, main="Nail Arsenic and Water Arsenic")
hgrid<-min(water)+(0:50)*diff(range(water))/50
lines(predict(spl<-smooth.spline(water,nails),hgrid))
lines(predict(smooth.spline(water,nails,spar=.1),hgrid),
lty=3)
lines(predict(smooth.spline(water,nails,spar=.5),hgrid),
lty=2)
legend(1250,1110,lty=1:3,col=1:3,
legend=paste("Smoothing",c(round(spl$spar,3),.5,.1),
c("Default","","")))
detach(arsenic)
Here hgrid is a set of points at which to calculate the spline fit. As with
most of the smoothing methods, R contains a lines method that will plot
Quantile Regression 159
2.5
........................... Smoothing 0.934 Default ..
.◦
...
. . . . . Smoothing 0.5 .
.
....
....
.
.
2 ...................... Smoothing 0.1 .....
.....
.......
.......
.
.
......
.......
......
........
1.5 ..
....
..
.... .......
..
.
..
.......
.
...
.. .......
Nail .. ......... ...
.
......
........
.... ....... ......... ....
As ... ...
.. ..
.......
.....
.
..
......
.......
..... ...... ........
... ..... ...........
1 .
.......
....
......
....
...
......
........
◦ ..... ...
◦ .........
.... ..... ........
....
....
......
...... .
.. .............
....... ........
. ......... ..........
... .......... ..........
◦... ............
0.5 ...
.
..
.... ......
..... .........
...... .......................◦
..................
.
..
..
.............
◦
◦ ..................◦
. ..
◦ ...
.... ..◦
.........
........◦
.◦ ...
◦
◦◦. ................
◦
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Water As
Spline Fit
When comparing (9.11) with the standard least-squares criterion (9.3), note
the absence of ·2 in (9.11).
Equivalently, one might minimize
X
−
e+j + ej , (9.12)
j
160 Regression Function Estimates
for
− −
e+ +
j ≥ 0, ej ≥ 0, ej − ej = Yj − β1 − β2 Xj ∀j. (9.13)
The objective function is not differentiable; the optimum is where (9.12)
is either at a point, or flat, making the optimizer not unique.
4.5
...
...
...
...
....
4 ....
....
....
....
....
....
....
....
....
....
....
....
Sum of 3.5 ....
....
....
....
Absolute ....
....
....
....
Residual ....
....
.....
Values 3 .....
.....
.....
..
...
..... ..
..... ....
.....
..... ...
.....
..... ...
..... ..
..... ....
..... ..
.....
2.5 ..... ....
.......
2
0 5 10 15 20
Regression Parameter
Intercept Fixed at Optimum
TABLE 9.1: Achieved coverage and average interval length for 90% confidence
intervals for L1 and L2 regression, with various error distributions
ables with the smaller sample size, and longer for Cauchy variables and for
Laplace and exponential variables with the larger sample size.
One can adapt this technique to fit quantiles of the errors other than the
median. One can replace the objective (9.12) by
X X
γ e+
j + (1 − γ) e−
j , (9.14)
j j
still subject to constraints (9.13). This change causes the regression line to
run through the 1 − γ quantile of Y |X.
Example 9.5.1 Consider again the arsenic data of Example 9.2.1. The
quantile regression fitter is found in R in library quantreg. The following
commands fit this model:
library(quantreg)#Need for rq
rq(arsenic$nails~arsenic$water)
to give the results
Call: rq(formula = nails ~ water)
Coefficients:
coefficients lower bd upper bd
(Intercept) 0.11420 0.10254 0.14093
water 15.60440 3.63161 18.07802
Hence the best estimate of the slope of the linear relationship of ar-
senic in nails to arsenic in water is 15.604, with 90% confidence interval
(3.632,18.078). This confidence level is the default for rq. Figure 9.6 is
drawn using
attach(arsenic)
rqo<-summary(rq(nails~water))$coef
regp<-rqo[2,2]+diff(rqo[2,2:3])*(0:100)/100
m<-apply(abs(outer(nails,rep(1,length(regp)))
- outer(water,regp)-rqo[1,1]),2,sum)
plot(regp,m,type="l",xlab="Regression Parameter",
ylab="Sum of absolute values of residuals",
main="L1 fit for Arsenic Large Scale",
sub="Intercept fixed at optimum")
detach(arsenic)
and demonstrates the estimation of the slope; in this figure the intercept
is held at its optimal value.
Compare this to the Theil-Sen estimator:
attach(arsenic)
library(deming); theilsen(nails~water,conf=.9)$coef
yielding
Quantile Regression 163
(Intercept) water
0.1167158 14.2156504
If the linear model fits, and if the variance of the errors does not depend on
the explanatory variable, then the lines representing various quantiles of the
distribution of the response conditional on the explanatory variable will be
parallel, but this parallelism need not hold for the estimates in any particular
data set, as can be seen in the next example.
Example 9.5.2 Consider the blood pressure data set of Example (6.4.2).
Quantile regression can be used to model the median systolic blood pres-
sure after treatment in terms of systolic blood pressure before treatment.
abline(rqout);abline(rqoutt,lty=2)
legend(150,200,legend=c("Median Regression",
".2 Quantile Regresson"),lty=1:2)
Results are plotted in Figure 9.7. Note the extreme lack of parallelism.
◦
200 ........................................... Median Regression
. . . . . . . .2 Quantile Regression ......
......
......
......
.
...
.......
.....
......
......
180 ◦.............
...
.
◦
.
...
...... . .
...
........ . .
..
......
. .
.
...... . .
◦ ............. . .
◦.
..
...◦
After ◦
..
........
. . . .
.
...... . .
......
Treat- 160 ......
...... . .
. .
◦ ......
ment .
......
......
..
...
. .
.
...... . .
...... ◦ . .
..
........
. .
.
...... . ◦.
......◦
...... . . . ◦
......
..
........ . .
.
.... . .
140 ..
.
. .
......
..
◦ ...... .
...... .
...... . .
.
...
........ .
.
...... . ◦
....... .
.......◦
......
.
...
.......
..
............
. ........
120 ......
This extends the location estimator (2.6). Least-squares estimates are given
by %(z) = z 2 /2, and the estimator of §9.5 is given by %(z) = |z|. These were
the same penalty functions used in §2.3.1 to give the sample mean and median
respectively. In both of these cases, the regression parameters minimizing the
penalty function did not depend on σ.
In parallel with that analysis, one might form an intermediate technique
between ordinary and quantile regression by solving (9.15) for an alterna-
tive choice of %. One might use (2.7); in such a case, one begins with an
Resistant Regression 165
library(MASS); rlmout<-rlm(spa~spb,data=bp)
library(quantreg); rqout<-rq(spa~spb,data=bp)
plot(bp$spb,bp$spa,
main="Blood Pressure After and Before Treatment",
xlab="Systolic Blood Pressure Before Treatment (mm HG)",
ylab="Systolic Blood Pressure After Treatment (mm HG)",
sub="Linear Fits to Original Data")
abline(rlmout,lty=2)
abline(rqout,lty=3)
abline(lm(spa~spb,data=bp),lty=4)
legend(min(bp$spb),max(bp$spa),
legend=c("Huber","Median","OLS"),lty=2:4)
Compare this with results for a contaminated data set. Change the re-
sponse variable for the first observation to an outlier, and refit.
bp$spa[1]<-120
rqout<-rq(bp$spa~bp$spb);rlmout<-rlm(bp$spa~bp$spb)
plot(bp$spb,bp$spa,
xlab="Systolic Blood Pressure Before Treatment (mm HG)",
ylab="Systolic Blood Pressure After Treatment (mm HG)",
sub="Linear Fits to Perturbed Data")
abline(rlmout,lty=2);abline(rqout,lty=3);
abline(lm(spa~spb,data=bp),lty=4)
legend(min(bp$spb),max(bp$spa),
legend=c("Huber","Median","OLS"),lty=2:4)
These fits are shown in Figure 9.8. The median regression line is slightly
different in the two plots, although if one does not look closely one might
miss this. The least squares fit is radically different as a result of the shift
in the data point. The Huber fit is moved noticeably, but not radically.
166 Regression Function Estimates
FIGURE 9.8: Blood Pressure After and Before Treatment, Original and Per-
turbed Data
◦
200 . . . . . . . Huber
.................................. Median ...
..............
.............
........................................ OLS ...
.
..
..
............
...............
......
........ .
....... ◦
180 ◦...............
.
....
..
..
....... .
....... .
Systolic .......
.........
...
..
.............
Blood .
◦ ..................... ◦
..........◦
◦ .....
Pressure . ..
.
............
................
..........
After 160 ◦
..............
.............
.............
Treatment ....................
... . ...
........ .....
(mm HG) .............. ◦
..................
.
...
.
. ...........◦. .. ◦
. ..
........ ...... ◦
........ .....
..
.....................
140 .
...
....... .......
....... ......
◦ ....... .....
..... .....
....... .....
...
......... .........
..... .... ◦
...... ......
...... .........◦
........ ..
.
.......... .........
... ..
......
......
120 ......
9.7 Exercises
1. The data at
Exercises 167
HTTP://lib.stat.cmu.edu/datasets/CPS 85 Wages
reflects wages from 1985. The first 42 lines of this file contain a
description of the data set, and an explanation of variables; delete
these lines first, or skip them when you read the file. The first six
fields are numeric. The first is educational level, and the sixth is
hourly wage; you can skip everything else.
a. Fit a relationship of wage on educational level, in such a way as
to minimize the sum of absolute value of residuals. Plot this, and
compare with the regular least squares regression.
b. Fit a relationship of wage on educational level, in such a way as
to enforce monotonicity. Superimpose a plot of this relationship on
a plot of the data.
c. Fit a kernel smoother to estimate the dependence of wage on ed-
ucational level, and compare it to the results for the loess smoother.
d. Fit a smoothing spline to estimate the dependence of wage on ed-
ucational level, and compare it to the results for the loess smoother
and the kernel smoother.
2. Repeat the analysis of Example 9.2.1 for arsenic levels on the square
root scale. As an intermediate step, calculate the optimal bandwidth
for the local linear smoother. Plot your results. Compare with the
results in Figure 9.1.
10
Resampling Techniques
169
170 Resampling Techniques
tions about the parametric shape of its distribution. Let G(θ̂; F ) represent
the desired, but unobservable, distribution of θ̂ computed from independent
random vectors Z1 , . . . , Zn , with each random vector Zi having a distribution
function F . Assume that this distribution depends on F only via the param-
eter of interest, so that one might write G(θ̂; F ) = H(θ̂; θ) for some function
H.
Consider constructing confidence intervals using the argument of §1.2.2.1.
Using (1.18) with T = θ̂, a 1 − α confidence interval for θ satisfies H(θ̂; θL ) =
1 − α/2 and H(θ̂; θU ) = α/2. The function H, as a function of θ̂, is unknown,
and will be estimated from the data. Since observed data all arise from a
distribution governed by a single value of θ, the dependence of H(θ̂; θ) on
θ cannot be estimated. An assumption is necessary in order to produce a
confidence interval. Assume that
H is to θ as H ∗ is to θ̂. (10.2)
Then the standard deviation estimate ςˆ is the sample standard deviation of the
bootstrap samples (Efron, 1981), and a 1 − α confidence interval is θ̂ ± ςˆz1−α/2
for ςˆ the sample standard deviation of bootstrap samples (10.5).
(θ̂ − vU ≤ θ ≤ θ̂ − vL ),
Then h i
P∗ uL − θ̂ ≤ θB,i − θ̂ ≤ uU − θ̂ = 1 − α, (10.7)
Univariate Bootstrap Techniques 173
This is the basic bootstrap confidence interval (Davison and Hinkley, 1997, p.
29).
(Efron, 1981).
This method is referred to as the percentile method (Efron, 1981), or as the
usual method (Shao and Tu, 1995, p. 132f). If the parameter θ is transformed
to a new parameter ϑ, using a monotonic transformation, then the bootstrap
samples transform in the same way, and so the percentile method holds if
there exists a transformation that can transform to symmetry, regardless of
whether one knows and can apply this transformation.
the estimate of the density for the nail arsenic values was plotted in Fig-
ure 8.1. This distribution is markedly asymmetric, and so the percentile
bootstrap is not reliable; use the residual bootstrap.
Example 10.2.2 Return again to the nail arsenic values of the previous
example. We again generate a confidence interval for the median. The
BCa method, and the previous two methods, may be obtained using the
package boot.
library(boot)#gives boot and boot.ci.
#Define the function to be applied to data sets to get
#the parameter to be bootstrapped.
boot.ci(boot(arsenic$nails,function(x,index)
Univariate Bootstrap Techniques 175
return(median(x[index])),9999))
to give
Level Normal Basic
95% ( 0.0158, 0.2733 ) ( 0.0400, 0.2310 )
library(MultNonParam); exactquantileci(arsenic$nails)
to obtain the interval (0.118, 0.354). Efron (1981) notes that this exact
interval will generally agree closely with the percentile bootstrap approach.
One can use the bootstrap to generate intervals for more complicated statis-
tics. The bootstrap techniques described so far, except for the percentile
method, presume that parameter values over the entire real line are possi-
ble. One can account for this through transformation.
shows a highly asymmetric distribution, and the BCa correction for asym-
metry is strong. (As noted before, the actual bootstrap distribution is
supported on a large but finite number of values, and is hence discrete
and does not have a density; the plot is heuristic only.) The output from
boot.ci contains some information not generally revealed using its de-
fault printing method. In particular, sdoutput$bca is a vector with five
numeric components. The first of these is the confidence level. The fourth
and fifth are the resulting confidence interval end points. The second and
third give quantiles resulting from (10.9). The upper quantile is very close
to the maximum value of 9999; boot.ci gives a warning, and serious
176 Resampling Techniques
boot.ci(boot(arsenic$nails,logscale,99999))$bca
gives the BCa 0.95 interval (-1.647,-0.096) for the log of standard devi-
ation of nail arsenic.
FIGURE 10.1: Boot Strap Samples for Log of Nail Arsenic Standard Deviation
2 .
.....
... ..
.. ..
... ....
.. ..
.. ..
.. ..
.. ...
.... ....
. .
.. ..
.. ..
.. ..
1.5 .. ..
... ....
.. ..
... ....
... ....
.... ...
.. ...
... ...
.... ...
...
... ...
... ...
Density 1
...
...
...
..
..
.. ...
.. ..
... ... .. ...
.... ... .... ....
.. ... ..
.. .. . ...
.. .. .. ..
... ..
.. .. ...
...
.. ..
... .. ..
... .. ..
.
. ..
............ ..
. ... ... ...
... ... .. ... .. ..
. .
... ..
..
. ... ..
. ..
..
.
. .. .
. ..
. . .
0.5 .
........
.
.. ..
....
...
.
.
.
.
.
. ...
..
..
.. .
. ..
. .
... .
... ... ..
. ..
. ..
... ... .
.
. ..
.. .. .
. ..
.. .. ..
. ..
..
.. .. .
. ...
.. .. ..
.. .
... .
. ..
... . .
. ..
.
.........
...
...
...
....... ... .
. ...
...
... . .
. ...
...
...
...
...
...
...... ..... ...
..
. ........
0 ...........................
...
... ...........
TABLE 10.1: Observed coverage for nominal 0.90 Bootstrap intervals, 10 ob-
servations
Pearson correlation between first and second twin brain volumes. First,
define the correlation function
rho<-function(d,idx) return(cor(d[idx,1],d[idx,2]))
and calculate the bootstrap samples
bootout<-boot(brainpairs[,c("v1","v2")],rho,9999)
for pairs of data. Figure 10.2 presents a histogram of the sample, with
vertical lines marking the estimate and percentile confidence interval end
points:
hist(bootout$t,freq=FALSE)
legend(-.4,6,lty=1:2,
legend=c("Estimate","Confidence Bounds"))
abline(v=bootout$t0)
abline(v=sort(bootout$t)[(bootout$R+1)*c(.025,.975)],lty=2)
Confidence intervals may be calculated using
boot.ci(bootout)
to give
Level Normal Basic
95% ( 0.7463, 1.1092 ) ( 0.8516, 1.1571 )
confidence intervals for exp(θ) are not the exponentiation of Studentized boot-
strap confidence intervals for θ, and, furthermore, the quality of the Studen-
tized bootstrap confidence intervals depends on the effectiveness of Studenti-
zation.
boot.ci(boot(brainpairs[,c("v1","v2")],regparam,9999))
to obtain
Level Normal Basic Studentized
95% (0.2836, 1.1388) ( 0.1746, 0.9239) (0.4733, 1.0067)
refinements and a second example follow that produce a technique more closely
following the regularity conditions for the bootstrap.
Example 10.3.3 Refer again to the brain size data of Example 5.2.1.
In order to apply the fixed-X bootstrap, first calculate fitted values and
residuals:
residuals<-resid(lm(brainpairs$v2~brainpairs$v1))
fittedv<-brainpairs$v2-residuals
If a function needs a variable that is not locally defined, R looks for it to
be defined globally. Using the function boot inside another function, and
manipulating the data on a level above the bootstrap function but below
the command line, may cause the code to fail, or give something other
than what was intended. In the example below, fittedv and residuals
are passed explicitly to boot, after the data, indices, and bootstrap sample
size, and must be referred to by name.
regparam<-function(data,indices,fittedv,residuals){
y<-fittedv+residuals[indices]
return(summary(lm(y~data[,1]))$coefficients[2,1:2]^(1:2))
}
library(boot)
bootout<-boot(brainpairs[,c("v1","v2")],regparam,9999,
fittedv=fittedv,residuals=residuals)
Again, the Studentized interval is appropriate:
boot.ci(bootout,type="stud")
Example 10.3.4 Refer again to the brain size data of Example 5.2.1.
Modify the approach of Example 10.3.3:
The above example demonstrates how to adjust for unequal variances for
residuals; adjusting for a lack of independence is more difficult.
One might bootstrap the location difference between two data sets. This
two-sample location model is a nonparametric version of the classical two-
sample t confidence interval setting. It may be mimicked using a linear models
approach (9.1) and (9.2) by using a single covariate vector, taking the value
0 for one group and 1 for the other group, to give the pooled version of t
intervals; the slope parameter is then the difference in locations for the two
groups, and fitted values are group means. In this case, the fixed-X approach
compares the mean difference estimate to other samples with the same number
of observations in the first group as is observed in the data set, with a similar
statement about the second group, and the random-X approach fails to do
this. In this respect, the fixed-X approach is more intuitive.
The same approach may be applied to explore differences among a larger
number of groups. One might use the bootstrap to give confidence bounds for
a wide variety of possible ANOVA summaries.
Example 10.3.5 Refer again to the chicken weight gain data of Ex-
ample 5.4.2. Use fixed-X bootstrap techniques to bound the largest group
mean minus the smallest group mean, when data are split by protein level.
The Jackknife 183
meandiff<-function(data,indices){
fitv<-fitted(lm(data[,1]~as.factor(data[,2])))
residv<-data[,1]-fitv
y<-fitv+residv[indices]
return(diff(range(unlist(lapply(split(y,data[,2]),
"mean")))))
}
attach(chicken)
cat("Bootstrap CI for maximal difference in means\n")
boot.ci(boot(cbind(weight,lev),meandiff,9999))
Because the design is balanced, with an equal number of chickens at each
level of diet, no adjustment for different standard deviations of the resid-
uals is needed. Because the fixed-X bootstrap is employed, Studentization
of the resulting mean differences is not necessary. The resulting BCa
confidence interval is (85.6, 769.5).
A similar approach might have been employed for inference about quantities
like the R2 .
and
√
Then (10.10) holds, and (10.11) holds if the skewness of Wn times n con-
verges to 0.
n
X n
X
∗
T̄n−1 = Xj /(n(n − 1)) = X̄,
i=1 j=1,j6=i
When the sample size n is even, then Tn = (X(n/2) + X(n/2+1) )/2, and
(
∗ X(n/2+1) if i ≤ n/2
Tn−1,i =
X(n/2) if i ≥ n/2 + 1.
∗
Then T̄n−1 = (X(n/2) + X(n/2+1) )/2 = Tn , and the bias estimate is always 0.
For n odd, then Tn = X((n+1)/2) , and
(X((n−1)/2) + X((n+1)/2) )/2 if i > (n + 1)/2
∗
Tn−1,i = (X((n+3)/2) + X((n+1)/2) )/2 if i < (n + 1)/2
(X((n+3)/2) + X((n−1)/2) )/2 if i = (n + 1)/2,
Hence
∗ n+1 n+1 n−1
T̄n−1 − Tn = 4n X((n−1)/2) + 4n X((n+3)/2) + 2n X((n+1)/2) −
X((n+1)/2)
n+1
= X((n−1)/2) + X((n+3)/2) − 2X((n+1)/2) , (10.12)
4n
and the bias estimate is
∗ n2 − 1
B = (n − 1)(T̄n−1 − Tn ) = X((n−1)/2) + X((n+3)/2) − 2X((n+1)/2) .
4n
Example 10.4.1 Consider again the nail arsenic data of Example 2.3.2.
Calculate the Jackknife estimate of bias for the mean, median, and
trimmed mean for these data. Jackknifing is done using the bootstrap
library, using the function jackknife.
library(bootstrap)#gives jackknife
jackknife(arsenic$nails,median)
This function produces the 21 values, each with one observation omitted:
$jack.values
[1] 0.2220 0.2220 0.2220 0.2220 0.1665 0.1665 0.2220 0.2220
[9] 0.1665 0.2220 0.2220 0.1665 0.1665 0.1665 0.1665 0.1665
[17] 0.1665 0.2220 0.1665 0.2220 0.2135
omission of the middle value, is the average of X(9) and X(11) . The mean
of the jackknife observations is 0.1952. The sample median is 0.1750, and
the bias adjustment is 20 × (0.1952 − 0.1750) = 20 × 0.0202 = 0.404, as
is given by R:
$jack.bias
[1] 0.4033333
This bias estimate for the median seems remarkably large. From, (10.12)
the jackknife bias estimate for the median is governed by the difference
between the middle value and the average of its neighbors. This data set
features an unusually large gap between the middle observation and the
one above it.
Applying the jackknife to the mean via
jackknife(arsenic$nails,mean)
gives the bias correction 0:
$jack.bias
[1] 0
The average effects of such corrections may be investigated via simulation. Ta-
ble 10.2 contains the results of a simulation based on 100,000 random samples
for data sets of size 11. In this exponential case, the jackknife bias correction
over-corrects the median, but appears to address the trimmed mean exactly.
Under some more restrictive conditions, one can also use this idea to esti-
mate the variance of T .
Exercises 187
10.5 Exercises
1. The data set
HTTP://ftp.uni-bayreuth.de/math/statlib/datasets/lupus
HTTP://ftp.uni-
bayreuth.de/math/statlib/datasets/federalistpapers.txt
stat.rutgers.edu/home/kolassa/Data/federalistpapers.txt ).
%include "/folders/myfolders/common.sas";
to your SAS program before using the macros. Adjust the local folder
/folders/myfolders/ to reflect your local configuration. Furthermore, the
examples use data sets that need to be read into SAS before analysis; reading
the data set is given below the first time the data set is used.
Example 2.3.2: Perform the sign test to evaluate the null hypothesis that
the population median for nail arsenic levels is .26. This is done using proc
univariate:
/**************************************************************/
/* Data from http://lib.stat.cmu.edu/datasets/Arsenic */
/* reformatted into ASCII on the course home page. Data re- */
/* flect arsenic levels in toenail clippings; covariates in- */
/* clude age, sex (1=M), categorical measures of quantities */
/* used for drinking and cooking, arsenic in the water, and */
/* arsenic in the nails. To make arsenic.dat from Arsenic, do*/
/*antiword Arsenic|awk ’((NR>39)&&(NR<61)){print}’>arsenic.dat*/
/* Potential threshold for ill health effects for toenails is */
/* .26 http://www.efsa.europa.eu/de/scdocs/doc/1351.pdf */
/**************************************************************/
data arsenic; infile ’/folders/myfolders/arsenic.dat’;
input age sex drink cook water nails; run;
proc univariate data=arsenic mu0=.26 ciquantdf(alpha=.05);
var nails; run;
Material between /* and */ are ignored by SAS, and are presented
above so that information describing the data set may be included with
the analysis code. The option mu0=.26 specifies the null hypothesis, and
189
190 Analysis Using the SAS System
Example 2.3.3: The proc univariate call above gives intervals for a variety
of quantiles, including quartiles, but not for arbitrary user-selected quantiles.
The following code gives arbitrary quantiles. Manually edit the code below to
replace 21 with the actual number of observations, 0.025 with half the desired
complement of the confidence, and .75 with the desired quantile.
proc sort data=arsenic; by nails; run;
data ci; set arsenic;
a=quantile("binomial",.025,.75,21);
b=21+1-quantile("binomial",.025,1-.75,21);
if _N_=a then output; if _N_=b then output; run;
title ’Upper quartile .95 CIs for nail arsenic’;
proc print data=ci; run;
Example 3.4.1: This example applied the Wilcoxon rank sum test to investi-
gating differences between the strengths of yarn types. At present, analysis is
restricted to bobbin 3, since this data set did not have ties. These score tests
are calculated using proc npar1way , with the option wilcoxon. The first has
continuity correction turned off, with the option correct=no. Correction is on
by default, and so no such option appears in the second run, with correction.
Exact values are given using the exact statement, followed by the test whose
exact p-values are to be calculated.
title ’Wilcoxon Rank Sum Test, yarn strength by type, bobbin 3’;
title2 ’Approximate values, no continuity correction’;
proc npar1way data=yarn1 wilcoxon correct=no ;
class type; var strength; run;
title2 ’Approximate values, continuity correction’;
proc npar1way data=yarn1 wilcoxon ; class type;
var strength; run;
title2 ’Exact values’;
proc npar1way data=yarn1 wilcoxon ; class type;
exact wilcoxon; var strength; run;
Example 3.4.3: Permutation testing may be done using proc npar1way and
scores=data:
title ’Permutation test for Arsenic Data’;
proc npar1way data=arsenic scores=data ;
class sex; exact scores=data; var nails; run;
Tables 3.4 and 3.5: Test sizes and powers may be calculated using the macro
test2 in the file common.sas, as above. First and second arguments are the
group sizes. The third argument is the Monte Carlo sample size. The last
argument is the offset between groups.
title ’Monte Carlo Assessment of Test Sizes’;
%include "/folders/myfolders/common.sas";
192 Analysis Using the SAS System
Example 3.9.1: The npar1way procedure also performs the Siegel-Tukey and
Ansari-Bradley tests, using the keywords ab and st in the statement.
proc npar1way data=yarn ab st ; class type; var strength; run;
Examples 4.3.1 and 4.4.1: The Kruskal-Wallis test, and the variants using
the Savage and the van der Waerden scores, can be performed using proc
npar1way.
data maize; infile ’/folders/myfolders/T58.1’ ;
input exno tabno lineno loc $ block $ plot
treat $ ears weight;
nitrogen=substr(treat,1,1); if weight<0 then weight=.; run;
data tean; set maize; if loc="TEAN" then; else delete;
if weight=. then delete; run;
title ’Kruskal Wallis H Test for Maize Data’;
proc npar1way wilcoxon savage vw data=tean plots=none;
class treat; var weight; /* exact;*/ run;
The option exact is commented out above; depending on the configura-
tion of the example, exact computation can be very slow, and resource-
intensive enough to make the calculations fail. The option wilcoxon triggers
the Kruskal-Wallis test, since it uses Wilcoxon scores. These calculations may
be compared with the standard Analysis of Variance approach.
proc anova data=tean; class treat; model weight=treat; run;
Example 4.4.2: Here are two calls to npar1way that give in principle the same
answer. The exact command causes exact p-values to be computed. If you
193
specify scores (in this case data) in the exact statement, you’ll get that exact
p-value. If you don’t specify scores, you’ll get the exact p-value for the scores
specified in the proc npar1way statement. If you don’t specify either, you’ll
get ranks without scoring. As we saw before, exact computations are hard
enough that SAS quits. The second proc npar1way presents a compromise:
random samples from the exact distribution. Give the sample size you want
after /mc . The two calls give different answers, since one is approximate.
Successive calls will give slightly different answers.
Example 4.7.5: Here we test for an ordered effect of nitrogen, and compare
with the unordered version.
title ’Jonckheere-Terpstra Test for Maize’;
proc freq data=tean noprint; tables weight*nitrogen/jt;
output out=jttab jt; run;
proc print data=jttab noobs; run;
title ’K-W Test for Maize, to compare with JT’;
proc npar1way data=tean plots=none; class nitrogen;
var weight; run;
Examples 5.4.1 and 5.6.1: Perform a paired test on brain volume, assuming
symmetry. First, read the data:
data expensesd; infile ’/folders/myfolders/friedman.dat’;
input cat $ g1 g2 g3 g4 g5 g6 g7; run;
Convert to a set with one observation per line, with level l as well:
194 Analysis Using the SAS System
Example 8.2.1: SAS does density estimation via proc univariate. Only graph-
196 Analysis Using the SAS System
Example 9.3.1: Various sources suggest that isotonic regression can be done
with the macro at
http://www.bios.unc.edu/distrib/redmon/SASv8/redmon2.sas .
Example 9.5.2: Fit the median and 0.2 quantile regression estimates:
197
Example 10.4.1: Again we’ll use the SAS input file as above. This works sim-
ilarly to the R jackknife. You need to tell the macro what statistic to use. In
R this is via a function, but in SAS this is via a macro.
%macro analyze(data=,out=);
proc means noprint median data=&data vardef=n;
output out=&out(drop=_freq_ _type_) median=med ;
var nails ;
%bystmt;
run;
%mend;
%jack(data=arsenic,chart=0)
B
Construction of Heuristic Tables and
Figures Using R
Some of the preceding tables and figures were presented, not as tools in the
analysis of specific data sets, but as tools for comparing and describing various
statistical methods. Calculations producing these tables and figures in R are
given below. Some other of the preceding tables and figures were specific to
certain data sets, but are of primarily heuristic value and are not required
for the data set to which they apply. Commands to produce these tables
and figures are also given below. Before running these commands, load two
R packages, MultNonParam from CRAN, and NonparametricHeuristic from
github:
install.packages(c("MultNonParam","devtools"))
library(devtools)
install_github("kolassa-dev/NonparametricHeuristic")
library(MultNonParam); library(NonparametricHeuristic)
Figure 1.1:
fun.comparedensityplot()
Table 2.1:
Figure 2.1:
fun.studentizedcaucyplot(10,10000)
Table 2.2:
fun.achievable()
199
200 Construction of Heuristic Tables and Figures Using R
Figure 2.2:
drawccplot()
Table 2.3:
mypower<-array(NA,c(4,3,3))
nobsv<-c(10,17,40)
for(jj in seq(length(nobsv))){
temp<-fun.comparepower(samp1sz=nobsv[jj], nsamp=100000,
dist=list("rnorm","rcauchy","rlaplace"),
hypoth=(1.96+.85)*c(1,sqrt(2),1)/sqrt(nobsv[jj]))
if(jj==1) dimnames(mypower)<-list(dimnames(temp)[[1]],
dimnames(temp)[[2]],as.character(nobsv))
mypower[,,jj]<-temp[,,1,1,1]
}
cat("\nPower for T and Sign Tests \n")
print(mypower)
Table 2.4:
library(VGAM); testare(.5)
Table 3.4:
fun.comparepower(samp1sz=10,samp2sz=10,altvalue=0)
Table 3.5:
fun.comparepower(samp1sz=10,samp2sz=10,altvalue=1)
Figure 4.3:
powerplot()
Figure 5.1:
hodgeslehmannexample()
Figure 6.2:
x<-(-10):10; y<-x ; z<-5*atan(x)
plot(range(x),range(c(y,z)),type="n")
points(x,y,pch=1); points(x,z,pch=2)
legend(0,-5,pch=1:2,legend=c("y=x","y=5 atan(x)"))
201
Figure 7.1:
y<-x<-rnorm(100)
coin<-rbinom(100,1,.5)
y[coin==0]<--x[coin==0]
par(oma=c(0,0,3,0))
par(mfrow=c(2,2))
p1<-qqnorm(x,main="Marginal Distribution for X")
p2<-qqnorm(y,main="Marginal Distribution for Y")
p3<-qqnorm((x+y)/sqrt(2),
main="Marginal Distribution for (X+Y)/sqrt(2)")
p4<-qqnorm((x-y)/sqrt(2),
main="Marginal Distribution for (X-Y)/sqrt(2)")
Table 9.1:
distv<-c("rnorm","rcauchy","rlaplace","runif","rexp")
t1<-fun.testreg(dists=distv)
t2<-fun.testreg(dists=distv,npergp=50)
Table 10.2:
203
204 Bibliography
Dwass, M. (1956, 06). The large-sample power of rank order tests in the two-
sample problem. The Annals of Mathematical Statistics 27 (2), 352–374.
Dwass, M. (1985). On the convolution of Cauchy distributions. The American
Mathematical Monthly 92 (1), 55–57.
Dykstra, R. L. (1981). An isotonic regression algorithm. Journal of Statistical
Planning and Inference 5, 355–363.
Edgeworth, F. Y. (1893). Viii. exercises in the calculation of errors. The
London, Edinburgh, and Dublin Philosophical Magazine and Journal of Sci-
ence 36 (218), 98–111.
SAS Institute Inc. (2017). SAS/STAT 14.3 User’s Guide. Cary, NC: SAS
Institute Inc.
Shao, J. and D. Tu (1995). The Jackknife and Bootstrap (first ed.). Springer
Series in Statistics. New York: Springer-Verlag.
Sheather, S. J. and M. C. Jones (1991). A reliable data-based bandwidth selec-
tion method for kernel density estimation. Journal of the Royal Statistical
Society. Series B (Methodological) 53 (3), 683–690.
213
214 Index
Mann-Whitney test, 48–50, 55–58, Savage scores, 50, 51, 53, 78, 79, 107
61–63, 81, 83, 86, 102, 103, Siegel-Tukey test, 58, 59
133 sign test, 20, 22–25, 27, 31–33, 43,
Mood’s median test, 43–46, 50, 53, 63, 96, 98, 105, 132–136
54, 61, 141 Spearman correlation, 116, 118–121,
multinomial distribution, 5, 74 123, 124
multivariate Gaussian distribution, Spearman Rank correlation, 115
113, 115, 121, 129 spline, 157, 159
multivariate median, 130, 132 standard Gaussian distribution, 2
standard normal distribution, 2
Nadaraya-Watson smoothing, 152, Student’s t distribution, 6, 16
155, 156 Studentized, 179
non-central Cauchy distribution, 3 Studentized range distribution, 72,
non-central chi-square distribution, 81
6, 86, 89, 92
non-centrality parameter, 6 test level, 7, 9–11, 16, 17, 20–23, 28,
normal distribution, 1 31, 116
normal scores, 50, 53, 79 test statistic, 7
null hypothesis, 6 tied observations, 54, 107
two-sample pooled t statistic, 40
one-sided hypothesis, 7 two-sample pooled t-test, 40, 41, 51,
order statistics, 26, 50–52, 57, 62, 63 52, 71, 72, 81
two-sided hypothesis, 9
p-value, 10 type I error rate, 7
Page’s test, 108, 109 type II error rate, 8
paired t-test, 95
parametric bootstrap, 171 U statistic, 48
Pearson correlation, 113–116, 120, uniform distribution, 2, 3, 16, 152
121, 124, 178 usual method, 173
percentile method, 173
permutation test, 52 van der Waerden scores, 50, 51, 78
pivot, 12, 179, 180
pivotal, 179 Walsh averages, 99, 100
pooled adjacent violators, 157 Wilcoxon rank-sum test, 47–50,
power, 8, 10, 11, 22, 23, 28–33, 44, 53–58, 61, 62, 81, 83, 86,
53–55, 58, 69, 71, 84–87, 99, 138
89–92, 102, 103, 135, 139 Wilcoxon signed-rank test, 96, 99,
Prentice test, 106 100, 132–136