APPENDIX A
Statistical analysis
A.1 Introduction
Design and statistical analysis are inextricably linked but in the
main part of the book we have aimed primarily to discuss design
with relatively minimal discussion of analysis. Use of results con-
nected with the linear model and analysis of variance is, however,
unavoidable and we have assumed some familiarity with these. In
this Appendix we describe the essential results required in a com-
pact, but so far as feasible, self-contained way.
To the extent that we concern ourselves with analysis, we rep-
resent the response recorded on a particular unit as the value of a
random variable and the objective to be inference about aspects of
the underlying probability distributions, in particular parameters
describing differences between treatments. Such models are an es-
sential part of the more formal part of statistical analysis, i.e. that
part that goes beyond graphical and tabular display, important
though these latter are.
One of the themes of the earlier chapters of this book is an inter-
play between two different kinds of probabilistic model. One is the
usual one in discussions of statistical inference where such models
are idealized representations of physical random variability as it
arises when repeat observations are made under nominally similar
conditions. The second model is one in which the randomness en-
ters only via the randomization procedure used by the investigator
in allocating treatments to experimental units. This leads to the
notion that a standard set of assumptions plus consideration of
the design used implies a particular form of default analysis with-
out special assumptions about the physical form of the random
variability encountered. These considerations are intended to re-
move some of the arbitrariness that may seem to be involved in
constructing models and analyses for special designs.
© 2000 by Chapman & Hall/CRC
A.2 Linear model
A.2.1 Formulation and assumptions
We write the linear model in the equivalent forms
E(Y ) = Xθ, Y = Xθ + , (A.1)
where by definition E() = 0. Here Y is a n × 1 vector of random
variables representing responses to be observed, one per experimen-
tal unit, θ is a q × 1 vector of unknown parameters representing
variously treatment contrasts, including main effects and interac-
tions, block and similar effects, effects of baseline variables, etc.
and X is a n × q matrix of constants determined by the design and
other structure of the system. Typically some components of θ are
parameters of interest and others are nuisance parameters.
It is frequently helpful to write q = p+1 and to take the first col-
umn of X to consist of ones, concentrating then on the estimation
of the last p components of θ. Initially we suppose that X is of full
rank q < n. That is there are fewer parameters than observations,
so that the model is not saturated with parameters, and moreover
there is not a redundant parameterization.
For the primary discussion we alternate between two possible
assumptions about the error vector : it is always clear from context
which is being used.
Second moment assumption. The components of are uncorre-
lated and of constant variance σ 2 , i.e.
E(T ) = σ 2 I, (A.2)
where I is the n × n identity matrix.
Normal theory assumption. The components of are indepen-
dently normally distributed with zero mean and variance σ 2 .
Unless explicitly stated otherwise we regard σ 2 as unknown. The
normal theory assumption implies the second moment assumption.
The reasonableness of the assumptions needs consideration in each
applied context.
A.2.2 Key results
The strongest theoretical motivation of the following definitions is
provided under the normal theory assumption by examining the
likelihood function, checking that it is the likelihood for an ex-
ponential family with q + 1 parameters and a q + 1 dimensional
© 2000 by Chapman & Hall/CRC
canonical statistic and that hence analysis under the model is to
be based on the statistics now to be introduced. We discuss opti-
mality further in Section A2.4 but for the moment simply consider
the following statistics.
We define the least squares estimate of θ by the equation
X T X θ̂ = X T Y, (A.3)
the residual vector to be
Yres = Y − X θ̂ (A.4)
and the residual sum of squares and mean square as
T
SSres = Yres Yres , MSres = SSres /(n − q). (A.5)
Occasionally we use the extended notation Yres.X , or even Y.X ,
to show the vector and model involved in the definition of the
residual.
Because X has full rank so too does X T X enabling us to write
θ̂ = (X T X)−1 X T Y. (A.6)
Under the second moment assumption θ̂ is an unbiased estimate
of θ with covariance matrix (X T X)−1 σ 2 and MSres is an unbiased
estimate of σ 2 . Thus the covariance matrix of θ̂ can be estimated
and approximate confidence limits found for any parametric func-
tion of θ. One strong justification of the least squares estimates is
that they are functions of the sufficient statistics under the normal
theory assumption. Another is that among unbiased estimators lin-
ear in Y , θ̂ has the “smallest” covariance matrix, i.e. for any matrix
C for which E(CY ) = θ, cov(θ̂)−cov(CY ) is positive semi-definite.
Stronger results are available under the normal theory assumption;
for example θ̂ has smallest covariance among all unbiased estima-
tors of θ.
Under the second moment assumption on substituting Y = Xθ+
into (A.6) we have
θ̂ = (X T X)−1 X T Xθ + (X T X)−1 X T
= θ + (X T X)−1 X T . (A.7)
The unbiasedness follows immediately and the covariance matrix
of θ̂ is
E{(θ̂ − θ)(θ̂ − θ)T } = (X T X)−1 X T E(T )X(X T X)−1
= (X T X)−1 σ 2 . (A.8)
© 2000 by Chapman & Hall/CRC
Further
Yres = {I − X(X T X)−1 X T }. (A.9)
T
Now the residual sum of squares is Yres Yres , so that the expected
value of the residual sum of squares is
σ 2 tr({I − X(X T X)−1 X T }T {I − X(X T X)−1 X T }).
Direct multiplication shows that
{I − X(X T X)−1 X T }T {I − X(X T X)−1 X T }
= {I − X(X T X)−1 X T }
and its trace is
tr{(In − X T X(X T X)−1 )} = tr(In ) − tr(Iq ) = n − q, (A.10)
where temporarily we show explicitly the dimensions of the identity
matrices involved.
A.2.3 Some properties
There is a large literature associated with the results just sketched
and their generalizations. Here we give only a few points.
First under the normal theory assumption the log likelihood is,
except for a constant
−n log σ − (Y − Xθ)T (Y − Xθ)/(2σ 2 ). (A.11)
The identity
(Y − Xθ)T (Y − Xθ)
= {(Y − X θ̂) + X(θ̂ − θ)}T {(Y − X θ̂) + X(θ̂ − θ)}
= SSres + (θ̂ − θ)T (X T X)(θ̂ − θ), (A.12)
the cross-product term vanishing, justifies the statement about suf-
ficiency at the beginning of Section A2.2.
Next we define the vector of fitted values
Ŷ = X θ̂, (A.13)
the values that would have arisen had the data exactly fitted the
model with the estimated parameter value. Then we have the anal-
ysis of variance, or more literally the analysis of sum of squares,
Y TY = (Y − X θ̂ + X θ̂)T (Y − X θ̂ + X θ̂)
© 2000 by Chapman & Hall/CRC
= SSres + Ŷ T Ŷ . (A.14)
We call the second term the sum of squares for fitting X and
sometimes denote it by SSX . It follows on direct substitution that
E(SSX ) = (Xθ)T (Xθ) + qσ 2 . (A.15)
A property that is often useful in analysing simple designs is that
because X T X is of full rank, a component θ̂s of the least squares
estimate is the unique linear combination of X T Y , the right-hand
side of the least squares equations, that is unbiased for θs . For such
a linear combination lsT X T Y to be unbiased we need
E(lsT X T Y ) = lsT X T Xθ = eTs θ, (A.16)
where es is a vector with one in row s and zero elsewhere. This
implies that ls is the sth column of (X T X)−1 .
Finally, and most importantly, consider confidence limits for a
component parameter. Write C = X T X and denote the elements
of C −1 by crs . Then
var(θ̂s ) = css σ 2
is estimated by
css MSres
suggesting the use of the pivot
√
(θ̂s − θ)/ (css MSres ) (A.17)
to calculate confidence limits for and test hypotheses about θs .
Under the normal theory assumption the pivot has a Student t
distribution with n − q degrees of freedom. Under the second mo-
ment assumption it will have for large n asymptotically a standard
normal distribution under the extra assumptions that
1. n − q also is large which can be shown to imply that MSres
converges in probability to σ 2
2. the matrix X and error structure are such that the central limit
theorem applies to θ̂s .
Over the second point note that if we assumed the errors inde-
pendent, and not merely uncorrelated, it is a question of verifying
say the Lindeberg conditions. A simple sufficient but not necessary
condition for asymptotic normality of the least squares estimates
is then that in a notional series of problems in which the number
of parameters is fixed and the number of observations tends to in-
finity the squared norms of all columns of (X T X)−1 X T tend to
zero.
© 2000 by Chapman & Hall/CRC
A.2.4 Geometry of least squares
For the study of special problems that are not entirely balanced
we need implicitly or explicitly either algebraic manipulation and
simplification of the above matrix equations or, perhaps more com-
monly, their numerical evaluation and solution. For some aspects of
the general theory, however, it is helpful to adopt a more abstract
approach and this we now sketch.
We shall regard the vector Y and the columns of X as elements
of a linear vector space V. That is, we can add vectors and multiply
them by scalars and there is a zero vector in V.
We equip the space with a norm and a scalar product and for
most purposes use the Euclidean norm, i.e. we define for a vector
Z ∈ V, specified momentarily in coordinate form,
kZk2 = Z T Z = ΣZi2 , (A.18)
and the scalar product by
(Z1 , Z2 ) = Z1T Z2 = (Z2 , Z1 ). (A.19)
Two vectors are orthogonal if their scalar product is zero.
Given a set of vectors the collection of all linear combinations of
them defines a subspace, S, say. Its dimension dS is the maximal
number of linearly independent components in S, i.e. the maxi-
mal number such that no linear combination is identically zero. In
particular the q columns of the n × q matrix X define a subspace
CX called the column space of X. If and only if X is of full rank
q = dCX .
In the following discussion we abandon the requirement that X
is of full rank.
Given a subspace S of dimension dS
1. the set of all vectors orthogonal to all vectors in S forms another
vector space S ⊥ called the orthogonal complement of S
2. S ⊥ has dimension n − dS
3. an arbitrary vector Z in V is uniquely resolved into two compo-
nents, its projection in S and its projection in the orthogonal
complement
Z = ZS + ZS ⊥ (A.20)
4. the components are by construction orthogonal and
kZk2 = kZS k2 + kZS ⊥ k2 . (A.21)
© 2000 by Chapman & Hall/CRC
We now regard the linear model as specifying that E(Y ) lies in
the column space of X, CX . Resolve Y into a component in CX and
a component in its orthogonal complement. The first component,
in matrix notation X θ̃, say, is such that the second component
Y − X θ̃ is orthogonal to every column of X, i.e.
X T (Y − X θ̃) = 0 (A.22)
and these are the least squares equations (A.3) so that θ̃ = θ̂. Fur-
ther, the components are the vectors of fitted values and residuals
and the analysis of variance in (A.14) is the Pythagorean identity
for their squared norms.
From this representation we have the following results.
In a redundant parameterization, the vector of fitted values and
the residual vector are uniquely defined by the column space of X
even though some at least of the estimates of individual compo-
nents of θ are not uniquely defined.
The estimate of a component of θ based on a linear combination
of the components of Y is a scalar product (l, Y ) of an estimat-
ing vector l with Y . We can resolve l into components lCX , lC⊥X in
and orthogonal to CX . Every scalar product (l⊥ , Y ), in a slightly
condensed notation, has zero mean and, because of orthogonal-
ity, var{(l, Y )} is the sum of the variances of the components. It
follows that for a given expectation the variance is minimized by
taking only estimating vectors in CX , i.e. by linear combinations
of X T Y , justifying under the second moment assumption the use
of least squares estimates. This property may be called the linear
sufficiency of X T Y .
We now sketch the distribution theory underlying confidence lim-
its and tests under the normal theory assumption. It is helpful,
although not essential, to set up in CX and its orthogonal com-
plement a set of orthogonal unit vectors as a basis for each space
in terms of which any vector may be expressed. By an orthogonal
transformation the scalar product of Y with any of these vectors
is normally distributed with variance σ 2 and scalar products with
different basis vectors are independent. It follows that
1. the residual sum of squares SSres has the distribution of σ 2 times
a chi-squared variable with degrees of freedom n − dCX
2. the residual sum of squares is independent of the least squares
estimates, and therefore of any function of them
© 2000 by Chapman & Hall/CRC
3. the least squares estimates, when uniquely defined, are normally
distributed
4. the sum of squares of fitted values SSX has the form of σ 2 times
a noncentral chi-squared variable with dCX degrees of freedom,
reducing to central chi-squared if and only if θ = 0, i.e. the true
parameter value is at the origin of the vector space.
These cover the distribution theory for standard so-called exact
tests and confidence intervals.
In the next subsection we give a further development using the
coordinate-free vector space approach.
A.2.5 Stage by stage fitting
In virtually all the applications we consider in this book the pa-
rameters and therefore the columns of the matrix X are divided
into sections corresponding to parameters of different types; in par-
ticular the first parameter is usually associated with a column of
one’s, i.e. is a constant for all observations.
Suppose then that
E(Y ) = X0 θ0.1 + X1 θ1.0 ; (A.23)
no essentially new ideas are involved with more than two sections.
We suppose that the column spaces of X1 and of X0 do not coincide
and that in general each new set of parameters genuinely constrains
the previous model.
It is then sometimes helpful to argue as follows.
Set θ1.0 = 0. Estimate θ0 , the coefficient of X0 in the model ig-
noring X1 , by least squares. Note that the notation specifies what
parameters are included in the model. We call the resulting esti-
mate θ̂0 the least squares estimate of θ0 ignoring X1 and the as-
sociated sum of squares of fitted values, SSX0 , the sum of squares
for X0 ignoring X1 .
Now project the whole problem into the orthogonal complement
of CX0 , the column space of X0 . That is, we replace Y by what we
now denote by Y.0 , the residual vector with respect to X0 and we
replace X1 by X1.0 and the linear model formally by
E(Y.0 ) = X1.0 θ1.0 , (A.24)
a linear model in a space of dimension n − d0 , where d0 is the
dimension of CX0 .
© 2000 by Chapman & Hall/CRC
We again obtain a least squares estimate θ̂1.0 by orthogonal pro-
jection. We obtain also a residual vector Y.01 and a sum of squares
of fitted values which we call the sum of squares adjusted for X0 ,
SSX1 .0 .
We continue this process for as many terms as there are in the
original model.
For example if there were three sections of the matrix X; X0 ,
X1 , X2 , the successive sums of squares generated would be for
fitting first X0 ignoring (X1 , X2 ), then X1 ignoring X2 adjusting
for X0 and finally X2 adjusting for (X0 , X1 ), leaving a sum of
squares residual to the whole model. These four sums of squares,
being squared norms in orthogonal subspaces, are under the nor-
mal theory assumption independently distributed in chi-squared
distributions, central for the residual sum of squares and in gen-
eral noncentral for the others. Although it is an aspect we have not
emphasized in the book, if a test is required of consistency with
θ2.01 = 0 in the presence of arbitrary θ0 , θ1 this can be achieved
via the ratio of the mean square for X2 adjusting for (X0 , X1 )
to the residual mean square. The null distribution, corresponding
to the appropriately scaled ratio of independent chi-squared vari-
ables, is the standard variance ratio or F distribution with degrees
of freedom the dimensions of the corresponding spaces.
The simplest special case of this procedure is the fitting of the
linear regression
E(Yi ) = θ0.1 + θ1.0 zi , (A.25)
so that X0 , X1 are both n × 1. We estimate θ0 ignoring X1 by the
sample mean Ȳ. and projection orthogonal to the unit vector X0
leads to the formal model
E(Yi − Ȳ. ) = θ1.0 (zi − z̄. ) (A.26)
from which familiar formulae for a least squares slope, and associ-
ated sum of squares, follow immediately.
In balanced situations, such as a randomized block design, in
which the three sections correspond to terms representing general
mean, block and treatment effects, the spaces X1.0 , X2.0 are or-
thogonal and X2.0 is the same as X2.01 . Then and only then the
distinction between, say, a sum of squares for blocks ignoring treat-
ments and a sum of squares for blocks adjusting for treatments can
be ignored.
Because of the insight provided by fitting parameters in stages
© 2000 by Chapman & Hall/CRC
and of its connection with the procedure known in the older lit-
erature as analysis of covariance, we now sketch an algebraic dis-
cussion, taking for simplicity a model in two stages and assuming
formulations of full rank.
That is we start again from
E(Y ) = X0 θ0.1 + X1 θ1.0 , (A.27)
where Xj is n × dj and θj.k is dj × 1. The matrix
R0 = I − X0 (X0T X0 )−1 X0T (A.28)
is symmetric and idempotent, i.e. R0T R0
= R02
= R0 and for any
vector u of length n, R0 u is the vector of residuals after regressing
u on X0 . The associated residual sum of squares is (R0 u)T (R0 u) =
uT R0 u and the associated residual sum of products for two vectors
u and v is uT R0 v. Further for any vector u, we have X0T R0 u = 0.
In the notation of (A.24) R0 Y = Y.0 and R0 X1 = X1.0 .
We can rewrite model (A.27) in the form
E(Y ) = X0 θ0 + R0 X1 θ1.0 (A.29)
where
θ0 = θ0.1 + (X0T X0 )−1 X0T X1 θ1.0 . (A.30)
The least squares equations for the model (A.29) have the form
T
X0 X0 0 θ̂0 X0T Y
= (A.31)
0 X1T R0 X1 θ̂1.0 X1T R0 Y
from which the following results can be deduced.
First the parameters θ0 and θ1.0 are orthogonal with respect to
the expected Fisher information in the model, and the associated
vectors X0 and X1.0 are orthogonal in the usual sense.
Secondly θ1.0 is estimated from a set of least squares equations
formed from the matrix of sums and squares of products of the
columns of X1 residual to their regression on X0 and the covariance
matrix of θ̂1.0 is similarly determined.
Thirdly the least squares estimate θ̂1.0 is obtained by adding to
θ̂0 a function uncorrelated with it.
Continuing with the analysis of (A.29) we see that the residual
sum of squares from the full model is obtained by reducing the
residual sum of squares from the model ignoring θ1 , i.e. Y T R0 Y ,
by the sum of squares of fitted values in the regression of R0 Y on
© 2000 by Chapman & Hall/CRC
R0 X1 :
(Y − X0 θ̂0 − R0 X1 θ̂1.0 )T (Y − X0 θ̂0 − R0 X1 θ̂1.0 )
= Y T R0 Y − θ̂1.0
T
X1T R0 X1 θ̂1.0 . (A.32)
Under the normal theory assumption, an F test of the hypothesis
that θ1.0 = 0 is obtained via the calculation of residual sums of
squares from the full model and from one in which the term in θ1.0
is omitted.
It is sometimes convenient to display the results from stage by
stage fitting in an analysis of variance table; an example with just
two stages is outlined in Table A.1. In the models used in this
book, the first section of the design matrix is always a column of
1’s corresponding to fitting an overall mean. This is so rarely of
interest it is usually omitted from the analysis of variance table,
so that the total sum of squares is (Y − Ȳ 1)T (Y − Ȳ 1) instead of
Y TY .
A.2.6 Redundant parameters
In the main text we often use formulations with redundant param-
eters, such as for the completely randomized and randomized block
Table A.1 Analysis of variance table emphasizing the fitting of model
(A.27) in two stages.
Source D.f. Sums of Expected
squares mean square
Regr. rank X0 θ̂0T X0T X0 θ̂0 σ2 +
on X0 =q θ0T X0T X0 θ0 /q
Regr. on p−q T
θ̂1.0 T
X1.0 X1.0 θ̂1.0 σ2 +
X1 , adj. T
θ1.0 T
X1.0 X1.0 θ1.0 /(p − q)
for X0
Residual n−p−q (Y − Ŷ )T (Y − Ŷ ) σ2
Total n Y TY
© 2000 by Chapman & Hall/CRC
designs
E(Yjs ) = µ + τj , (A.33)
E(Yjs ) = µ + τj + βs . (A.34)
With v treatments and b blocks these models have respectively v+1
and v + b + 1 parameters although the corresponding X matrices
in the standard formulation have ranks v and v + b − 1.
The least squares equations are algebraically consistent but do
not have a unique solution. The geometry of least squares shows,
however, that the vector of fitted values and the sum of squares
for residual and for fitting X are unique. In fact any parameter of
the form aT Xθ has a unique least squares estimate.
From a theoretical viewpoint the most satisfactory approach is
via the use of generalized inverses and the avoidance of explicit
specification of constraints. The issue can, however, always be by-
passed by redefining the model so that the column space of X
remains the same but the number of parameters is reduced to
eliminate the redundancy, i.e. to restore X to full rank. When
the parameters involved are parameters of interest this is indeed
desirable in that the primary objective of the analysis is the esti-
mation of, for example, treatment contrasts, so that formulation
in terms of them has appeal despite a commonly occurring loss of
symmetry.
While, as noted above, many aspects of the estimation problem
do not depend on the particular reparameterization chosen some
care is needed. First and most importantly the constraint must
not introduce an additional rank reduction in X; thus in (A.33)
the constraint
µ + Στs = a (A.35)
for a given constant a would not serve for resolution of parameter
redundancy.
In general suppose that the n × q matrix X is of rank q − r and
impose constraints Aθ = 0, where A is r × q chosen so that the
equations
Xθ = k, Aθ = 0
uniquely determine θ for all constant vectors k. This requires that
(X T , AT ) is of full rank. The least squares projection equations
© 2000 by Chapman & Hall/CRC
supplemented by the constraints give
XT X AT θ̂ XT Y
= , (A.36)
A 0 λ 0
where λ is a vector of Lagrange multipliers used to introduce the
constraint.
The matrix on the left-hand side is singular but is converted into
a nonsingular matrix by changing X T X to X T X +AT A which does
not change the solution. Equations for the constrained estimates
and their covariance matrix follow.
In many of the applications discussed in the book some compo-
nents of the parameter vector correspond to qualitative levels of a
factor, i.e. each level has a distinct component. As discussed above
constraints are needed if a model of full rank is to be achieved. De-
note the unconstrained set of relevant components by {ψ1 , . . . , ψk };
typical constraints are ψ1 = 0 and Σψj = 0, although others are
possible. The objective is typically the estimation with standard
errors of a set C of contrasts Σcj ψj with Σcj = 0. If C is the set
of simple contrasts with baseline, i.e. the estimation of ψj − ψ1
for j = 2, . . . , k, the resulting estimates of the constrained param-
eters and their standard errors are all that is needed. In general,
however, the full covariance matrix of the constrained estimates
will be required; for example if the constrained parameters are de-
noted by φ2 , . . . , φk , estimation of ψ3 − ψ2 via φ̂3 − φ̂2 is direct
but the standard error requires cov(φ̂3 , φ̂2 ) as well as the separate
variances.
In the presentation of conclusions and especially for moderately
large k the recording of a full covariance matrix may be inconve-
nient. This may often be avoided by the following device. We at-
tach pseudo-variances ω1 , . . . , ωk to estimates of so-called floating
parameters a + ψ1 , . . . , a + ψk , where a is an unspecified constant,
in such a way that exactly or approximately
var(Σcj ψ̂j ) = var{Σcj (a + ψ̂j )} = Σωj c2j (A.37)
for all contrasts in the set C. We then treat the floating parame-
ters as if independently estimated. In simple cases this reduces to
specifying marginal means rather than contrasts.
© 2000 by Chapman & Hall/CRC
A.2.7 Residual analysis and model adequacy
There is an extensive literature on the direct use of the residual
vector, Yres.X , for assessing model adequacy and detecting possible
data anomalies. The simplest versions of these methods hinge on
the notion that if the model is reasonably adequate the residual
vector is approximately a set of independent and identically dis-
tributed random variables, so that structure detected in them is
evidence of model inadequacy. This is broadly reasonable when the
number of parameters fitted is small compared with the number
of independent observations. In fact the covariance matrix of the
residual vector is
σ 2 {I − X(X T X)−1 X T } (A.38)
from which more refined calculations can be made.
For many of the designs considered in this book, however, q is
not small compared with n and naive use of the residuals will be
potentially misleading.
Possible departures from the models can be classified roughly as
systematic changes in structure, on the whole best detected and
analysed by fitting extended models, and data anomalies, such as
defective observations, best studied via direct inspection of the
data, where these are not too extensive, and by appropriate func-
tions of residuals. See the Bibliographic notes and Further results
and exercises.
A.3 Analysis of variance
A.3.1 Preliminaries
In the previous section analysis of variance was introduced in the
context of the linear model as a schematic way first of calculating
the residual sum of squares as a basis for estimating residual vari-
ance and then as a device for testing a null hypothesis constraining
the parameter vector of the linear model to a subspace. There are,
however, other ways of thinking about analysis of variance.
The first corresponds in a sense to the most literal meaning of
analysis. Suppose that an observed random variable is in fact the
sum of two unobserved (latent) variables, so that in the simplest
case in which systematic effects are absent we can write
Y = µ + ξ + η, (A.39)
© 2000 by Chapman & Hall/CRC
where µ is the unknown mean and ξ and η are uncorrelated random
variables of zero mean and variances respectively σξ2 , ση2 ; these are
called components of variance. Then
var(Y ) = σξ2 + ση2 , (A.40)
with obvious generalization to more than two components of vari-
ability.
If now we observe a number of different random variables with
this general structure it may be possible to estimate the com-
ponents of variance σξ2 , ση2 separately and in this sense to have
achieved a splitting up, i.e. analysis, of variance. The two simplest
cases are where we have repeat observations on a number of groups
of observations Yaj , say, with
Yaj = µ + ξa + ηaj (A.41)
for a = 1, . . . , A; j = 1, . . . , J, where it is convenient to depart
briefly from our general convention of restricting upper case letters
to random variables.
The formulation presupposes that the variation between groups,
as well as that within groups, is regarded as best represented by
random variables; it would thus not be appropriate if the groups
represented different treatments.
A second possibility is that the observations are cross-classified
by rows and columns in a rectangular array and where the obser-
vations Yab can be written
Yab = µ + ξa + ηb + ab , (A.42)
where the component random variables are now interpreted as ran-
dom row and column effects and as error, respectively.
It can be shown that in these and similar situations with ap-
propriate definitions the components of variance can be estimated
via a suitable analysis of variance table. Moreover we can then, via
the complementary technique of synthesis of variance, estimate the
properties of systems in which either the variance components have
been modified in some way, or in which the structure of the data
is different, the variance components having remained the same.
For example under model (A.41) the variance of the overall mean
of the data, considered as an estimate of µ is easily seen to be
σξ2 /A + ση2 /(AJ).
We can thus estimate the effect on the precision of the estimate of
© 2000 by Chapman & Hall/CRC
µ of, say, increasing or decreasing J or of improvements in mea-
surement technique leading to a reduction in ση2 .
A.3.2 Cross-classification and nesting
We now give some general remarks on the formal role of analysis of
variance to describe relatively complicated structures of data. For
this we consider data classified in general in a multi-dimensional
array. Later it will be crucial to distinguish classification based on a
treatment applied to the units from that arising from the intrinsic
structure of the units but for the moment we do not make that
distinction.
A fundamental distinction is between cross-classification and
nesting. Thus in the simplest case we may have an array of obser-
vations, which we shall now denote by Ya;j , in which the labelling
of the repeat observations for each a is essentially arbitrary, i.e.
j = 1 at a = 1 has no meaningful connection with j = 1 at a = 2.
We say that the second suffix is nested within the first. By contrast
we may have an arrangement Yab , which can be thought of as a
row by column A × B array in which, say, the column labelling
retains the same meaning for each row a and vice versa. We say
that rows are crossed with columns.
Corresponding to these structures we may decompose the data
vector into components. First, for the nested arrangement, we have
that
Ya;j = Ȳ.. + (Ȳa. − Ȳ.. ) + (Ya;j − Ȳa. ). (A.43)
This is to be contrasted with, for the cross-classification, the de-
composition
Yab = Ȳ.. + (Ȳa. − Ȳ.. ) + (Ȳ.b − Ȳ.. ) + (Yab − Ȳa. − Ȳ.b + Ȳ.. ). (A.44)
As usual averaging over a suffix is denoted by a full-stop.
Now in these decompositions the terms on the right-hand sides,
considered as defining vectors, are mutually orthogonal, leading to
familiar decompositions of the sums of squares. Moreover there is
a corresponding decomposition of the dimensionalities, or degrees
of freedom, of the spaces in which these vectors lie, namely for the
nested case
AJ = 1 + (A − 1) + A(J − 1)
© 2000 by Chapman & Hall/CRC
and in the cross-classified case
AB = 1 + (A − 1) + (B − 1) + (A − 1)(B − 1).
Note that if we were mistakenly to treat the suffix j as if it were a
meaningful basis for cross-classification we would be decomposing
the third term in the analysis of the data vector, the sum of squares
and the degrees of freedom into the third and fourth components of
the crossed analysis. In general if a nested suffix is converted into a
crossed suffix, variation within nested levels is typically converted
into a main effect and one or more interaction terms.
The skeleton analysis of variance tables corresponding to these
structures, with degrees of freedom, but without sums of squares,
are given in Table A.2.
There are many possibilities for extension, still keeping to bal-
anced arrangements. For example Yabc;j denotes an arrangement in
which observations, perhaps corresponding to replicate determina-
tions, are nested within each cell of an A×B×C cross-classification,
whereas observations Y(a;j)bc have within each level of the first clas-
sification a number of sublevels which are all then crossed with the
levels of the other classifications. The skeleton analysis of variance
tables for these two settings are given in Tables A.3 and A.4. Note
that in the second analysis the final residual could be further de-
composed.
These decompositions may initially be regarded as concise de-
scriptions of the data structure. Note that no probabilistic consid-
erations have been explicitly involved. In thinking about relatively
Table A.2 Skeleton analysis of variance table for nested and crossed
structures.
Nested Crossed
Source D.f. Source D.f.
Mean 1 Mean 1
A-class (groups) A−1 A-class (rows) A−1
Within groups A(J − 1) B-class (cols) B−1
A×B (A − 1)(B − 1)
Total AJ
Total AB
© 2000 by Chapman & Hall/CRC
Table A.3 Skeleton analysis of variance for Yabc;j .
Source D.f.
Mean 1
A A−1
B B−1
C C −1
B×C (B − 1)(C − 1)
C ×A (C − 1)(A − 1)
A×B (A − 1)(B − 1)
A×B×C (A − 1)(B − 1)(C − 1)
Within cells ABC(J − 1)
Total ABCJ
Table A.4 Skeleton analysis of variance for Y(a;j)bc .
Source D.f.
Mean 1
A A−1
Within A A(J − 1)
B B−1
C C−1
B×C (B − 1)(C − 1)
C ×A (C − 1)(A − 1)
A×B (A − 1)(B − 1)
A×B×C (A − 1)(B − 1)(C − 1)
Residual A(BC − 1)(J − 1)
Total ABCJ
complex arrangements it is often essential to establish which fea-
tures are to be regarded as crossed and which as nested.
In terms of modelling it may then be useful to convert the data
decomposition into a model in which the parameters associated
with nested suffixes correspond to random effects, differing levels
of nesting corresponding to different variance components. Also
© 2000 by Chapman & Hall/CRC
it will sometimes be necessary to regard the highest order inter-
actions as corresponding to random variables, in particular when
one of the interacting factors represents grouping of the units made
without specific subject-matter interpretation. Further all or most
purely cross-classified suffixes correspond to parameters describ-
ing the systematic structure, either parameters of direct interest
or characterizations of block and similar effects. For example, the
model corresponding to the skeleton analysis of variance in Table
A.4 is
Y(a;j)bc = τaA + ηj(a) + τbB + τcC + τbc
BC AB
+ τab AC
+ τac ABC
+ τabc + j(abc) .
For normal theory linear models with balanced data and a single
level of nesting the resulting model is a standard linear model and
the decomposition of the data and the sums of squares have a
direct use in terms of standard least squares estimation and testing.
With balanced data and several levels of nesting hierarchical error
structures are involved.
Although we have restricted attention to balanced arrangements
and normal error, the procedure outlined here suggests a system-
atic approach in more complex problems. We may have data un-
balanced because of missing combinations or unequal replication.
Further the simplest appropriate model may not be a normal the-
ory linear model but, for example a linear logistic or linear probit
model for binary response data. The following procedure may nev-
ertheless be helpful.
First write down the formal analysis of variance table for the
nearest balanced structure.
Next consider the corresponding normal theory linear model. If
it has a single error term the corresponding linear model for unbal-
anced data can be analysed by least squares and the corresponding
generalized linear model, for example for binary data, analysed by
the method of maximum likelihood. Of course even in the normal
theory case the lack of balance will mean that the sums of squares
in the analysis of variance decomposition must be interpreted in
light of the other terms in the model. If the model has multiple error
terms the usual normal theory analysis uses the so-called residual
maximum likelihood, or REML, for inference on the variance com-
ponents. This involves constructing the marginal likelihood for the
residuals after estimating the parameters in the mean. The corre-
sponding analysis for generalized linear models will involve some
special approximations.
© 2000 by Chapman & Hall/CRC
The importance of these remarks lies in the need to have a sys-
tematic approach to developing models for complex data and for
techniques of analysis when normal theory linear models are largely
inapplicable.
A.4 More general models; maximum likelihood
We have, partly for simplicity of exposition and partly because of
their immediate practical relevance, emphasized analyses for con-
tinuous responses based directly or indirectly on the normal theory
linear model. Other types of response, in particular binary, ordi-
nal or nominal, arise in many fields. Broadly speaking the use of
standard designs, for example of the factorial type, will usually be
sensible for such situations although formal optimality considera-
tions will depend on the particular model appropriate for analysis,
except perhaps locally near a null hypothesis; see Section 7.6.
For models other than the normal theory linear model, formal
methods of analysis based on the log likelihood function or some
modification thereof provide analyses serving essentially the same
role as those available in the simpler situation. Thus confidence
intervals based on the profile likelihood or one of its modifications
are the preferred analogue of intervals based on the Student t dis-
tribution and likelihood ratio tests the analogue of tests for sets
of parameters using the F distribution. We shall not review these
further here.
A.5 Bibliographic notes
The method of least squares as applied to a linear model has a
long history and an enormous literature. For the history, see Stigler
(1986) and Hald (1998). Nearly but not quite all books on design
of experiments and virtually all those on regression and analysis of
variance and most books on general statistical theory have discus-
sions of least squares and the associated distribution theory, the
mathematical level and style of treatment varying substantially
between books. The geometrical treatment of the distribution the-
ory is perhaps implicit in comments of R. A. Fisher, and is cer-
tainly a natural development from his treatment of distributional
problems. It was explicitly formulated by Bartlett (1933) and by
Kruskal (1961) and in some detail in University of North Carolina
lecture notes, unpublished so far as we know, by R. C. Bose.
© 2000 by Chapman & Hall/CRC
Analysis of variance, first introduced, in fact in the context of
a nonlinear model, by Fisher and Mackenzie (1923), is often pre-
sented as an outgrowth of linear model analysis and in particular
either essentially as an algorithm for computing residual sums of
squares or as a way of testing (usually uninteresting) null hypothe-
ses. This is only one aspect of analysis of variance. An older and
in our view more important role is in clarifying the structure of
sets of data, especially relatively complicated mixtures of crossed
and nested data. This indicates what contrasts can be estimated
and the relevant basis for estimating error. From this viewpoint
the analysis of variance table comes first, then the linear model
not vice versa. See Nelder (1965a, b) for a systematic formulation
and from a very much more mathematical viewpoint Speed (1987).
Brien and Payne (1999) describe some further developments. The
main systematic treatment of the theory of analysis of variance
remains the book of Scheffé (1959).
The method of fitting in stages is implicit in Gauss’s treatment;
the discussion in Draper and Smith (1998, Chapter 2) is help-
ful. The connection with analysis of covariance goes back to the
introduction of that technique; for a broad review of analysis of
covariance, see Cox and McCullagh (1982). The relation between
least squares theory and asymptotic theory is discussed by van der
Vaart (1998).
The approach to fitting models of complex structure to general-
ized linear models is discussed in detail in McCullagh and Nelder
(1989).
For techniques for assessing the adequacy of linear models, see
Atkinson (1985) and Cook and Weisberg (1982).
For discussion of floating parameters and pseudo-variances, see
Ridout (1989), Easton et al. (1991), Reeves (1991) and Firth and
Menezes (2000).
A.6 Further results and exercises
1. Show that the matrices X(X T X)−1 X T and I − X(X T X)−1 X T
are idempotent, i.e. equal to their own squares and give the
geometrical interpretation. The first is sometimes called the hat
matrix because of its role in forming the vector of fitted values
Ŷ .
2. If the covariance matrix of Y is V σ 2 , where V is a known pos-
itive definite matrix, show that the geometrical interpretation
© 2000 by Chapman & Hall/CRC
of Section A.2.3 is preserved when norm and scalar product are
defined by
kY k2 = Y T V −1 Y, (Y1 , Y2 ) = Y1T V −1 Y2 .
Show that this leads to the generalized least squares estimating
equation
X T V −1 (Y − X θ̂) = 0.
Explain why these are appropriate definitions statistically and
geometrically.
3. Verify by direct calculation that in the least squares analysis of
a completely randomized design essentially equivalent answers
are obtained whatever admissible constraint is imposed on the
treatment parameters.
4. Denote the jth diagonal element of X(X T X)−1 X T by hj , called
the leverage of the corresponding response value. Show that 0 <
hj < 1 and that Σhj = q, where q is the rank of X T X. Show
that the variance of the corresponding component of the residual
vector, Yres,j is σ 2 (1 − hj ), leading to the
√ definition of the jth
standardized residual as rj = Yres,j /{s (1 − hj )}, where s2 is
the residual mean square. Show that the difference between Yj
and the predicted mean for Yj after omitting the jth value from
the analysis, xTj β̂(j) , divided by the estimated standard error of
the difference (again omitting Yj from the analysis) is
rj (n − q − 1)1/2 /(n − q − rj2 )1/2
and may be helpful for examining for possible outliers. For some
purposes it is more relevant to consider the influence of specific
observations on particular parameters of interest. See Atkinson
(1985) and Cook and Weisberg (1982, Ch. 2).
5. Suggest a procedure for detecting a single defective observation
in a single Latin square design. Test the procedure by simulation
on a 4 × 4 and a 8 × 8 square.
6. Develop the intra-block analysis of a balanced incomplete block
design via the method of fitting parameters in stages.
7. Consider a v × v Latin square design with a single baseline vari-
able z. It is required to fit the standard Latin square model
augmented by linear regression on z. Show by the method of fit-
ting parameters in stages that this is achieved by the following
© 2000 by Chapman & Hall/CRC
construction. Write down the standard Latin square analysis of
variance table for the response Y and the analogous forms for
the sum of products of Y with z and the sum of squares of z.
Let RY Y , RzY , Rzz and TY Y , TzY , Tzz denote respectively the
residual and treatment lines for these three analysis of variance
tables. Then
(a) the estimated regression coefficient of Y on z is RzY /Rzz ;
(b) the residual mean square of Y in the full analysis is (RY Y −
2
RzY /Rzz )/(v 2 − 3v + 1);
(c) the treatment effects are estimated by applying an adjust-
ment proportional to RzY /Rzz to the simple treatment ef-
fects estimated from Y in an analysis ignoring z;
(d) the adjustment is uncorrelated with the simple unadjusted
effects so that the standard error of an adjusted estimate is
easily calculated;
(e) an F test for the nullity of all treatment effects is obtained
by comparing with the above residual mean square the sum
of squares with v − 1 degrees of freedom
TY Y + RY Y − (TzY + RzY )2 /(Tzz + Rzz ).
How would treatment by z interaction be tested?
8. Recall the definition of a sum of squares with one degree of
freedom associated with a contrast lT Y = Σlj Yj , with Σlj = 0,
as (lT Y )2 /(lT l). Show that in the context of the linear model
E(Y ) = Xθ, the contrast is a function only of the residual vector
if and only if lT X = 0. In this case show that under the normal
theory assumption l can be taken to be any function of the
fitted values Ŷ and the distribution of the contrast will remain
σ 2 times chi-squared with one degree of freedom.
For a randomized block design the contrast Σljs Yjs = Σljs (Yjs −
Ŷjs ) with ljs = (Ȳj. − Ȳ.. )(Ȳ.s − Ȳ.. ) was suggested by Tukey
(1949) as a means of checking deviations from additivity of the
form
E(Yjs ) = µ + τj + βs + γ(τj βs ).
© 2000 by Chapman & Hall/CRC