Biometrika (2000), 87, 4, pp.
954–959
© 2000 Biometrika Trust
Printed in Great Britain
A new family of power transformations to improve normality
or symmetry
B IN-KWON YEO
Department of Control & Instrumentation of Engineering, Kangwon National University,
Chunchon, 200-701, Korea
[email protected] Downloaded from http://biomet.oxfordjournals.org/ at University of Miami - Otto G. Richter Library on December 31, 2014
RICHARD A. JOHNSON
Department of Statistics, University of W isconsin-Madison, W isconsin 53706, U.S.A
[email protected]
S
We introduce a new power transformation family which is well defined on the whole real line
and which is appropriate for reducing skewness and to approximate normality. It has properties
similar to those of the Box–Cox transformation for positive variables. The large-sample properties
of the transformation are investigated in the contect of a single random sample.
Some key words: Kullback–Leibler information; Maximum likelihood inference; Power transformation;
Relative skewness.
1. I
A major step towards an objective way of determining a transformation was made by Box &
Cox (1964). The Box–Cox transformation yBC(l, x) is given by
q
(xl−1)/l (lN0),
yBC(l, x)=
log(x) (l=0),
for positive x. They considered selecting transformations for achieving approximate normality. The
Box–Cox transformation is, however, only valid for positive x. Although a shift parameter can be
introduced to handle situations where the response is negative but bounded below, the standard
asymptotic results of maximum likelihood theory may not apply since the range of the distribution
is determined by the unknown shift parameter; see Atkinson (1985, pp. 195–9). To circumvent
these problems, some statisticians consider the signed power transformation, see for instance Bickel
& Doksum (1981),
ySP(l, x)={sgn(x)|x|l−1}/l (l>0),
which covers the whole real line. Since ySP is, however, designed to handle kurtosis rather than
skewness, it has a serious drawback when it is applied to a skewed distribution. For instance,
suppose X has the mixture density
f (x)=0·3w(x)+0·7c(x), (1·1)
where w(.) is the standard normal density and c(.) is the gamma density
1
c(x)= (x+2)3 exp{−(x+2)} (x>−2).
6
Miscellanea 955
We are interested in transforming the random variable X so that the transformed distribution is
approximately normal. Following Hernandez & Johnson (1980), we select ySP to minimise the
Kullback–Leibler information number
P q r
g (u)
g (u) log l du, (1·2)
l w (u)
m,s2
where g (.) is the probability density function of the transformed variable ySP(l, X) and w (.) is
l m,s2
the probability density function of a normal distribution with mean m and variance s2. The mini-
mum is over a suitable range of l, m and s2, and then the best choice of l produces a g that is
l
closest, in the sense of Kullback–Leibler information, to a target normal density w .
m,s2
Even though ySP is increasing in x, its Jacobian changes from a decreasing function of x to an
increasing function as x changes sign. The cusp occurs at ySP(l, 0)=−1/l, so the transformed
Downloaded from http://biomet.oxfordjournals.org/ at University of Miami - Otto G. Richter Library on December 31, 2014
density is bimodal and looks far from normal, as shown in Fig. 1. Most extended transformations,
including the modulus transformation introduced by John & Draper (1980), lead to a bimodal
distribution in our example where the support consists of the whole real line. A new family of
transformations is therefore needed.
0·3
0·2
0·1
0·0
_10 _5 0 5 10
Fig. 1. Plots of the mixture density f (.) (solid line), the
transformed density g (.) based on ySP (dotted line)
l
and the target normal density w (.) (dashed line).
m,s2
2. T
When searching for transformations that improve the symmetry of skewed data or distributions,
it is helpful to recall the concept of relative skewness introduced by van Zwet (1964, p. 3). To
motivate the definition, let X be a random variable having a continuous distribution function F
with inverse F−1, and let I be the smallest interval for which pr(XµI )=1. Then the distribution
F F
F is said to be symmetric about x if F(x +x)+F(x −x)=1 and all real xµI . Let Y be another
0 0 0 F
random variable with a continuous distribution function G and inverse G−1. Define y(x)=
G−1{F(x)}. Then G is the distribution function of y(X) so the random variable y(X) has the same
distribution as the random variable Y. Van Zwet (1964) shows that G−1F is convex, respectively
concave, if and only if G is the distribution function of a nondecreasing convex, respectively concave,
transformation y(X) of X. He then defines relative skewness, as follows.
D. T he distribution function G is more right-skewed, respectively more left-skewed, than
the distribution F if G−1{F(.)} is a nondecreasing convex, respectively concave, function.
Since a nondecreasing convex, respectively concave, transformation of a random variable effects
a contraction of the lower, respectively upper, part of the support and an extension of the upper,
respectively lower, part, it decreases the skewness to the left, respectively right. The Box–Cox
transformation, for example, is concave in x for l<1 and convex in x for l>1. However, ySP
changes from convex to concave as x changes sign so it is not to be recommended when data that
956 I-K Y R A. J
can be positive or negative are skewed. John & Draper (1980) and Burbidge, Magee & Robb
(1988) studied specific cases of other convex-to-concave transformations.
To motivate our choice of power transformations, we first consider a modified modulus trans-
formation which has different transformation parameters on the positive and negative line. Let
q
{(x+1)l+ −1}/l (x0, l N0),
+ +
log(x+1) (x0, l =0),
y(l , l , x)= +
+ − −{(−x+1)l− −1}/l (x<0, l N0),
− −
−log(−x+1) (x<0, l =0).
−
Next, we impose the condition that the second derivative ∂2y(l , l , x)/∂x2 be continuous at
+ −
x=0. This forces the transformation to be smooth and implies that l +l =2. Consequently,
+ −
Downloaded from http://biomet.oxfordjournals.org/ at University of Miami - Otto G. Richter Library on December 31, 2014
we define the power transformation, y(., .) : R×R R, where
q
{(x+1)l−1}/l (x0, lN0),
log(x+1) (x0, l=0),
y(l, x)= (2·1)
−{(−x+1)2−l−1}/(2−l) (x<0, lN2),
−log(−x+1) (x<0, l=2).
Then, by Lemma 1 below, y(l, x) is concave in x for l<1 and convex for l>1. Here, the constant 1
in parentheses makes the transformed value have the same sign as the original value, and allows
us to prove Lemma 1 by working separately with the positive and negative domain. It also reduces
y to the identity transformation for l=1.
Figure 2 shows the differences between the Box–Cox transformations, (xl−1)/l, and the new
transformations. In fact, the new transformations on the positive line are equivalent to the general-
ised Box–Cox transformations, {(x+1)l−1}/l, for x>−1, where the shift constant 1 is included.
We also see from Fig. 2 that, if the sign of x is changed, so that a right-skewed distribution becomes
left-skewed, or the reverse, then the value of l is replaced by 2−l. We first establish properties of
transformation (2·1).
(a) Box–Cox transformations (b) New transformations
4 4
Transformed values
Transformed values
2 2
0 0
2·0 2·0
1·5 1·5
_2 1·0 _2 1·0
0·5 0·5
0·0 0·0
_4 _4
_2 0 2 4 6 _4 _2 0 2 4
Original values Original values
Fig. 2. A comparison of (a) the Box–Cox and ( b) the new transformations, with
l=0·0, 0·5, 1·0, 1·5 and 2·0.
L 1. T he transformation function y(., .) defined in (2·1) satisfies the following:
(i) y(l, x)0 for x0, and y(l, x)<0 for x<0;
(ii) y(l, x) is convex in x for l>1 and concave in x for l<1;
(iii) y(l, x) is a continuous function of (l, x);
Miscellanea 957
(iv) if y(k)=∂ky(l, x)/∂lk then, for k1,
q
[(x+1)l{log(x+1)}k−ky(k−1)]/l (lN0, x0),
{log(x+1)}k+1/(k+1) (l=0, x0),
y(k)=
−[(−x+1)2−l{−log(−x+1)}k−ky(k−1)]/(2−l) (lN2, x<0),
{−log(−x+1)}k+1/(k+1) (l=2, x<0),
is continuous in (l, x);
(v) y(l, x) is increasing in both l and x;
(vi) y(l, x) is convex in l for x>0 and concave in l for x<0.
Note that y(0) ¬ y(l, x).
The proofs are straightforward but tedious. Details are given in a University of Wisconsin
Downloaded from http://biomet.oxfordjournals.org/ at University of Miami - Otto G. Richter Library on December 31, 2014
technical report by the authors.
The mixture density f (.) in (1·1) is skewed to the right so, according to van Zwet (1964), a good
transformation should be concave in order to lead to near symmetry. Following Hernandez &
Johnson (1980), we show in § 3 that it is reasonable to select the transformation, y(l, x), to minimise
the Kullback–Leibler information number (1·2). By numerical integration, we obtain l=0·555. As
expected, this transformation is concave; see (ii) of Lemma 1. Figure 3 shows that the normal
approximation is much improved since the transform has pulled down the right tail and pushed
out the left tail.
0·4
0·3
0·2
0·1
0·0
_5 0 5 10
Fig. 3. Plots of the mixture density f (.) (solid line), the
transformed density g (.) based on y (dotted line) and
l
the target normal density w (.) (dashed line).
m,s2
3. T
In this section we focus our attention on transforming a random sample from a parent distri-
bution, with probability density function f (.), to near normality. Let X , . . . , X be independent
1 n
and identically distributed random variables and denote the transformed variables by
y(l, X ), . . . , y(l, X ). We assume that, for some l, the transformed observations can be treated as
1 n
normally distributed with some mean m and variance s2. Under this assumption, the loglikelihood
function is
n n 1 n
l (h | x)=− log(2p)− log(s2)− ∑ {y(l, x )−m}2
n 2 2 2s2 i
i=1
n
+(l−1) ∑ sgn(x ) log(|x |+1), (3·1)
i i
i=1
where h=(l, m, s2)∞ and x=(x , . . . , x )∞.
1 n
958 I-K Y R A. J
Holding l fixed, we initially maximise l (l, ., . | x), yielding
n
1 n 1 n
m@ (l)= ∑ y(l, x ), s@ 2(l)= ∑ {y(l, x )−m@ (l)}2. (3·2)
n i n i
i=1 i=1
The maximum likelihood estimate, l@ , of l is obtained by maximising the profile loglikelihood
function and then h@ =(l@ , m@ (l@ ), s@ 2(l@ ))∞ maximises the loglikelihood function (3·1).
Under certain regularity conditions, the maximum likelihood estimator h@ is a strongly consistent
estimator of h which minimises the Kullback–Leibler information given by (1·2). Let
0
A K B
∂
Vl (h | X)= l (h | X)
1 0 ∂h 1
i h=h0
be the gradient of the loglikelihood function of one observation for h=(h , h , h )∞=(l, m, s2)∞, and
Downloaded from http://biomet.oxfordjournals.org/ at University of Miami - Otto G. Richter Library on December 31, 2014
1 2 3
let
A K B
∂2
V2l (h | X)= l (h | X)
1 0 ∂h ∂h 1
i j h=h0
@
be the Hessian of the loglikelihood function. Then n1/2(h −h ) is asymptotically normal with mean 0
0
and covariance matrix S(h )=V (h )W (h )V (h )∞, where
0 0 0 0
V (h )=E {V2l (h | X)}−1, W (h )=E [Vl (h | X){Vl (h | X)}∞].
0 f 1 0 0 f 1 0 1 0
The details of the regularity conditions are given in the authors’ technical report.
The assumption of homogeneity of the variance has been considered to be particularly important
in many applications. However, the variance is often represented by a simple function of the mean.
In practice, this relationship between mean and variance plays the major role in determining the
transformation. Bartlett (1947) claims that a variance stabilising transformation ‘often has the effect
of improving the closeness of the distribution to normality’; his justification is based on ‘correlation
of variability with mean level on the original scale often implying excessive skewness which tends
to be eliminated after the transformation’.
Let var(X)=s2( m) be a function of the mean. Without loss of generality, we assume that s2( m)
is an increasing function of m. For the new transformation, we approximate the variance of the
transformed variable, var{y(l, X)} j s2(m)h(l, m), where
q
( m+1)2(l−1) ( m0),
h(l, m)=
(−m+1)2(1−l) ( m<0).
Since h(l, m) is decreasing in m for l<1, we may choose a l corresponding to variance stabilisation
so that the variance becomes nearly constant on the transformed scale.
4. E
Darwin (1876) studied the effect of cross- and self-fertilisation on the growth of plants. Fifteen
pairs of seedlings of the same age, one produced by cross-fertilisation and the other by self-
fertilisation, were grown together so that the members of each pair were reared under nearly
identical conditions. His aim was to demonstrate the greater vigour of the cross-fertilised plants.
The differences between the final heights of plants in each pair after a fixed period of time were
6·1, −8·4, 1·0, 2·0, 0·7, 2·9, 3·5, 5·1, 1·8, 3·6, 7·0, 3·0, 9·3, 7·5 and −6·0.
The paired t-statistic is 2·142 ( p=0·025), so the data support Darwin’s claim at the 5% level of
significance. However, both the Q–Q plot and the sample skewness statistic, √b =4·713, cast
1
doubt on the normality assumption required for the paired t-test and indicate the data to be skewed
to the left.
When the new transformation is applied, the parameter estimates for h and the corresponding
Miscellanea 959
estimated covariance matrix are
AB A B A B
l@ 1·305 0·434
m@ = 4·570 , SC = 3·270 54·451 .
s@ 2 29·786 26·282 198·227 3367·608
The likelihood ratio x2 statistic for l=1 is 3·873 ( p=0·0499). The sample skewness statistic of the
1
transformed values is √b =0·093 and the Shapiro–Wilk statistic is W =0·975 ( p=0·887), showing
1
that the normality of transformed data is much improved.
Let t−1(c, n−1) be the cth quantile of the t distribution with (n−1) degrees of freedom. Then
the cth quantile of the mean difference can be approximated by
q@ =y−1{l@ , m@ (l@ )+t−1(c, n−1)s@ (l@ )/n1/2};
c
Downloaded from http://biomet.oxfordjournals.org/ at University of Miami - Otto G. Richter Library on December 31, 2014
see Carroll & Ruppert (1991). In this example, the estimated 0·01th quantile is q@ =0·790. Since
0·01
this is not negative, we strongly conclude that the cross-fertilised plants grow with the greater
vigour, on average.
R
A, A. C. (1985). Plots, T ransformations and Regression. Oxford: Oxford University Press.
B, M. S. (1947). The use of transformations. Biometrics 3, 39–52.
B, P. J. & D, K. A. (1981). An analysis of transformations revisited. J. Am. Statist. Assoc.
76, 296–311.
B, G. E. P. & C, D. R. (1964). An analysis of transformations (with Discussion). J. R. Statist. Soc. B
26, 211–52.
B, J. B., M, L. & R, A. L. (1988). Alternative transformations to handle extreme values of
the dependent variable. J. Am. Statist. Assoc. 83, 123–7.
C, R. J. & R, D. (1991). Prediction and tolerance intervals with transformation and/or weight-
ing. T echnometrics 33, 197–210.
D, C. (1876). T he EVect of Cross- and Self-fertilization in the Vegetable Kingdom, 2nd ed. London:
John Murray.
H, F. & J, R. A. (1980). The large-sample behavior of transformations to normality. J. Am.
Statist. Assoc. 75, 855–61.
J, J. A. & D, N. R. (1980). An alternative family of transformations. Appl. Statist. 29, 190–7.
Z, W. R. (1964). Convex T ransformations of Random Variables. Amsterdam: Mathematisch Centrum.
[Received November 1998. Revised July 2000]