0% found this document useful (0 votes)
17 views104 pages

STA3022 Course Notes PART II

This document discusses the application of analysis of variance (ANOVA) in research to determine if there are significant differences in means across multiple groups. It explains the methodology of ANOVA, including hypothesis formulation, the partitioning of sums of squares, and the use of independent and dependent variables in statistical analysis. A case study involving foster rats is used to illustrate the concepts and calculations involved in conducting a one-factor ANOVA.

Uploaded by

alex.w.p.russell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views104 pages

STA3022 Course Notes PART II

This document discusses the application of analysis of variance (ANOVA) in research to determine if there are significant differences in means across multiple groups. It explains the methodology of ANOVA, including hypothesis formulation, the partitioning of sums of squares, and the use of independent and dependent variables in statistical analysis. A case study involving foster rats is used to illustrate the concepts and calculations involved in conducting a one-factor ANOVA.

Uploaded by

alex.w.p.russell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Applied multivariate data analysis:

PART II - Predictive Techniques

Course notes for STA3022F: Research and survey statistics

Dr. Ian Durbach


(R sections by Dr. Şebnem Er)

March 11, 2014


114
Chapter 6

Analysis of variance and covariance

One of the most common questions in research is to ask whether two or more groups differ in terms
of some attribute of interest. Do the wages of male and female employees in a company differ? Is the
life expectancy of rural South Africans lower than the life expectancy of urban South Africans? Are
some of a bank’s branches receiving more complaints than others? The first two of these questions can
be addressed using a t-test because there are only two groups in each case, but the third one cannot
use a t-test because there are more than two groups/branches. If we want to compare some attribute
between more than two groups, we often make use of a statistical method called analysis of variance
or ANOVA for short. You can think of ANOVA as an extension of the t-test to the case of more than
two groups (it also works when there are just two groups, by the way).

The two key features of an ANOVA are

1. The method tests whether the average or mean of all the groups is the same or not.

2. The method uses the statistical hypothesis test

H0 : The average value of attribute Y is the same in all the groups


H1 : The average value of attribute Y is different in at least one of the groups

The first point is important because it allows us to express a qualitative research question like “do
the wages of male and female employees in a company differ?” in a more precise form: “does the
mean wage of male employees differ from the mean wage of female employees?”. Note that there are
other ways that we might choose to express the qualitative question – for example, we might ask “does
the minimum wage of male employees differ from the minimum wage of female employees?”. But
we cannot use ANOVA to address this question. The ANOVA method deals only with the equality
of means between groups. It is important when interpreting results to bear this in mind – that the
ANOVA method only tells one whether two groups differ on average.

The second point is important because it allows us to answer the more precise research question “does
the mean wage of male employees differ from the mean wage of female employees?” using statistical
hypothesis testing. You will be familiar with some statistical hypothesis tests already, for example
the t-test, F -test and χ2 -test. These tests all (a) formulate a precise hypothesis, (b) compute a test
statistic which follows some statistical distribution, (c) use the test statistic to generate a conclusion
about the hypothesis. ANOVA works in the same way – in many ways it is just another hypothesis
test.

6.1 Partitioning the sums of squares


The general ANOVA method uses the concept of “explained variation” which should be familiar from
previous courses and previous chapters in these notes. That is, there is a certain amount of variability
or variation in one variable (the dependent variable) which we attempt to attribute to one or more

115
116 CHAPTER 6. ANALYSIS OF VARIANCE AND COVARIANCE

other variables (the independent variables). If a significant amount of variation can be attributed to a
particular independent variable, then we say that (a) that independent variable has a significant effect
(at some specified p-level), and (b) we reject the null hypothesis that the means in different groups
are the same and establish that at least one group’s mean differs significantly from the rest (also at
some specified p-level).

We’ll use the following case study to illustrate. The dataset comes from a study investigating whether
there is a genetic component to the success of foster parenting i.e. whether certain parts of a foster
mother or foster child’s genetic make-up make the pairing more likely to succeed. The study used
rats as participants, but comes from a time when it was popular for psychologists interested in human
behaviour to use rats in their investigations.

Data are collected from 36 foster mother-foster child pairs. For each pair, the following data were
recorded:
• Mother: the genotype (the genetic ‘type’) of the foster mother,

• Litter: the genotype of the foster child,

• AgeM: the age of the foster mother in weeks when adopting the child, and

• WeightC: the final weight of the foster child in grams.


A successful fostering relationship will result in a heavier child. There are three possible genotypes
for rats, type “A”, type “B” and type “J”. The data is given in Table 6.1 (in a slightly different format
from usual to save space).

Pair 1 2 3 4 5 6 7 8 9
Litter A A A A A B B B B
Mother A A A A A A A A A
AgeM 47 40 23 64 22 51 18 57 29
WeightC 61.5 68.2 64 65 59.7 60.3 51.7 49.3 48
Pair 10 11 12 13 14 15 16 17 18
Litter J J J J A A A B B
Mother A A A A B B B B B
AgeM 34 28 41 14 53 40 23 41 65
WeightC 59 57.4 54 47 55 42 60.2 50.8 64.7
Pair 19 20 21 22 23 24 25 26 27
Litter B B B J J J A A A
Mother B B B B B B J J J
AgeM 46 49 56 54 8 63 37 14 57
WeightC 61.7 64 62 59.5 52.8 56 42 54 61
Pair 28 29 30 31 32 33 34 35 36
Litter A A B B J J J J J
Mother J J J J J J J J J
AgeM 22 15 36 64 6 28 51 21 46
WeightC 48.2 39.6 51.3 40.5 44.8 51.5 53 42 54

Table 6.1: Data for study testing for the biological basis of mother-to-child bonding in rats

The data is saved in a txt file with the “[Link]” name. Assuming that this file is in your
working directory, you can extract the data into R using the following codes:

> FosterRats=[Link]("[Link]", header=T)


> attach(FosterRats)
> names(FosterRats)
6.1. PARTITIONING THE SUMS OF SQUARES 117

[1] "Litter" "Mother" "AgeM" "WeightC"

Note that there is one dependent variable (WeightC) which is numeric, and three independent vari-
ables (Mother, Litter, AgeM). Two of the independent variables are categorical (Mother and Litter)
and are called grouping variables, and the other is numeric (AgeM) and is called a covariate.

> tapply(WeightC,list(Mother), mean)

A B J
57.31538 57.15455 48.49167

> tapply(WeightC,list(Litter), mean)

A B J
55.41538 54.93636 52.58333

> tapply(WeightC,list(Mother, Litter), mean)

A B J
A 63.68 52.325 54.35
B 52.40 60.640 56.10
J 48.96 45.900 49.06

Now, the weights of the child rats clearly vary – from a minimum of 39.6g to a maximum of 68.2g.
That is, we say that the weight variable shows variability. What is responsible for this variation?
What makes some child rats heavy and some light? The way in which the ANOVA method answers
this question is the following:

1. Let Yi be the weight of observation i, i = (1, 2, . . . , N ), where N is the total number of observa-
tions. Here, N = 36.

2. Calculate the overall average


N
1 X
Ȳ.. = Yi
N
i=1

This is also called the grand mean. In our example, the grand mean weight of the child rats is

Ȳ.. = (61.5 + 68.2 + · · · + 54)/36 = 54.325g

3. Quantify the total amount of variability in the dependent variable. This is called the total sum of
squares (SST ) and is calculated by subtracting the grand mean from each observation, squaring
it, and adding the result i.e.
XN
SST = (Yi − Ȳ.. )2 (6.1)
i=1

In our example,

SST = (61.5 − 54.325)2 + (68.2 − 54.325)2 + · · · + (54 − 54.325)2 = 2123


118 CHAPTER 6. ANALYSIS OF VARIANCE AND COVARIANCE

This figure represents the total amount of variability in the dependent variable. Note that the
formula for the SST is very closely related to the formula for the variance you would have learned
in your first-year course (since both measure variability)
N
1 X SST
Sample variance = (xi − x̄)2 =
N −1 n−1
i=1

The quantity n − 1, which the SST must be divided by in order to turn it into a true variance
measure, is known as the “degrees of freedom” of the SST.

4. Determine the amount of this total variability that is “accounted for” or explained by various
independent variables. This step leads directly to the different types of ANOVA that we will
consider in this chapter

• If there is only one categorical independent variable available to account for the SST, we
call the analysis a one-factor ANOVA. For example, if we use only the genotype of the
mother rat as an independent variable, we perform a one-factor ANOVA.
• If there are two categorical independent variables available to account for the SST, we call
the analysis a two-factor ANOVA. If we use the genotype of both the mother and child rats
as independent variables, we perform a two-factor ANOVA.
• If there are one or more categorical independent variables and one or more numeric in-
dependent variables available to account for the SST, we call the analysis an analysis of
covariance. If we use the genotype of both the mother and child rats as independent vari-
ables as well as the age of the mother, we perform an ANOCOVA.

Of course, we usually cannot expect to account for all the variability in a dependent variable with our
independent variables. The portion of the total variability that is left over after the effects of all our
independent variables have been accounted for is called the unexplained variation or residual variation
(labelled SSE)

6.2 One-factor analysis of variance


In this section we test whether the mean values of the dependent variable differ between groups formed
using just one categorical independent variable. In the ANOVA framework, this is the same as saying
that we test whether a significant amount of the total sums of squares can be attributed to the single
independent variable.

We call the independent variable that we use the treatment variable, and we refer to the specific
groups or levels of the treatment variable as treatments. The treatment levels are usually indexed
by j = 1, 2, . . . , t, so there are in general t different levels of the treatment variable. In our current
example, we’ll use the mother rat’s genotype as our treatment variable, and test whether the child’s
weight differs across groups determined by the mother rat’s genotype. We have t = 3 treatment levels
(type A, B, and J) so that the indexes are j = 1 for type A, j = 2 for type B, and j = 3 for type J.

6.2.1 Formulation of hypotheses


The null and alternate hypothesis are given by:

H0 : The average weight of rats is the same across all foster-mother genotype groups.
H1 : The average weight of rats differs in at least one of the foster-mother genotype groups.

We can also write this mathematically as

H0 : µ1 = µ2 = µ3
H1 : At least one µj differs
6.2. ONE-FACTOR ANALYSIS OF VARIANCE 119

What we are saying in both sets of null hypotheses is that the average weight of rats with type A
foster-mothers is equal to the average weight of rats with type B foster-mothers, and that both are
equal to the average weight of rats with type J foster-mothers.

6.2.2 Evaluation of hypotheses


Previously we used the notation Yi to refer to observation i. The index i was simply used to refer to
observation 1 through N and we did not attempt to distinguish between different treatment groups.
Now that we have added a treatment variable (mother’s genotype), we need to extend the notation
that we use slightly. We refer to a treatment level using the index j (here, j = {1, 2, 3}), and we refer
to the weight of the ith respondent in treatment group j by Yij . Here are some examples (refer to the
dataset in Table 6.1):
• Treatment group 2 is mothers with genotype B. The 1st observation in treatment group 2 is
denoted Y12 . From Table 6.1, Y12 = 55 (pair number 14).

• Treatment group 1 is mothers with genotype A. The 3rd observation in treatment group 1 is
denoted Y31 . From Table 6.1, Y31 = 64 (pair number 3).

• Treatment group 3 is mothers with genotype J. The 7th observation in treatment group 3 is
denoted Y73 . From Table 6.1, Y73 = 40.5 (pair number 31).
We refer to the number of observations in treatment group j by nj . That is, n1 = 13 is the number of
rats with mother genotype A, n2 = 11 is the number of rats with mother genotype B, and n3 = 12 is
the number of rats with mother genotype J. Obviously, the sum of the treatment group sizes is equal
to the total sample size, N i.e. n1 + n2 + n3 = N . The average value of the dependent variable in
treatment group j is denoted by Ȳj , and is calculated by simply taking the average over all observations
in the treatment group. In our example, Ȳ1 = 57.31 is the average weight of rats with mother genotype
A, Ȳ2 = 57.15 is the average weight of rats with mother genotype B, and Ȳ3 = 48.49 is the average
weight of rats with mother genotype J. It therefore appears that mothers with genotype J are worse
foster mothers than mothers having the other two genotypes. It is exactly these differences that we
want to test for statistical significance using ANOVA.

We are now in a position to test for equal means in the treatment groups by partitioning the total sums
of squares calculated in the previous section (SST = 2123) into that part which can be attributed
to the treatment effect, called the treatment sum of squares (SST R) and the part that cannot be
explained by the treatment effect (the sum of squares for error, SSE).
t
X
SST R = nj (Ȳj − Ȳ.. )2 (6.2)
j=1
nj
t X
X
SSE = (Ȳij − Ȳ.. )2 = SST − SST R (6.3)
j=1 i=1

In our example,

SST R =13(57.31 − 54.325)2 + 11(57.15 − 54.325)2 + 12(48.49 − 54.325)2


=612.65

SSE =(61.5 − 57.31)2 + (68.2 − 57.31)2 + · · · + (47 − 57.31)2


+ (55 − 57.15)2 + (42 − 57.15)2 + · · · + (56 − 57.15)2
+ (42 − 48.49)2 + (54 − 48.49)2 + · · · + (54 − 48.49)2
=1510.35

Note that we are able to explain SST R/SST = 612.65/2123 = 29% of the variation in the rats’
weights using just the single variable of the foster mother’s genotype. Not bad! This quantity is also
120 CHAPTER 6. ANALYSIS OF VARIANCE AND COVARIANCE

known as the R2 . The remaining 71% (= SSE/SST ) is left unexplained.

In order to test whether the treatment effect is significant, we first need to adjust both the treatment
and error sums of squares by their degrees of freedom. This gives us the mean square between treatments
(M ST R) and mean square for error (M SE).

M ST R = SST R/(t − 1) (6.4)


M SE = SSE/(N − t) (6.5)

In our example,

M ST R = 612.65/2 = 306.33
M SE = 1510.35/33 = 45.768

In essence, all that the mean square calculations do is to turn the sums of squares into true variance
estimates by dividing them by their degrees of freedom. The importance of this is that we can then
calculate an F -statistic (that is, a test statistic following the F -distribution) by taking the ratio of
those two variance estimates
M ST R SST R/(t − 1)
F = = (6.6)
M SE SSE/(N − t)
This test F -statistic has t − 1 and N − t degrees of freedom. In our example,

F = 306.33/45.76 = 6.69

This value can be compared to different critical values obtained from statistical tables in the usual
way. For example, the critical F -value with 2 and 33 degrees of freedom at a 5% level of significance
is 3.285 (check this). Since our test statistic exceeds this value, we’d reject the null hypothesis and
conclude that at least one genotype group mean differs from the rest. This appears to suggest that
some genetic types of rats do make better foster-mothers than others.
To apply one-way analysis of variance to the data we can use the aov function in R and then
the summary method to give us the usual analysis of variance table. The model formula specifies a
one-way ANOVA, where the first factor is Mother:

> anovaMother=aov(WeightC~Mother,data=FosterRats)
> #aov() function does the analysis of variance
> summary(anovaMother)

Df Sum Sq Mean Sq F value Pr(>F)


Mother 2 612.7 306.33 6.694 0.00363 **
Residuals 33 1510.1 45.76
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

6.2.3 Post-hoc tests


In the previous section, we concluded that some genetic types of rats make statistically better foster-
mothers than others. Although this is an important finding, it should strike you as only part of the
answer we are looking for. A natural question to ask at this stage is: which genotypes are better
than others? Generally, we would be interested in finding out which treatment groups are significantly
different from one another. We use post hoc tests in order to answer this question.

There are many different types of post-hoc tests. The one we will use is Tukey’s honest significant
difference (HSD) method, but other post-hoc tests include Fisher’s least significant difference (LSD)
6.2. ONE-FACTOR ANALYSIS OF VARIANCE 121

method and Scheffé’s method of multiple comparisons.

HSD allows us to identify whether two group means differ from one another by using the following
approach. Suppose that we are comparing two treatment groups, which we label as group 1 and group
2.
1. Compute a critical range according to the formula
s 
q
α 1 1
Critical range = (t − 1)F(t−1;N −t) + M SE (6.7)
n1 n2

2. Compute the absolute value of the difference between the means of the two groups i.e. |Ȳ1 − Ȳ2 |.
This is called a contrast. If this contrast exceeds the critical range, then the corresponding group
means differ significantly from one another at the α × 100% significance level.
Note that the critical range depends on the number of observations in each group (n1 and n2 ) and so
we have to calculate a new critical range every time we want to compare two group means.

Returning to the foster-child example, we can apply Scheffé’s method to draw the following conclusions:

Comparison 1: genotype A vs. genotype B


√ p
• Critical range = 2 × 3.29 (1/13 + 1/11)47.198 = 7.22
• Contrast = |57.31 − 57.15| = 0.16
• Since the contrast between type A and B does not exceed the critical range, the group means
do not differ significantly (at the 5% level)
Comparison 2: genotype A vs. genotype J
√ p
• Critical range = 2 × 3.29 (1/13 + 1/12)47.198 = 7.05
• Contrast = |57.31 − 48.49| = 8.82
• Since the contrast between type A and J exceeds the critical range, the group means differ
significantly (at the 1% level)
Comparison 3: genotype B vs. genotype J
√ p
• Critical range = 2 × 3.29 (1/12 + 1/11)47.198 = 7.36
• Contrast = |57.15 − 48.49| = 8.66
• Since the contrast between type B and J exceeds the critical range, the group means differ
significantly (at the 1% level)
Simply put, we can conclude that rats with foster-mothers possessing genotype A or B grow to be
more healthy than rats with foster-mothers possessing genotype J.
To do this in R, we need the function called TukeyHSD().
> TukeyHSD(anovaMother)

Tukey multiple comparisons of means


95% family-wise confidence level

Fit: aov(formula = WeightC ~ Mother, data = FosterRats)

$Mother
diff lwr upr p adj
B-A -0.1608392 -6.961114 6.639436 0.9981448
J-A -8.8237179 -15.468742 -2.178694 0.0071418
J-B -8.6628788 -15.591803 -1.733954 0.0116244
122 CHAPTER 6. ANALYSIS OF VARIANCE AND COVARIANCE

6.3 Two-factor analysis of variance


Two-factor analysis of variance differs from one-factor analysis of variance principally in the fact
that we now add a second categorical independent variable and so partition the total variability in
the dependent variable (the SST) into three sections rather of the two done previously. The second
independent variable is often called a blocking variable, and we will use this label here. As with the
treatment variable, the blocking variable also has different levels which we call blocks or blocking levels.
The blocking levels are indexed by k = 1, 2, . . . , b, so there in general b different levels of the blocking
variable. The SST is then divided up into the portion accounted for by the treatment variable (the
treatment sum of squares, SSTR), the portion accounted for by the blocking variable (the blocking
sum of squares, SSB), and the residual sum of squares, SSE.

Continuing with our example, we’ll add the baby rat’s genotype as a blocking variable, and test
whether the child’s weight differs by its genotype. We have b = 3 blocking levels (again, type A, B,
and J, but this time for the child rat) so that the indexes are k = 1 for type A, k = 2 for type B,
and k = 3 for type J. As many of the modelling steps remain the same as in one-factor ANOVA, this
section is less detailed than the previous section.

6.3.1 Formulation of hypotheses


Since we have two independent variables, there are two sets of hypotheses. The hypothesis related to
the treatment variable is also sometimes referred to as the primary hypothesis and is stated as before:

H0 : The average weight of rats is the same across all foster-mother genotype groups.
H1 : The average weight of rats differs in at least one of the foster-mother genotype groups.

or mathematically as

H0 : µ1 = µ2 = µ3
H1 : At least one µj differs

The hypothesis related to the blocking variable is also sometimes referred to as the secondary hypothesis
and is stated in similar terms:

H0 : The average weight of rats is the same across all foster-child genotype groups.
H1 : The average weight of rats differs in at least one of the foster-child genotype groups.

or mathematically as

H0 : µ1 = µ2 = µ3
H1 : At least one µk differs

Each of these hypotheses are evaluated separately.

6.3.2 Evaluation of hypotheses


In the two-factor case, we have added a second independent variable (child’s genotype) and need to
extend the notation further. We refer to a blocking level using the index k (here, k = {1, 2, 3}), and
as before we refer to a treatment level using the index j (here, j = {1, 2, 3}). We refer to the number
of observations in treatment group j and blocking group k by njk . For example, the number of baby
rats with own genotype B and mother genotype A is given by n21 = 3.

The calculation of the sums of squares is complicated if the number of observations in each treatment-
block combination is not the same i.e. if njk is not the same for all j and k. Data where the sample
sizes in each treatment-block combination are equal is known as balanced. If any of the sample sizes
differ, the data is unbalanced. If the data is balanced, then the calculation of the sums of squares is
6.3. TWO-FACTOR ANALYSIS OF VARIANCE 123

a relatively straightforward extension of the one-factor case (you may have these formulae from an
earlier statistics course). However, in nearly all observational studies such as those found in business
and social science, it would be unlikely that the data collected would be perfectly balanced and hence
these formulae cannot be used without modification. The extension of the one-factor ANOVA sums of
squares formulae to the case of unbalanced two-factor experiments is beyond the scope of this course,
and so we will not spend any more time calculating sums of squares. Instead, we will refer directly to
the ANOVA output of some statistical package such as R. Output for our present example is shown
below.
To apply two-way analysis of variance without the interactions to the data we can use the aov
function in R and then the summary method to give us the usual analysis of variance table. The
model formula specifies a two-way ANOVA, where the first factor is Mother and the second factor is
Litter:

> anovaMotherLitter=aov(WeightC~Mother+Litter,data=FosterRats)
> drop1(anovaMotherLitter, ~., test="F")

Single term deletions

Model:
WeightC ~ Mother + Litter
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 1461.3 143.33
Mother 2 605.48 2066.8 151.81 6.4221 0.004639 **
Litter 2 48.79 1510.1 140.51 0.5175 0.601059
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

The R output shows sums of squares (SS), degrees of freedom (DoF), mean squares (MS), test F -
statistics, and p-values for each of the effects (including an ‘intercept’ effect, which we are not interested
in – it just tests whether the global average Ȳ.. is equal to zero). The interpretation of the test statistics
remains the same, except that we are now evaluating more than one research hypothesis. Our primary
hypothesis is that the average weight of rats is the same across all foster-mother genotype groups
(that is, that the ‘Mother’ variable has no significant effect on weight). The F -statistic of 6.422 and
associated p-value of 0.005 suggests that we should reject this hypothesis, even at the 1% level, and
conclude that foster-mother genotype does play a significant role in determining the weight of the
foster-child. This is the same conclusion as drawn in the one-factor case.

Our secondary hypothesis is that the average weight of rats is the same across all foster-child genotype
groups (that is, that the ‘Litter’ variable has no significant effect on weight). The F -statistic of 0.518
and associated p-value of 0.601 suggests that we cannot reject this hypothesis, even at the 20% level,
and that we should therefore conclude that foster-child genotype does not play a significant role in
determining the weight of the foster-child. Another way of saying this is that it does not appear that
the genetic make-up of a child makes it any more responsive to foster-care.

At this stage it is useful to ask what proportion of the variation in weight the full two-factor model is
explaining, bearing in mind that the one-factor model had an R2 of 29%. The R output shown below
indicates that the addition of the non-significant second variable has only very marginally raised the
R2 to 31%.

6.3.3 Interaction effects


In the previous section we investigated whether there were mother-types and child-types that were
more likely to succeed in a fostering relationship. We found that rats possessing genotype A or B
124 CHAPTER 6. ANALYSIS OF VARIANCE AND COVARIANCE

were better foster-mothers than those possessing genotype J, but that the genotype of the foster-child
did not play a role in the relationship. In addition to this valuable information, one of the things
we might wonder about is whether certain combinations of mother-types and child-types are more
successful than others. For example, maybe if the mother’s genotype matches the child’s genotype,
the relationship is more likely to succeed because the mother and child are more “alike”. Whenever
we look at the effects of the combination of two or more effects, we are looking at what are called
interaction effects. Where there are only two effects (as in the current example), there is only one
possible interaction. Where there are three effects, say A, B, and C, there are three possible two-way
interactions (interactions between two variables): between A and B, between A and C, and between
B and C. There is also one three-way interaction: between A, B, and C. It becomes difficult to
interpret interactions between three or more variables, and so usually only the two-way interactions
are interpreted. Sometimes, however, it does become necessary to consider higher-order interactions.

Including each interaction adds a further hypothesis to those already specified. The so-called interac-
tion hypothesis states that:

H0 : There is no significant interaction between foster-child and foster-mother genotype groups.


H1 : There is a significant interaction between foster-child and foster-mother genotype groups.

As before, we test this hypothesis by calculating the sums of squares that can be ‘attributed’ to the
interaction effect and by comparing this to the total sums of squares. If a big enough portion of the
SST can be attributed to the interaction effect, then it is significant. Again, we will not actually
calculate the sums of squares due to the interaction effect, but will just consider it as part of the
ANOVA output. The addition of an interaction effect to our current example results in the ANOVA
output given below.
To apply two-way analysis of variance with the interactions to the data we can use the aov() func-
tion in R and then the drop1() function to give us the usual analysis of variance table. The model
formula specifies a two-way ANOVA, where the first factor is Mother and the second factor is Litter,
and an interaction term is included:

> options(contrasts=c("[Link]", "[Link]"))


> anovaInteract=aov(WeightC~Mother+Litter+Mother*Litter,data=FosterRats)
> drop1(anovaInteract,~., test="F" )

Single term deletions

Model:
WeightC ~ Mother + Litter + Mother * Litter
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 1024.8 138.55
Mother 2 521.75 1546.5 149.37 6.8734 0.003865 **
Litter 2 29.39 1054.2 135.57 0.3872 0.682693
Mother:Litter 4 436.58 1461.3 143.33 2.8757 0.041793 *
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

The interpretation of the primary and secondary hypotheses are the same as before, and so we will
not repeat them here (note, though, that the significance levels are slightly different because the part
of the variation that cannot be explained – the SSE – is lower after the addition of the new interaction
effect). The interaction null hypothesis can be rejected at the 5% level but not at the 1% level. That
is, there is significant (but not very significant) evidence that some combinations of mother-genotypes
and child-genotypes do better than others.

As before, an obvious question to ask is exactly which combinations are those that tend to do better?
6.3. TWO-FACTOR ANALYSIS OF VARIANCE 125

To answer this question, we need to return to post-hoc testing. Here, we can use Scheffé’s method
as before to identify whether two group means differ from one another by setting up a critical range
and comparing the observed difference to the critical range. The only difference is that the ‘group’ we
now refer to is a combination of a mother-group and a child-group (so that there are 9 such groups).
Suppose that we want to compare two interaction groups: Group 1, which has mothers of type A and
children of type A (we refer to this as group “A-A”); and Group 2, which has mothers of type A and
0.05 = 2.29, we have:
children of type B (group “A-B”). Noting that F(8,28)
√ p
• Critical range = 8 × 2.29 (1/5 + 1/4)37.95 = 17.68

• Contrast = |63.68 − 52.40| = 11.28

• Since the contrast between combination A-A and A-B does not exceed the critical range, these
group means do not differ significantly (at the 5% level)
The results of performing Tukey HSD tests between all possible mother-child combinations is sum-
marised in the R output below. The table contains p-values for all possible post-hoc tests of means.
Hence, a low p-value indicates that those two groups (which are, remember, mother-child combina-
tions) have significantly different group means. The actual group means are given in the third row.

> TukeyHSD(anovaInteract)

Tukey multiple comparisons of means


95% family-wise confidence level

Fit: aov(formula = WeightC ~ Mother + Litter + Mother * Litter, data = FosterRats)

$Mother
diff lwr upr p adj
B-A -0.1608392 -6.418569 6.096891 0.9977637
J-A -8.8237179 -14.938584 -2.708852 0.0037020
J-B -8.6628788 -15.038994 -2.286763 0.0062611

$Litter
diff lwr upr p adj
B-A -2.2324543 -8.490184 4.025276 0.6544881
J-A -2.5461467 -8.661012 3.568719 0.5633171
J-B -0.3136924 -6.689808 6.062423 0.9918327

$Mother:Litter
diff lwr upr p adj
B:A-A:A -11.280 -26.418222 3.8582218 0.2723080
J:A-A:A -14.720 -27.830085 -1.6099154 0.0191903
A:B-A:A -11.355 -25.260345 2.5503446 0.1783016
B:B-A:A -3.040 -16.150085 10.0700846 0.9965549
J:B-A:A -17.780 -35.123012 -0.4369882 0.0413114
A:J-A:A -9.330 -23.235345 4.5753446 0.4000529
B:J-A:A -7.580 -22.718222 7.5582218 0.7501163
J:J-A:A -14.620 -27.730085 -1.5099154 0.0204036
J:A-B:A -3.440 -18.578222 11.6982218 0.9970003
A:B-B:A -0.075 -15.906931 15.7569313 1.0000000
B:B-B:A 8.240 -6.898222 23.3782218 0.6623527
J:B-B:A -6.500 -25.422777 12.4227772 0.9591247
A:J-B:A 1.950 -13.881931 17.7819313 0.9999662
B:J-B:A 3.700 -13.225046 20.6250464 0.9977038
126 CHAPTER 6. ANALYSIS OF VARIANCE AND COVARIANCE

J:J-B:A -3.340 -18.478222 11.7982218 0.9975524


A:B-J:A 3.365 -10.540345 17.2703446 0.9953999
B:B-J:A 11.680 -1.430085 24.7900846 0.1095530
J:B-J:A -3.060 -20.403012 14.2830118 0.9995017
A:J-J:A 5.390 -8.515345 19.2953446 0.9214391
B:J-J:A 7.140 -7.998222 22.2782218 0.8033869
J:J-J:A 0.100 -13.010085 13.2100846 1.0000000
B:B-A:B 8.315 -5.590345 22.2203446 0.5486250
J:B-A:B -6.425 -24.376723 11.5267227 0.9486111
A:J-A:B 2.025 -12.632520 16.6825202 0.9999192
B:J-A:B 3.775 -12.056931 19.6069313 0.9958363
J:J-A:B -3.265 -17.170345 10.6403446 0.9962486
J:B-B:B -14.740 -32.083012 2.6030118 0.1441032
A:J-B:B -6.290 -20.195345 7.6153446 0.8356294
B:J-B:B -4.540 -19.678222 10.5982218 0.9816545
J:J-B:B -11.580 -24.690085 1.5300846 0.1153934
A:J-J:B 8.450 -9.501723 26.4017227 0.8050310
B:J-J:B 10.200 -8.722777 29.1227772 0.6733463
J:J-J:B 3.160 -14.183012 20.5030118 0.9993704
B:J-A:J 1.750 -14.081931 17.5819313 0.9999853
J:J-A:J -5.290 -19.195345 8.6153446 0.9287146
J:J-B:J -7.040 -22.178222 8.0982218 0.8147247

There are no differences that are significant at the 5% level, but a careful look at the table will show
that the most significant differences are between mother-child combination A-A and J-A (p = 0.019),
between combination A-A and J-J (p = 0.0204), and between combination A-A and J-B (p = 0.0413).
Of more interest, is that there is some evidence to suggest that having the mother and child possess
the same genotype is of some benefit. Mothers of type A do best with children of type A (compare
63.68 for A-A, 52.33 for A-B, and 54.35 for A-J). Mothers of type B do best with children of type B
(compare 52.40 for B-A, 60.64 for B-B, and 56.10 for B-J). And mothers of type J do best (but only
just!) with children of type J (compare 48.96 for J-A, 45.90 for J-B, and 49.06 for J-J). This is shown
more clearly in a so-called plot of means, as shown below.

> boxplot(WeightC~Mother+Litter+Mother*Litter)
65
60
55


50
45
40

A.A B.A J.A A.B B.B J.B A.J B.J J.J

Finally, we can ask what improvement in the model R2 has been achieved by the addition of the
interaction effect. The R output shown below indicates that the addition of the interaction has
substantially improved the model R2 from 31% in the case of the two-factor ANOVA model to 52%.
6.4. ANALYSIS OF COVARIANCE 127

6.4 Analysis of covariance


In this final section, we look at what happens when we want to add a continuous numerical independent
variable to an analysis of variance. An analysis of variance, remember, examines the effect of categorical
independent variables on a continuous dependent variable. Sometimes, however, we may wish to
include numeric variables side-by-side the categorical ones. For example, in our current example we
might have reason to believe that the age of the mother rat may play a role in her ability to give foster
care. Older rats, for example, may have greater experience in mothering if they have had several
litters of their own and this may translate into better foster care-giving. The numerical independent
variables in this case are often called covariates, and the resulting analysis becomes an analysis of
covariance.

Of course, if we only have numerical independent variables and a single numeric dependent variable,
then the analysis could use a multiple regression framework (to be covered in the next chapter).
It is therefore common to think of analysis of covariance as a mixture of analysis of variance and
regression. The addition of each new numeric covariate results in a new research hypothesis stating
the null hypothesis that the covariate has no effect against the alternate hypothesis that it does exert
some effect. Since the covariates are numeric, these hypotheses are stated in the language of multiple
regression, where ‘no effect’ corresponds to a β coefficient of zero. If we wish to add the covariate of
age to our current example, we have the following hypothesis:

H0 : βage = 0
H1 : βage 6= 0

The corresponding R output is given below.

> ancova1 = aov(WeightC~AgeM+Litter+Mother+Litter*Mother,data=FosterRats)


> drop1(ancova1,~.,test="F")

Single term deletions

Model:
WeightC ~ AgeM + Litter + Mother + Litter * Mother
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 896.63 135.74
AgeM 1 128.13 1024.76 138.55 3.7155 0.064902 .
Litter 2 58.83 955.46 134.03 0.8530 0.437733
Mother 2 484.38 1381.01 147.29 7.0230 0.003642 **
Litter:Mother 4 370.53 1267.16 140.20 2.6861 0.053521 .
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

As you can see, the significance of the previously discussed effects stays relatively constant. The effect
of mother-genotype is still significant at the 0.5% level, and the effect of child-genotype remains non-
significant even at the 20% level. The significance of the interaction effect is no longer significant at
the 5% level, but it is very close to significant. The new covariate, the mother’s age at the time of
adoption, is also close to significant at the 5% level, and there therefore seems to be a weakly significant
effect of the age of the foster-mother on the weight of the foster-child. Once again, this is valuable but
incomplete knowledge: we would like to know the direction of the influence. Does foster-child weight
increase with the mother’s age, or decrease? For this, we need to examine the significance of β̂age ,
which is not shown in the R output. It turns out that β̂age = 0.125, with a standard error of 0.065.
This gives a test t-statistic of 0.125/0.065 = 1.923 (or equivalently, a test F -statistic of 1.9232 = 3.715.
The p-value corresponding to both these statistics is, as shown in the relevant R output, 0.065. It
therefore seems that age has a slight positive effect on the weight of the foster child, and that it is the
128 CHAPTER 6. ANALYSIS OF VARIANCE AND COVARIANCE

older rats that make for better foster-mothers.

With the full ANOCOVA model complete, we can examine the proportion of the variation in the
weight variable that this final model can explain. The R output below indicates that the addition of
age has improved the model R2 from 52% in the case of the two-factor ANOVA model with interaction
to 58%. Thus, with just the genotype of the mother and child, and the age of the mother, we are able
to explain around 60% of the variability in the weight of a child rat placed under foster care.

6.5 Assumptions underlying the use of ANOVA


Although we will not consider this topic in any detail, it is important to note that the use of ANOVA
is based on the following two assumptions:

1. That the values in each combination of treatment and blocking level (called a ‘cell’ of the design)
are normally distributed. The easiest way to test this assumption is to visually inspect plots
like histograms, stem-and-leaf plots, and normality plots. More formal methods for evaluating
normality are the Kolmogorov-Smirnov and Shapiro-Wilks tests for normality.

2. That the variances in each combination of treatment and blocking level (‘cells’) are not signif-
icantly different from each other. This is also called the homogeneity of variances assumption.
It can be tested in R using Levene or Brown-Forsythe tests.

Even if these two assumptions are not met, it may still be viable to use ANOVA. Although there is not
really a clear consensus, the approach has been shown to be quite robust (insensitive) to violations of
normality, except where sample sizes are very small. It also seems quite robust to moderate violations
of the homogeneity of variances assumption, provided that sample sizes are roughly equal in the
different groups and, again, are not too small. In practice, it has been suggested that one can feel
reasonably safe in using ANOVA if the largest sample standard deviation is no larger than twice the
smallest sample standard deviation. However, when sample sizes are unequal and variances are not
equal in the groups, the ANOVA approach is not appropriate and non-parametric techniques should
be used.
Chapter 7

Multiple Linear Regression

In this chapter, we discuss the concepts and practice of multiple linear regression. Since this technique
has been covered in previous courses, this chapter is a brief revision of some of the key ideas underlying
the application of this important method.

The aim of multiple regression is to construct a function

Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp

that is able to explain the maximum amount of variation in one variable Y (called the dependent
variable or outcome variable) using just the set of X variables (called the independent variables or
predictor variables. Simply put, regression aims to choose values for the β coefficients so that our
predictions of the outcome variable Y are as close to the true values as possible. These coefficients
can be estimated either using all the independent variables simultaneously or by using a stepwise
approach.

A simultaneous estimation approach will include all the independent variables – both significant and
insignificant – in the final regression model. In stepwise estimation, variables must be statistically
significant in order to enter into the regression model. The first variable to enter is the variable that
has the strongest association with Y (as measured by the correlation coefficient). This initial variable
is then paired with each of the remaining independent variables one at a time, and the variable that is
able to offer the best improvement to the amount of explained variation in Y is chosen, provided
that this improvement is statistically significant. The process continues until there are no more
independent variables left to consider (because they have all been included in the model) or until
the best improvement is no longer statistically significant.

Regression requires that the dependent variable Y be numeric. The independent variables should also
be numeric, although in this chapter we will see if we want to include a categorical independent variable,
we can do so by manipulating it in a particular way. Thus, the requirement that the independent
variables be numeric is actually quite flexible, and we can include categorical predictors if need be.
The requirement that the outcome Y is numeric, on the other hand, cannot be broken. In fact, an
assumption of the regression model is not just that Y is numeric, but that it is normally distributed.
Applications of multiple regression should therefore check that Y is at least approximately normal,
either using a visual inspection of the histogram of Y or a more formal statistical test (e.g. Shapiro-
Wilks).

7.1 A worked example of a multiple regression


The following study is taken from an article entitled “Vague Theory and Model Uncertainty in Macroso-
ciology” ?. The research attempts to explain strike severity (days lost due to industrial disputes per
1000 wage salary earners per year) as a function of the following independent variables
• X1 : unemployment levels

129
130 CHAPTER 7. MULTIPLE LINEAR REGRESSION

• X2 : inflation levels

• X3 : a measure of the parliamentary representation of social democratic and labour parties;

• X4 : a measure of union centralisation.

• X5 : year (given as a decade: 1950’s, 1960’s, 1970’s, 1980’s)


Data was gathered from 18 OECD (Organization for Economic Cooperation and Development) coun-
tries from 1951-1985, and there are some 625 observations in total. The first 10 data points are shown
in Table 7.1.

Y X1 X2 X3 X4 X5 X6
ID Volume Unempl Inflation LaborParl UnionCent Year Decade
1 296 1.3 19.8 43 0.375 1951 60’s
2 397 2.2 17.2 43 0.375 1952 60’s
3 360 2.5 4.3 43 0.375 1953 60’s
4 300 1.7 0.7 47 0.375 1954 60’s
5 326 1.4 2 38.5 0.375 1955 60’s
6 352 1.8 6.3 38.5 0.375 1956 60’s
7 195 2.3 2.5 38.5 0.375 1957 60’s
8 133 2.7 1.3 36.9 0.375 1958 60’s
9 109 2.6 1.8 36.9 0.375 1959 60’s
10 208 2.5 3.8 36.9 0.375 1960 70’s
11 173 2.5 2.5 49.2 0.375 1961 70’s
12 142 2.6 -0.3 49.2 0.375 1962 70’s
13 158 2.1 0.5 41 0.375 1963 70’s
14 243 1.6 2.4 41 0.375 1964 70’s
15 211 1.5 4 41 0.375 1965 70’s
16 183 1.7 3 33.1 0.375 1966 70’s
17 168 1.9 3.2 33.1 0.375 1967 70’s
18 249 1.8 2.7 33.1 0.375 1968 70’s
19 439 1.8 2.9 47.2 0.375 1969 70’s
20 515 1.6 3.9 47.2 0.375 1970 80’s

Table 7.1: Data for strike example

The data set in the your working directory in R stored in a txt file can be imported into R using
the following codes:

> strikes_all=[Link]("strikes_all.txt", header=T)


> attach(strikes_all)
> names(strikes_all)

[1] "Volume" "Unempl" "Inflation" "LaborParl" "UnionCent" "Year" "Decade"


[8] "Ind_60s" "Ind_70s" "Ind_80s" "Country" "Ind_C1" "Ind_C2" "Ind_C3"
[15] "Ind_C4"

To begin with, let us consider the regression of strike amount on the following four numeric independent
variables: unemployment levels, inflation levels, parliamentary representation of labour, and union
centralisation. The regression output obtained from R with the following code is given below:

> formulaAll=Volume~Unempl+Inflation+LaborParl+UnionCent
> modelAll=lm(formulaAll,data=strikes_all)
> summary(modelAll)
7.2. INCLUDING CATEGORICAL INDEPENDENT VARIABLES 131

Call:
lm(formula = formulaAll, data = strikes_all)

Residuals:
Min 1Q Median 3Q Max
-610.9 -213.5 -96.5 23.6 6555.0

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 312.851 90.953 3.440 0.000621 ***
Unempl 17.490 7.361 2.376 0.017798 *
Inflation 19.420 4.701 4.131 4.10e-05 ***
LaborParl -0.141 1.640 -0.086 0.931522
UnionCent -400.568 70.836 -5.655 2.38e-08 ***
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

Residual standard error: 534.1 on 620 degrees of freedom


Multiple R-squared: 0.09833, Adjusted R-squared: 0.09252
F-statistic: 16.9 on 4 and 620 DF, p-value: 3.65e-13

When analysing the output from a regression model, it is useful to divide the analysis into two parts:

1. Evaluating the overall quality of the model

2. Evaluating the contribution of each independent variable

7.2 Including categorical independent variables

Suppose now that we would like to include Decade as an additional independent variable. The
problem, however, is that Decade is a categorical variable, with four groups or levels (50’s, 60’s,
70’s, and 80’s). The usual linear regression formulation can handle binary (0/1) variables, and
the manipulation that we use to include a categorical independent variable is to recode the
variable into a number of binary variables called dummy variables or indicator variables. We will
use the Decade variable to illustrate this process.

Recoding a categorical variable requires that one of the levels of the variable be chosen as a
so-called reference category. All other levels of the variable will then be ‘compared to’ this level
in the regression. In practice, it will sometimes make more sense to choose a particular level
as a basis of comparison, but technically it makes no difference which level is chosen. Here, we
arbitrarily choose the first level (the 1950’s) as the reference category.

The recoding process then involves creating a new binary independent variable for each of the
remaining levels. These are the indicator or dummy variables. That is, if there are L levels then
there will be L − 1 new indicator variables, each representing one of the levels of the categorical
independent variable. Then, if an observation has a value of level l for the categorical variable
then it receives a “1” for the indicator variable representing level l and a “0” for all the other
indicator variables. Where the observation belongs to the reference category, a “0” is placed in
all indicator variables. This process is illustrated for the first ten cases of the strike example in
Table 7.2
132 CHAPTER 7. MULTIPLE LINEAR REGRESSION

X5 X7 X8 X9
ID Decade Ind6 0s Ind7 0s Ind8 0s
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
8 1 0 0 0
9 1 0 0 0
10 2 1 0 0
11 2 1 0 0
12 2 1 0 0
13 2 1 0 0
14 2 1 0 0
15 2 1 0 0
16 2 1 0 0
17 2 1 0 0
18 2 1 0 0
19 2 1 0 0
20 3 0 1 0

Table 7.2: Recoding the categorical Decade into indicator variables

This can be achieved with the following code in R:

> Decade.f = factor(Decade)


> dummies = [Link](~Decade.f)
> dummies[1:20,]

(Intercept) Decade.f1 Decade.f2 Decade.f3


1 1 1 0 0
2 1 1 0 0
3 1 1 0 0
4 1 1 0 0
5 1 1 0 0
6 1 1 0 0
7 1 1 0 0
8 1 1 0 0
9 1 1 0 0
10 1 0 1 0
11 1 0 1 0
12 1 0 1 0
13 1 0 1 0
14 1 0 1 0
15 1 0 1 0
16 1 0 1 0
17 1 0 1 0
18 1 0 1 0
19 1 0 1 0
20 1 0 0 1
7.3. ADDITIONAL TOPICS 133

The three new indicators variables (X7 , X8 , and X9 ) can be thought of as indicating whether a
particular observation belongs to the 1960’s, 1970’s, and 1980’s respectively, where a 1 indicates
‘yes’ and a 0 indicates ‘no’. If an observation does not belong to any of those decades (i.e.
has a 0 or ‘no’ in all three columns), then they must belong to the reference category (1950’s).
These three indicator variables can then be included as independent variables along with all the
other previously included numeric independent variables. Note that the categorical Decade is
not included. The results are obtained as follows with the following R code:

> modelCategorical= update(modelAll, .~.+dummies[,2:4])


> summary(modelCategorical)

Call:
lm(formula = Volume ~ Unempl + Inflation + LaborParl + UnionCent +
dummies[, 2:4], data = strikes_all)

Residuals:
Min 1Q Median 3Q Max
-669.5 -200.5 -91.9 30.3 6532.8

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 169.0156 98.7906 1.711 0.087612 .
Unempl 31.5556 8.2752 3.813 0.000151 ***
Inflation 25.6605 5.4179 4.736 2.70e-06 ***
LaborParl 0.4354 1.6331 0.267 0.789866
UnionCent -373.4723 70.6062 -5.290 1.71e-07 ***
dummies[, 2:4]Decade.f1 81.8555 38.8537 2.107 0.035541 *
dummies[, 2:4]Decade.f2 89.7700 40.6911 2.206 0.027743 *
dummies[, 2:4]Decade.f3 16.0070 38.3204 0.418 0.676301
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

Residual standard error: 529.2 on 617 degrees of freedom


Multiple R-squared: 0.119, Adjusted R-squared: 0.109
F-statistic: 11.91 on 7 and 617 DF, p-value: 2.783e-14

The regression coefficients and significance levels of the previous numeric variables have stayed
roughly the same as before, and so the same interpretations apply. Of the indicator variables
only one is significant, which is Ind(80’s). The negative coefficient for this variable must be
interpreted relative to the reference category. That is, the outcome variable is significantly lower
in the level indicated by Ind(80’s) than in the level indicated by the reference category. In plain
words, strike volumes are significantly lower in the 1980’s than in the 1950’s. To estimate exactly
how much lower, we need to consult the unstandardised regression coefficient for Ind(80’s), which
is β9 = −269.5. This means that on average there were 270 fewer strike days lost per 1000 salary
earners in the 1980’s than in the 1950’s. The non-significance of the indicator variables for the
1960’s and 1970’s implies that the average strike volumes were roughly the same in those two
decades as in the 1950’s. The addition of the indicator variables has increased the percentage of
explained variation in strike volume to 11.9% and the adjusted R2 to 10.9%.

7.3 Additional topics


In this chapter we have briefly revised some of the basic ideas behind multiple regression. The
topic of regression is an enormous area of research, and is almost certainly the most widely used
134 CHAPTER 7. MULTIPLE LINEAR REGRESSION

multivariate approach in practice. We will not discuss any more advanced regression topics in
this course, but the following aspects are important to be aware of, especially when using a
regression approach in your own research.

Linearity of relationships The basic regression equation assumes that all relationships be-
tween the independent variables and the dependent variable are linear in nature. Often,
relationships between variables are not linear but still follow some pattern. A famous ex-
ample from physics is that the gravitation force between two objects decreases according
to the square of the distance between them. Sometimes it is possible to include non-linear
relationships by appropriate transformations of the independent variables (e.g. by taking
the square of a variable and using that in the regression). Useful transformations are the
log transformation, squares, and square roots.
Interactions The basic regression equation assumes that the influence of each independent
variable does not depend on any of the other independent variables. Often, this is not the
case. As an example, exercise increases the life expectancy of most people, but can present
large risks to those who already have damaged hearts (i.e. can decrease life expectancy).
This is a clear case of one effect, amount of exercise, interacting with another effect, status
of heart. If you have two independent variables X1 and X2 that you believe may interact,
an appropriate regression model is

Y = β0 + β1 X1 + β2 X2 + β3 (X1 X2 )

The variable containing the interactions (X1 X2 ) can be created by simply multiplying
each observation’s rating on X1 and X2 . Many software package also allow you to specify
interactions directly.
Outlier detection Outliers are observations that occupy an unusual or extreme location com-
pared to the other observations in the data set. Sometimes, these few outliers can have
a significant influence over the values of the regression coefficients, in which case they are
called influential observations. The danger here is that the regression coefficients do not
end up reflecting the majority of the observations in the data set but are rather skewed
by these one or two extreme observations. It is therefore important to identify whether
there are outliers in a dataset, and what influence these have. Some typical tests are to
examine Cook’s distances, standardised residuals, or deleted residuals. Sometimes outliers
are assumed to be errors and removed, but they may also be an important source of extra
information.
Multicollinearity Multicollinearity refers to cases where the independent variables are strongly
correlated with one another. Multicollinearity can cause problems in the estimation of the
regression coefficients and especially with testing these coefficients for statistical signif-
icance. In extreme cases where two independent variables are perfectly correlated, the
coefficients can not be estimated at all. Multicollinearity can sometimes be detected by ex-
amining correlations between independent variables – any high correlations are a warning
sign – but more subtle cases require special diagnostics such as variance inflation factors.
Multicollinearity can be remedied – examples of these fixes are principal components re-
gression and ridge regression.
Chapter 8

Discriminant Analysis

In the previous chapter, we saw how a multiple regression model could be used to predict and
explain a numeric dependent variable, using either numeric independent variables or categorical
independent variables recoded as dummy variables. A question that arises is: what happens
if the variable that we are trying to predict is not numeric but categorical? As an example of
where this might occur, consider the following examples:

New product development Designers of new products often test their product out on sample
audiences, attempting to explain a respondent’s decision to buy or not buy as a function
of different product attributes like price, appearance, ease of use, and so on. Here, the
variable indicating whether a respondent will or will not buy the product is the categorical
dependent variable (with two groups: ‘buy’ and ‘not buy’).
Investments Investors often try to predict which firms will fail as a function of some financial
and economic indicators, like the price-earnings ratio. Here, the variable indicating whether
a firm failed or did not fail is the categorical dependent variable (with two groups: ‘failure’
and ‘no failure’)
Career guidance Career guidance councellors may be interested in predicting which job an
applicant would be best suited for based on their responses to a questionnaire. Here, the
variable indicating the main job interest is the categorical dependent variable (with many
groups: e.g. doctor, lawyer, plumber, teacher)

Clearly, multiple regression is not appropriate in these circumstances. But this and the next
chapter provide two techniques that are appropriate for modelling whenever the dependent
variable is categorical. This chapter discusses discriminant analysis, which in some ways is an
extension of multiple regression to categorical dependent variables, while the following chapter
discusses classification trees, which work in a quite different way. Both of these techniques can
be used to achieve a research goal of understanding why certain objects belong to one group
and other objects belong to other groups; that is, they are concerned with the prediction and
explanation of group membership.

For discriminant analysis at least, it is easier to introduce the technique for cases in which there
are only two possible groups (like the first two of our examples above). Later in the chapter, we
will extend these results to cases where the dependent variables contains more than two groups.

8.1 Two-group discriminant analysis

We begin this section by looking at a very small example showing some of the reasoning behind
discriminant analysis. The example is taken from ?. Later on, we’ll introduce some real-world
applications of the technique. Suppose that an appliance manufacturer is pre-testing a new make
of fridge on a test market of 10 respondents. Each of the respondent is asked to rate the new

135
136 CHAPTER 8. DISCRIMINANT ANALYSIS

fridge on three attributes: durability (X1 ), performance (X2 ), and style (X3 ), all on a 1 (terrible)
to 10 (superb) scale. In addition, each respondent is asked whether or not they would buy the
fridge if it appeared on the market. The raw data is given in Table 8.1 below.

X1 X2 X3
ID Durability Performance Style Decision
1 8 9 6 Buy
2 6 7 5 Buy
3 10 6 3 Buy
4 9 4 4 Buy
5 4 8 2 Buy
6 5 4 7 No buy
7 3 7 2 No buy
8 4 5 5 No buy
9 2 4 3 No buy
10 2 2 2 No buy

Table 8.1: Data for two-group new product development example

The data that was saved in your working directory with a name “[Link]” is imported
in R with the following code:

> hypothetical=[Link]("[Link]", header=T)


> attach(hypothetical)
> names(hypothetical)

[1] "ID" "X1" "X2" "X3" "Decision" "StX1" "StX2"


[8] "StX3"

In all discriminant analysis applications, we know which group each respondent in the sample
belongs to i.e. we know whether each respondent indicated that they would ‘buy’ or ‘not buy’.
Because this knowledge is obtained in the survey process, prior to any statistical modelling, it is
often called a priori group membership. For example, we say that respondent 1 belongs to the
a priori ‘buy’ group. Other terms for this a priori classification are ‘true’, ‘observed’ or ‘actual’.
Once we have built a discriminant analysis model, we can use the model to obtain predicted
group memberships for each respondent. Naturally for a good model we would hope that the
predicted group membership matches the a priori group membership for as many respondents as
possible. In any case, it is important to keep the distinction between a priori group membership
(which is obtained from survey data alone) and predicted group membership (which is obtained
from the discriminant analysis model) in mind.

Our aim here is to use the three product attributes (X1 , X2 , X3 ) to explain group membership
i.e. to explain a person’s decision to buy or not buy the fridge. The first thing we might try is
to look at the average attribute values in the ‘buy’ group and ‘no buy’ group, for each attribute.
These averages are obtained in R as follows:

> mean.X1=tapply(X1,Decision,mean)
> mean.X2=tapply(X2,Decision,mean)
> mean.X3=tapply(X3,Decision,mean)
> means=rbind(mean.X1,mean.X2,mean.X3)
> rownames(means)=c("X1","X2","X3")
> colnames(means)=c("Not Buy","Buy")
> t(means)
8.1. TWO-GROUP DISCRIMINANT ANALYSIS 137

X1 X2 X3
Not Buy 3.2 4.4 3.8
Buy 7.4 6.8 4.0

Variables that are able to distinguish/discriminate between respondents who say ‘buy’ and those
who say ‘no buy’ will tend to have larger differences in means. For example, respondents who
bought the fridge on average gave it a fairly high durability rating of 7.4, while those who did
not buy the fridge on average gave it a low rating of 3.2. This difference in means of 4.2 suggests
that durability may be a good discriminating variable. The style attribute X3 , on the other
hand, has very similar mean values in the ‘buy’ and ‘no buy’ group. This indicates that those
who bought the fridge tended to give the same style rating as those who did not buy the fridge.
This suggests that style is not a good discriminating variable.
A further understanding of how discriminant analysis works can be obtained by plotting the
data graphically, as shown in Figure 8.1. Here, the attribute ratings on each attribute have been
plotted on an axis from 1 – 10. Respondents who belong to the ‘buy’ group are indicated by
circles, while those in the ‘not buy’ group are indicated by rectangles. The number inside a circle
or rectangle just refers to that respondent’s ID in Table 8.1.

10 8

9 7 5 6 2 1 4 3
X1
Durability

1 2 3 4 5 6 7 8 9 10

6 7

10 4 8 3 2 5 1
X2
Performance

1 2 3 4 5 6 7 8 9 10

10

7 9 8

5 3 4 2 1 6
X3
Style

1 2 3 4 5 6 7 8 9 10

Figure 8.1: Graphical display of attribute ratings given by respondents in new product development
example

First consider the durability attribute X1 . Respondents who gave high ratings on durability
138 CHAPTER 8. DISCRIMINANT ANALYSIS

tend to indicate that they would buy the fridge, as indicated by the predominance of circles
on the right-hand side of the X1 axis. Similarly, those who gave low ratings on X1 tend to
indicate that they would not buy the fridge. In fact, we could use a predictive rule which says
“If someone gives an durability rating of 6 or more, then classify them as someone who would
buy the product”. The value of 6 here is called a cut-off value, and it indicates the value of X1
that divides the sample into the ‘buys’ and the ‘not buys’. Clearly, any value greater than 5 and
less than or equal to 6 will serve the same purpose.

Using this rule, we would correctly classify 9 of the 10 respondents. This gives us a hit rate or
percentage of correct classifications of 90%. The only respondent that we would “misclassify” is
respondent 5, who gave a durability rating of 4 (and so would be predicted as ‘not buy’) but who
nevertheless decided to buy the fridge (belongs to the a priori ‘buy’ group).

Now consider the performance attribute X2 . Here the distinction between the ‘buy’ and ‘not
buy’ group is less clear, but if we chose a cut-off of 5.5 then a discriminating rule might go “If
someone gives an performance rating of 5.5 or more, then classify them as someone who would
buy the product”. Using this rule, we would only misclassify two respondents. Firstly, we would
misclassify respondent 4, who gave a low performance rating of 4 but still decided to buy the
fridge. Then, we would also misclassify respondent 7, who gave a fairly high performance rating
of 7 but decided not to buy the fridge. The other 8 respondents are correctly classified, giving
a hit rate of 80% for this rule; lower than for the previous rule but still good. Importantly,
the one respondent who was incorrectly classified by the first rule (respondent 5) was correctly
classified using X2 . This suggests that some combination of X1 and X2 might give an even better
prediction of group membership than either X1 or X2 on their own.

The easiest way to combine X1 and X2 is to add them together to form a new variable Z =
X1 +X2 . Since X1 and X2 are both measured on 1 – 10 scales, the new variable Z has a minimum
possible value of 2 when X1 = X2 = 1 and a maximum possible value of 20 when X1 = X2 = 10.
The graphical plot of values for this new attribute Z is shown in Figure 8.2 below.

8 4

10 9 6 7 5 2 3 1
Z=
X1 + X2

2 4 6 8 10 12 14 16 18 20

Figure 8.2: Graphical display of scores for discriminant function Z = X1 + X2

As the plot shows, by using a cut-off of 11 we can achieve a perfect classification of respondents
into the two groups i.e. a 100% hit rate. We therefore do not need any further attributes, but
in any case Figure 8.1 shows that there is far more overlap between the two groups when only
X3 is considered, and that any cut-off would achieve a far lower hit rate than any of the rules
we have considered so far (check for yourself that even the ‘best’ cut-off choice of 4.5 still results
in a hit rate of only 50%). Thus the ‘style’ variable X3 is not of any real use at distinguishing
between the ‘buy’ and ‘no buy’ groups in this example.

Discriminant analysis follows a similar procedure to the one outlined above. It is based on com-
bining the independent variables in a function called the discriminant function. The combination
is a linear one, so that the discriminant function takes the general form
Z = w1 X1 + w2 X2 + · · · + wp Xp
The exact combination of independent variables is determined by the values of the wi ’s, which
perform a similar role to the β coefficients in multiple regression. The wi are chosen in such a
8.1. TWO-GROUP DISCRIMINANT ANALYSIS 139

way that the discriminant function is able to optimally discriminate between groups. Note that
in the above example, we have already constructed some possible discriminant functions. First,
we tried the discriminant function Z = X1 (i.e. we set w1 = 1 and w2 = w3 = 0). This gave
a hit rate of 90%. Next we tried the discriminant function Z = X2 (i.e. we set w2 = 1 and
w1 = w3 = 0), which gave a lower hit rate of 80%. Finally, we tried the discriminant function
Z = X1 + X2 (i.e. we set w1 = w2 = 1 and w3 = 0). This gave a perfect hit rate of 100%. These
functions, however, were really just informed guesses based on examining the graphical plots in
Figure 8.1.
The actual discriminant analysis estimation procedure in R is applied using the MASS package.
In a similar way to regression, the discriminant function can be estimated either using all the
variables simultaneously or using a stepwise approach. A simultaneous estimation approach
will include all the independent variables in the final discriminant function, whether they are
significant discriminators or not. In stepwise estimation, only variables that are significant
discriminators between the groups enter into the model. The first variable to enter is the variable
that best discriminates between the groups. This initial variable is then paired with each of the
remaining independent variables one at a time, and the variable that is able to offer the best
improvement to the discriminating ability of the model is chosen, provided that this improvement
is statistically significant. The process continues until there are no more independent variables
left to consider (because they have all been included in the model) or until the best improvement
in discrimination offered by a variable is not significant.
In order to see which variables are significant, we use the [Link]() function in klaR package
which must be installed in your R environment. The package is called with the library(klaR)
function.

> library(klaR)
> [Link](Decision ~ X1+X2+X3,data=hypothetical)

Formula containing included variables:

Decision ~ X1 + X2
<environment: 0x00000000091d3ea0>

Values calculated in each step of the selection procedure:

vars [Link] [Link] [Link] [Link] [Link]


1 X1 0.4048583 11.760000 0.008964284 11.760000 0.00896428
2 X2 0.2814346 8.936283 0.011825542 3.069865 0.11785507

According to the stepwise estimation we see that the X1 and X2 variables are found to be signifi-
cant. Therefore we will use these variables in order to apply discriminant analysis. Discriminant
analysis can be applied in R using the MASS package and the lda() function in this package. As
a result, you need to have this package installed in your R environment.

> library(MASS)
> formula=Decision~X1+X2
> data=hypothetical
> [Link]=lda(Decision~X1+X2,method="moment")
> coefficients=function(formula, data, groups){
+ # this function calculates the constant term in the 2 group discriminant function.
+ # the only inputs are the formula and the name of the data.
+ # groups defines the number of categories to be estimated in the response variable
+ if(groups==2){
140 CHAPTER 8. DISCRIMINANT ANALYSIS

+ [Link]=lda(formula,data=data,method="moment")
+ [Link]=predict([Link])
+ [Link]=get_all_vars([Link])
+ [Link]=dim([Link])[2]
+ [Link]=cbind([Link][,2:[Link]])
+ [Link]=[Link]([Link])
+ [Link]=[Link]([Link]$scaling)
+ predictions=[Link]%*%[Link]
+ constant=[Link]$x[1]-predictions[1]
+ constant
+ coeffs=rbind(constant,[Link])
+ } else {
+ [Link]=lda(formula,data,method="moment")
+ [Link]=predict([Link])
+ [Link]=get_all_vars([Link])
+ [Link]=dim([Link])[2]
+ [Link]=cbind([Link][,2:[Link]])
+ [Link]=[Link]([Link])
+ [Link]=[Link]([Link]$scaling)
+ predictions=[Link]%*%[Link]
+ constant1=[Link]$x[1,1]-predictions[1,1]
+ constant2=[Link]$x[1,2]-predictions[1,2]
+ constants=cbind(constant1,constant2)
+ rownames(constants)=c("constant")
+ colnames(constants)=c("LD1","LD2")
+ coeffs=rbind(constants,[Link])
+ }
+ print(coeffs)
+ }
> coefficients(formula,data,2)

LD1
constant -4.5295841
X1 0.4755469
X2 0.3587831

Here we obtain the constant and the coefficients for the explanatory variables. The function can
be written as follows:
Z = −4.53+0.476X1 +0.359X2 . Note that X1 receives a slightly larger weight than X2 , reflecting
our earlier statement that durability appeared to discriminate better between the groups than
performance did.

The discriminant function can be used to obtain a discriminant score for each respondent by
simply substituting their responses on the independent variables into the discriminant function.
For example, in the current example respondent number 1 gave a durability rating of X1 = 8
and a performance rating of X2 = 9. Substituting these into the discriminant function gives a
discriminant score Z = −4.53 + 0.476(8) + 0.359(9) = 2.51.
Discriminant scores can be found similarly for all respondents. These can be obtained in R with
the following code:

> [Link]=predict([Link])
> LD1=[Link]$x
8.1. TWO-GROUP DISCRIMINANT ANALYSIS 141

> LD1

LD1
1 2.5038393
2 0.8351792
3 2.3785836
4 1.1854705
5 0.2428686
6 -0.7167171
7 -0.5914614
8 -0.8334808
9 -2.1433578
10 -2.8609240

These are shown in Table 8.2. Note that all respondents in the ‘buy’ group have positive
discriminant scores, while all those in the ‘no buy’ group have negative scores. A cut-off value of
zero would therefore result in a perfect classification of respondents. We will consider a general
procedure for calculating appropriate cut-off values in the next section.

ID Decision Z = −4.53 + 0.476X1 + 0.359X2


1 Buy 2.51
2 Buy 0.84
3 Buy 2.38
4 Buy 1.19
5 Buy 0.25
Mean score for ‘buy’ group 1.43
6 No buy –0.71
7 No buy –0.59
8 No buy –0.83
9 No buy –2.14
10 No buy –2.86
Mean score for ‘no buy’ group -1.43

Table 8.2: Discriminant scores using the optimal discriminant function

Of special interest are the averages of the discriminant scores in each group. The average of the
discriminant scores of all members of a group is called the group centroid, often abbreviated to
just the centroid. The group centroids for the current example are also shown in Table 8.2. Once
the discriminant score of each observation has been calculated, the centroid of each group can
be found by taking the average of the discriminant scores of all members of that group.

3. Calculate the average value of each of the independent variables in each of the groups. This has
been stored in R under the “means” object:

> means

Not Buy Buy


X1 3.2 7.4
X2 4.4 6.8
X3 3.8 4.0

(j) (1)
X̄i be the average value of independent variable Xi in group j (e.g. X̄2 is the average value
of X2 in group 1, the no-buy group). Then the centroids can be calculated by substituting the
142 CHAPTER 8. DISCRIMINANT ANALYSIS

average values into the discriminant function i.e. by


(1) (1)
Z̄ (1) = −4.53 + 0.476X̄1 + 0.359X̄2
(2) (2)
Z̄ (2) = −4.53 + 0.476X̄1 + 0.359X̄2

where Z̄ (1) denotes the centroid for group 1 and Z̄ (2) denotes the centroid for group 2.

Z̄ (1) = −4.53 + 0.476 ∗ 3.2 + 0.359 ∗ 4.4 = −1.4272


Z̄ (2) = −4.53 + 0.476 ∗ 7.4 + 0.359 ∗ 6.8 = 1.4272

These can be obtained in R with the following code:

> [Link]=sum(LD1*(hypothetical$Decision==0))/sum(hypothetical$Decision==0)
> [Link]=sum(LD1*(hypothetical$Decision==1))/sum(hypothetical$Decision==1)
> centroids=rbind([Link],[Link])
> colnames(centroids)=c("Mean score (centroid)")
> rownames(centroids)=c("No buy", "buy")
> centroids

Mean score (centroid)


No buy -1.429188
buy 1.429188

The centroid provides an indication of what the discriminant score of a ‘typical’ member of
the group might be. If a discriminant function is able to successfully discriminate between two
groups, their centroids should be quite different – since this is an indication that the two groups
differ on average. Therefore, another way of testing whether a discriminant function is significant
or not is to test whether there is a significant difference between the group centroids (the other
way we have already looked at is to examine the hit rate). We will consider this in more detail
in the next section.

8.2 Evaluating the discriminant analysis

In the previous section we estimated the discriminant function and obtained the centroids how-
ever we still need to assess whether the discriminant function is significant or not and we need
to test whether there is a significant difference between the group centroids.

8.2.1 Estimation and assessment of the discriminant function

In this example, we used a stepwise estimation approach and decided to use only X1 and X2
variables. The discriminant function is estimated to be Z = −4.53 + 0.476X1 + 0.359X2 . Im-
portantly, one cannot directly interpret the coefficient signs without knowing the group centroids.
That is, it is incorrect to say that the variable X1 has a positive effect on buy/no buy decision.
The correct interpretation of the direction of the effect will be discussed in the next section.
The discriminant function coefficients can be individually tested for significance, but we will also
address this point in the next section. Here, we are only trying to establish whether the overall
discriminant function is able to adequately discriminate between the two groups.

The ability of the discriminant function to discriminate between the two groups is evaluated in
the following two ways.
8.2. EVALUATING THE DISCRIMINANT ANALYSIS 143

• By testing whether the group centroids of the two groups are significantly different from
one another.
• By calculating and evaluating the overall hit rate (the proportion of correct classifications).

We will consider each of these in turn.

Testing the differences between centroids for significance

Group centroids are given by the centroids object in R:

> centroids

Mean score (centroid)


No buy -1.429188
buy 1.429188

In order to find out whether the discriminant function is able to significantly discriminate between
the two groups, we need to test whether the two group centroids are significantly different from
one another. The difference between centroids is usually measured by a type of distance called
the Mahalanobis distance. The Mahalanobis distance takes into account correlations between the
different dimensions that the distance is calculated over. Whenever dimensions are uncorrelated,
Mahalanobis distance is the same as Euclidean distance. However, since here differences are
only measured on one dimension (differences in the scores on the single discriminant function
Z), Euclidean distance is also equivalent to Mahalanobis distance and we may use either. The
difference between centroids can easily be calculated as d = Z̄ (1) − Z̄ (2) = −1.43 − (1.43) = 2.86.
Usually squared distances are used i.e. d2 = 8.1703. In R:

> distance=[Link]
> [Link]=distance^2
> [Link]

[1] 8.170316

This distance can be tested for statistical significance using an F -test. The hypotheses are

H0 :Z̄ (1) = Z̄ (2)


H1 :Z̄ (1) 6= Z̄ (2)

The actual calculation of the test statistic for this hypothesis test is obtained with the following
code where n1 = 5 and n2 = 5 with p = 2 explanatory variables:

(n − 1 − p)n1 ∗ n2 2
Fcalculated = d
p(n − 2)(n1 + n2 )
(10 − 1 − 2)5 ∗ 5
Fcalculated = 2.862
2(10 − 2)(5 + 5)

The F -statistic obtained from this calculation is 8.936, and has 2 and 7 degrees of freedom.
This can be compared against critical values of the F -distribution using F -tables, but the exact
p-value associated with the test is obtained in R with the following:

> pf(8.936, 2, 7,[Link] = FALSE,log = FALSE)


144 CHAPTER 8. DISCRIMINANT ANALYSIS

[1] 0.01182648

The p-value associated with the calculated F value is 0.0118. Thus, there is evidence to suggest
that the group centroids are significantly different from one another and that the discriminant
function is therefore able to significantly discriminate between groups.
Since there is only one discriminant function we can obtain the same F value using the overall
significance test of the model which is going to give us the same F-value that is obtained with
the above formula. In R:

> independentVars<-[Link](hypothetical[,c(2,3)])
> manova1<-manova(independentVars ~ hypothetical$Decision)
> [Link]<-summary(manova1,test="Wilks")
> [Link]

Df Wilks approx F num Df den Df Pr(>F)


hypothetical$Decision 1 0.28143 8.9363 2 7 0.01183 *
Residuals 8
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

Evaluating the hit rate

There is an additional way to assess the overall effectiveness of the discriminant function, and
that is to examine the proportion of correct classification that it makes (the “hit rate”). There
are four ways of classifying respondents – all four methods will provide the same classification.

(a) Discriminant functions


(b) Classification functions
(c) Mahalanobis distances
(d) Posterior probabilities

Here we will only discuss the first and the last methods.
Method 1: Using discriminant functions to classify objects
The first step involves calculating a cut-off value based on the group centroids; the second step
involves classifying objects depending on whether they lie below or above the cut-off value. the
calculation of the cut-off value is a weighted sum of the group centroids

n2 Z̄ (1) + n1 Z̄ (2)
Cut-off value = (8.1)
n1 + n2
where n1 and n2 are the number of objects in group 1 and 2 respectively. It is clear that, for
cases where there are two equal-sized groups, the cut-off value is always equal to zero, and all
that is required is to see which of the group centroids is negative and which is positive. In this
example the cut-off value is calculated as follows:

5 ∗ (−1.43) + 5 ∗ 1.43
Cut-off value = =0 (8.2)
5+5
Once we have an appropriate cut-off value, we can observe that the group centroid for the ‘no
buy’ group (Group1) (Z̄ (1) = −1.43) is below the cut-off 0, while the group centroid for the
‘buy’ group (Group2) (Z̄ (2) = 1.43) is above the cut-off 0. Therefore, any observation lying
below the cut-off of 0 should be regarded as ‘closer’ to the centroid of the ‘no-buy’ group, and
8.2. EVALUATING THE DISCRIMINANT ANALYSIS 145

any observation lying above the cut-off of 0 should be regarded as ‘closer’ to the centroid of
the ‘buy’ group. This gives the following decision rule that can be used to classify individual
observations

Classification rule: If Z ≤ 0, then classify customer to the ‘no buy’ group. If Z > 0,
then classify patient to the ‘buy’ group.

The case of Z = 0 is tricky because in that case the customer is equidistant from the two group
centroids. Strictly speaking, the customer could be classified into either group.

We can now use the decision rule above to classify each customer to a group depending on their
discriminant score. For example, customer 4’s discriminant score is Z = 1.1854702 (check this
for yourself with the R output provided with the LD1 object). Since this is above the cut-off of
0, the discriminant analysis model classifies customer 4 into the ‘buy’ group. customer 6, on the
other hand, has a discriminant score of Z = −0.7167171 (again, check this). Since this is below
the cut-off of 0, the discriminant analysis model classifies customer 6 into the ‘no buy’ group.
The classifications made for each customer, together with the a priori group membership and
the result of the model classification, is obtained in R as follows:

> classes=[Link]$class
> [Link]=[Link](priorClass=hypothetical$Decision, LD1,
+ posteriorClass=classes,decision=hypothetical$Decision==classes)
> [Link]

priorClass LD1 posteriorClass decision


1 1 2.5038393 1 TRUE
2 1 0.8351792 1 TRUE
3 1 2.3785836 1 TRUE
4 1 1.1854705 1 TRUE
5 1 0.2428686 1 TRUE
6 0 -0.7167171 0 TRUE
7 0 -0.5914614 0 TRUE
8 0 -0.8334808 0 TRUE
9 0 -2.1433578 0 TRUE
10 0 -2.8609240 0 TRUE

The output shows that of the 10 customers were all correctly classified. This gives a correct
classification or hit rate of 10/10 or 100%. This would appear to be excellent, but is there any
kind of standard that we can compare the hit rate to? What represents a ‘good’ hit rate? Some
guidance is given by considering the proportion of correct classifications that could be achieved
by chance alone i.e. without any statistical model. We consider two such ‘chance’ models.

• Maximum chance criterion: If we have two groups, with say 80 people in Group A and
20 people in Group B. Then by classifying all people into the larger Group A, we will be
correct for 80% of the sample i.e. a hit rate of 80%. In general, by classifying all respondents
into the larger of two groups, we can achieve a hit rate of

HMax = P

where P is the proportion of the sample belonging to that group. In the example above, we
can achieve a hit rate of 5/10 (50%) by classifying all customers as ‘buy’ category. Thus,
any discriminant function must achieve a hit rate of more than 50% if it is to be considered
worthwhile. Clearly, our model does just this.
146 CHAPTER 8. DISCRIMINANT ANALYSIS

• Proportional chance criterion The maximum chance criterion is a good basis for evaluating
a discriminant function if the only aim is to maximise the overall percentage of correct clas-
sifications. However, it suffers from one weakness. Clearly, the maximum chance criterion
is unable to correctly predict any of the objects in the smaller group. In most situations,
the aim of the analysis is to be able to correctly predict members of both groups, and a re-
searcher would be happy to accept a small reduction in the overall accuracy if it meant that
both groups could be predicted equally well. In these cases, a more appropriate criterion
for comparison is the proportional chance criterion, which achieves a hit rate of

HProp = P 2 + (1 − P )2

where P is again the proportion in the bigger sample. In the current example since the
number of the observations in each group is the same we get the same criterion, HProp =
0.502 + 0.502 = 50%. Again, our discriminant model achieves a hit rate far in excess of
what might be expected by chance alone.

Often, the classification results are presented in the form of a classification matrix, a crosstable
of actual group membership and predicted group membership (obtained from the discriminant
model under the classes object in R). The classification matrix for this example is obtained in
R as follows:

> classificationTable=table(hypothetical$Decision,classes)
> rownames(classificationTable)=c("No buy actual class","Buy actual class")
> colnames(classificationTable)=c("No buy predicted class","Buy predicted class")
> classificationTable

classes
No buy predicted class Buy predicted class
No buy actual class 5 0
Buy actual class 0 5

Entries along the diagonal of the classification matrix (the “5” and the “5”) represent correct
classifications (since the predicted group matches the actual group) and all off-diagonal entries
represent misclassifications. Using the classification, it is easy to calculate the proportion of cor-
rect classification, both overall and within each group. We have already considered the overall
hit rate, which is simply (5 + 5)/10 = 100%. But we can also note that 5 of the 5 members of
the a priori ‘No buy’ group were correctly classified, so that the hit rate in that particular group
is 5/5 = 100%. Similarly, the discriminant model correctly classified 5 of the 5 members of the
a priori ‘buy’ group, for a hit rate of 100%. The model is classifying each group with a 100%
hit rate. This will not always be the case with a discriminant analysis.

The correct and misclassification rates are calculated in R using the following codes:

> [Link]=function(tab){
+ num1=sum(diag(tab))
+ denom1=sum(tab)
+ signif(1-num1/denom1,3)
+ }
> misRate=[Link](classificationTable)*100
> misRate

[1] 0
8.2. EVALUATING THE DISCRIMINANT ANALYSIS 147

> hitRate=(100-misRate)
> hitRate

[1] 100

Method 4: Using posterior probabilities to classify objects


The posterior probabilities are perhaps the classification results that are the easiest to interpret
(and so to present to non-statistical users). They simply give the probability, obtained from the
discriminant model, that object i belongs to group j. Naturally, these must sum to one across
groups for each respondent. The probabilities are called ‘posterior’ probabilities because they
are based on certain outputs of the discriminant model i.e. they can only be calculated after
(“posterior to”) the discriminant model is constructed. The posterior probability of each object
belonging to each group is given in R with the following:

> posteriors=[Link]$posterior
> posteriors

0 1
1 0.0007788482 0.9992211518
2 0.0841496176 0.9158503824
3 0.0011137731 0.9988862269
4 0.0326561487 0.9673438513
5 0.3330972240 0.6669027760
6 0.8858108627 0.1141891373
7 0.8443056087 0.1556943913
8 0.9154754887 0.0845245113
9 0.9978205989 0.0021794011
10 0.9997192031 0.0002807969

The posterior probabilities have one advantage over the other methods: they make it quite
easy to see which of the classifications are the most uncertain. As an example, the posterior
probability that customer 3 belongs to the ‘buy’ group is 0.9988862269. That is, in the eyes
of the model there is extremely little chance that customer 3 belongs to the ‘no buy’ group
(only 0.11%). In contrast, the posterior probability that customer 7 belongs to the ‘no buy’
group is 0.8443056087. That is, although we classify customer 7 to the ‘no buy’ group (because
that probability is highest), we should bear in mind that there is still a 15.57% chance that the
customer belongs to the ‘buy’ group.

A word on hold-out samples and sample size requirements

When evaluating the ability of the discriminant function to accurately discriminate between
members of different groups, we are to a certain extent cheating by testing the function on
the same set of respondents that we used to build the function in the first place. Ideally, we
should test the function on a group of respondents that were not used in its construction. To see
why this is so, consider the following analogy. Suppose you take a snap sample of five of your
classmates, and find that 2 of the 5 are left-handed women, while the other 3 are right-handed
men. Suppose that now you formulate the rule “If gender is female, then classify as left-handed”.
If you test this rule in your original sample you achieve a hit rate of 100%, but if you test this
rule in the general population you would certainly do much worse that that. The problem here
is that in creating a model you capture not only certain relationships that apply in the general
population, but also small quirks and idiosyncracies of the group of people that happen to fall
into your sample.
148 CHAPTER 8. DISCRIMINANT ANALYSIS

For this reason, it is preferable to test a predictive model on a group of objects that were not
used to build the model. In the hypothetical example in the last paragraph, one would ideally
go out and sample another group of respondents and test the rule on them. The group of
respondents used to build the model is often called the training sample and the group used to
test the model is called the test sample or hold-out sample. Note that we do not necessarily need
to do the sampling in two stages. We can just divide our sample into two, one group serving
as the training sample and the other group serving as the hold-out sample. This approach to
testing a predictive model is known as cross-validation or split-sample validation.

Splitting a sample into two raises two important questions: (1) how big does the sample need
to be in order to do so? (2) what proportion of the sample should go into the training sample
and hold-out sample respectively? A general guideline for discriminant analysis provided in ? is
that there should be at least 5 respondents in the training sample for every independent variable
in the discriminant function, with a ratio of 20 respondents to every independent variable being
preferable. For this reason, we are not going to be able to apply this hold-out sample technique
in our example.

There are no definite guidelines for how to split the sample into training and hold-out samples
when it is large enough to do so. The most common split is to use one half the sample as a
training sample and the other half as a hold-out sample, but other frequently used split are 60%
training, 40% hold-out and 75% training, 25% hold-out.

8.3 Another worked example of two-group discriminant analysis

One popular application area for discriminant analysis is in medical research, where the technique
is used to determine which symptoms (independent variables measuring physical characteristics
like blood pressure, weight, red blood cell count, etc) are able to discriminate between those
patients who have a disease and those who do not, and to identify those factors which can dis-
criminate between those patients who get better and those who do not recover.

In this section we will consider a study investigating the effectiveness of a pain-relief treatment
on a sample of patients complaining of pain caused by the shingles virus. The study is covered
in ? (for those interested in details, it is about the “effect of iontophoretic treatment with the
nerve conduction-inhibiting chemical vinicristine on elderly patients complaining of post-herpetic
neuralgia”!). Essentially, 18 elderly patients took part in this study, all complaining of pain. Ten
of these patients received an experimental pain-relief treatment, and eight did not. Six weeks
later, all patients were interviewed again to see whether any improvement in their condition had
occured. Data is shown in Table 8.3. The categorical dependent variable Pain relief takes the
value 1 if the patient reported some improvement, and 2 otherwise. The Treatment variable
takes the value 1 if the patient received the experimental treatment, and 0 otherwise. The Age
(in years), Gender (Male=1, Female=0), and the Duration of symptoms (the number of weeks
that the patient had been suffering from pain) were also recorded.

The data stored in a txt file in your working directory with the name “[Link]”.

> PainReliefData=[Link]("[Link]", header=T)


> attach(PainReliefData)

The following object(s) are masked from hypothetical:

ID
8.3. ANOTHER WORKED EXAMPLE OF TWO-GROUP DISCRIMINANT ANALYSIS 149

Pain X1 : X2 : X3 : X4 : Duration
ID relief Treatment Age Gender of symptoms
1 1 1 76 1 36
2 1 1 52 1 22
3 2 0 80 0 33
4 2 1 77 1 33
5 2 1 73 0 17
6 2 0 82 0 84
7 2 1 75 1 24
8 2 0 78 0 96
9 1 1 83 0 61
10 1 1 75 0 60
11 2 0 62 1 8
12 2 0 74 0 35
13 1 1 78 0 3
14 1 1 70 0 27
15 2 0 72 1 60
16 1 1 71 0 8
17 2 0 74 0 5
18 2 0 81 0 26

Table 8.3: Data for pain relief example

> names(PainReliefData)

[1] "ID" "PainRelief" "Treatment" "Age" "Gender" "Duration"

Our primary aim here is to answer the question “what are the variables that are responsible for
someone experiencing an improvement in their condition?”. Put another way, this asks which
of the four independent variables, if any, is able to discriminate between the group of patients
who did experience pain relief and the group who did not. Of particular interest is whether the
treatment variable is a significant discriminating variable i.e. whether the treatment is effective
at relieving pain. We will split the discriminant analysis procedure into two parts:

(a) Estimating the discriminant function and assessing its overall quality
(b) Interpreting the discriminant function

8.3.1 Estimation and assessment of the discriminant function

In this example, we will use the simultaneous estimation approach. The discriminant function
is estimated in R to be

Z = −4.72 − 2.95X1 + 0.08X2 + 1.38X3 − 0.01X4

. These coefficients can be extracted with the coefficients() function defined previuosly with the
hypothetical example.

> formula=PainRelief ~ Treatment+Age +Gender+ Duration


> data=PainReliefData
> fit <- lda(formula, data, method="moment")
> coefficients(formula,data,2)
150 CHAPTER 8. DISCRIMINANT ANALYSIS

LD1
constant -4.72364953
Treatment -2.94760733
Age 0.08512076
Gender 1.38325090
Duration -0.01138514

Importantly, one cannot directly interpret the coefficient signs without knowing the group cen-
troids. That is, it is incorrect to say that the treatment variable X1 has a negative effect on
pain relief. The correct interpretation of the direction of the effect will be discussed in the next
section. The discriminant function coefficients can be individually tested for significance, but
we will also address this point in the next section. Here, we are only trying to establish whether
the overall discriminant function is able to adequately discriminate between the two groups.

The ability of the discriminant function to discriminate between the two groups is evaluated in
the following two ways.

• By testing whether the group centroids of the two groups are significantly different from
one another.
• By calculating and evaluating the overall hit rate (the proportion of correct classifications).

We will consider each of these in turn.

Testing the differences between centroids for significance

In order to test whether group centroids are significantly different from one another, we first
need to calculate the centroid of each group. This can be done in two ways.

(a) Calculate a discriminant score for each patient by substituting their Xi values into the
discriminant function. For example, patient 1’s discriminant score is given by Z = −4.72 −
2.95(1) + 0.08(76) + 1.38(1) − 0.01(36) = −0.23. Once the discriminant score of each patient
has been calculated, the centroid of each group can be found by taking the average of the
discriminant scores of all members of that group.
(b) Calculate the average value of each of the independent variables in each of the groups. For
(j) (1)
example, let X̄i be the average value of independent variable Xi in group j (e.g. X̄2 is
the average value of X2 (Age) in group 1, the group receiving treatment). Then the centroid
can be calculated by substituting the average values into the discriminant function i.e. by
(1) (1) (1) (1)
Z̄ (1) = −4.72 − 2.95X̄1 + 0.08X̄2 + 1.38X̄3 − 0.01X̄4
(2) (2) (2) (2)
Z̄ (2) = −4.72 − 2.95X̄1 + 0.08X̄2 + 1.38X̄3 − 0.01X̄4

where Z̄ (1) denotes the centroid for group 1 and Z̄ (2) denotes the centroid for group 2.
Since we used the first method in the previous example, let us use the other method to
calculate group centroids in this case. The group means of each of the independent variables
is obtained in R with the following:
> ratioTreatment=tapply(PainReliefData[,3], PainRelief, mean)
> meanAge=tapply(PainReliefData[,4], PainRelief, mean)
> ratioGender=tapply(PainReliefData[,5], PainRelief, mean)
> meanDuration=tapply(PainReliefData[,6], PainRelief, mean)
> means=round(rbind(ratioTreatment,meanAge,ratioGender,meanDuration),3)
> means
8.3. ANOTHER WORKED EXAMPLE OF TWO-GROUP DISCRIMINANT ANALYSIS 151

Improved Same
ratioTreatment 1.000 0.273
meanAge 72.143 75.273
ratioGender 0.286 0.364
meanDuration 31.000 38.273

Group centroids are thus given by

Z̄ (1) = −4.72 − 2.95(1) + 0.08(72) + 1.38(0.29) − 0.01(31) = −1.49


Z̄ (2) = −4.72 − 2.95(0.27) + 0.08(75) + 1.38(0.36) − 0.01(38) = 0.95

These centroids can be calculated in R after fitting the discriminant function:

> fit <- lda(formula, data, method="moment")


> fitPredict=predict(fit)
> LD1=fitPredict$x
> centroid1=sum(LD1*(PainReliefData$PainRelief==Improved))/sum(PainReliefData$PainReli
> centroid2=sum(LD1*(PainReliefData$PainRelief==Same))/sum(PainReliefData$PainRelief==
> centroids=rbind(centroid1,centroid2)
> colnames(centroids)=c("Mean score (centroid)")
> rownames(centroids)=c("Improved", "Same")
> centroids

Mean score (centroid)


Improved -1.4881265
Same 0.9469896

In order to find out whether the discriminant function is able to significantly discriminate between
the two groups, we need to test whether the two group centroids are significantly different from
one another. The difference between centroids can easily be calculated as d = 0.95 − (−1.49) =
2.44. Usually squared distances are used i.e. d2 = 5.95. This distance can be tested for statistical
significance using an F -test. The hypotheses are

H0 :Z̄ (1) = Z̄ (2)


H1 :Z̄ (1) 6= Z̄ (2)

The F -statistic obtained from R is 5.15 (check this yourself!), and has 4 and 13 degrees of free-
dom. This can be compared against critical values of the F -distribution using F -tables, but the
exact p-value associated with the test is also given by R as 0.01 (Check this with R). Thus, there
is evidence to suggest that the group centroids are significantly different from one another and
that the discriminant function is therefore able to significantly discriminate between groups.

Since there is only one discriminant function we can obtain the same F value using the overall
significance test of the model. In R:

> independentVars<-[Link](PainReliefData[,c(3,4,5,6)])
> manova1<-manova(independentVars ~ PainReliefData$PainRelief)
> [Link]<-summary(manova1,test="Wilks")
> [Link]
152 CHAPTER 8. DISCRIMINANT ANALYSIS

Df Wilks approx F num Df den Df Pr(>F)


PainReliefData$PainRelief 1 0.38679 5.1525 4 13 0.01038 *
Residuals 16
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1

It is clear that the F-value obtained from the wilks lambda test is the same obtained from the
formula.
Evaluating the hit rate

There is an additional way to assess the overall effectiveness of the discriminant function, and
that is to examine the proportion of correct classification that it makes (the “hit rate”).
Method 1: Using discriminant functions to classify objects
We have already looked at an example of using a discriminant function to classify objects in
the earlier new product example. The first step involves calculating a cut-off value based on the
group centroids; the second step involves classifying objects depending on whether they lie below
or above the cut-off value. For cases where there are two equal-sized groups, the cut-off value is
always equal to zero, and all that is required is to see which of the group centroids is negative
and which is positive. For unequal groups, the calculation of the cut-off value is a weighted sum
of the group centroids
n2 Z̄ (1) + n1 Z̄ (2)
Cut-off value = (8.3)
n1 + n2
where n1 and n2 are the number of objects in group 1 and 2 respectively. In the current
application, the cut-off value is given by

11(−1.49) + 7(0.95)
Cut-off value = = −0.54
18
Once we have an appropriate cut-off value, we can observe that the group centroid for the
‘improved’ group is below the cut-off (Z̄ (1) = −1.49) while the group centroid for the ‘no relief’
group is above the cut-off (Z̄ (2) = 0.95). Therefore, any patient lying below the cut-off of -0.54
should be regarded as ‘closer’ to the centroid of the ‘improved’ group, and any patient lying
above the cut-off of -0.54 should be regarded as ‘closer’ to the centroid of the ‘no relief’ group.
This gives the following decision rule that can be used to classify individual patients

Classification rule: If Z ≤ −0.54, then classify patient to the ‘improved’ group. If


Z > 0.54, then classify patient to the ‘no relief’ group.

The case of Z = −0.54 is tricky because in that case the patient is equidistant from the two
group centroids. Strictly speaking, the patient could be classified into either group. Here, we have
arbitrarily classified the case of Z = −0.54 into the ‘improved’ group. In practice, this should
not cause any concern because it is unlikely that even one patient would have a discriminant
score of exactly –0.54.

We can now use the decision rule above to classify each patient to a group depending on their
discriminant score. For example, patient 4’s discriminant score is Z = −0.11 (check this for
yourself). Since this is above the cut-off of –0.54, the discriminant analysis model classifies
patient 3 into the ‘no relief’ group. In fact, Table 8.3 shows that patient 4 is in the ‘no relief’
group, and thus the model has made a correct classification. Patient 5, on the other hand, has
a discriminant score of Z = −1.65 (again, check this). Since this is below the cut-off of –0.52,
the discriminant analysis model classifies patient 5 into the ‘improved’ group. But, as Table 8.3
shows, patient 5 belongs to the ‘no relief’ group (that is, in reality patient 5 did not improve),
and thus the model has made a incorrect classification here. The classifications made for each
8.3. ANOTHER WORKED EXAMPLE OF TWO-GROUP DISCRIMINANT ANALYSIS 153

patient, together with the a priori group membership and the result of the model classification,
is obtained in R with the following:

> [Link]=[Link](priorClass=PainReliefData$PainRelief, fitPredict$x,


+ posteriorClass=fitPredict$class,decision=PainReliefData$PainRelief==fitPredict$class
> [Link]

priorClass LD1 posteriorClass decision


1 Improved -0.2286929 Same FALSE
2 Improved -2.1121993 Improved TRUE
3 Same 1.7103020 Same TRUE
4 Same -0.1094167 Same TRUE
5 Same -1.6509885 Improved FALSE
6 Same 1.2999016 Same TRUE
7 Same -0.1771921 Same TRUE
8 Same 0.8227969 Same TRUE
9 Improved -1.3007269 Improved TRUE
10 Improved -1.9703078 Improved TRUE
11 Same 1.8460075 Same TRUE
12 Same 1.1768071 Same TRUE
13 Improved -1.0659928 Improved TRUE
14 Improved -2.0202022 Improved TRUE
15 Same 2.1051881 Same TRUE
16 Improved -1.7187638 Improved TRUE
17 Same 1.5183612 Same TRUE
18 Same 1.8751187 Same TRUE

The table shows that of the 18 respondents, two were misclassified (patient 1 and patient 5).
This gives a correct classification or hit rate of 16/18 or 89%. This would appear to be excellent,
but is there any kind of standard that we can compare the hit rate to? What represents a
‘good’ hit rate? Some guidance is given by considering the proportion of correct classifications
that could be achieved by chance alone i.e. without any statistical model. We consider two such
‘chance’ models.
• Maximum chance criterion: If we have two groups, with say 80 people in Group A and
20 people in Group B. Then by classifying all people into the larger Group A, we will be
correct for 80% of the sample i.e. a hit rate of 80%. In general, by classifying all respondents
into the larger of two groups, we can achieve a hit rate of
HMax = P
where P is the proportion of the sample belonging to that group. In the example above,
we can achieve a hit rate of 11/18 (61%) by classifying all patients as ‘no relief’. Thus, any
discriminant function must achieve a hit rate of more than 61% if it is to be considered
worthwhile. Clearly, our model does just this.
• Proportional chance criterion The maximum chance criterion is a good basis for evaluating
a discriminant function if the only aim is to maximise the overall percentage of correct clas-
sifications. However, it suffers from one weakness. Clearly, the maximum chance criterion
is unable to correctly predict any of the objects in the smaller group. In most situations,
the aim of the analysis is to be able to correctly predict members of both groups, and a re-
searcher would be happy to accept a small reduction in the overall accuracy if it meant that
both groups could be predicted equally well. In these cases, a more appropriate criterion
for comparison is the proportional chance criterion, which achieves a hit rate of
HProp = P 2 + (1 − P )2
154 CHAPTER 8. DISCRIMINANT ANALYSIS

where P is again the proportion in the bigger sample. In the current example, HProp =
0.612 + 0.392 = 52%, which sets a less strict standard for the discriminant function. Again,
our discriminant model achieves a hit rate far in excess of what might be expected by chance
alone.

Often, the classification results are presented in the form of a classification matrix, a crosstable
of actual group membership (obtained from the Pain relief variable in Table 8.3 or from the a
priorClass column in the previous R output) and predicted group membership (obtained from
the discriminant model i.e. the ‘posteriorClass’). The classification matrix for this example is
obtained in R.

> classificationTable=table(PainReliefData$PainRelief,fitPredict$class)
> rownames(classificationTable)=c("Improved actual class","Same actual class")
> colnames(classificationTable)=c("Improved predicted class","Same predicted class")
> classificationTable

Improved predicted class Same predicted class


Improved actual class 6 1
Same actual class 1 10

Entries along the diagonal of the classification matrix (the “6” and the “10”) represent correct
classifications (since the predicted group matches the actual group) and all off-diagonal entries
represent misclassifications. Using the classification, it is easy to calculate the proportion of
correct classification, both overall and within each group. We have already considered the
overall hit rate, which is simply (6 + 10)/18 = 89%. The overall hit rate is calculated in R as
follows:

> misRate=[Link](classificationTable)*100
> misRate

[1] 11.1

> hitRate=100-misRate
> hitRate

[1] 88.9

But we can also note that 10 of the 11 members of the a priori ‘No relief’ group were correctly
classified, so that the hit rate in that particular group is 10/11 = 91%. Similarly, the discrimi-
nant model correctly classified 6 out of the 7 members of the a priori ‘Improved’ group, for a hit
rate of 86%. The model is therefore roughly equally capable of classifying those patients who
experience pain relief and those who do not. This will not always be the case with a discriminant
analysis. Often, the model is considerably better at classifying the members of one of the groups.

Method 4: Using posterior probabilities to classify objects


The posterior probabilities are perhaps the classification results that are the easiest to interpret
(and so to present to non-statistical users). They simply give the probability, obtained from the
discriminant model, that object i belongs to group j. Naturally, these must sum to one across
groups for each respondent. The probability that an object belongs to a particular group is
inversely proportional to its Mahalanobis distance to that group centroid, but the relationship
it is not exactly proportional. The probabilities are called ‘posterior’ probabilities because they
8.3. ANOTHER WORKED EXAMPLE OF TWO-GROUP DISCRIMINANT ANALYSIS 155

are based on certain outputs of the discriminant model i.e. they can only be calculated after
(“posterior to”) the discriminant model is constructed. The posterior probability of each object
belonging to each group is obtained in R with the following lines of codes:

> posteriors=fitPredict$posterior
> posteriors

Improved Same
1 0.364946858 0.63505314
2 0.982580635 0.01741936
3 0.005088682 0.99491132
4 0.300607166 0.69939283
5 0.948313968 0.05168603
6 0.013704091 0.98629591
7 0.336402478 0.66359752
8 0.042514852 0.95748515
9 0.886609827 0.11339017
10 0.975566846 0.02443315
11 0.003661940 0.99633806
12 0.018405789 0.98159421
13 0.815321427 0.18467857
14 0.978301530 0.02169847
15 0.001951452 0.99804855
16 0.955830192 0.04416981
17 0.008096157 0.99190384
18 0.003412191 0.99658781

Finally, note that the classification and results columns are precisely the same as those obtained
using any and all of the previous classification methods.

The posterior probabilities have one advantage over the other methods: they make it quite easy
to see which of the classifications are the most uncertain. As an example, the posterior proba-
bility that patient 14 belongs to the ‘no relief’ group is 0.99. That is, in the eyes of the model
there is extremely little chance that patient 14 belongs to the ‘improved’ group (only 1%). In
contrast, the posterior probability that patient 7 belongs to the ‘improved’ group is 0.56. That
is, although we classify patient 7 to the ‘improved’ group (because that probability is highest),
we should bear in mind that there is still a 44% chance that the patient belongs to the ‘no re-
lief’ group. This classification is therefore far more uncertain than the classification of patient 14.

Interpretation of the discriminant function

The analysis in the previous section established that the discriminant function that we have
constructed is (a) able to significantly discriminate between the two groups, and (b) is able to do
so very well, with a hit rate of 89%. Having done this, we would now like to be able to establish
two more things:
(a) which of the independent variables are significant and which are not significant,
(b) the nature/direction of the relationship between any significant independent variables and
the grouping variable.
We will consider each of these tasks in turn.

Assessing the significance of the independent variables


156 CHAPTER 8. DISCRIMINANT ANALYSIS

The significance of the independent variables is determined in a similar way to multiple re-
gression. That is, we examine the magnitude of the discriminant function coefficient assigned
to each of the independent variables, and test it for statistical significance using an F -test. If
independent variable Xi does not contribute significantly to the discriminatory ability of the
model, then the associated coefficient should be close to zero. The hypothesis is tested using an
F -test. The details of this test are beyond the scope of the course, and we will consider only
the interpretation of the output. This is applied in R using the [Link]() stepwise function
with a 5% significance level:

> [Link](PainRelief ~ Treatment+Age+Gender+Duration,data=PainReliefData,0.05)

Formula containing included variables:

PainRelief ~ Treatment
<environment: 0x00000000090ff270>

Values calculated in each step of the selection procedure:

vars [Link] [Link] [Link] [Link]


1 Treatment 0.4909091 16.59259 0.0008845647 16.59259
[Link]
1 0.0008845647

The Treatment variable is highly significant, and we would strongly reject the null hypothesis
that w1 = 0 at even the 1% level of significance. The Gender and Age variables are not significant
at the 5% level but can be considered very weakly significant as the do discriminate significantly
at the 28% level (check this yourself). Duration clearly plays no role in the discriminatory power
of the model.

Describing the relationships between the independent and grouping variables

Establishing that an independent variable is capable of significantly discriminating between two


groups is the same as saying that the two groups differ with respect to their means for that
variable. In this case, we must then ask how the group means differ. Is group 1 higher or lower
than group 2? In our example, it is critical to know not just that the treatment has a significant
effect, but to check that the ‘improved’ group has a higher mean treatment rate than the ‘no
relief’ group (it is possible to think of harmful treatments which may cause significantly less
pain relief; one would not want to mistake a significantly harmful treatment for a significant
benefit!). The simplest and most effective way to describe these relationships is to examine the
mean response to each significant independent variable in each group. We already considered
the group means before when calculating group centroids, but we repeat the means here for all
variables that are significant at the 20% level (it would be more common to use a significance
level of 5% or 10%, but we use 20% for illustratory purposes so that we have more than just one
variable to consider).

> means

Improved Same
ratioTreatment 1.000 0.273
meanAge 72.143 75.273
ratioGender 0.286 0.364
meanDuration 31.000 38.273
8.4. DISCRIMINANT ANALYSIS WITH MORE THAN TWO GROUPS 157

The Treatment and Gender variables are binary, so that the group means for those variables are
correctly interpreted as the proportion of patients who have responses of “1” on that variable.
Therefore, 100% of the patients in the ‘Improved’ group received treatment, compared to only
27% of those in the ‘No relief’ group. This indicates that the chances of pain relief are significantly
improved by the treatment, which was the key aim of the study. It also appears that there are
slightly fewer males (who were coded “1”) in the ‘Improved’ group (29%) than in the ‘No relief’
group (36%). This suggests that pain relief is more frequently achieved by females than males.
Finally, the average age in the ‘No relief’ group is marginally higher than in the ‘Improved’
group, indicating that pain relief is more frequently achieved by younger patients. It should be
noted, however, that the differences in the average age and gender distribution of the two groups
are only very marginally significant, and that the means are in fact quite similar.

8.4 Discriminant analysis with more than two groups

When moving from a discriminant analysis with two groups to one with more than two groups,
the main difference is that for each additional group, we also get an additional discriminant
function. That is, the number of discriminant functions that we have is equal to the number
of groups minus one.1 When the dependent variable has three groups, discriminant analysis
estimates two discriminant functions, and uses them together to discriminate between members
of the three groups. The rest of the analysis is largely the same as for when there are two groups.

As a hypothetical illustration of a three-group discriminant analysis, let us consider the following


example taken from ?. Suppose that a service provider has been able to survey 15 of its customers,
asking them to evaluate the company on the price of its offering and the level of service given.
Both attributes are measured on a 1 (terrible) to 7 (superb) scale. Also, the customers were
asked to indicate whether they (1) intended to switch to a competitor in the next year, (2) were
undecided, (3) had no intentions of leaving in the next year. The aim of a discriminant analysis
here is, as before, to explain the differences in group membership using the two independent
variables at our disposal.

ID Group Price Service


1 Switch 2 2
2 Switch 1 2
3 Switch 3 2
4 Switch 2 1
5 Switch 2 3
6 Undecided 4 2
7 Undecided 4 3
8 Undecided 5 1
9 Undecided 5 2
10 Undecided 5 3
11 Stay 2 6
12 Stay 3 6
13 Stay 4 6
14 Stay 5 6
15 Stay 5 7

Table 8.4: Data for three-group switching intentions example


1
In fact, the number of discriminant functions is equal to the minimum of the number of groups minus one, and
the number of independent variables in the model i.e. N D = min(N G − 1, p) where N D is the number of discriminant
functions, N G is the number of groups and p is the number of independent variables. In nearly all real-world applications,
however, the number of independent variables is greater than the number of groups, so that the number of discriminant
functions will be N G − 1.
158 CHAPTER 8. DISCRIMINANT ANALYSIS

Because the dependent variable has three groups, discriminant analysis needs to fit two discrim-
inant functions. Each of these are linear combinations of the independent variables i.e.

Z1 =w11 X1 + w12 X2
Z2 =w21 X1 + w22 X2

That is, the discriminant analysis procedure must choose discriminant function coefficients w11 ,
w12 , w21 , and w22 . For illustration purposes, suppose that we choose the first discriminant
function to have the simplest possible form: Z1 = X1 , and the second discriminant function to
be Z2 = X2 . We can then represent each customer by their score on each of the two discrimi-
nant functions (this will just be their score on each of the independent variables). Because each
respondent has two discriminant scores, we can plot these scores on a two-dimensional graph.
This is shown in Figure 8.3 (circles indicate switching customers, rectangles indicate undecided
customers, and pentagons indicate staying customers).

Z2 = X2

7 15

6 11 12 13 14

3 5 7 10

2 2 1 3 6 9

1 4 8

1 2 3 4 5 6 7
Z1 = X1

Figure 8.3: Graphical display of discriminant scores in switching intentions example

As for two-group discriminant analysis, we can establish regions containing the members of each
group. The regions are clearly marked in Figure 8.3: the top region contains those intend to
stay, the bottom left region those who intend to switch, and the bottom right region those who
are undecided. The only difference in moving to a three-group analysis is that the regions are
now two-dimensional. Again, we can specify cut-offs defining each region. In Figure 8.3, if
Z2 ≥ 4.2 the classification is “Intend to stay” regardless of the score on Z1 . If Z2 < 4.2 and
Z1 < 3.5, the classification is “Intend to switch”, and if Z2 < 4.2 and Z1 ≥ 3.5, the classification
is “Undecided”. Note how the cut-offs are also defined over two dimensions.

The rest of the analysis remains largely the same as for when we have only two groups. Again,
we can calculate group centroids and compare these for statistical significance. Because there are
two discriminant functions, each centroid will have two co-ordinates: a Z1 component consisting
of the average Z1 values of all members of the group, and a Z2 component consisting of the
average Z2 values of all members of the group. The group centroids for the current example are
shown in Table 8.5. Clearly, since Z1 = X1 and Z2 = X2 , the group centroids will be the same
as the group means for the respective independent variables.
8.5. A WORKED EXAMPLE OF MULTI-GROUP DISCRIMINANT ANALYSIS 159

Group Z̄1 Z̄2


Switch 2.0 2.0
Undecided 4.6 2.2
Stay 3.8 6.2

Table 8.5: Group centroids for switching intentions example

The fitting of a second discriminant function provides a greater flexibility in forming ways to
discriminate betweeen groups. In this example, it is clear that it is service level (X2 ) that
discriminates between those intending to stay and the other two groups. That is, those rating
the company highly on service tend to stay regardless of what rating they give on price. For those
who rate the company poorly on service, the price rating (X1 ) becomes critical in distinguishing
between customers who intend to switch (who rate the company poorly on price as well) and
those who are undecided (who rate the company averagely or above on price).

8.5 A worked example of multi-group discriminant analysis

In this section we consider a real-world application of discriminant analysis to a case where


the dependent variable has three groups. Since the construction of a three-group discriminant
model is in many ways the same as for a two-group model, we use this example to introduce
some additional modelling compexities as well. These features are (1) the creation of a cate-
gorical dependent variable from a numeric one, (2) the use of training and hold-out samples,
(3) the inclusion of categorical independent variables using dummy variables, and (4) a stepwise
estimation approach.

The study is an investigation into which factors contribute to air pollution along a particular
stretch of road. The data are a sample of 500 observations collected by the Norwegian Public
Roads Administration at a particular point on a busy stretch of road between October 2001 and
August 2003. The variables that data was collected on are as follows:

• N O2 : Concentration of nitrogen dioxide (NO2 ) particles,


• X1 : Number of cars passing per hour (Cars),
• X2 : Temperature 2 meters above the ground (degree C) (Temp),
• X3 : Wind speed (meters/second) (WindSp),
• X4 : Wind direction (degrees between 0 and 360) (WindDir),
• X5 : Time of day (morning/afternoon/evening) (Time).

The dependent variable indicating the concentration of NO2 is numeric, and so we could treat
this research problem using multiple regression. However, we can also always convert a numeric
variable into a categorical one by dividing it up into a specified number of categories. Sometimes,
it will be of more interest to analyse this converted categorical variable. For example, suppose
that legal requirements state that if the NO2 concentration is above a certain level, a clean-up
operation must be started; that if the concentration is below another level, a financial reward is
given to the municipality in charge of the road; and that between these two levels action is taken.
In this case, it might make more sense to analyse the category an object belongs to rather than
the raw NO2 level. That is the approach we will take here i.e. to divide the observations into
three groups which we will call Group A (N O2 < 30), Group B (30 ≤ N O2 < 60), and Group C
(N O2 ≥ 60).

One of the independent variables that we want to include, indicating the time of day, is categor-
ical, measured only according to whether the observation was taken in the morning, afternoon,
or evening. To include this variable in the discriminant model, we must first choose a reference
160 CHAPTER 8. DISCRIMINANT ANALYSIS

category and then recode the variable into two new dummy variables. We will arbitrarily choose
“evening” as the reference category and create two dummy variables to compare the effect of the
other two categories relative to evening. These dummy variables are labelled X6 (IMorn) and
X7 (IAft). Data for the first five observations is shown in Table 8.6 after all recoding has been
done.

Obs N O2 N O2 Group Cars Temp WindSp WindDir IMorn IAft


1 41.2 2 2189 9.2 4.8 74.4 0 0
2 22.2 1 2206 6.4 3.5 56 0 1
3 27.5 1 123 -3.7 0.9 281.3 0 0
4 80.5 3 1045 -7.2 1.7 74 0 0
5 77.2 3 1841 -1.3 2.6 65 1 0

Table 8.6: Data for air pollution example (first five observations only)

The next complexity that we introduce is the use of a training and hold-out sample. Because we
have a fairly large sample of 500 observations, we can afford to keep some of that sample aside
for use in validating the results of the discriminant model i.e. as a hold-out sample. We will use
a 50-50 split, using one half of the sample to fit the discriminant function and the other half to
validate the model. This leaves 250 observations in the training sample, which (because there
are 6 independent variables once the dummy variables are introduced) gives a ratio of just more
than 40 respondents per independent variable. This is well above even the guideline ratio of 20.
The data stored in a txt file names [Link] is extracted into R and then the training
sample is specificed in order to run the discriminant analysis with the following code:

> NO2Pollution=[Link]("[Link]", header=T)


> training=[Link](NO2Pollution[NO2Pollution$Train=="1",])
> holdout=[Link](NO2Pollution[NO2Pollution$Train=="0",])
> attach(training)

The following object(s) are masked from PainReliefData:

ID
The following object(s) are masked from hypothetical:

ID

8.5.1 Estimation and assessment of the discriminant functions

We use a stepwise approach to estimate the coefficients of the two discriminant functions, so
that only those variables that are significant at a pre-specified level (a level of 5% was used in
this example) will be included in the final model.
This is applied using the [Link]() function in R:

> [Link](NO2Cat~Cars+temp+windsp+winddir+IMorn+IAft,data=training, 0.05)

Formula containing included variables:

NO2Cat ~ Cars + windsp + temp


<environment: 0x0000000008fe8db8>
8.5. A WORKED EXAMPLE OF MULTI-GROUP DISCRIMINANT ANALYSIS 161

Values calculated in each step of the selection procedure:

vars [Link] [Link] [Link] [Link]


1 Cars 0.7687253 37.15557 7.811596e-15 37.155569
2 windsp 0.6229865 32.83528 2.778303e-24 28.774098
3 temp 0.5935398 24.33676 2.910303e-25 6.077472
[Link]
1 7.811596e-15
2 5.841883e-12
3 2.652914e-03

Here it is clear that we will choose to use Cars, windsp and temp variables for the discriminant
model. The coefficients for the two discriminant functions are obtained with the following R
code:

> formula=NO2Cat~Cars+temp+windsp
> data=training
> coefficients(formula,data,3)

LD1 LD2
constant -0.3942446139 1.8483824937
Cars 0.0009814663 -0.0004586329
temp -0.0579196467 -0.0506396721
windsp -0.3835792675 -0.3409784004

Therefore the two discriminant functions that were estimated are given by

Z1 = −0.394 + 0.001X1 − 0.058X2 − 0.384X3


Z2 = 1.848 + 0.0004X1 − 0.051X2 − 0.341X3

That is, only X1 (the number of cars passing by), X2 (temperature), and X3 (wind speed) are
significant discriminators. The other variables (X4 , wind direction; and the two dummy variables
indicating time of day) are not significant predictors. We can therefore immediately say that
the average wind direction and time distributions do not differ between the three air pollution
groups.

With the discriminant function coefficients estimated, we can move on to consider whether they
do an adequate job of discriminating between the three groups. Once again after testing the
overall significance of the analysis, we use two approaches to evaluate this.
In order to obtain the overall significance of the model we need to do a wilks lambda test in R:

> independentVars<-[Link](training[,c(4,5,11)])
> manova1<-manova(independentVars ~ training$NO2Cat)
> [Link]<-summary(manova1,test="Wilks")
> [Link]

Df Wilks approx F num Df den Df Pr(>F)


training$NO2Cat 1 0.59795 55.135 3 246 < 2.2e-16 ***
Residuals 248
---
Signif. codes: 0 Ś***Š 0.001 Ś**Š 0.01 Ś*Š 0.05 Ś.Š 0.1 Ś Š 1
162 CHAPTER 8. DISCRIMINANT ANALYSIS

According to the R output, it is clear that the discriminant analysis is significant overall.
After having seen that the model is significant overall, we first test the centroids for significant
differences, and then we construct a classification matrix and evaluate the hit rate. It is important
to bear in mind that the test of centroids is performed using the training sample, but the hit
rate is evaluated using the hold-out sample.
Testing the differences between centroids for significance

Because there are three groups, there are three pairs of centroids that can be compared for
significant differences. We can compare Group A to B, Group A to C, and Group B to C. In
each case, the hypothesis tested remains the same as for the two-group example e.g. to test for
a significant difference in the centroids of Group A and B, the hypotheses are

H0 :Z̄ (A) = Z̄ (B)


H1 :Z̄ (A) 6= Z̄ (B)

The actual calculation of the group centroids is a little more complicated because there are now
two discriminant functions, but proceeds in essentially the same way as before. That is, we
find the group means on each of the independent variables and plug them into the discriminant
functions. The group means for all independent variables in the training sample are shown in
Table 8.7, together with the number of observations in each of the groups.

Group n Cars Temp WindSp WindDir IMorn IAft


A 83 932 1.75 3.86 131 0.28 0.17
B 85 1761 0.95 3.03 157 0.38 0.35
C 82 2269 0.01 2.29 158 0.35 0.40

Table 8.7: Group means for training sample

The group centroids can then be calculated by substituting these means into the discriminant
functions. For example, the group centroid for Group A is given by
(A)
Z̄1 = −0.394 + 0.001(932) − 0.058(1.75) − 0.384(3.86) = −1.046
(A)
Z̄2 = 1.848 + 0.0005(932) − 0.051(1.75) − 0.341(3.86) = 0.017

The centroid for Group A is therefore the point (−1.046, 0.017), which can be graphed in two di-
mensions and compared to the other group centroids (the centroid of Group B is (0.116, −0.041);
the centroid of Group C is (0.952, 0.025)) to see whether the differences between them are sig-
nificant. These centroids are obtained with the following R code:

> fit <- lda(formula,data=training, method="moment")


> fitPredict=predict(fit)
> LDscores=fitPredict$x
> centroid1G1=sum(LDscores[,1]*(training$NO2Cat==1))/sum(training$NO2Cat==1)
> centroid2G1=sum(LDscores[,2]*(training$NO2Cat==1))/sum(training$NO2Cat==1)
> centroid1G2=sum(LDscores[,1]*(training$NO2Cat==2))/sum(training$NO2Cat==2)
> centroid2G2=sum(LDscores[,2]*(training$NO2Cat==2))/sum(training$NO2Cat==2)
> centroid1G3=sum(LDscores[,1]*(training$NO2Cat==3))/sum(training$NO2Cat==3)
> centroid2G3=sum(LDscores[,2]*(training$NO2Cat==3))/sum(training$NO2Cat==3)
> centroids=matrix(c(centroid1G1,centroid1G2,centroid1G3,centroid2G1,centroid2G2,
+ centroid2G3),nrow=3,ncol=2,dimnames=list(c("Group1","Group2","Group3"),
+ c("Centroid1","Centroid2")))
> centroids
8.5. A WORKED EXAMPLE OF MULTI-GROUP DISCRIMINANT ANALYSIS 163

Centroid1 Centroid2
Group1 -1.0599081 0.01748246
Group2 0.1161243 -0.04107599
Group3 0.9524611 0.02488312

The centroids are plotted in R using:

> plot(fit)
> points(centroids[1,1],centroids[1,2], col="blue", pch=19)
> text(centroids[1,1]+0.1,centroids[1,2]+0.1, "C1", cex = 0.5, col="blue")
> points(centroids[2,1],centroids[2,2], col="red", pch=19)
> text(centroids[2,1]+0.1,centroids[2,2]+0.1, "C2", cex = 0.5, col="red")
> points(centroids[3,1],centroids[3,2], col="green", pch=19)
> text(centroids[3,1]+0.1,centroids[3,2]+0.1, "C3", cex = 0.5, col="green")
2

3
12 3
213 2 2
21 2 3 2 2 33
1 2 12 3 3 2 3
3 2 2 3 2 3
1 32 1 1 3
2 112 3 21 2
1

1 2 3 2 3 2
2 1 2 2 3
1 12 2 1 2
2
1
1 112 2 33 2 3 1
1 3 3
1 2 11 2 2 13 3 2
1 1 3 3
1 11 11 1 3 3 3
11 33 3 3C3 2 2 2
1 11 1 1 1 3 33
C2 3 33 3 ●
C1
1 2 ● 33 3
0

1 ●22 3 1 33 3
11 1 1 11 3 3 3
1 2 3 32 23 3
2
1 121 1 22 2 3
2 2 3 33 3
1 123 2 2 1
12
LD2

1 2 1 21 23 2 3
1 1 23 3 3 13
11 1 3
2 23 21 3 3
−1

1 1 22 3 2 2 2
2 12 2 32 33
1 2 2 3 2
2 2
1 2 1 1
1 2
3
3
2 1
−2

3 2
3
1 3
2

1 3 2
−3

−3 −2 −1 0 1 2 3

LD1
Again, the ‘dif-
ference’ between the two group centroids is measured by the Mahalanobis distance (which, since
there are now two dimensions, is in general not the same as Euclidean distance).
The distances are calculated in R:

> distanceG1G2=(centroids[1,1]-centroids[2,1])^2+(centroids[1,2]-centroids[2,2])^2
> distanceG1G3=(centroids[1,1]-centroids[3,1])^2+(centroids[1,2]-centroids[3,2])^2
> distanceG2G3=(centroids[2,1]-centroids[3,1])^2+(centroids[2,2]-centroids[3,2])^2
> cbind(distanceG1G2,distanceG1G3,distanceG2G3)
164 CHAPTER 8. DISCRIMINANT ANALYSIS

distanceG1G2 distanceG1G3 distanceG2G3


[1,] 1.386481 4.049684 0.7038098

The Mahalanobis distances can be tested for statistical significance using an F -test as before.
These results are summarised in Table 8.8.

Group A vs. B A vs. C B vs. C


H0 : Z̄ (A) = Z̄ (B) Z̄ (A) = Z̄ (C) Z̄ (B) = Z̄ (C)
H1 : Z̄ (A) 6= Z̄ (B) Z̄ (A) 6= Z̄ (C) Z̄ (B) 6= Z̄ (C)
Mahal. D2 1.386 4.050 0.704
F 19.25 55.23 9.71
p < 0.001 < 0.001 < 0.001

Table 8.8: Tests for significant difference between group centroids

All F -statistics are highly significant and we can therefore conclude that the two discriminant
functions are together able to significantly discriminate between the members of all three groups.

Evaluating the hit rate

In order to evaluate the hit rate of the discriminant model, we must first classify observations into
predicted groups. The classification of observations will be based on the posterior probabilities
obtained using R:

> posteriors=fitPredict$posterior
> [Link]=[Link](priorClass=training$NO2Cat,posteriorClass=fitPredict$class, post
+ decision=training$NO2Cat==fitPredict$class)
> head([Link])

priorClass posteriorClass X1 X2 X3 decision


1 2 1 0.44664642 0.4134561 0.1398975 FALSE
2 1 2 0.23724351 0.4709984 0.2917581 FALSE
3 1 2 0.40027405 0.4020813 0.1976446 FALSE
4 3 2 0.16340637 0.4339061 0.4026875 FALSE
5 3 2 0.14087959 0.4424074 0.4167130 FALSE
6 3 3 0.07237966 0.3894487 0.5381717 TRUE

The classification matrix that is obtained from applying any of these methods to the training
sample is applied in R:

> classificationTable=table(training$NO2Cat,fitPredict$class)
> rownames(classificationTable)=c("Group1 actual class",
+ "Group2 actual class","Group3 actual class")
> colnames(classificationTable)=c("Group1 predicted class",
+ "Group2 predicted class", "Group3 predicted class")
> classificationTable

Group1 predicted class Group2 predicted class


Group1 actual class 61 18
Group2 actual class 25 38
Group3 actual class 6 29
8.5. A WORKED EXAMPLE OF MULTI-GROUP DISCRIMINANT ANALYSIS 165

Group3 predicted class


Group1 actual class 4
Group2 actual class 22
Group3 actual class 47

The overall correct classification rate is calculated in R:

> misRate=[Link](classificationTable)*100
> misRate

[1] 41.6

> hitRate=100-misRate
> hitRate

[1] 58.4

The discriminant analysis model correctly classifies 58.4% of observations in the training sample.
This should also be tested on the hold-out sample. If the hit rates in the training sample are
very similar to those in the hold-out sample, we can be confident that our results will generalise
to the broader population and that we have not overfitted the model. In R:

> predictHoldout=predict(fit,holdout[,c(11,4,5)])
> [Link]=table(holdout$NO2Cat,predictHoldout$class)
> [Link]

1 2 3
1 60 18 4
2 23 37 21
3 8 27 52

> [Link]=[Link]([Link])*100
> [Link]

[1] 40.4

> [Link]=[Link]
> [Link]

[1] 59.6

The discriminant analysis model correctly classifies 58.4% of observations in the training sample
and 59.6% of the observations in the hold-out sample.
The model is able to classify observations from Group A better than observations from Group
B and C (check why?).
As a basis for comparison, we can again compare the hit rate in the training sample with
the hit rate that might be achieved by chance, using either the maximum chance criterion or
the proportional chance criterion. The maximum chance criterion remains P , the proportion
belonging to the largest group, regardless of the number of groups used and is therefore equal
166 CHAPTER 8. DISCRIMINANT ANALYSIS

to HMax = 87/250 = 0.35 or 35%. For more than two groups, the proportional chance criterion
is given by
HProp = ρ21 + ρ22 + · · · + ρ2q
where ρi is the proportion belonging to group i and q is the number of groups. In this example,
HProp = (92/250)2 + (85/250)2 + (73/250)2 = 0.336 or 33.6%. Therefore the model, with its hit
rate of around 60%, is doing substantially better than what could be expected by chance alone.

Assessing the relative contribution of each discriminant function

Although there are two discriminant functions, they do not contribute equally to the discrimi-
natory power of the model. In a similar way to factor analysis and correspondence analysis, the
first discriminant function will always carry more explanatory power than the second function.
We can measure the contribution of each function by its eigenvalue, and use these eigenvalues
to express the relative contribution of each function as a percentage of the total. To obtain the
eigenvalues, the following code must be run in R:

> eigenvalues=fit$svd
> eigenvalues

[1] 9.1863421 0.3313268

> eigenvalues^2/sum(eigenvalues^2)

[1] 0.998700838 0.001299162

In the current example, the first discriminant function has an eigenvalue of 9.186, while the
second function has an eigenvalue of 0.331. The relative contribution of the first function is thus
9.186/(9.186 + 0.331) = 0.9987 or 99.8%, with the second function contributing the remaining
0.2%. The second function can therefore for all practical purposes be ignored in any subsequent
interpretation.

8.5.2 Interpretation of the discriminant functions

Once it has been established (a) that the discriminant functions together are able to significantly
discriminate between the groups and (b) what the relative importance of each discriminant
function is and whether any can be ignored, the remaining functions can be interpreted in
the usual way. Each discriminant function should be assessed separately. Because we have
already established that only the first discriminant function needs to be interpreted, and this
interpretation is precisely the same as in the two-group example.
According to the standardized coefficients obtained in R:

> formulaSt=NO2Cat~scale(Cars)+scale(temp)+scale(windsp)
> fitst=lda(formulaSt,data=training, method="moment")
> fitst$scaling

LD1 LD2
scale(Cars) 1.1221210 -0.5243599
scale(temp) -0.3621544 -0.3166349
scale(windsp) -0.7212983 -0.6411898
8.5. A WORKED EXAMPLE OF MULTI-GROUP DISCRIMINANT ANALYSIS 167

The number of cars is the most significant discriminator between the groups i.e. the largest
influence on pollution levels. Wind speed is also very highly significant. Temperature, although
still highly significant, is not as significant as the other two variables.

In terms of the direction of the significant effects, observations in Group A (low levels of NO2 )
are taken when there are significantly fewer cars passing by, when temperatures are higher, and
when wind speeds are higher. The high-pollutant group C has the highest traffic levels, lowest
temperatures, and lowest wind speeds. The intermediate group B occupies intermediate levels
of all three attributes. Thus, it seems that the main factors contributing to high levels of air
pollution are high traffic volumes, low temperatures, and low wind speeds.
168 CHAPTER 8. DISCRIMINANT ANALYSIS
Chapter 9

Classification Trees

In the previous chapter, we saw how the discriminant analysis technique allowed us to explain
group membership using a set of numeric predictor variables. Categorical independent variables
could also be included, although this required that they first be transformed into a set of binary
indicator variables. The key feature of discriminant analysis, therefore, is the fact that its
dependent variable is categorical, denoting some kind of group membership. In this chapter,
we consider another technique that can be used on exactly the same type of research problem
i.e. one with a single categorical dependent variable and numeric or categorical independent
variables. This technique is classification trees. Classification trees is actually the name given
to a collection of algorithms rather than a single technique, but these algorithms are so similar
in what they set out to achieve that they can be considered together in a single chapter.

If in discriminant analysis we already have a technique capable of predicting group membership


from any kind of independent variables, then why consider another technique? The first reason
is that classification trees offer a highly visual presentation of results that is easily understood by
non-statisticians, even when seen for the first time. It is therefore a popular practical technique
in industry. In contrast, the discriminant functions which contain most of the information about
a discriminant analysis, are relatively difficult to interpret, requiring at least some knowledge of
mathematical functions, significance testing, and so on. The second reason is that classification
trees are able to capture interactions between the independent variables in a simpler way than
discriminant analysis, particularly interactions between categorical independent variables. This
is a direct result of the way that classification trees are constructed, which we shall turn to
shortly.

There are several different ways to construct classification trees. The two that we will consider
in this course go by the names of the classifaction and regression trees (CART) algorithm and
the chi-squared automatic interaction detection (CHAID) algorithm. Importantly, these are just
different ways of constructing the trees. The basic interpretation of the trees remains the same
regardless of which algorithm is used to build the tree.

9.1 Classification tree basics

We begin by introducing some aspects shared by all classification tree algorithms, using the
following application of credit risk assessment as an example. Banks build predictive models in
order to classify loan applicants into those who are creditworthy and those who have a high risk
of not repaying and who should therefore be denied the loan. These models are usually based
on some demographic and socio-economic information – like age, number of children, education
level, income, and so on.

In this example, a bank has collected the following information from 100 customers who applied

169
170 CHAPTER 9. CLASSIFICATION TREES

for loans with the bank in the past:

• Volume: Volume of bank transactions per month (Frequent, Seldom)


• Value: Income per transaction (High, Low)
• Age: Age of customer (Younger than 25, 25 to 30, Older than 30)
• Status: Whether the loan was repaid (Paid/Not paid)

In this example, all variables are categorical. The dependent variable Status as well as the in-
dependent variables Volume and Value contain two levels, while Age contains three. From the
point of view of classification trees though, only the dependent variable need be categorical (and
even this requirement can be relaxed, although this is beyond the scope of this course) and all
categorical variables can have any number of levels, not just two or three. We will return to this
point later.

Classification tree algorithms begin by placing all objects together in what is known as a root
node. A “node” is simply a name for a grouping of objects that classification trees creates based
on certain combinations of the independent variables. The root node is the name given to the
group that is created before considering any independent variables i.e. by just grouping together
everyone in the sample.

In the root node, there are 80 customers who did repay their loans and 20 customers who did
not. We can describe this node in a graphical way in Figure 9.1(a). The node is represented by a
box, with the inset bar chart describing the relative proportion of the node’s members belonging
to each dependent variable group. The left-hand bar represents the number of customers who
repaid their loans while the right-hand bar represents the number who did not.

In the left-hand corner of the node box is a node index that can be used to refer to the node i.e.
Node 1. In the top-right corner of the root node is what is called the node classification (“Paid”).
This is the predicted group membership for all objects in that node. From the point of view of
the model, all objects in a particular node are treated identically – that is, there must be a single
classification for all objects in a node. The classification for any node is the dependent variable
group that has the largest number of members in that node. For example, since there are 80
loan-repayers and only 20 loan-defaulters, the classification is to the more frequent loan-repayers
group. Clearly, the 20 customers who defaulted on their loans would be incorrectly classified in
the root node.

The classification tree algorithm now attempts to split apart the group of all customers contained
in the root node, by using the independent variables. The aim when splitting a node is to get
as many objects as possible who exhibit one type of behaviour (e.g. paying their loans) into one
of the new nodes, with as many objects who display the opposite behaviour (e.g. not paying
their loans) into the other nodes. However, the catch is that we can only use the independent
variables to split apart the objects in a node into new nodes.

For example, one possible grouping is to place all high volume customers in one group and all
low volume customers in another group. Another is to place all those younger than 25 in one
group, those between 25 and 30 in another group, and those older than 30 in a third group. In
Figure 9.1(a), the algorithm has decided to divide the root node into low volume customers on
one hand, and high volume customers on the other hand. In later sections we will see exactly
how the different classification algorithms decide how this split should occur. Note that there
are 50 low volume customers who find their way to the left-hand node in Figure 9.1(a), and
that nearly all of them are in the a priori “Paid” group (shown by the dominance of the “Paid”
bar in the bar chart in that node). Most of those who did not pay their loan, therefore, are in
the other node containing high volume customers. From this, we can assert that most of those
9.1. CLASSIFICATION TREE BASICS 171

(a)

(b) (c)

Figure 9.1: Steps in the construction of a classification tree: (a) the tree splits according to the most
significant predictor variable, (b) the tree splits further according to the next most significant variable,
(c) the final classification tree is achieved
172 CHAPTER 9. CLASSIFICATION TREES

not repaying their loans were high-volume customers (note though, that the reverse – that most
high-volume customers do not repay their loans – is not true).

At this stage it is useful to state some terminology. The root node (Node 1) was split into two
(Node 2 and Node 3). Here, Node 1 is referred to as a parent node, and Nodes 2 and 3 as child
nodes, for obvious reasons. Note that the two child nodes are exhaustive and mutually exclusive:
that is, if you are in Node 1 then you must belong to one and only one of Node 2 or Node 3.

At the next step of the tree construction, the algorithm tries to split each of the child nodes
further. Figure 9.1(b) shows how Node 3, which contains high volume customers, can be further
split into two: (1) those customers who are both high volume and low value (Node 4), and (2)
those customers who are both high volume and high value (Node 5). Interestingly, in the group
of high volume, high value customers, non-payment is the most frequently observed behaviour
(as seen by the node classification of “Not paid”). This suggests that customers who transact
frequently and have high value transactions should not be lent any money. Note also that the
algorithm is unable to split the low volume customers in Node 2 any further. This is because
all customers in this node already behave nearly identically on the dependent variable i.e. they
(nearly) all pay back the loan. It is therefore not worth splitting this node any further. A node
that cannot be split any further is known as a terminal node.

In Figure 9.1(c), the algorithm again tries to split each of the new child nodes (that is, Node
4 and Node 5) even further. Node 4 (containing high volume, low value customers) cannot be
split any further and is therefore a terminal node, but Node 5 (containing high volume, high
value customers) is split further into those customers who are older than 30 (Node 6, containing
n = 15 customers) and 30 or younger (Node 7, containing n = 15 customers). None of these two
nodes can be split any further (in this case, because we have used all the independent variables)
and so the set of terminal nodes is:
• Node 2 (n = 50): contains those customers who seldom transact. Classified as “Paid”.
• Node 4 (n = 20): contains those customers who both transact frequently and have low
value transactions. Classified as “Paid”.
• Node 6 (n = 15): contains those customers who transact frequently and have high value
transactions and are older than 30. Classified as “Paid”.
• Node 7 (n = 15): contains those customers who transact frequently and have high value
transactions and are 30 or younger. Classified as “Not paid”.
These four terminal nodes are the focus of our attention. Note how they too are exhaustive and
mutually exclusive – every object must belong to one and only one of the terminal nodes. They
can also be thought of as classification rules i.e. as ways for classifying objects based on their
values on the independent variables. In this case, the classification rules can be stated as follows:
• If Volume = Seldom, then classify as “Paid” (Node 2).
• If Volume = Frequent and Value = Low, then classify as “Paid” (Node 4).
• If Volume = Frequent and Value = High and Age > 30, then classify as “Paid” (Node 6).
• If Volume = Frequent and Value = High and Age ≤ 30, then classify as “Not paid” (Node
7).
A classification rule is just a statement about (a) what combination of conditions make one land
up in a particular terminal node, and (b) what the resulting node classification in that node is.
The rules are easy to understand and to describe to others, and are even easier to follow given
the graphical display of the tree in Figure 9.1(c). They also give clear policy guidelines for this
particular bank i.e. do not lend to younger customers (30 or younger) who transact frequently
and for large values, but lend to everyone else.

Of course, as with a discriminant analysis model, our classification tree is bound to incorrectly
9.1. CLASSIFICATION TREE BASICS 173

classify some objects. The exact number of incorrectly classified objects is not shown in the
classification tree (but can be found from other standard output), but one can see approximate
levels of incorrect classifications by considering the relative lengths of the two bars in the bar
charts contained in the terminal nodes. For example, objects in Node 2 are all classified as
“Paid”, and since the “Not paid” bar is not even visible, there are clearly not many objects in
this group. In fact, there are 2 out of 50 who did not pay, which gives a misclassification rate
of 4% in this node. Node 6, on the other hand, also has all objects classified as “Paid”, but
in that node the length of the “Not paid” bar is about half that of the “Paid” bar (the exact
number of non-payments is 5 out of 15). As a result, the misclassification rate in Node 6 is 33%,
considerably higher than that of Node 2. For completeness, the respective misclassification rates
in the other terminal nodes are 2/20 = 10% in Node 4 and 4/15 = 27% in Node 7. A standard
output of any classification tree model is a misclassification rate in each node, as well as an
overall classification matrix, which can be interpreted in exactly the same way as in discriminant
analysis. The classification matrix for the final classification tree in Figure 9.1(c) is shown in
Table 9.1.

A priori group
Paid Not paid Total
Predicted Paid 76 9 83
group Not paid 4 11 15
Total 80 20 100

Table 9.1: Classification matrix

The overall hit rate is therefore (76 + 11)/100 = 87%. It is worth reflecting for a minute on
the simplicity of the classification tree output, particularly compared with many of the other
statistical techniques that we have covered in this course. For the most part, users would need
to be presented with only the four decision rules, a misclassification rate for each of the rules,
and the overall hit rate.

The application of CART in R is done using the “rpart” package. Therefore this package should
be downloaded ([Link](”rpart”)) and recalled with the library() function before working
on the classification trees:

> library(rpart)

The data we are working with is saved as a .txt file with the name [Link]. In order to
work with this data set, the data set must be extracted into R with the following code (assuming
that the data is saved in your working directory):

> CreditRisk=[Link]("[Link]", header=T)


> datatoWork=CreditRisk
> attach(datatoWork)
> names(datatoWork)

[1] "ID" "Volume" "Value" "Age" "Status"

In order to see a few lines of your data set, you can either open the data appearing in your
working directory or type in the following code:

> head(datatoWork)
174 CHAPTER 9. CLASSIFICATION TREES

ID Volume Value Age Status


1 1 Seldom Low >30 Paid
2 2 Seldom Low >30 Paid
3 3 Seldom Low >30 Paid
4 4 Seldom Low >30 Paid
5 5 Seldom Low <25 Paid
6 6 Seldom Low <25 Paid

The variable of interest is the Status and the variables that will be used to partition the customers
into “Paid” and “Unpaid” categories, we will use the rpart() function available in the “rpart”
package that you have already installed:

> formula=Status ~ Volume+ Value +Age


> cartTree <- rpart(formula, data=datatoWork)

The output obtained using this function is not a tree plot but it is a textual version of the tree
plot. The output provides the total number of observations used in the analysis as well the
number of observations corresponding to each of the nodes. For example, the first node is the
root node with 100 observations in total. The first partition is created using the Volume variable
(Nodes 2 and 3). It is clear that Node 3 is not partitioned any further, which means that this
is a terminal node. The terminal nodes in the output are marked with a “*” at the end of the
node. Therefore, we see that Nodes with numbers 3, 5, 8 and 9 are terminal nodes.

When we look at each individual node, we see that the partition criteria is written first followed
by the total number of observations in that node; the number of observations falling into the
category that is not predicted with this node and the name of the category that is predicted
with this node. In other words, the name of the category printed in any node will have more
observations.

The root node is further partitioned into Node 2 and Node 3. When we look at Node 2, we
see that this node belongs to the “Volume=Frequent” group of customers. There are 50 cus-
tomers that frequently have bank transactions. The name of the category predicted with this
node is “Paid” and the number of “Unpaid” customers is 18. This means that the number of
customers predicted to “Paid” category is 32 (50-18=32). Since the CART tree algorithm will
predict the category with the higher number of observations (“Unpaid”=18; “Paid”=32), this
node is predicting the “Paid” category which has 32 customers. The following numbers in the
paranthesis (0.36 0.64) provide the proportions of the customers in the categories in that node
(“%Unpaid”=18/50=0.36; “%Paid”=32/50=0.64). The higher proportion belongs to the pre-
dicted category which is “Paid” in this node.

Similarly, when we look at Node 3, we see that this node belongs to the “Volume=Seldom”
group of customers. There are 50 (100-50=50) customers that seldomly have a transaction in
the bank. The name of the category predicted with this node is “Paid” and the number of “Un-
paid” customers is 2. This means that the number of customers predicted to “Paid” category is
48 (50-2=48). Since the CART tree algorithm will predict the category with the higher number
of observations (“Unpaid”=2; “Paid”=48), this node is predicting the “Paid” category which has
48 customers. The following numbers in the paranthesis (0.04 0.96) provide the proportions of
the customers in the categories in that node (“%Unpaid”=2/50=0.04; “%Paid”=48/50=0.96).
The higher proportion belongs to the predicted category which is “Paid” in this node. Since
Node 3 has a “*” at the end and is not followed by any other nodes, it is clear that this node is
a terminal node, indicating that (RULE 1) any customer that has a seldom transaction
in the bank will be predicted as a customer who repays the loan.
9.1. CLASSIFICATION TREE BASICS 175

Node 2 is further partitioned using the “Value” variable (Nodes 4 and 5). Node 4 belongs
to the “Frequent Volume” and “High Value” customers, whereas Node 5 belongs to “Frequent
Volume” and “Low Value” customers. There are 30 customers in Node 4 and the name of the
category predicted using this node is “Unpaid” and there are 14 “Paid” customers in this node.
This means that the number of customers predicted to “Unpaid” category is 16 (30-14=16) with
a proprtion of 53% (16/30=0.53).

Node 5 belongs to the “Frequent Volume” and “Low Value” customers and there are 20 customers
in this node in total. This node is predicting the “Paid” category and there are 18 customers
in the “Paid” category (20-2=18). The proportion of customers in the “Paid” category is 90%
(18/20=0.90). And this node is not partitioned any further, indicating that this is a terminal
node (marked with a “*” at the end). Therefore (RULE 2) any customer with a frequent trans-
action volume in the bank and high value income per transaction will be predicted as a customer
who repays the loan.

Node 4 is further partioned into Nodes 8 and 9 with the Age variable. Since CART can only
do binary splitting, the “Age<25” and “Age between 25-30” are grouped in Node 8. Node 8 has
15 customers in total and this node is predicting the “Unpaid” category. The number of customer
that did not pay their loan is 11 (15-4=11) and the proportion of the unpaid customers is 73.3%
(11/15=0.733). Node 8 is also a terminal node. We have used all the variables and there are
no variables left to partition any node. Therefore (RULE 3) any customer aged less 30 years
old with a frequent transaction volume in the bank and high value income per transaction will be
predicted as a customer who does not repay the loan.

At Node 9, we see that this node belongs to the “Age>30” group of customers. There are
15 customers that are aged more than 30 years old with high value income and frequent vol-
ume transactions. The name of the category predicted with this node is “Paid” and there are
10 customers who repaid the loan (15-5=10). The proportion of the paid customers is 66.67%
(10/15=0.6667). Node 9 is also a terminal node. Therefore (RULE 3) any customer aged more
than 30 years old with a frequent transaction volume in the bank and high value income per
transaction will be predicted as a customer who repays the loan.

The general rules for classifying the customers are the same as explained previuosly. The infor-
mation that one can read from the textual tree output obtained using the rpart() function can
also be obtained using the summary(cartTree) command in R:

> summary(cartTree) # detailed summary of splits

Call:
rpart(formula = formula, data = datatoWork)
n= 100

CP nsplit rel error xerror xstd


1 0.1166667 0 1.00 1.0 0.200000
2 0.0100000 3 0.65 0.8 0.183303

Variable importance
Volume Value Age
41 40 19

Node number 1: 100 observations, complexity param=0.1166667


predicted class=Paid expected loss=0.2 P(node) =1
class counts: 20 80
probabilities: 0.200 0.800
176 CHAPTER 9. CLASSIFICATION TREES

left son=2 (50 obs) right son=3 (50 obs)


Primary splits:
Volume splits as LR, improve=5.12000, (0 missing)
Value splits as LR, improve=1.87013, (0 missing)
Age splits as RRL, improve=0.12500, (0 missing)
Surrogate splits:
Value splits as LR, agree=0.54, adj=0.08, (0 split)

Node number 2: 50 observations, complexity param=0.1166667


predicted class=Paid expected loss=0.36 P(node) =0.5
class counts: 18 32
probabilities: 0.360 0.640
left son=4 (30 obs) right son=5 (20 obs)
Primary splits:
Value splits as LR, improve=4.506667, (0 missing)
Age splits as LRL, improve=0.240000, (0 missing)

Node number 3: 50 observations


predicted class=Paid expected loss=0.04 P(node) =0.5
class counts: 2 48
probabilities: 0.040 0.960

Node number 4: 30 observations, complexity param=0.1166667


predicted class=Notpaid expected loss=0.4666667 P(node) =0.3
class counts: 16 14
probabilities: 0.533 0.467
left son=8 (15 obs) right son=9 (15 obs)
Primary splits:
Age splits as LRL, improve=2.4, (0 missing)

Node number 5: 20 observations


predicted class=Paid expected loss=0.1 P(node) =0.2
class counts: 2 18
probabilities: 0.100 0.900

Node number 8: 15 observations


predicted class=Notpaid expected loss=0.2666667 P(node) =0.15
class counts: 11 4
probabilities: 0.733 0.267

Node number 9: 15 observations


predicted class=Paid expected loss=0.3333333 P(node) =0.15
class counts: 5 10
probabilities: 0.333 0.667

A plot of the tree which is not that pretty is plotted using the following codes in R:

> plot(cartTree, uniform=TRUE, main="Classification Tree for Credit Risk")


> text(cartTree, pretty=0,use.n=TRUE, all=TRUE, cex=1)
9.1. CLASSIFICATION TREE BASICS 177

Classification Tree for Credit Risk

Volume=Frequent
|
Paid
20/80

Value=High
Paid Paid
18/32 2/48

Age=<25,25−30
Notpaid Paid
16/14 2/18

Notpaid Paid
11/4 5/10

Using “partykit” package (installed in your R environment), it is possible to get the same output
with some additional information such as the prediction error rates and a prettier plot:

> library(partykit)
> partyfit=[Link](cartTree)
> print(partyfit)

Model formula:
Status ~ Volume + Value + Age

Fitted party:
[1] root
| [2] Volume in Frequent
| | [3] Value in High
| | | [4] Age <25, 25-30: Notpaid (n = 15, err = 26.7%)
| | | [5] Age >30: Paid (n = 15, err = 33.3%)
| | [6] Value in Low: Paid (n = 20, err = 10.0%)
| [7] Volume in Seldom: Paid (n = 50, err = 4.0%)

Number of inner nodes: 3


Number of terminal nodes: 4

> plot(partyfit, type = "extended")


178 CHAPTER 9. CLASSIFICATION TREES

1
Volume

Frequent Seldom
2
Value

High Low
3
Age

<25, 25−30 >30

Node 4 (n = 15) Node 5 (n = 15) Node 6 (n = 20) Node 7 (n = 50)


1 1 1 1
Notpaid

Notpaid

Notpaid

Notpaid
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
Paid

Paid

Paid

Paid

0 0 0 0

Apparently, the
results of the classification analysis must be interpreted looking at both the tree plot and the
textual version of the tree plot. The classification matrix is obtained using the following:

> p1=predict(partyfit, datatoWork)


> table(Status,p1)

p1
Status Notpaid Paid
Notpaid 11 9
Paid 4 76

Here the correct classification rate is 87% ((11+76)/100=0.87). This value must be compared
to H-max and H-prop values in order to evaluate the correct classification rate.

9.2 General principles of classification trees

“Classification trees” is a term given to a number of algorithms that have several features in
common

• Firstly, they attempt to predict group membership. That is, they try to predict or explain
a categorical dependent variable. It is possible to use some classification tree algorithms to
predict numeric dependent variables, but these go under the name of “regression trees” and
are generally considered separately to classification trees.
9.2. GENERAL PRINCIPLES OF CLASSIFICATION TREES 179

• Secondly, they represent the predictive process in a tree-like structure. We have seen an
example of such a tree-like structure in Figure 9.1(c). The tree structure results in an
easy-to-understand set of classification rules that can be used to predict every object into
a particular part of the tree (called a node), with a corresponding prediction for the value
of the dependent variable.
• Thirdly, they employ a method of recursive partitioning when creating the tree. This means
that all trees begin by considering all objects as belonging to the same group. They then
attempt to divide or partition the data into sub-groups that are as similar to one another as
possible with respect to the dependent variable. Another way of saying this is that objects
within nodes should be as homogeneous as possible with respect to the outcome variable.
The recursive aspect of the method is that it attempts to further partition each sub-group
it creates, until it can no longer do so any further.

Thus, all classification trees algorithms are partitioning algorithms that are recursive and re-
sult in a tree-like structure that can be used to generate classification rules that predict group
membership. These algorithms can differ from one another in the following two key ways:

Splitting rule Each algorithm must decide how it is going to decide how to partition the data.
That is, how is it going to find the independent variables that result in partitions that are
maximally homogenous within partitions and give good overall hit rates? We will consider
two splitting rules: (1) the reduction in diversity criterion that is used as a default by CART;
and (2) the chi-squared test of association that is used by default by CHAID. Splitting rules
are often divided into binary splitting rules, which always divide a parent node into two
child nodes; and non-binary splitting rules, which can divide a parent node into two or more
child nodes. CART is an example of an algorithm that uses binary splitting rules, while
CHAID uses non-binary rules.
Stopping rule The recursive aspect of classification trees means that they will always try to
continue to partition nodes further and further until they have some reason to stop. This
“reason to stop” is given the name of a stopping rule. The simplest possible stopping rule
is: “continue splitting nodes until each terminal node contains only one object”. This
is generally an unsatisfactory rule that is almost certain to overfit the data and create
classification rules that will not generalise to a broader population. We will consider a
number of more suitable stopping rules in section 9.5.

Where sample size allows, the dataset used to create a classification tree should be divided into
a training sample and a hold-out sample, for precisely the same reasons as were described in the
previous chapter on discriminant analysis. That is, there is always a danger that when we fit a
model to a dataset we are capturing certain random aspects that are unique to that dataset and
do not hold in the general population. This is known as overfitting. Usually the more complex
the model, the bigger the risk of overfitting. To avoid this danger, it is important to test the
final model on a set of data that was not used in the construction of the model itself. This data
is known as a hold-out sample to distinguish it from the data used to construct the model, which
is known as the training sample. If the hit rates are similar in the training and hold-out samples,
then the researcher can be quite confident that no serious overfitting has occured.

Once the classification tree has been constructed, the results are usually evaluated and presented
in the following three ways:

(a) The graphical classification tree, together with the decision rules that lead to each terminal
node and the classification for that node.
(b) A misclassification rate for each of the rules/terminal nodes, indicating the reliability of
each of the decision rules.
(c) The overall hit rate, indicating the overall quality of the predictive model. This is usually
given in a classification matrix.
180 CHAPTER 9. CLASSIFICATION TREES

In the remainder of the chapter, we consider in more detail two of the different algorithms that
can be used to build a classification tree: the classification and regression tree (CART), and the
chi-squared automatic interaction detection (CHAID) algorithms.

9.3 Classification and regression trees (CART)

The CART algorithm that we will discuss in this section has two features that distinguish it
from the CHAID algorithm covered in the next section:

• It is a binary splitting algorithm. That is, each parent node is partitioned into two and only
two child nodes. Every node is thus either a terminal node or has two child nodes.
• The partitioning is based on a measure of the diversity (i.e. heterogeneity with respect to
the outcome variable) in a node. This is known as the reduction in diversity criterion.

9.3.1 Partitioning the tree

We will illustrate the construction of a CART tree using the same credit risk example as before.
As with all classification trees, the algorithm begins by placing all objects together in a root
node. As mentioned, there are 80 members of the “Paid” group in this node and 20 “Not paid”
members. That is, there is some diversity of behaviour in this node because some objects belong
to one group (“Paid”) and other objects belong to another group (“Not paid”). The quantitative
measure of diversity that is used by the CART algorithm is the diversity index (also called the
Gini index), which is stated below.

Diversity index/Gini index The diversity index of a node is the probability that
any two objects chosen at random (with replacement) from all those in the node will
belong to different groups.

For cases where the dependent variable has only two groups, the diversity index of a node can
be computed as
DI = 1 − ρ21 − ρ22
where ρ1 is the proportion of objects in that node who belong to Group 1 (e.g. to the “Paid”
group) and ρ2 is the proportion of objects in that node who belong to Group 2 (e.g. to the “Not
paid” group). To see why this is so, consider that there are two ways in which two objects chosen
at random will be the same. Either:

• They are both from Group 1. The probability of this happening, assuming independence
of observations1 , is ρ1 × ρ1 = ρ21
• They are both from Group 2. The probability of this happening is ρ22

Therefore, the probability that the two randomly-chosen objects are not the same is simply
1 − ρ21 − ρ22 . In the credit risk example ρ1 = 0.8 and ρ2 = 0.2, so the diversity index of the root
node is given by DI = 1 − 0.64 − 0.04 = 0.32.

For the more general case where the dependent variable has q groups, the diversity index of a
node can be computed as
Xq
DI = 1 − ρ2i
i=1

by the same reasoning as before. Note that the minimum value that the diversity index can take
on is always zero, which is achieved when all the objects in a node belong to the same group (that
1
Recall that if the probability of an event A is given by Pr(A) and the probability of an event B is given by Pr(B),
then provided that the two events are independent, the probability of them both happening is Pr(A ∩ B) = Pr(A) Pr(B)
9.3. CLASSIFICATION AND REGRESSION TREES (CART) 181

is, when the node is perfectly homogenous with respect to the dependent variable). Supposing
that all the members in a node belong to Group 1, then it is certain that an object drawn at
random from the node will belong to Group 1 i.e. ρ1 = 1, all other ρi = 0 and DI = 0. The
maximum value the diversity index can take on is a little more complicated because it depends
on the number of groups there are. The case of maximum diversity is when there are an equal
number of objects in each of the groups. For a two-group case, this would be achieved when
50% of the objects in a node belong to Group 1 and the other 50% belong to Group 2. For a
three-group case, it would occur when 33% of objects are in each of the three groups. In general,
when there are q groups maximum diversity will occur when ρi = 1/q, ∀i (for all i). The diversity
index will then be DI = 1 − q(1/q 2 ) = 1 − (1/q).

At this stage, all the CART algorithm has done is to compute a measure of diversity for the
parent node. Now, it must decide how to partition this root node. This is done according to the
following procedure (adapted from ?):

(a) Consider each independent variable in turn.


(b) For each independent variable, consider every possible binary split of the values of that
independent variable (by creating two categories or intervals, A and not-A).
(c) Calculate the diversity measure obtained for each possible partitioning (i.e. the weighted
average of the diversity index obtained from each child node in the partition)
(d) The partition that leads to the greatest reduction in diversity from the parent node to the
child nodes over all possible partitions is selected as the one to use.
(e) The process is then repeated for each of the new child nodes (i.e. they become the parent
nodes of the next step).

Let us return to the credit risk example to illustrate this process. There are three independent
variables, and two of these are binary variables. Since these binary variables have only two
categories to begin with, they can only be partitioned in this way i.e.

• Child 1 – V olume = F requent; Child 2 – V olume = Seldom.


• Child 1 – V alue = High; Child 2 – V alue = Low.

The Age variable, on the other hand, has three categories and can thus form three different types
of binary partitions (remember that CART must use only binary partitions) i.e.

• Child 1 – Age < 25; Child 2 – Age = [25, 30] and Age > 30.
• Child 1 – Age < 25 and Age = [25, 30]; Child 2 – Age > 30.
• Child 1 – Age < 25 and Age > 30; Child 2 – Age = [25, 30].

These are the five possible ways in which the root node can be partitioned. For each of these
possible partitions, we need to compute the reduction in diversity that would occur. To compute
the reduction in diversity, we need to perform three computations

(a) Calculate the diversity index in each of the child nodes. Refer to this as DIChild 1 and
DIChild 2
(b) Calculate the weighted average of the two diversity indices, given by

nChild 1 DIChild 1 + nChild 2 DIChild 2


W ADI =
nChild 1 + nChild 2

(c) Subtract the weighted average diversity index from the diversity index of the parent node
i.e.
RiD = DIParent − W ADI
182 CHAPTER 9. CLASSIFICATION TREES

We consider two of the possible partitions in order to illustrate the required calculations:

Child 1 – V olume = F requent; Child 2 – V olume = Seldom

The cross-tabulation of the Volume variable with the dependent variable is shown in Table 9.2.
It can be used to calculate the diversity index in each of the child nodes under this particular
partitioning.

Volume
Frequent Seldom Total
Paid 32 48 80
Not paid 18 2 20
Total 50 50 100

Table 9.2: Loan repayment status partitioned by volume category, for the root node

The diversity index of each child is given by

• Child 1 (V olume = F requent): ρPaid = 32/50, ρNot paid = 18/50, so

DIChild 1 = 1 − 0.642 − 0.362 = 0.4608

• Child 2 (V olume = Seldom): ρPaid = 48/50, ρNot paid = 2/50, so

DIChild 2 = 1 − 0.962 − 0.042 = 0.0768

Therefore, the weighted average diversity index arising from this partition is

50(0.4608) + 50(0.0768)
W ADI = = 0.2688
100
Since the diversity index of the parent node was found above to be 0.32, this partitioning gives
a reduction in diversity of
RiD = 0.32 − 0.2688 = 0.0512
Child 1 – Age < 25; Child 2 – Age = [25, 30] and Age > 30

The cross-tabulation of the Age variable with the dependent variable is shown below.

Age
< 25 25 − 30 > 30 Total
Paid 30 15 35 80
Not paid 10 5 5 20
Total 40 20 40 100

Table 9.3: Loan repayment status partitioned by age category, for the root node

The diversity index of each child is given by

• Child 1 (Age < 25): ρPaid = 30/40, ρNot paid = 10/40, so

DIChild 1 = 1 − 0.752 − 0.252 = 0.375

• Child 2 (Age = [25, 30] and Age > 30 combined): ρPaid = (15 + 35)/(20 + 40), ρNot paid =
(5 + 5)/(20 + 40), so
DIChild 2 = 1 − 0.8332 − 0.1672 = 0.277
9.3. CLASSIFICATION AND REGRESSION TREES (CART) 183

Therefore, the weighted average diversity index arising from this partition is
40(0.375) + 60(0.277)
W ADI = = 0.317
100
Since the diversity index of the parent node was found above to be 0.32, this partitioning gives
a reduction in diversity of
RiD = 0.32 − 0.317 = 0.0033
Clearly, the reduction in diversity offered by the Volume partition is considerably greater than
the reduction offered by this particular Age partition. In fact, it turns out that the Volume
partition offers the greatest reduction in diversity of all five of the possible partitions. On this
basis, the CART algorithm would select this partition and would divide the root node into two:
one child containing all high volume customers (n = 50) and the other child containing all low
volume customers (n = 50).

This procedure is now repeated separately in each of the newly-formed nodes. That is, the child
nodes now themselves become parent nodes, and we consider once again all possible ways that
we could partition each node.

In new parent node V olume = F requent

Since we have already used the binary Volume partitioning, we cannot use it again. This leaves
the following four possible binary partitions:

• Child 1 – V alue = High; Child 2 – V alue = Low.


• Child 1 – Age < 25; Child 2 – Age = [25, 30] and Age > 30.
• Child 1 – Age < 25 and Age = [25, 30]; Child 2 – Age > 30.
• Child 1 – Age < 25 and Age > 30; Child 2 – Age = [25, 30].

Once again, we need to evaluate the reduction in diversity that would arise from each of these
partitions, and select the one that offers the largest reduction. The only difference is that we
no longer consider all objects in the sample, but only those that fall into this particular parent
node i.e. frequent volume customers (n = 50). We consider one of the five possible partitions,
just to show that the calculations work in the same way as in the previous step.

Child 1 – V alue = High; Child 2 – V alue = Low

The cross-tabulation of the Value variable with the dependent variable for those objects in the
V olume = F requent node is shown below.

Value
High Low Total
Paid 14 18 32
Not paid 16 2 18
Total 30 20 50

Table 9.4: Loan repayment status partitioned by value category, for the V olume = F requent node

The diversity index of each prospective child is given by

• Child 1 (V alue = High): ρPaid = 14/30, ρNot paid = 16/30, so

DIChild 1 = 1 − 0.472 − 0.532 = 0.498

• Child 2 (V alue = Low): ρPaid = 18/20, ρNot paid = 2/20, so

DIChild 2 = 1 − 0.902 − 0.102 = 0.180


184 CHAPTER 9. CLASSIFICATION TREES

Therefore, the weighted average diversity index arising from this partition is

30(0.498) + 20(0.180)
W ADI = = 0.3708
50

Since the diversity index of the parent node was found in the previous step to be 1−0.642 −0.362 =
0.4608, this partitioning gives a reduction in diversity of

RiD = 0.4608 − 0.3708 = 0.09

This reduction in diversity in fact turns out to be larger than any of the reductions offered by the
three other possible partitions. On this basis, the high volume customers making up this node
are further partitioned into those that have high average values for their (frequent) transactions
(new Child 1), and those that have low average values for their (also frequent) transactions
(new Child 2). In the next step, we would attempt to partition these new child nodes even
further, based on the remaining three binary splits that are available to us (all based on the Age)
variable. But before we do this, we first turn to the other parent node at this second step – the
one containing the low volume customers.

In new parent node V olume = Seldom

Again, we have the following four possible binary partitions:

• Child 1 – V alue = High; Child 2 – V alue = Low.


• Child 1 – Age < 25; Child 2 – Age = [25, 30] and Age > 30.
• Child 1 – Age < 25 and Age = [25, 30]; Child 2 – Age > 30.
• Child 1 – Age < 25 and Age > 30; Child 2 – Age = [25, 30].

We now need to evaluate the reduction in diversity that would arise from each of these partitions,
among those that fall into this particular parent node i.e. seldomly transacting customers (n =
50). We again consider just one of the five possible partitions.

Child 1 – V alue = High; Child 2 – V alue = Low

The cross-tabulation of the Value variable with the dependent variable for those objects in the
V olume = Seldom node is shown below.

Value
High Low Total
Paid 38 10 48
Not paid 2 0 2
Total 40 10 50

Table 9.5: Loan repayment status partitioned by value category, for the V olume = Seldom node

The diversity index of each child is given by

• Child 1 (V alue = High): ρPaid = 10/10, ρNot paid = 0/10, so

DIChild 1 = 1 − 12 − 02 = 0

• Child 2 (V alue = Low): ρPaid = 38/40, ρNot paid = 2/40, so

DIChild 2 = 1 − 0.952 − 0.052 = 0.095


9.4. CHI-SQUARED AUTOMATIC INTERACTION DETECTION (CHAID) 185

Therefore, the weighted average diversity index arising from this partition is
10(0) + 40(0.095)
W ADI = = 0.0760
50
Since the diversity index of the parent node was found in the previous step to be 1−0.962 −0.042 =
0.0768, this partitioning gives a reduction in diversity of

RiD = 0.0768 − 0.0760 = 0.0008

A slightly better reduction in diversity turns out to be possible using the binary partition of
Age < 25 and Age ≥ 25, but even this reduction is only equal to 0.0048.

This raises the question of whether this tiny reduction in diversity is “worth” creating another
partition for. Ultimately, the decision about whether a particular reduction in diversity is suffi-
cient to justify creating a new split is made by the stopping rule used by the classification tree
algorithm. We will consider stopping rules in more detail in section 9.5. For now, it can be
noted that the tree in Figure 9.1(d), which was constructed using a standard CART algorithm
did not consider the reduction in diversity to be sufficient to further split the low volume node
(Node 2 in the figure). It therefore becomes a terminal node.

Thus far, we have explained in detail how (1) the root node was partitioned into frequent volume
and seldom volume customers; (2) how the frequent volume customers were further partitioned
into those who have high average value transactions, and those who have low average value
transactions; (3) how the seldom volume customers could not be partitioned any further. This
results in the tree shown in Figure 9.1(c). Of course, the process can now be repeated in each of
the new child nodes (Node 4 and Node 5 in the figure, representing high-value, frequent trans-
actors and low-value, frequent transactors respectively). As Figure 9.1(d) shows, Node 4 cannot
be partitioned any further (because the reduction in diversity is not large enough to pass the
particular stopping rule that is used) but Node 5 can be further partitioned by a binary Age
split.

9.4 Chi-squared automatic interaction detection (CHAID)


The CHAID algorithm differs from the CART algorithm covered in the last section in two main
ways:
• It is not restricted to binary splitting. Each parent node can be partitioned into any number
of child nodes, up to a maximum of the number of distinct categories making up the
predictor variable.
• The partitioning is based on a chi-squared test of association between the dependent variable
and each predictor variable.

9.4.1 Partitioning the tree

For comparative purposes, we will illustrate the construction of a CHAID tree using the same
credit risk example as before. The algorithm begins by placing all objects together in a root
node, as before. Then, the CHAID algorithm decides how to partition this root node according
to the following procedure:
(a) Consider each independent variable in turn.
(b) For each independent variable, consider every possible split of the values of that independent
variable (this can consist of any possible combination of categories or intervals).
(c) Perform a chi-square test of association between the dependent variable and each possible
partitioning.
186 CHAPTER 9. CLASSIFICATION TREES

(d) The partition that leads to the most significant chi-square test statistic over all possible
partitions is selected as the one to use.
(e) The process is then repeated for each of the new child nodes (i.e. they become the parent
nodes of the next step).

The motivation for using a chi-squared test of association to choose partitions is that, if a
partitioning is successful, it will split apart objects such that similar objects are in the same
node i.e. objects within nodes are as homogenous at possible. If each node contains objects of a
similar type, then there will be a strong association between nodes (represented by the partitions
of an independent variable) and the dependent variable.

The use of the chi-squared test of association demands that both variables be categorical. When
the independent variable is categorical, this obviously poses no problem. In fact, the CHAID
approach has an advantage over the CART approach in that it can consider any combination
of categories, not just binary splits. That is, in addition to the five possible initial partitions
mentioned in the CART section, the CHAID approach can also try the partition

• Child 1 – Age < 25; Child 2 – Age = [25, 30]; Child 3 – Age > 30.

A less obvious point is that CHAID can also handle numeric independent variables, in the same
way as CART. That is, numeric variables are transformed into categorical variables by grouping
together certain intervals. Importantly, this grouping does also not need to be binary. For
example, if an independent variable Height is measured on a scale from 0 – 200cm, a possible
transformation is 0 – 150, 150 – 170, 170 – 180, and 180 – 200. The CHAID algorithm searches
for the most effective split. However, in most cases it is wise to use only a small number of
partitions in order to keep a reasonable number of objects in each node and avoid overfitting the
tree. This is ultimately controlled by the user.

We return now to evaluating which of the (now six) possible partitions is the best way to
split the root node. Again, we consider two calculations for illustratory purposes. The same
cross-tabulations as used in the calculation of the diversity index can be used by the CHAID
algorithm. The cross-tabulation of payment status with transaction volume was shown in Table
9.2. The test statistic resulting from a chi-squared test of association on this cross-table is given
by χ2 = 16 (see Chapter 2 if you need reminding how to compute this statistic).
This Chi-Square test can be applied in R easily using the table() function in conjunction with
the summary() function:

> summary(table(Status,Volume))

Number of cases in table: 100


Number of factors: 2
Test for independence of all factors:
Chisq = 16, df = 1, p-value = 6.334e-05

This can be tested against a critical statistic from a chi-squared distribution with 1 degree of
freedom, or the p-value obtained directly from a statistical package (it turns out to be p =
0.00006). This p-value represents the value of this particular Volume partition in classifying
objects into repayment groups.

Another possible partition would be to split the root node into the three possible age groups.
Note that this particular partitioning was not permitted by the CART algorithm, since it divides
the root node into three groups (i.e. it is not binary). The cross-tabulation of loan repayment
with age category can be obtained in R using the following code:
9.4. CHI-SQUARED AUTOMATIC INTERACTION DETECTION (CHAID) 187

> summary(table(Status,Age))

Number of cases in table: 100


Number of factors: 2
Test for independence of all factors:
Chisq = 0.4687, df = 2, p-value = 0.7911
Chi-squared approximation may be incorrect

The test statistic resulting from a chi-squared test of association between loan status and age is
given by χ2 = 0.4687, which has a corresponding p-value of p = 0.7911.
Another possible partition would be to split the root node into the two possible value groups.
The cross-tabulation of loan repayment with value category is obtained in R as follows:

> summary(table(Status,Value))

Number of cases in table: 100


Number of factors: 2
Test for independence of all factors:
Chisq = 5.844, df = 1, p-value = 0.01563

The test statistic resulting from a chi-squared test of association between loan status and age is
given by χ2 = 5.844, which has a corresponding p-value of p = 0.01563.
Thus, the Volume variable is a between splitting variable than these two particular Age and Value
partitioning because it has the smallest (more significant) p-value. In fact, the split created by
the three-group Age partition does not even associate significantly with repayment status.

It turns out that the most significant test statistic is observed when the Volume partition is used.
As a result, this variable is chosen as the initial splitting variable, which is consistent with the
CART results. The CHAID procedure is then repeated in each of the new child nodes. We will
only consider one of the child nodes, containing frequently transacting customers.

The V olume = F requent node can be further partitioned in one of five ways (the four age
partitions (binary and nonbinary), or the one value partition). Again, which partitioning is
chosen can be ascertained by calculated a chi-test statistic for each of these possible partitions.
The cross-tabulation of loan repayment with value can be obtained in R using the following code:

> summary(table(Status[Volume=="Frequent"],Value[Volume=="Frequent"]))

Number of cases in table: 50


Number of factors: 2
Test for independence of all factors:
Chisq = 9.78, df = 1, p-value = 0.001764

The relevant chi-squared statistic is 9.78, which has a p-value of 0.0017. Therefore, there is a
significant association between this partition and repayment behaviour. The cross-tabulation
of the three-group Age partition with the loan repayment for those objects in the V olume =
F requent node can be obtained in R using the following code:

> summary(table(Status[Volume=="Frequent"],Age[Volume=="Frequent"]))
188 CHAPTER 9. CLASSIFICATION TREES

Number of cases in table: 50


Number of factors: 2
Test for independence of all factors:
Chisq = 0.5208, df = 2, p-value = 0.7707
Chi-squared approximation may be incorrect

The chi-squared statistic for this test of association is χ2 = 0.5208, which has a corresponding
p-value of p = 0.7707. The other three possible age partitions are also non-significant, and so
the Value partition is the most significant of all possible partitions, and is chosen to split the
node containing frequently transacting customers. The process then repeats once again in all
new child nodes until no further splitting is possible, either because all the possible splits have
been used, or because there are no longer any significant chi-squared test statistics i.e. no more
significant way of partitioning the data. The final tree produced by the CHAID algorithm is
obtained in R with the CHAID package using the following codes:

> library(CHAID)
> chaidfit <- chaid(formula, data = datatoWork)
> print(chaidfit)

Model formula:
Status ~ Volume + Value + Age

Fitted party:
[1] root
| [2] Volume in Frequent
| | [3] Value in High: Notpaid (n = 30, err = 46.7%)
| | [4] Value in Low: Paid (n = 20, err = 10.0%)
| [5] Volume in Seldom: Paid (n = 50, err = 4.0%)

Number of inner nodes: 2


Number of terminal nodes: 3

> plot(chaidfit)
9.5. STOPPING RULES 189

1
Volume

Frequent Seldom

2
Value

High Low

Node 3 (n = 30) Node 4 (n = 20) Node 5 (n = 50)


1 1 1
Notpaid

Notpaid

Notpaid
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2


Paid

Paid

Paid

0 0 0

9.5 Stopping rules

As we have seen, classification trees are recursive algorithms. What this means is that they
begin with all objects in the root node and then split this root node into partitions. Each of
the partitions (or child nodes) may then be split further, into finer partitions, and then these
second-level partitions split again and again. That is, in the extreme case classification trees
can split and split until there is only one case in each node. Having one case in a node is the
ultimate stopping condition, because obviously the node can not be split any further.

However, the partitions that are created further down the tree are likely to reflect peculiarities
and random noise in the training data set and are unlikely to generalise to other datasets. This
is the common problem of “overfitting” the model, and there is therefore a need to limit the size
of the tree so that its findings are general findings that apply to the broader population and do
not just overfit the sample data. There are several approaches that have been designed with this
aim in mind. Some can be applied to any classification tree algorithm, while others are specific
to a particular algorithm. In this section we discuss some commonly used stopping rules.

Stopping rules are generally divided into two types of approaches,

Bonsai techniques attempt to limit the growth of the tree before it gets too large. This
involves, at each partitioning, asking whether the partitioning is “useful” in terms of some
criteria. If a split is deemed useful, it is allowed to go ahead. If it is deemed not useful,
the node is not split any further and becomes terminal. Bonsai techniques require that one
specify what is meant by “useful”.
190 CHAPTER 9. CLASSIFICATION TREES

Pruning techniques allow the tree to grow to its full size i.e. do not employ any stopping rules
to begin with, and then removes nodes and branches of the tree that look like they are the
consequences of overfitting and thus will not generalise. Pruning techniques require that
one specify what is meant by “consequences of overfitting”.

9.5.1 Bonsai techniques

We consider three stopping rules that fall under the heading of bonsai techniques: setting a
minimum node size, setting minimal levels of statistical significance, and setting a minimum
reduction in diversity.

Setting a minimum node size: Perhaps the simplest stopping rule is simply to prevent any
node that contains fewer than Nstop objects from splitting any further, where Nstop is a
parameter that is set by the user. Obviously, Nstop will depend on the sample size in the
training set, but a good starting point might be to set Nstop to 5% or 10% of the sample
size. The larger the minimum node size that is set, the smaller the classification tree will
be.
Setting a minimum level of statistical significance: The CHAID algorithm selects parti-
tions based on which one has the most significant p-value in a chi-squared test of associa-
tion. A possible stopping rule is to prevent any partitioning from being used if it possesses
a p-value of greater than αstop , even if it is the most significant of all possible partitions (re-
membering that a higher p-value means less significant). Suppose that αstop is set to 0.05.
Then if the most significant partition is not significant at the 5% level (that is, p > 0.05),
the node will not be partitioned and will become a terminal node. The lower (more signif-
icant) the value of αstop used, the smaller the tree will be. Note that this rule can only be
applied to CHAID trees.
Setting a minimum reduction in diversity: The CART algorithm selects partitions based
on which one offers the largest reduction in diversity. The larger the reduction in diversity,
the more useful the split is likely to be. Therefore, a possible stopping rule is to set a
minimum reduction in diversity that is acceptable to the user, RiDstop . If a partition offers
a reduction in diversity less than RiDstop , then it will not be used even if it is the largest
reduction of all the possible partitions. The larger the value of RiDstop , the smaller the
resulting tree.

9.5.2 Pruning techniques

The following two pruning methods are commonly used in practice: pruning based on the per-
formance of the tree in a hold-out sample, and pruning based on a measure of misclassification
error.

Pruning using a hold-out sample: This approach uses an independent hold-out sample and
measures the percentage of correct classifications or hit rate of the full tree. The full tree
is then slowly reduced in size by trimming off nodes and branches until the root node
is reached. The selected tree is the one that maximises the overall percentage of correct
classifications. This method requires that the sample size is large enough to split into
training and hold-out samples.
Pruning on misclassification error (CV cost): This pruning approach is useful when no
hold-out sample is available and the training sample is too small to have a hold-out sample
taken from it. The approach is based on a measure of misclassification error known as the
CV cost or cross-validation cost. Each tree has a complexity or size (measured by its number
of terminal nodes), and each tree of a particular size has a CV cost, which is calculated in
the following way:
(a) Divide the entire sample into V samples, as equal in size as possible.
9.6. FURTHER EXAMPLES 191

(b) Specify a particular ‘size’ for the classification tree (starting with just a root node i.e.
n = 1, all the way up the number of terminal nodes in the full tree with no stopping
rule)
(c) The classification tree of the specified size is computed V times, each time leaving out
one of the V sub-samples from the computations and using it as a hold-out sample to
calculate a misclassification rate.
(d) Compute the average misclassification rate (CV cost) for a tree of a specified size along
with its standard error.
The CV cost gives an estimate of a hold-out sample misclassification rate when it is not
possible to use an actual hold-out sample. In some circumstances, it will make sense to
choose the tree with the minimum CV cost. We will use a slightly more general stopping
rule that states:
Pruning on CV cost rule: choose the tree with the fewest terminal nodes that
has a CV cost less than or equal to
min(CV cost) + Ψ × Std error of tree with min(CV cost)
where min(CV cost) is the minimum CV cost among all the possible classification
trees and Ψ is a parameter specified by the user.
If Ψ = 0, then the tree with the minimum CV cost will always be chosen (because it is the
only tree that has a CV cost less than or equal to min(CV cost)). For larger values of Ψ, a
smaller tree may be chosen if it has a CV cost that is within Ψ×Std error of tree with min(CV cost)
of the minimum. The larger the value of Ψ, the smaller the selected tree.

9.6 Further examples


9.6.1 Predicting purchase intentions for laundry detergent

In the following dataset, 350 respondents were surveyed on their choice of laundry detergents
and the reasons for that choice. The data was provided by Synovate South Africa’s Brand Lab.
Respondent were asked to indicate whether they were likely to purchase a particular brand of
laundry detergent (say, brand X) in the next month. These purchase intentions are recorded
on a three-level categorical scale (“Yes, I am almost certain to buy it”, “Maybe, I’m not sure”,
and “No, I will almost certainly not buy it”). Respondents were also asked to indicate which
attributes they thought brand X possessed. The list of possible attributes is given below.

Quality Is the product of a high quality?


Trusted Is the product performance dependable?
GoodValue Is the product good value for money?
WellPriced Is the product well priced?
Fashionable Is the product becoming fashionable to use?
GoodTalk Do people say good things about the product?
Packaging Is the packaging good?
CleansWell Does the product clean clothes well?
Expensive Is the product expensive?
SmellsGood Does the product leave clothes smelling nice?
Gentle Is the product gentle on fabrics?

The data which is saved in your working directory with the [Link] name is
extracted into your workspace as follows:

> PurchaseIntentions <- [Link]("[Link]")


> attach(PurchaseIntentions)
192 CHAPTER 9. CLASSIFICATION TREES

The following object(s) are masked from datatoWork:

ID

> head(PurchaseIntentions)

ID Training Quality Trusted GoodValue WellPriced Fashionable GoodWoM Packaging


1 1 1 Yes Yes Yes Yes No No No
2 2 1 Yes Yes Yes No No No No
3 3 1 No No No No No No No
4 4 1 Yes Yes No Yes No Yes No
5 5 1 Yes Yes Yes No Yes Yes No
6 6 1 No No No No No No No
CleansWell Expensive SmellsGood Gentle PurchaseInt
1 Yes No Yes Yes Yes
2 Yes No Yes Yes Yes
3 No No No No Yes
4 Yes No No Yes Yes
5 Yes Yes Yes Yes Yes
6 No No No No No

The first 250 observations are used as a training data set. As a result, the model and the data
to work with is defined in R as follows:

> formula2=PurchaseInt ~ Training+Quality+Trusted+GoodValue+WellPriced+Fashionable+Goo


> datatoWork2=PurchaseIntentions[1:250,]

The CART method with a minimum splitting criteria of 13 (the minimum number of observations
that must exist in a node in order for a split to be attempted) and with a complexity parameter
(CV) of 0.005 is applied using the following R code:

> rpartfit2 <- rpart(formula2, data=datatoWork2,control=[Link](minsplit=13, cp=


> rpartfit2

n= 250

node), split, n, loss, yval, (yprob)


* denotes terminal node

1) root 250 151 Yes (0.29200000 0.31200000 0.39600000)


2) Gentle=No 136 64 No (0.33823529 0.52941176 0.13235294)
4) Quality=No 96 28 No (0.26041667 0.70833333 0.03125000)
8) Expensive=Yes 13 3 Maybe (0.76923077 0.15384615 0.07692308) *
9) Expensive=No 83 17 No (0.18072289 0.79518072 0.02409639)
18) SmellsGood=Yes 4 0 Maybe (1.00000000 0.00000000 0.00000000) *
19) SmellsGood=No 79 13 No (0.13924051 0.83544304 0.02531646)
38) Trusted=Yes 5 1 Maybe (0.80000000 0.20000000 0.00000000) *
39) Trusted=No 74 9 No (0.09459459 0.87837838 0.02702703) *
5) Quality=Yes 40 19 Maybe (0.52500000 0.10000000 0.37500000)
10) SmellsGood=No 27 8 Maybe (0.70370370 0.11111111 0.18518519) *
11) SmellsGood=Yes 13 3 Yes (0.15384615 0.07692308 0.76923077) *
9.6. FURTHER EXAMPLES 193

3) Gentle=Yes 114 33 Yes (0.23684211 0.05263158 0.71052632)


6) Quality=No 29 15 Maybe (0.48275862 0.06896552 0.44827586)
12) Fashionable=Yes 8 2 Maybe (0.75000000 0.00000000 0.25000000) *
13) Fashionable=No 21 10 Yes (0.38095238 0.09523810 0.52380952)
26) GoodValue=No 15 8 Maybe (0.46666667 0.13333333 0.40000000) *
27) GoodValue=Yes 6 1 Yes (0.16666667 0.00000000 0.83333333) *
7) Quality=Yes 85 17 Yes (0.15294118 0.04705882 0.80000000) *

> partyfit2=[Link](rpartfit2)
> plot(partyfit2, type = "extended")

1
Gentle

No Yes
2 13
Quality Quality

No Yes No Yes
3 10 14
Expensive SmellsGood Fashionable

Yes No Yes No
5 16
SmellsGood GoodValue

Yes No No Yes
7
Trusted No Yes

Yes No

Node 4 (n
Node
= 13)
6 Node
(n = 4)8Node
(n = 5)
9Node
(n = 74)
11Node
(n = 12
27)
Node
(n = 13)
15
Node
(n =17
8)Node
(n = 15)
18
Node
(n =19
6) (n = 85)
1 1 1 1 1 1 1 1 1 1
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0 0 0 0 0 0 0 0 0 0
Maybe Maybe Maybe Maybe Maybe Maybe Maybe Maybe Maybe Maybe

Try the CART method with a minimum splitting criteria of 20 (the minimum number of obser-
vations that must exist in a node in order for a split to be attempted) and with a complexity
parameter (CV) of 0.05 using the R code and compare the results.
The aim of this research is to identify the attributes that distinguish between the three groups of
users, so that (1) product developers can work to improve those aspects that are causing people
to not use the product, (2) advertisers can accentuate those aspects that are leading people
to buy the product. In doing so, we also aim to develop a set of decision rules based on the
attributes that would allow a market researcher to classify a potential customer as someone who
is likely to buy the product or unlikely to buy the product. From our point of view, we also use
this example to illustrate the CV cost calculation (“pruning based on misclassification error”)
and to show how the choice of classification tree algorithm (CART or CHAID) can affect the
final tree that is chosen as well as the overall misclassification rate.
194 CHAPTER 9. CLASSIFICATION TREES

Since we have a fairly large sample of 350, we divide the sample into a training sample of 250 and
a hold-out sample of 100. It is important to bear in mind that these trees give the classification
for the training sample, and so the hit rate for the minsplit=13 criteria is 79.2% (check this
yourself!) and it may be upwardly biased because we are testing the model on the same data as
we used to create the model in the first case. In cases where we have kept aside a hold-out sample
in order to test the constructed classification tree, we should also report a classification matrix
obtained from applying the classification tree to the hold-out sample. The result of applying the
tree to the hold-out sample of 100 respondents is applied in R as follows which gives the correct
classification rate of 51% (check this!):

> dataHoldOut=PurchaseIntentions[250:350,]
> p2=predict(partyfit2, dataHoldOut)
> table(dataHoldOut$PurchaseInt,p2)

p2
Maybe No Yes
Maybe 13 7 13
No 2 23 1
Yes 13 4 25

Since the two hit-rates are very far away from each other, we should doubt that this classification
tree is not provided stable solutions.
Chapter 10

Structural Equation Modeling

Earlier in this course, we saw how the factor analysis technique could be used to identify cases
where responses on a subset of observed variables were in fact associated or correlated with
one another. In this case, we said that the subset of variables measured the same underlying
construct or factor. Using the factor analysis technique, we were able to build up a quite pow-
erful way of discussing underlying traits or factors that we do not measure directly but only
indirectly through the independent variables: traits that were usually vague and rather abstract
(e.g. “intelligence”, “charitableness”, “socioeconomic status”). However, a problem with using fac-
tor analysis is that no dependence relationships can be investigated. That is, we cannot answer
questions about whether two factors are related to one another, or whether a combination of two
factors can influence the value of a third. It was a purely descriptive or exploratory technique.

One way to investigate dependencies between factors is to generate factor scores, as we saw in
the factor analysis section, and then use these in a multiple linear regression. However, there
are two comments that we should note about this approach: (1) first, we should ask whether it
might not be possible to use a single technique rather than match together two quite different
multivariate approaches; (2) we should also note that a multiple regression framework is quite
restrictive in terms of the relationships we can look at. For example, it is often the case that
the dependent variable in one model is the independent variable in a second model: we will soon
consider an example where industrialization levels in a country affect the democracy levels in a
country in both 1960 and 1965 (this is a famous research problem in the literature - Bollen) and
the democracy levels in 1960 affect the democracy levels in 1965. In this case, the democracy
levels in a country in 1960 is both a dependent variable (influenced by industrialization levels in
1960) and an independent variable (influencing the democracy levels in 1965). The regression
techniques we have considered thus far cannot handle such complexity, but in this section we
turn to Structural equation modelling (SEM), a technique combining elements of both multiple
regression and factor analysis that can deal with very complex modelling tasks.

The fundamental characteristics of SEM are (1) the estimation of multiple and interrelated
dependence relationships, (2) the ability to represent unobserved concepts (factors) in these re-
lationships, and (3) the ability to account for measurement error in the estimation process. The
first characteristic in particular is not captured by conventional multivariate analysis techniques.
SEM thus gives the researcher more flexibility than other multivariate methods. But along with
this flexibility comes the potential for inappropriate use of the method. The overriding consid-
eration in any application of SEM should be a reliance on a theoretically based foundation for
the proposed model and (particularly) any modifications to this model. SEM is often advocated
in a confirmatory capacity, where if used correctly the model provides a strong test of causal re-
lationships. In an exploratory capacity, the researcher must take care not to fall prey to ‘fishing’
the data for relationships that have little generalisability beyond the sample data. Naturally the
continuum between strictly confirmatory and strictly exploratory analyses leaves ample room
for SEM-based exploration around a proposed theory, but the point is that the role of theory

195
196 CHAPTER 10. STRUCTURAL EQUATION MODELING

assumes a critical role in assuring the usefulness of results of the SEM process.

Structural equation modelling is composed of two parts: a measurement model and a structural
model. The measurement model is a sub-model that (1) specifies the indicators for each con-
struct, and (2) assesses the reliability of each construct for estimating the causal relationships.
The measurement model is similar in form to factor analysis; the major difference lies in the de-
gree of control provided to the researcher. In ordinary factor analysis, the researcher can specify
only the number of factors, but all variables have loadings for all factors (although to different
degrees – which is the basis for choosing which factor is ‘loaded on’ by a variable). In the mea-
surement model, the researcher specifies which variables are associated with each construct, with
variables having no loadings (i.e. a loading of zero) on any construct other than those constructs
it is associated with. This is discussed in more detail later. The structural model is a set of one
or more dependence relationships linking the hypothesised model’s constructs. These relation-
ships can be written in equation form, in much the same way as a multiple regression procedure.
The true value of SEM comes from the benefits of using the structural and measurement models
simultaneously, each playing distinct roles in the overall analysis.

10.1 Developing a theoretical model

Multivariate techniques, particularly SEM, should never be used by themselves as a means


of proving ‘causation’ without having some theoretical guiding perspective. SEM is based on
causal relationships, which can vary in form, strength and meaning. A generally established set
of criteria for causation is:

(a) Sufficient association between the two variables.


(b) Evidence that the cause occurs before the effect.
(c) Lack of alternative causal variables.
(d) Theoretical basis for the relationship.

The above criteria illustrate that significant association between two variables i.e. satisfaction
of criterion 1, is not sufficient to validate or justify a theory. This emphasises the need for
theoretical justification for the specification of the dependence relationships in SEM. There is
little more that can be said about the development of a theory: it is a practical matter relating
to the researcher. However, one important aspect that does need to be taken into account
before and during model development is the role of a modelling strategy. Three broad modelling
strategies are suggested for use in SEM applications.

Confirmatory Strategy In a confirmatory modelling strategy the researcher specifies a single


model, and SEM is used to assess its statistical significance. The underlying logic is one
of “either it works or it doesn’t”. Although this is a simple and concise strategy, it suf-
fers relatively more than other strategies from the confirmatory bias of SEM. Under the
confirmatory bias, a significant or acceptable model is best interpreted as just one of the
possible models rather than the ‘correct’ one. The lack of comparisons due to a confirmatory
strategy also limits the extent of model interpretations.
Competing Models Strategy A competing models strategy is a means of evaluating the es-
timated model with other models i.e. to compare alternative formulations of the underlying
theory. It is important, however, that all the alternative models have a good theoretical
basis.
Model Development Strategy The previous models assume a good working theory is in
place. Often this is not the case, so that the researcher is aiming to develop the theory by
empirical observation. A model development strategy supports such an aim by attempting
to improve a model through modifications of the structural and/or measurement models.
10.2. CONSTRUCTING A PATH DIAGRAM OF CAUSAL RELATIONSHIPS 197

The underlying theory is taken only as a starting point for building a model that is sup-
ported by the data. It is this strategy which comes closest to an ‘exploratory’ view, and as
such has the most danger for overfitting and unnecessary modelling.

Parsimony (the idea that model simplicity is an important advantage, so that if two models offer
the same predictive accuracy then one should choose the model with fewer parameters) is an
important modelling concept in the early stages of SEM – it is suggested that interpretation
(particularly of statistical significance) becomes difficult if the number of constructs exceeds 20,
although obviously this number alone is no justification for limiting a genuinely large model.
The benefits of interpretation, though, lie with parsimonious and concise models.

10.2 Constructing a path diagram of causal relationships

The key feature of SEM is the way that it represents the theoretical model specified by the
researcher in the previous step. It does this using (1) a path diagram which provides a graphical
representation of a series of causal relationships, using arrows to denote the direction of influence
between variables, and (2) a set of algebraic equations called structural equations which in effect
‘translate’ the path diagram into a precise mathematical format. The structural equations
themselves look very much like regression equations; the main difference is that with SEM there
is a whole system of these equations (rather than just one).

The construction of a path diagram can become complicated as the size of the problem increases.
There are two basic modelling aspects that are contained in a path diagram

(a) Dependent-independent variable relationships (causation)


(b) Associations between constructs or variables making up constructs (correlation)

These aspects are represented using three fundamental elements

Constructs A construct is a theoretically based concept used as a building block for the model
- also known as a factor. A construct may be classified along two dimensions: exoge-
nous/endogenous and latent/manifest.

Exogenous/Endogenous constructs
(a) Exogenous constructs: those that are not predicted by any other construct - also known
as source variables or independent variables.
(b) Endogenous constructs: those that are predicted by one or more constructs.
The great advantage of SEM is that endogenous constructs may predict other endogenous
constructs. These interrelationships - to form a ‘chain of models’ - are not captured by
other techniques such as multiple regression.

Latent/Manifest variables
(a) Latent constructs: those variables or constructs that are not directly measured in the
model (these are the equivalents of ‘factors’ in factor analysis).
(b) Manifest constructs: those variables or constructs that are directly measured in the
model.
Latent variables are usually denoted in path diagrams by circles/ovals and manifest vari-
ables usually represented by rectangles.
Causal Relationships These are denoted by straight arrows from independent (exogenous or
endogenous) to dependent (endogenous) variables i.e. in the direction of causation. The
arrow may be double-headed to represent a reciprocal relationship, although this is rare
and may cause some estimation problems.
198 CHAPTER 10. STRUCTURAL EQUATION MODELING

Correlations Correlations between two constructs are indicated by a curved line between the
constructs involved.

At this stage, let us introduce a worked example in order to make the steps involved in building
a SEM clearer. Bollen, K.A. (1989) aimed to measure the political democracy. Bollen analyses a
data set of 75 developing countries using two points in time, 1960-1965. Democracy is measured
using four variables such that freedom of the press, freedom of political opposition, fairness of
elections and effectiveness of elected legislature. Industrialization in 1960s which is believed to
have an impact on democracy levels of a country in both 1960s and 1965s is measured using three
different indicators such that GNP per capita in 1960, energy consumption per capita in 1960,
and percentage of labor force in industry in 1960. The goal in this particular study therefore is
to analyse the determinants of democracy levels in a country.

In examining how the SEM is built up, we first discuss the basic structural model, then the
measurement model specifying how each construct is to be measured, and finally put the two
together into a full SEM.

10.2.1 The structural model

As mentioned before, the structural model is a set of one or more dependence relationships link-
ing the hypothesised model’s constructs. Democracy level in 1965 is affected by both democracy
level in 1960 and the industrialization level in 1960, whereas democracy level in 1960 is only
affected by the industrialization level in the same year, and that these relationships are given in
the hypothesised structural model shown in Figure 10.1.

In SEM, it is common to differentiate the different variables according to latent endogenous and
latent exogenous variable definitions. Latent endogenous variables are denoted ξ (the Greek
letter “xi”), latent exogenous variables are denoted η (the Greek letter “eta”). The indicators
that construct the latent endogenous variables are denoted “y” and the indicators that construct
the latent exogenous variables are denoted “x”. The different error terms attached to latent en-
dogenous variables are denoted ζ (the Greek letter zeta), the residuals attached to the indicators
of latent endogenous variables are denoted  (the Greek letter epsilon), the residuals attached to
the indicators of latent exogenous variables are denoted δ (the Greek letter delta).

η1
(democracy
1960)

ξ1
(Indust
1960)

η2
(democracy
1965)

Figure 10.1: Structural model component of the full SEM

A number of points are immediately clear from the path model above. Firstly, the democracy
levels in 1965 is believed to depend on two variables: industrialization levels in the country
in 1960, and the democracy levels in the country in 1960. Second, the democracy levels in the
country in 1960 are also influenced by the industrialization levels in the country in 1960. It is easy
10.2. CONSTRUCTING A PATH DIAGRAM OF CAUSAL RELATIONSHIPS 199

to think of situations in which the relationship might be reciprocal (e.g. where the dem65 may
be able to influence the dem60 as well). At this stage, we have a single exogenous latent variable
industrialization levels in the country in 1960 and two endogenous latent variables democracy
levels in the country in 1960 and democracy levels in the country in 1960.

10.2.2 The measurement model

As is frequently the case with models in the social and economic sciences, the constructs of in-
terest ‘dem60’, ‘dem65’ and ‘ind60’ are vague and difficult to measure directly. We will therefore
not try to measure the constructs directly, but will instead treat them as underlying factors to
be measured with a set of indicator variables (the manifest variables). Our structural equation
model therefore also needs to specify a measurement model identifying how each of the constructs
is going to be measured. The measurement model provides a way to actually test the structural
relationships by specifying exactly how the constructs should be measured. This task may be
particularly difficult but important when the underlying constructs are of an abstract nature
i.e. intelligence, commitment to a cause, etc. In this respect the measurement model can be
compared to factor analysis, with which it shares many similarities. However, whereas in factor
analysis each variable loads on each factor (although to different degrees which are expressed
as the magnitude of the loadings), SEM gives the researcher control to specify precisely which
factor is associated with each variable. By implication those variables not associated with a
factor (as specified by the researcher) will receive factor loadings of zero on that factor. Conven-
tional factor analysis is therefore more exploratory in nature since there are no constraints on
the variable loadings. The use of the measurement model of SEM in a factor analysis context is
often termed confirmatory factor analysis.

Returning to our example, we begin by specifying the measurement model (or ‘factor model’)
for the exogenous latent variable ind60: this construct is measured using three indicators or
manifest variables, GNP per capita, energy consumption per capita, and percentage of labour
force. This can be shown graphically by the path diagram in Figure 10.2.

x1
(GNP per
capita)

x2
(Energy ξ1 (Indust
consump 1960)
per capita)

x3
(Percentage
of labour
force)

Figure 10.2: Measurement model component (for latent exogenous constructs)

Note that in the notation of SEM, the underlying or latent construct is said to “cause” the ob-
served or manifest variables (e.g. that industrialization “cause” the observed GNP per capita,
energy consumption and percentage of labour force. We now also need to specify the measure-
ment model for the endogenous latent variables dem60 and dem65: let us suppose that it is
agreed that freedom of the press, freedom of political opposition, fairness of elections and ef-
fectiveness of elected legislature give a reasonably good indication of the democracy levels in a
200 CHAPTER 10. STRUCTURAL EQUATION MODELING

country, as shown in Figure 10.3.

y2 y4 y6 y8
y1 y3 y5 y7
(freedom of (effectiveness (freedom of (effectiveness
(freedom of (fairness of (freedom of (fairness of
political of elected political of elected
the press elections the press elections
opposition legislature opposition legislature
1960) 1960) 1965) 1965)
1960) 1960) 1965) 1965)

η1 η2
(democracy (democracy
1960) 1965)

Figure 10.3: Measurement model component (for latent endogenous constructs)

An important point is that in order for the SEM to be estimated, it is necessary to fix the
coefficient of one of the indicators on each of the endogenous latent variables. If the researcher
has a preference, this has to be specified. The other parameters in the measurement models are
free parameters to be estimated by the SEM.

10.2.3 Putting the measurement and structural models together

Having specified the individual parts making up the model, we can put the two measurement
models and one structural model together to form the full model. The associated path diagram
is shown in Figure 10.4.

y2 y4
y1 y3
(freedom of (effectiveness
(freedom of (fairness of
political of elected
the press elections
opposition legislature
1960) 1960)
1960) 1960)
x1
(GNP per
capita)
η1
(dem60)

x2
(Energy ξ1
consump (ind60)
per capita)

η2
x3 (dem65)
(Percentage
of labour
force)
y6 y8
y5 y7
(freedom of (effectiveness
(freedom of (fairness of
political of elected
the press elections
opposition legislature
1965) 1965)
1965) 1965)

Figure 10.4: Full SEM model (no residual variables shown)

This represents the full extent of our structural equation model. Many applications of SEM also
choose to display error terms in the graphical models. All hypothesised relationships, be they
part of the structural or measurement models, are subject to some kind of error, and so these
errors are often included explicitly. The way in which this is done is simply to attach a latent
‘error’ variable to the dependent variable of each relationship in the path diagram.

Figure 10.5 shows the full SEM with errors terms included.

As is clear, the full path diagram contains a wealth of information, specifying both hypothesised
causal relationships and a full measurement model indicating which observed variables load
onto which underlying latent constructs. Although we do not allow any of the variables to be
10.3. CONVERTING THE PATH DIAGRAM INTO STRUCTURAL AND MEASUREMENT MODELS201

1 2 3 4

y2 y4
y1 y3
(freedom of (effectiveness
(freedom of (fairness of
political of elected
the press elections
opposition legislature
1960) 1960)
1960) 1960)
x1
δ1 (GNP per
capita)
η1
(dem60) ζ1

x2
(Energy ξ1
δ2
consump (ind60)
per capita)

η2
(dem65) ζ2
x3
(Percentage
δ3
of labour
force)
y6 y8
y5 y7
(freedom of (effectiveness
(freedom of (fairness of
political of elected
the press elections
opposition legislature
1965) 1965)
1965) 1965)

5 6 7 8

Figure 10.5: Full SEM model (with residual variables shown)

correlated, this extension can easily be included into the path model, by linking the variables
which one wants to estimate the correlation for by a curved line (no arrow) between those two
variables. This correlation would then be an extra parameter to be estimated by the model.

10.3 Converting the path diagram into structural and measure-


ment models

This third stage specifies the model contained in the path diagram in more formal, specifically
algebraic, terms. As such it represents a bridge between the theoretical first and second stages
which are predominantly theory-building phases, and the remaining stages, where the model
‘takes over’. As mentioned earlier, a SEM comprises two sub-models: (1) a structural model
linking constructs in a series of dependence relationships, and (2) a measurement model specify-
ing which variables measure which constructs. It may again be useful to break the model down
into a structural model and a measurement model.

10.3.1 The structural model

The translation of the path diagram to a set of structural equations is a straightforward proce-
dure. Each of the endogenous variables becomes the dependent variable in its own equation –
with its respective independent variables. Each structural equation is therefore made up of

(a) An endogenous construct as a dependent variable


(b) One or more endogenous and/or exogenous constructs as independent variables
(c) Structural coefficients (beta coefficients) on the independent variables
(d) An error term

In our worked example, there are two equations making up the structural part of the model.
Firstly, democracy levels in 1965 (η2 ) depends on both democracy levels in 1960 (η1 ) and indus-
trialization levels in 1960 (ξ1 ). Secondly, democracy levels in 1960 depends on industrialization
levels in 1960. We can express these two structural equations as:

η1 = β1 ξ1 + ζ1
η2 = β2 ξ1 + β3 η1 + ζ2
202 CHAPTER 10. STRUCTURAL EQUATION MODELING

10.3.2 The measurement model

The conversion of the measurement parts of the path diagram into algebraic terms works in
much the same way as for the structural equation model, remembering that (1) the underlying
or latent construct “causes” the observed or manifest variables, and (2) in order for the SEM to
be estimated, we need to fix the coefficient of one of the indicators on each of the endogenous
latent variables:

y1 = β4 η1 + 1
y2 = β5 η1 + 2
y3 = β6 η1 + 3
y4 = β7 η1 + 4

y5 = β 8 η 2 +  5
y6 = β9 η2 + 6
y7 = β10 η2 + 7
y8 = β11 η2 + 8

x1 = β12 ξ1 + δ1
x2 = β13 ξ1 + δ2
x3 = β14 ξ1 + δ3

10.3.3 Putting the measurement and structural models together

Having specified the individual parts making up the model, we can put the two measurement
models and one structural model together to form the full model in algebraic form. Note that in
this case all the information contained in the path diagram is contained in this set of 8 structural
equations, and that given either one we can construct the other (i.e. we can construct the diagram
from the set of equations or we can construct the set of equations from the diagram). The two
forms are equivalent representations of the underlying theoretical model.

η1 = β1 ξ1 + ζ1
η2 = β2 ξ1 + β3 η1 + ζ2
y1 = β4 η1 + 1
y2 = β5 η1 + 2
y3 = β6 η1 + 3
y4 = β7 η1 + 4
y5 = β8 η2 + 5
y6 = β9 η2 + 6
y7 = β10 η2 + 7
y8 = β11 η2 + 8
x1 = β12 ξ1 + δ1
x2 = β13 ξ1 + δ2
x3 = β14 ξ1 + δ3

Another important capability of SEM is the ability to specify correlations between constructs.
In fact the SEM does not ‘specify correlations’ but instead includes the correlation between
10.4. ESTIMATING THE PROPOSED MODEL 203

two constructs as an additional parameter to estimate. If no correlations are specified, the


parameter is not included in the model. Correlations between exogenous constructs are common,
representing some sort of ‘shared’ influence on endogenous variables.

The definition of the variables are as follows:

• ξ1 : Latent exogenous variable


• η1 : Latent endogenous variable
• η2 : Latent endogenous variable
• y1 − y8 : Manifest endogenous variables
• x1 − x3 : Manifest endogenous variables
• ζ1 − ζ2 : Errors for the latent endogenous variables
• 1 − 8 : Errors for the indicators of latent endogenous variables
• δ1 − δ3 : Errors for the indicators of latent exogenous variables

10.4 Estimating the proposed model

As mentioned before, the first three stages are largely concerned with the development of a
theoretical model in a form amenable to modelling by SEM. Having done this, we may move
on to actually estimating and interpreting the resultant model. There are two major aspects to
this fourth stage: (1) inputting the data in the appropriate form, and (2) selecting an estimation
procedure. Both of these have a large effect on the results of the SEM process.

Use of correlation or covariance matrix

Although the covariance matrix was used in original SEM applications, the use of correlations
has gained widespread acceptance. The use of covariances makes the interpretation of results
more difficult because the coefficients must be interpreted in terms of the units of measure for
the constructs (in comparison to the correlations, which are standardised and are thus easy to
compare). In this course, we make exclusive use of correlation matrices when estimating a SEM.

Sample size

The question of what size sample is ‘big enough’ pervades most of statistical modelling. It too
plays an important role in estimation and interpretation of SEM results, and provides a basis
for the estimation of sampling error. The general view is that although maximum likelihood
estimation (MLE) can produce valid results with a sample size as small as 50, the accepted
minimum sample size is 100-150. A sample size of 200 has been proposed as a ‘good’ or ‘critical’
sample size. The MLE method increases its sensitivity to differences in the data as the sample
size increases: once the sample size reaches 400-500 the MLE is too sensitive and any difference
is detected, leading inevitably to conclusions of poor fit. A sample size of 150-200 is therefore the
closest one can get to a definitive answer to the question of sample size. That answer, however,
should be modified and influenced by the consideration of four further factors:

Model Size The most obvious consideration is that as the model increases in size and com-
plexity, we need more observations to ensure that those parameters are reliably estimated.
A ratio of 5 respondents for each estimated parameter is advised as a rule of thumb, with
a ratio of 10 being considered more appropriate.
Model Misspecification If the researcher has concerns about the impact of specification error
(the effect of omitting important variables), sample sizes should be increased.
204 CHAPTER 10. STRUCTURAL EQUATION MODELING

Departures from Normality If normality is violated, the sample size requirements should
increase to in the region of 15 respondents per estimated parameter.
Estimation Procedure The rules of thumb given above are for the MLE procedure. The
asymptotically distribution-free methods require substantially larger sample sizes in order
to offset the reliance on the distributional assumptions of other methods.

10.4.1 Estimation procedure

The following estimation procedures have been used in SEM.


Ordinary Least Squares Used in early SEM applications, but no longer popular.
Maximum Likelihood Estimation The most common procedure. Efficient and unbiased un-
der multivariate normality, but sensitive to non-normality. We use only this estimation
method in the current course.
Generalised/Weighted Least Squares Less sensitive to violations of normality, but become
impractical as model size and complexity increase.
Asymptotically Distribution-Free Insensitive to non-normality, but much larger sample sizes
required.

10.5 Evaluating the goodness-of-fit of a model


Once the parameters of the SEM have been estimated, we can move on to examine the results.
Before the performance of the model can be evaluated, it is necessary to inspect the model for
offending estimates, those estimates that are theoretically unacceptable. Offending estimates can
be identified by negative (or nonsignificant) error variances, standardised coefficients exceeding
1, or extremely large standard errors, and indicate that there is some problem in the estimation
of the model. Once acceptability is achieved, an evaluation may be made of the goodness-of-fit
of (1) the overall model, (2) the measurement model, and (3) the structural model.

10.5.1 Assessing overall model fit

Goodness-of-fit measures the correspondence of the observed covariance or correlation matrix


with that predicted from the proposed model. It is here that parsimony becomes especially
important, since it is often possible to ‘overfit’ the model to the extent that generalisability is
sacrificed for acceptable fit. There are two broad types of measures: (1) absolute fit measures
and (2) fit measures adjusting for model size. Generally, the researcher is encouraged to use one
or more measures from each type, to try and get a consensus of opinion across the measures
as to the acceptability of the proposed model. It is important to realise that an acceptable
level of goodness-of-fit does not guarantee that all constructs will meet the requirements of the
measurement model nor that the structural model will be fully supported. The researcher must
investigate these areas separately.

Absolute fit measures These measures assess only overall (structural and measurement model)
fit with no adjustments for parsimony. The basic measures are
• Likelihood ratio χ2 : Tests the null hypothesis that no difference exists between the
actual and predicted correlation/covariance matrices. Rejection of the null hypothesis
therefore implies a poor fit.
• Goodness-of-fit index (GFI): A model with a GFI greater than 0.90 is generally accepted
to have a good fit.
• Root mean square error (RMSE): A model with a RMSE less than 0.08 is generally
accepted to have a good fit. The rule-of-thumb further states that an upper cut-off for
acceptable fit is given by 0.10.
10.5. EVALUATING THE GOODNESS-OF-FIT OF A MODEL 205

Adjusted fit measures These measures adjust the measures of fit to provide a comparison of
models with different numbers of estimated coefficients – the purpose being to determine
the amount of fit achieved by each estimated coefficient.
• Adjusted goodness-of-fit index (AGFI): A model with a AGFI greater than 0.90 is
generally accepted to have a good fit.
• Normed Fit Index (NFI): A model with a NFI greater than 0.90 is generally accepted
to have a good fit.

Once the overall model fit has been evaluated, and provided that we have confirmed the accept-
ability or near acceptability of a model, the parameter estimates of both the measurement model
and the structural model can then be assessed. In contrast to exploratory factor analysis, since
the loadings of the measurement model are actually now estimated parameters, we can use a
t-test to test for significance. The assessment of the structural part of the model follows closely
what would happen in a regression analysis. Again, the most valuable information about the
direction and size of each effect can be gained by examining the estimated coefficients for signif-
icance. Due to the characteristics of the MLE (especially at small sample sizes), the researcher
is advised to be conservative in the specification of a significance level i.e. to use 1% rather than
5%.

Let us return again to our worked example, where the results that follow have been obtained by
estimating the parameters for the structural equation model in R. The SEM model is analysed
in R using the “lavaan” package. Therefore it has to be downloaded before running the analysis.
After downloading the package, the analysis of the democratization example can be applied in
R using the following R codes assuming that the Bollen data is stored in your working directory
with a file name [Link]:

> Bollen=[Link]("[Link]", header=T)


> attach(Bollen)
> names(Bollen)

[1] "y1" "y2" "y3" "y4" "y5" "y6" "y7" "y8" "x1" "x2" "x3"

> library(lavaan)

The latent variables are defined using the “equal to and tilde” sign and the name of the latent
variable is written on the left hand side of the equation where the indicators that construct the
relevant latent variable are written on the right hand side of the equation:

>
+ # latent variable definitions
+ ind60 =~ x1 + x2 + x3
+ dem60 =~ y1 + y2 + y3 + y4
+ dem65 =~ y5 + y6 + y7 + y8
+

The structural parts are defined using the “tilde” sign and the endogenous variable is written
on the left hand side of the equation just like in the multiple regression analysis whereas those
variables that have an impact on the endogenous variable are written on the right hand side of
the equation:

>
+ # regressions
206 CHAPTER 10. STRUCTURAL EQUATION MODELING

+ dem60 ~ ind60
+ dem65 ~ ind60 + dem60
+

Finally, the covariances (if there are any) are defined using the “double tilde ” sign:

>
+ # residual covariances
+ y1 ~~ y5
+ y2 ~~ y4 + y6
+ y3 ~~ y7
+ y4 ~~ y8
+ y6 ~~ y8
+

All of these specifications are written in an object in R (called “modelBollen” in this example)
that is specified using the single quotes;

> modelBollen <-


+ # latent variable definitions
+ ind60 =~ x1 + x2 + x3
+ dem60 =~ y1 + y2 + y3 + y4
+ dem65 =~ y5 + y6 + y7 + y8
+ # regressions
+ dem60 ~ ind60
+ dem65 ~ ind60 + dem60
+ # residual covariances
+ y1 ~~ y5
+ y2 ~~ y4 + y6
+ y3 ~~ y7
+ y4 ~~ y8
+ y6 ~~ y8
+

After specifiying the model, the SEM estimation is applied using the “lavaan()” estimation func-
tion:

> fitBollen <- lavaan(modelBollen, Bollen, [Link]=TRUE, [Link]=TRUE,


+ [Link].x=TRUE)
> fitMeasures(fitBollen, c("chisq", "df", "pvalue", "rmsea","[Link]",
+ "[Link]", "cfi","gfi","agfi" ))

chisq df pvalue rmsea [Link]


38.125 35.000 0.329 0.035 0.000
[Link] cfi gfi agfi
0.092 0.995 0.923 0.854

As suggested in the previous discussion, we first look at overall model fit before considering
individual parameter estimates.

Overall model fit


Our main focus of interest is in evaluating if the theoretical model that we proposed in the path
10.5. EVALUATING THE GOODNESS-OF-FIT OF A MODEL 207

diagram in Figure 10.5 is consistent with the observed patterns in the data. We can see whether
the data fits the model by considering some of the overall fit statistics discussed earlier:

Fit statistic Estimate Cut-off for good fit


ML χ2 38.125 Test against H0 : No difference between
(35 DoF, p-value=0.329) predicted and observed covariance
matrix
RMSEA 0.035 strong (RMSEA≤ 0.08), weak (≤ 0.10)
GFI 0.923 GFI≥ 0.9
Adjusted GFI 0.854 AGFI≈ 0.9

Table 10.1: Summary of main fit indices

We can begin by noting that the observed maximum likelihood χ2 test statistic, which tests
the null hypothesis that the data fits the hypothesized model (specifically, that the observed
covariance matrix is equal to the covariance matrix predicted by the hypothesized model), is
35.125 with a p-value of 0.329. We would therefore accept the null hypothesis of good fit at
any level of significance. However, bear in mind that the χ2 statistic is very sensitive when
sample sizes are large (as they are in this example) and is also sensitive to non-normal data. We
should therefore be careful about basing all our conclusions on this one fit statistic. Considering
some of the other fit statistics discussed earlier, it becomes clearer that the data actually fits
the hypothesized model rather well: the estimated RMSEA statistic is 0.035 (with a confidence
interval of 0-0.092) which is less than 0.08, indicating an excellent fit since even the confidence
interval is almost less than 0.08; the GFI is well above the cut-off of 0.90, and the adjusted
GFI is very close to the lower limit of 0.90 for a good fit. An overall conclusion would therefore
appear to be that the data fits the hypothesized structural model well so far according to the
fitness indices.

Error variances
Since variances must be positive by definition, we can check if the estimated error variances are
positive or not.

> summary(fitBollen, [Link]=TRUE, rsquare=TRUE)

lavaan (0.5-11) converged normally after 70 iterations

Number of observations 75

Estimator ML
Minimum Function Test Statistic 38.125
Degrees of freedom 35
P-value (Chi-square) 0.329

Model test baseline model:

Minimum Function Test Statistic 730.654


Degrees of freedom 55
P-value 0.000

Full model versus baseline model:

Comparative Fit Index (CFI) 0.995


Tucker-Lewis Index (TLI) 0.993
208 CHAPTER 10. STRUCTURAL EQUATION MODELING

Loglikelihood and Information Criteria:

Loglikelihood user model (H0) -1547.791


Loglikelihood unrestricted model (H1) -1528.728

Number of free parameters 31


Akaike (AIC) 3157.582
Bayesian (BIC) 3229.424
Sample-size adjusted Bayesian (BIC) 3131.720

Root Mean Square Error of Approximation:

RMSEA 0.035
90 Percent Confidence Interval 0.000 0.092
P-value RMSEA <= 0.05 0.611

Standardized Root Mean Square Residual:

SRMR 0.044

Parameter estimates:

Information Expected
Standard Errors Standard

Estimate [Link] Z-value P(>|z|)


Latent variables:
ind60 =~
x1 1.000
x2 2.180 0.139 15.742 0.000
x3 1.819 0.152 11.967 0.000
dem60 =~
y1 1.000
y2 1.257 0.182 6.889 0.000
y3 1.058 0.151 6.987 0.000
y4 1.265 0.145 8.722 0.000
dem65 =~
y5 1.000
y6 1.186 0.169 7.024 0.000
y7 1.280 0.160 8.002 0.000
y8 1.266 0.158 8.007 0.000

Regressions:
dem60 ~
ind60 1.483 0.399 3.715 0.000
dem65 ~
ind60 0.572 0.221 2.586 0.010
dem60 0.837 0.098 8.514 0.000

Covariances:
y1 ~~
y5 0.624 0.358 1.741 0.082
y2 ~~
10.5. EVALUATING THE GOODNESS-OF-FIT OF A MODEL 209

y4 1.313 0.702 1.871 0.061


y6 2.153 0.734 2.934 0.003
y3 ~~
y7 0.795 0.608 1.308 0.191
y4 ~~
y8 0.348 0.442 0.787 0.431
y6 ~~
y8 1.356 0.568 2.386 0.017

Variances:
x1 0.082 0.019
x2 0.120 0.070
x3 0.467 0.090
y1 1.891 0.444
y2 7.373 1.374
y3 5.067 0.952
y4 3.148 0.739
y5 2.351 0.480
y6 4.954 0.914
y7 3.431 0.713
y8 3.254 0.695
ind60 0.448 0.087
dem60 3.956 0.921
dem65 0.172 0.215

R-Square:

x1 0.846
x2 0.947
x3 0.761
y1 0.723
y2 0.514
y3 0.522
y4 0.715
y5 0.653
y6 0.557
y7 0.678
y8 0.685
dem60 0.200
dem65 0.961

The final part of this output shows the variances associated with each of the endogenous manifest
variable (indicators of the latent variables, x1-x3; y1-y8); each of the endogenous latent variable
(dem60 and dem65) and the exogenous variable (ind60). According to the unstandardized esti-
mates, the error variances are all positive which satisfies the definition.

Parameter estimates
Once we have established a reasonable fit of the model to the data, we can investigate a little
more deeply by viewing the parameters of each of the variables of interest. The unstandardized
parameter estimates are given after the fit indices section of the output. For the standardized
parameter estimates, the following code is run in R:

> standardizedSolution(fitBollen, type = "[Link]")


210 CHAPTER 10. STRUCTURAL EQUATION MODELING

lhs op rhs [Link] se z pvalue


1 ind60 =~ x1 0.920 NA NA NA
2 ind60 =~ x2 0.973 NA NA NA
3 ind60 =~ x3 0.872 NA NA NA
4 dem60 =~ y1 0.850 NA NA NA
5 dem60 =~ y2 0.717 NA NA NA
6 dem60 =~ y3 0.722 NA NA NA
7 dem60 =~ y4 0.846 NA NA NA
8 dem65 =~ y5 0.808 NA NA NA
9 dem65 =~ y6 0.746 NA NA NA
10 dem65 =~ y7 0.824 NA NA NA
11 dem65 =~ y8 0.828 NA NA NA
12 dem60 ~ ind60 0.447 NA NA NA
13 dem65 ~ ind60 0.182 NA NA NA
14 dem65 ~ dem60 0.885 NA NA NA
15 y1 ~~ y5 0.296 NA NA NA
16 y2 ~~ y4 0.273 NA NA NA
17 y2 ~~ y6 0.356 NA NA NA
18 y3 ~~ y7 0.191 NA NA NA
19 y4 ~~ y8 0.109 NA NA NA
20 y6 ~~ y8 0.338 NA NA NA
21 x1 ~~ x1 0.154 NA NA NA
22 x2 ~~ x2 0.053 NA NA NA
23 x3 ~~ x3 0.239 NA NA NA
24 y1 ~~ y1 0.277 NA NA NA
25 y2 ~~ y2 0.486 NA NA NA
26 y3 ~~ y3 0.478 NA NA NA
27 y4 ~~ y4 0.285 NA NA NA
28 y5 ~~ y5 0.347 NA NA NA
29 y6 ~~ y6 0.443 NA NA NA
30 y7 ~~ y7 0.322 NA NA NA
31 y8 ~~ y8 0.315 NA NA NA
32 ind60 ~~ ind60 1.000 NA NA NA
33 dem60 ~~ dem60 0.800 NA NA NA
34 dem65 ~~ dem65 0.039 NA NA NA

Firstly, we can note that there are no offending estimates according to the standardized param-
eter estimates (none of them greater than 1). Beyond this, there are a few comments that are
worth making on interpreting the R output. The structural model and the measurement model
can be analysed separately.

Interpretation of structural model results


The results for the unstandardized structural model show that the democracy levels in 1960
have a positive effect on the democracy levels in 1965 (unstandardized coefficient 0.837). The
industrialization levels in 1960 has a positive effect on both democracy levels in 1960 (coefficient
1.483) and 1965 (coefficient 0.572). All these coefficients are highly significant, with p-levels less
than or equal to 1%. Though the standardized parameter estimates for these relationships are
0.447, 0.182 and 0.885, respectively. Since the standardized coefficient estimates are interpreted
as correlation coefficients between the variables, such that the corrrelation coefficient between
dem60 and ind60 is 0.447, we prefer parameter estimates that are closer to one for a better
relationship. Therefore we can say that dem60 has a strong impact on dem65 (standardized
coefficient 0.846), whereas the effect of ind60 on dem65 is very weak (standardized coefficient
0.182). In the same manner, the effect of ind60 on dem60 is not that strong (standardized coef-
10.6. INTERPRETING AND MODIFYING THE MODEL 211

ficient 0.447).

Interpretation of measurement model results


The results for the first measurement model show that all variables associated (i.e. load sig-
nificantly) with the constructs they are hypothesized to load onto according to the theoretical
measurement model developed above are significant. That is, x1-x3 load significantly on the
construct ‘ind60’ (unstandardized coefficients of β13 = 2.180 and β14 = 1.819 respectively). The
standardized coefficient estimates for x1-x3 are all above the cut-off value of 0.70 (standardized
coefficients 0.920, 0.973 and 0.872 respectively) indicating that these indicators strongly form
the industrialization construct.

The second part of the measurement model belongs to the dem60 latent variable. All the vari-
ables (y1-y4) that construct the dem60 latent variable are significant (p-values less than 1%).
The standardized parameter estimates are well above 0.70 cut-off value (standardized coefficients
0.850, 0.717, 0.722 and 0.846) indicating that these indicators strongly form the democracy level
in 1960. The same picture is provided with the dem65 latent variable. The standardized param-
eter estimates are well above 0.70 cut-off value (standardized coefficients 0.808, 0.746, 0.824 and
0.828) indicating that these indicators strongly form the democracy level in 1965.

10.6 Interpreting and modifying the model

The final stage returns to the non-statistical, theoretical foundations of the first three stages.
Once the model is deemed acceptable, the researcher should first examine the results for their
correspondence to the proposed theory. Questions to be asked include:

• Are the principal relationships in the theory supported and found to be statistically signif-
icant?
• Do competing models add insight in alternative formulations of the theory that can be
supported?
• Are all the relationships in the hypothesised direction (positive or negative)?

10.6.1 Interpretation of results

The first and most importance task involves interpreting the results in both empirical and prac-
tical terms. The measurement model provides the researcher with a way of providing meaning
to the constructs. The significance tests in the structural model are the bases for accepting or
rejecting the proposed relationships between exogenous and endogenous constructs.

10.6.2 Model respecification

Once model interpretation is complete, the researcher may well want to look for ways to improve
model fit and/or its correspondence to the underlying theory. This process is governed by model
respecification, the process of adding or deleting estimated parameters from the original model.
With a view to previous discussion, caution is advised to make such modifications with care and
only after obtaining theoretical justification. Any changes must have a theoretical justification
before being considered. Model changes based solely on the modification indices are totally
contrary to the spirit of the SEM techniques and should always be avoided.

The graphical plot of a SEM analysis can be obtained using the “qgraph” package and “qgraph()”
function in R as follows:
212 CHAPTER 10. STRUCTURAL EQUATION MODELING

> library(qgraph)
> [Link](fitBollen,layout="tree",[Link]=5,[Link]=10,
+ include=4,curve=-0.4,[Link]=0.6,filename="lavaan1")

Standardized model ([Link])


1 0.8 0.04

ind60 dem60 dem65

0.45 0.89

0.18

0.92 0.97 0.87 0.85 0.72 0.72 0.85 0.81 0.75 0.82 0.83

x1 x2 x3 y1 y2 y3 y4 y5 y6 y7 y8

0.15 0.05 0.24 0.28 0.49 0.48 0.28 0.35 0.44 0.32 0.31

10.7 Further class examples

10.7.1 Confirmatory factor analysis of factors underlying performance on


psychological tests
Application Nine psychological tests were administered to seventh and eighth grade students
in 1939. Researchers hypothesised that underlying performance on these 9 tests were three
basic skills: verbal skills, visual skills, and working speed.
Sample size 72
Type of analysis The aim of the analysis is to see whether the 9 tests align themselves with the
3 underlying factors in the hypothesised way (confirmatory factor analysis). There are no
structural relationships to examine, and therefore no structural model. Only a measurement
model is present in this example of SEM.

Variable Label Description Hypothesised underlying factor


X1 VIS_PERC Visual perception Visual
X2 CUBES Working with cube shapes Visual
X3 LOZENGES Working with lozenge shapes Visual
X4 PAR_COMP Paragraph comprehension Verbal
X5 SEN_COMP Sentence completion Verbal
X6 WRD_MNG Word meaning Verbal
X7 ADDITION Addition Speed
X8 CNT_DOT Counting dots Speed
X9 ST_CURVE Writing straight/curved letters Speed

Table 10.2: Description of variables in Class Example 2


10.7. FURTHER CLASS EXAMPLES 213

X1 X2 X3 X4 X5 X6 X7 X8 X9
X1 47.801
X2 10.013 19.758
X3 25.798 15.417 69.172
X4 7.973 3.421 9.207 11.393
X5 9.936 3.296 11.092 11.277 21.616
X6 17.425 6.876 22.954 19.167 25.321 63.163
X7 17.132 7.015 14.763 16.766 28.069 33.768 565.593
X8 44.651 15.675 41.659 7.357 19.311 20.213 293.126 440.792
X9 124.657 40.803 114.763 39.309 61.230 79.993 368.436 410.823 1371.618

Table 10.3: Input covariance matrix - Lower

The lower covariance matrix is written in the same format in single quotes in R and this covari-
ance matrix is defined using the variable labels (x1-x9 in this example). The reason one may
need to define the covariance matrix is that the data may not be available but the covariance
matrix might be given as it is in this case:

> lower <-


+ 47.801
+ 10.013 19.758
+ 25.798 15.417 69.172
+ 7.973 3.421 9.207 11.393
+ 9.936 3.296 11.092 11.277 21.616
+ 17.425 6.876 22.954 19.167 25.321 63.163
+ 17.132 7.015 14.763 16.766 28.069 33.768 565.593
+ 44.651 15.675 41.659 7.357 19.311 20.213 293.126 440.792
+ 124.657 40.803 114.763 39.309 61.230 79.993 368.436 410.823 1371.618
> [Link] <-
+ getCov(lower, names = c("x1", "x2","x3","x4","x5","x6","x7","x8","x9"))

The structural and the measurement models are defined in R using the single quotes:

> [Link] <-


+ # latent variables
+ visual =~ x1+x2+x3
+ verbal =~ x4+x5+x6
+ speed =~ x7+x8+x9
+ # correlated residuals
+ visual ~~ verbal
+ visual ~~ speed
+ verbal ~~ speed
+

After defining the model in terms of structural and measurement parts, the model can be analysed
using the following R code:

> fitPsycho <- sem([Link],


+ [Link] = [Link],
+ [Link] = 72)
>

After running the analysis, the fit measures can be obtained with the fitMeasures() function in
R:
214 CHAPTER 10. STRUCTURAL EQUATION MODELING

> fitMeasures(fitPsycho, c("chisq", "df", "pvalue", "rmsea","[Link]",


+ "[Link]", "cfi","gfi","agfi" ))

chisq df pvalue rmsea [Link]


25.593 24.000 0.374 0.030 0.000
[Link] cfi gfi agfi
0.102 0.993 0.930 0.869

Fit statistic Estimate


ML χ2 25.593 (24 DoF)
RMSEA 0.03
GFI 0.93
Adjusted GFI 0.869

Table 10.4: Summary of main fit indices (no correlations between factors)

According to the model fit indices, the model is estimating strongly. Therefore we can look
at the parameter estimates. The first output provides the unstandardized parameter estimates
with the p-values and the second output provides the standardized parameter estimates. The
interpretation of the parameter estimates is left to the student:

> summary(fitPsycho, standardized = TRUE)

lavaan (0.5-11) converged normally after 137 iterations

Number of observations 72

Estimator ML
Minimum Function Test Statistic 25.593
Degrees of freedom 24
P-value (Chi-square) 0.374

Parameter estimates:

Information Expected
Standard Errors Standard

Estimate [Link] Z-value P(>|z|) [Link] [Link]


Latent variables:
visual =~
x1 1.000 4.646 0.677
x2 0.491 0.146 3.354 0.001 2.280 0.517
x3 1.233 0.313 3.935 0.000 5.728 0.694
verbal =~
x4 1.000 2.901 0.866
x5 1.320 0.164 8.046 0.000 3.829 0.829
x6 2.248 0.280 8.017 0.000 6.522 0.826
speed =~
x7 1.000 15.566 0.659
x8 1.066 0.230 4.629 0.000 16.593 0.796
x9 1.656 0.366 4.530 0.000 25.775 0.701
10.7. FURTHER CLASS EXAMPLES 215

Covariances:
visual ~~
verbal 7.287 2.490 2.927 0.003 0.541 0.541
speed 37.846 14.758 2.564 0.010 0.523 0.523
verbal ~~
speed 15.180 7.128 2.130 0.033 0.336 0.336

Variances:
x1 25.555 6.390 25.555 0.542
x2 14.285 2.762 14.285 0.733
x3 35.397 9.287 35.397 0.519
x4 2.817 0.821 2.817 0.251
x5 6.656 1.625 6.656 0.312
x6 19.754 4.768 19.754 0.317
x7 315.424 67.991 315.424 0.566
x8 159.341 53.226 159.341 0.367
x9 688.213 162.501 688.213 0.509
visual 21.582 8.141 1.000 1.000
verbal 8.417 1.934 1.000 1.000
speed 242.313 87.957 1.000 1.000

> standardizedSolution(fitPsycho, type = "[Link]")

lhs op rhs [Link] se z pvalue


1 visual =~ x1 0.677 NA NA NA
2 visual =~ x2 0.517 NA NA NA
3 visual =~ x3 0.694 NA NA NA
4 verbal =~ x4 0.866 NA NA NA
5 verbal =~ x5 0.829 NA NA NA
6 verbal =~ x6 0.826 NA NA NA
7 speed =~ x7 0.659 NA NA NA
8 speed =~ x8 0.796 NA NA NA
9 speed =~ x9 0.701 NA NA NA
10 visual ~~ verbal 0.541 NA NA NA
11 visual ~~ speed 0.523 NA NA NA
12 verbal ~~ speed 0.336 NA NA NA
13 x1 ~~ x1 0.542 NA NA NA
14 x2 ~~ x2 0.733 NA NA NA
15 x3 ~~ x3 0.519 NA NA NA
16 x4 ~~ x4 0.251 NA NA NA
17 x5 ~~ x5 0.312 NA NA NA
18 x6 ~~ x6 0.317 NA NA NA
19 x7 ~~ x7 0.566 NA NA NA
20 x8 ~~ x8 0.367 NA NA NA
21 x9 ~~ x9 0.509 NA NA NA
22 visual ~~ visual 1.000 NA NA NA
23 verbal ~~ verbal 1.000 NA NA NA
24 speed ~~ speed 1.000 NA NA NA

The graphical plot of a SEM analysis can be obtained using the “qgraph” package and “qgraph()”
function in R as follows:
216 CHAPTER 10. STRUCTURAL EQUATION MODELING

> library(qgraph)
> [Link](fitPsycho,layout="tree",[Link]=5,[Link]=10,
+ include=4,curve=-0.4,[Link]=0.6,filename="fitPsychoGraph")

Standardized model ([Link])


0.54 0.34
1 1 1

0.52
visual verbal speed

0.68 0.52 0.69 0.87 0.83 0.83 0.66 0.8 0.7

x1 x2 x3 x4 x5 x6 x7 x8 x9
0.54 0.73 0.52 0.25 0.31 0.32 0.57 0.37 0.51

You might also like