Chapter Four
Regression on Dummy Variables
HABTAMU LEGESE
1.1 The nature of dummy variables
In regression analysis the dependent variable is
frequently influenced not only by variables that can be
readily quantified on some well-defined scale.
(e.g., sex, race, colour, religion, nationality, wars,
earthquakes, strikes, political upheavals, and
changes in government economic policy).
Cont.
For example, holding all other factors constant, female
daily wage workers are found to earn less than their
male counterparts, and nonwhites are found to earn
less than whites.
This pattern may result from sex or racial discrimination,
but whatever the reason, qualitative variables such as sex
and race do influence the dependent variable and clearly
should be included among the explanatory variables.
Cont.
Qualitative variables usually indicate the presence or
absence of a “quality” or an attribute, such as male or
female, black or white, or Christian or Muslim.
One method of “quantifying” such attributes is by
constructing artificial variables that take on values of
1 or 0, 0 indicating the absence of an attribute and 1
indicating the presence (or possession) of that attribute.
Cont.
For example, 1 may indicate that a person is a male, and 0
may designate a female; or 1 may indicate that a person is a
college graduate, and 0 that he is not, and so on.
Variables that assume such 0 and 1 values are called dummy
variables.
Alternative names are indicator variables, binary variables,
categorical variables, and dichotomous variables.
Cont.
Dummy variables can be used in regression models just as
easily as quantitative variables. As a matter of fact, a
regression model may contain explanatory variables that are
exclusively dummy, or qualitative, in nature.
Example: Yi = + Di + ui ------------------------------------------(1.01)
where Y=annual salary of a college professor
Di = 1 if male college professor
= 0 otherwise (i.e., female professor)
Cont.
Model (1.01) may enable us to find out whether sex makes any
difference in a college professor’s salary, assuming, of course,
that all other variables such as age, degree attained, and years
of experience are held constant.
Assuming that the disturbance satisfies the usually
assumptions of the classical linear regression model, we obtain
from (1.01).
Mean salary of female college professor: E (Yi / Di = 0) = -------(1.02)
Mean salary of male college professor: E (Yi / Di = 1) = +
Assumptions of the Classical Linear Regression Model:
1. The regression model is linear, correctly specified, and has an additive error
term.
2. The error term has a zero population mean.
3. All explanatory variables are uncorrelated with the error term
4. Observations of the error term are uncorrelated with each other (no serial
correlation).
5. The error term has a constant variance (no heteroskedasticity).
6. No perfect multicollinearity
7. The error term is normally distributed (not required).
Cont.
the intercept term gives the mean salary of female college professors and the slope
coefficient tells by how much the mean salary of a male college professor differs from the
mean salary of his female counterpart, + reflecting the mean salary of the male college
professor.
A test of the null hypothesis that there is no sex discrimination ( H 0 : = 0) can be easily made
by running regression (1.01) in the usual manner and finding out whether on the basis of the t
test the estimated is statistically significant.
A. Dummy Independent Variable Models
1.2 Regression on one quantitative variable and one qualitative
variable with two classes, or categories
Consider the model: Yi = i + 2 Di + X i + ui ---------------(1.03)
Where: Yi = annual salary of a college professor
X i = years of teaching experience
Di = 1 if male
=0 otherwise
Cont.
Model (1.03) contains one quantitative variable (years of
teaching experience) and one qualitative variable (sex)
that has two classes (or levels, classifications, or
categories), namely, male and female.
What is the meaning of this equation? Assuming, as usual, that E (u i ) = 0, we see that
Mean salary of female college professor: E (Yi / X i , Di = 0) = 1 + X i ---------(1.04)
Mean salary of male college professor: E (Yi / X i , Di = 1) = ( + 2 ) + X i ------(1.05)
Cont.
Geometrically, we have the situation shown in fig. 1.1 (for
illustration, it is assumed that ). In words, model 1.01
postulates that the male and female college professors’ salary
functions in relation to the years of teaching experience have
the same slope but different intercepts.
In other words, it is assumed that the level of the male
professor’s mean salary is different from that of the female
professor’s mean salary (by but the rate of change in the
mean annual salary by years of experience is the same for both
sexes.
Cont.
If the assumption of common slopes is valid, a test of the
hypothesis that the two regressions (1.04) and (1.05) have the
same intercept (i.e., there is no sex discrimination) can be
made easily by running the regression (1.03) and noting the
statistical significance of the estimated on the basis of the
traditional t test.
If the t test shows that is statistically significant, we reject
the null hypothesis that the male and female college professors’
levels of mean annual salary are the same.
Cont.
Before proceeding further, note the following features of the
dummy variable regression model considered previously
1. To distinguish the two categories, male and female, we have
introduced only one dummy variable . For if always
denotes a male, when D = 0 we know that it is a female since
there are only two possible outcomes.
Hence, one dummy variable suffices to distinguish two
categories. The general rule is this: If a qualitative variable
has ‘m’ categories, introduce only ‘m-1’ dummy variables.
Cont.
In our example, sex has two categories, and hence we
introduced only a single dummy variable. If this rule is not
followed, we shall fall into what might be called the dummy
variable trap, that is, the situation of perfect
multicollinearity.
2. The assignment of 1 and 0 values to two categories, such as
male and female, is arbitrary in the sense that in our example we
could have assigned D = 1 for female and D = 0 for male.
Cont.
3. The group, category, or classification that is assigned the
value of 0 is often referred to as the base, benchmark, control,
comparison, reference, or omitted category. It is the base in
the sense that comparisons are made with that category.
4. The coefficient attached to the dummy variable D can be
called the differential intercept coefficient because it tells by
how much the value of the intercept term of the category that
receives the value of 1 differs from the intercept coefficient of
the base category.
What is dummy variable ?
What is dummy variable ?
In statistics and econometrics, particularly in regression analysis,
a dummy variable is one that takes only the value 0 or 1 to
indicate the absence or presence of some categorical effect that
may be expected to shift the outcome.
What is the purpose of dummy variables?
What is the purpose of dummy variables?
Dummy variables are useful because they enable us to use a
single regression equation to represent multiple groups.
This
means that we don't need to write out separate equation
models for each subgroup.
The dummy variables act like 'switches' that turn various
parameters on and off in an equation.
How do you determine the number of dummy
variables?
How do you determine the number of dummy
variables?
Thefirst step in this process is to decide the number of dummy
variables.
Thisis easy; it's simply k-1, where k is the number of levels of
the original variable.
You could also create dummy variables for all levels in the
original variable, and simply drop one from each analysis.
Is 0 male or female?
Is 0 male or female?
In the case of gender, there is typically no natural reason to code
the variable female = 0, male = 1, versus male = 0, female = 1.
However, convention may suggest one coding is more familiar
to a reader; or choosing a coding that makes the regression
coefficient positive may ease interpretation.
Can dummy variables be 1 and 2?
Can dummy variables be 1 and 2?
Technically, dummy variables are dichotomous,
quantitative variables.
Their range of values is small; they can take on only two
quantitative values.
As a practical matter, regression results are easiest to interpret
when dummy variables are limited to two specific
values, 1 or 0.
What is dummy variable ?
What is dummy variable ?
In statistics and econometrics, particularly in regression analysis,
a dummy variable is one that takes only the value 0 or 1 to
indicate the absence or presence of some categorical effect that
may be expected to shift the outcome.
What is the purpose of dummy variables?
What is the purpose of dummy variables?
Dummy variables are useful because they enable us to use a
single regression equation to represent multiple groups.
This
means that we don't need to write out separate equation
models for each subgroup.
The dummy variables act like 'switches' that turn various
parameters on and off in an equation.
How do you determine the number of dummy
variables?
How do you determine the number of dummy
variables?
Thefirst step in this process is to decide the number of dummy
variables.
Thisis easy; it's simply k-1, where k is the number of levels of
the original variable.
You could also create dummy variables for all levels in the
original variable, and simply drop one from each analysis.
Why do we drop one dummy variable?
1.3 Regression on one quantitative variable and
one qualitative variable with more than two classes
Suppose that, on the basis of the cross-sectional data, we
want to regress the annual expenditure on health care by
an individual on the income and education of the
individual.
Since the variable education is qualitative in nature,
suppose we consider three mutually exclusive levels of
education: less than high school, high school, and
college.
Cont.
Now, unlike the previous case, we have more than two
categories of the qualitative variable education.
Therefore, following the rule that the number of dummies be
one less than the number of categories of the variable, we should
introduce two dummies to take care of the three levels of
education.
Assuming that the three educational groups have a
common slope but different intercepts in the regression
of annual expenditure on health care on annual income,
we can use the following model:
Cont.
Yi = 1 + 2 D2i + 3 D3i + X i + ui --------------------------(1.06)
Where Yi = annual expenditure on health care
X i = annual expenditure
D2 = 1 if high school education
= 0 otherwise
D3 = 1 if college education
= 0 otherwise
Cont.
Note that in the preceding assignment of the dummy variables
we are arbitrarily treating the “less than high school
education” category as the base category. Therefore, the
intercept will reflect the intercept for this category.
The differential intercepts and tell by how much the
intercepts of the other two categories differ from the intercept
of the base category, which can be readily checked as follows:
Cont.
Assuming , we obtain from (1.06)
E (Yi | D2 = 0, D3 = 0, X i ) = 1 + X i
E (Yi | D2 = 1, D3 = 0, X i ) = ( 1 + 2 ) + X i
E (Yi | D2 = 0, D3 = 1, X i ) = ( 1 + 3 ) + X i
which are, respectively the mean health care expenditure
functions for the three levels of education, namely, less
than high school, high school, and college.
Cont.
Geometrically, the situation is shown in fig 1.2 (for illustrative
purposes it is assumed that ).
1.4 Regression on one quantitative variable and two
qualitative variables
❖ The technique of dummy variable can be easily extended to
handle more than one qualitative variable.
us revert to the college professors’ salary regression (1.03),
❖ Let
but now assume that in addition to years of teaching
experience and sex the skin color of the teacher is also an
important determinant of salary.
❖ Forsimplicity, assume that colour has two categories: black
and white
Cont.
We can now write (1.03) as:
Yi = 1 + 2 D2i + 3 D3i + X i + ui ----------(1.07)
Where Yi = annual salary
X i = years of teaching experience
D2 = 1 if female
=0 otherwise
D3 = 1 if white
=0 otherwise
Cont.
Notice that each of the two qualitative variables, sex and color,
has two categories and hence needs one dummy variable for
each. Note also that the omitted, or base, category now is
“black female professor”.
Cont.
Assuming E (u i ) = 0 , we can obtain the following regression from (1.07)
Mean salary for black female professor:
E (Yi | D2 = 0, D3 = 0, X i ) = 1 + X i
Mean salary for black male professor:
E (Yi | D2 = 1, D3 = 0, X i ) = (1 + 2 ) + X i
Mean salary for white female professor:
E (Yi | D2 = 0, D3 = 1, X i ) = (1 + 3 ) + X i
Mean salary for white male professor:
E (Yi | D2 = 1, D3 = 1, X i ) = (1 + 2 + 3 ) + X i
Cont.
Once again, it is assumed that the preceding regressions differ
only in the intercept coefficient but not in the slope
coefficient.
An OLS estimation of (1.07) will enable us to test a variety
of hypotheses. Thus, if is statistically significant, it will
mean that colour does affect a professor’s salary.
Similarly, if is statistically significant, it will mean that
sex also affects a professor’s salary. If both these differential
intercepts are statistically significant, it would mean sex as
well as colour is an important determinant of professors’
salaries.
Cont.
From the preceding discussion it follows that we can extend
our model to include more than one quantitative variable and
more than two qualitative variables.
Theonly precaution to be taken is that the number of dummies
for each qualitative variable should be one less than the
number of categories of that variable.
1.5 Interaction effects
Consider the following model:
Yi = 1 + 2 D2i + 3 D3i + X i + ui ----------------------------(1.08)
where Yi = annual expenditure on clothing
X i = Income
D2 = 1 if female
= 0 if male
D3 = 1 if college graduate
= 0 otherwise
Cont.
The implicit assumption in this model is that the differential
effect of the sex dummy is constant across the two levels of
education and the differential effect of the education dummy
is also constant across the two sexes.
That is, if, say, the mean expenditure on clothing is higher for
females than males this is so whether they are college
graduates or not. Likewise, if, say, college graduates on the
average spend more on clothing than non-college graduates,
this is so whether they are female or males.
Cont.
In many applications, such an assumption may be untenable. A
female college graduate may spend more on clothing than a
male graduate.
In other words, there may be interaction between the two
qualitative variables and therefore their effect on mean Y
may not be simply additive as in (1.08) but multiplicative as
well, as in the following model:
Yi = 1 + 2 D2i + 3 D3i + 4 ( D2i D3i ) + X i + ui -----------------(4.09)
Cont.
From (4.09) we obtain
E (Yi | D2 = 1, D3 = 1, X i ) = (1 + 2 + 3 + 4 ) + X i ------------(4.10)
which is the mean clothing expenditure of graduate females.
Notice that
differential effect of being a female
differential effect of being a college graduate
differential effect of being a female graduate
Cont.
If are all positive, the average clothing
expenditure of females is higher than the base category (which
here is male non-graduate), but it is much more so if the females
also happen to be graduates.
This shows how the interaction dummy modifies the effect of the
two attributes considered individually.
Whether the coefficient of the interaction dummy is statistically
significant can be tested by the usual t test. Omitting a significant
interaction term will lead to a specification bias.
Some Important Uses of Dummy Variable
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Then the function would be
Cont.
Cont.
Use of Dummy Variables in Seasonal Analysis
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
Cont.
B. Dummy Dependent Variable Models
The dependent variable can also take the form of a dummy
variable, where the variable consists of 1 and 0.
If it takes the value of 1, it can be interpreted as a success.
Examples might include home ownership or mortgage
approvals, where the dummy variable takes the value of 1 if
someone owns a home and 0 if they do not.
1. Linear Probability Model
2. Logit and probit model
Thank You