The Future.
Reimagined
STA03A3: Linear Models
GLM in R: Applications → Nominal and Ordinal Logistic Regression
Academic Year - 2024
Semester - I
Lecturer - Prof YA Shiferaw
Department of Statistics
Faculty of Science, University of Johannesburg.
APK Campus, PO Box 524, Auckland Park 2006
09 May 2024
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
1 Nominal Logistic Regression
1.1 Example: Car preferences
In a study of motor vehicle safety, men and women driving small, medium and large cars were
interviewed about vehicle safety and their preferences for cars, and various measurements were
made of how close they sat to the steering wheel (McFadden et al. 2000). There were 50 subjects
in each of the six categories (two sexes and three car sizes). They were asked to rate how important
various features were to them when they were buying a car. Table 1 shows the ratings for air
conditioning and power steering, according to the sex and age of the subject (the categories “not
important” and “of little importance” have been combined).
See Table 8.1 in the textbook.
For these data the response, importance of air conditioning and power steering, is rated on an
ordinal scale but for the purpose of this example the order is ignored and the 3-point scale is
treated as nominal.
The category ”no or little” importance is chosen as the reference category.
Age is also ordinal, but initially we will regard it as nominal.
Table 2 shows the results of fitting the nominal logistic regression model with reference categories
of “Women” and “18-23 years,” and
πj
log = β0j + β1j x1 + β2j x2 + β3j x3 , j = 2, 3, (1)
π1
where
(
1 for men
x1 =
0 for women
, (
1 for age 24-40 years
x2 =
0 otherwise
(
1 for age > 40 years
x3 =
0 otherwise
Response Age Sex1 Frequency Sex
1 1 1 Women 26 1
2 2 1 Women 12 1
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
3 3 1 Women 7 1
4 1 2 Women 9 1
5 2 2 Women 21 1
6 3 2 Women 15 1
> response<-CARS$Response
> age<-CARS$Age
> sex<-CARS$Sex
> frequency<-CARS$Frequency
> res.cars=multinom(response~factor(age)+factor(sex),weights=frequency,data=CARS)
# weights: 15 (8 variable)
initial value 329.583687
iter 10 value 290.490920
final value 290.351098
converged
> #res.cars
> summary(res.cars)
Call:
multinom(formula = response ~ factor(age) + factor(sex), data = CARS,
weights = frequency)
Coefficients:
(Intercept) factor(age)2 factor(age)3 factor(sex)2
2 -0.5907992 1.128268 1.587709 -0.3881301
3 -1.0390726 1.478104 2.916757 -0.8130202
Std. Errors:
(Intercept) factor(age)2 factor(age)3 factor(sex)2
2 0.2839756 0.3416449 0.4028997 0.3005115
3 0.3305014 0.4009256 0.4229276 0.3210382
Residual Deviance: 580.7022
AIC: 596.7022
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
Results and Interpretation
Coefficients:
– (Intercept): This represents the intercept term for the baseline category, which is ”no
or little” importance of air conditioning and power steering. For both sexes (men and
women) and all age groups (18-23 years, 24-40 years, and over 40 years), this intercept
is negative. This means that the log odds of selecting ”no or little” importance decrease
compared to the reference category as the predictors change.
– factor(age)2: This coefficient corresponds to the effect of the second age group (24-40
years) on the log odds of selecting the importance of air conditioning and power steering.
For both sexes, this coefficient is positive, indicating that compared to the reference
age group (18-23 years), individuals aged 24-40 years are more likely to prioritize air
conditioning and power steering.
– factor(age)3: Similarly, this coefficient represents the effect of the third age group (over
40 years) on the log odds of selecting the importance of air conditioning and power
steering. It’s also positive for both sexes, indicating that individuals over 40 years old
are more likely to prioritize air conditioning and power steering compared to the reference
age group.
– factor(sex)2: This coefficient corresponds to the effect of being male on the log odds of
selecting the importance of air conditioning and power steering. For both age groups,
this coefficient is negative, suggesting that compared to females (the reference category),
males are less likely to prioritize air conditioning and power steering.
Std. Errors:
– These values represent the standard errors associated with each coefficient estimate. They
indicate the uncertainty or variability in the estimates. Lower standard errors suggest
more precise estimates.
Residual Deviance and AIC:
– Residual Deviance: This measures how well the model fits the data. It represents the
difference between the observed values and the values predicted by the model. A smaller
residual deviance indicates a better fit.
– AIC (Akaike Information Criterion): This is a measure of the model’s goodness of fit
while penalizing the number of parameters. Lower AIC values indicate better-fitting
models relative to others. Akaike information criterion
AIC = –2ℓ(π̂; y) + 2p
= –2 × (–290.35) + 16 = 596.70
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
The maximum value of the log-likelihood function for the minimal model (with only two
parameters, β02 and β03 ) is
> res.carsN=multinom(response~1,weights=frequency,data=CARS)
# weights: 6 (2 variable)
initial value 329.583687
final value 329.272024
converged
> logLik(res.carsN)
' log Lik.' -329.272 (df=2)
and for the fitted model (1) is
> logLik(res.cars)
' log Lik.' -290.3511 (df=8)
-290.35,
Likelihood ratio chi-squared statistic: the likelihood ratio chi-squared statistic is
C = 2 ℓ(b) – ℓ(bmin )
= (–290.35 + 329.27) = 77.84
Pseudo R2
ℓ(b) – ℓ(bmin )
Pseudo R2 =
ℓ(bmin )
= (–329.27 + 290.35) /(–329.27) = 0.118
The first statistic, which has 6 degrees of freedom (8 parameters in the fitted model minus 2
for the minimal model), is very significant compared with the χ 2 (6) distribution, showing the
overall importance of the explanatory variables.
However, the second statistic suggests that only 11.8% of the “variation” is “explained” by
these factors.
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
To estimate the probabilities, first consider the preferences of women (x1 = 0 aged 18 – 23
(so x2 = 0 and x3 = 0).
For this group
π̂2 π̂2
log = –0.591, so = e–0.591 = 0.5539
π̂ π̂1
1
π̂3 π̂3
log = –1.039, so = e–1.039 = 0.3538
π̂1 π̂1
but π̂1 + π̂2 + π̂3 = 1, so π̂1 (1 + 0.5539 + 0.3538) = 1; therefore, π̂1 = 1/1.9077 = 0.524, and
hence, π̂2 = 0.290 and π̂3 = 0.186.
Now consider men (x1 = 1) aged over 40 (so x2 = 0, but x3 = 1) so that log π̂π̂2 =
1
–0.591 – 0.388 + 1.588 = 0.609, log π̂π̂3 = 1.065, and hence, π̂1 = 0.174, π̂2 = 0.320 and
1
π̂3 = 0.505 (correct to 3 decimal places).
2 Ordinal Logistic Regression
Analysis of proportional odds ordinal regression model in R
> ################ R code (proportional odds ordinal regression model)
> res.polr=polr(factor(response)~factor(age)+factor(sex),weights=frequency,data=CARS)
> summary(res.polr)
Call:
polr(formula = factor(response) ~ factor(age) + factor(sex),
data = CARS, weights = frequency)
Coefficients:
Value Std. Error t value
factor(age)2 1.1471 0.2776 4.132
factor(age)3 2.2325 0.2915 7.659
factor(sex)2 -0.5762 0.2262 -2.548
Intercepts:
Value Std. Error t value
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
1|2 0.0435 0.2323 0.1874
2|3 1.6550 0.2556 6.4744
Residual Deviance: 581.2956
AIC: 591.2956
> logLik(res.polr)
log Lik.' -290.6478 (df=5)
> res.polrN=polr(factor(response)~1,weights=frequency,data=CARS)
> logLik(res.polrN)
' log Lik.' -329.272 (df=2)
Results and interpretation
Interpretation:
Coefficients:
– The coefficients represent the effects of age and sex on the log odds of selecting different
levels of importance regarding air conditioning and power steering.
– The coefficient for factor(age)2 (age 24-40 years) is 1.1471 with a standard error of
0.2776 and a t-value of 4.132. This indicates that individuals aged 24-40 years are more
likely to prioritize air conditioning and power steering compared to the reference group
(18-23 years).
– The coefficient for factor(age)3 (age over 40 years) is 2.2325 with a standard error of
0.2915 and a t-value of 7.659. This suggests that individuals over 40 years old are even
more likely to prioritize air conditioning and power steering compared to the reference
group.
– The coefficient for factor(sex)2 (male) is -0.5762 with a standard error of 0.2262 and
a t-value of -2.548. This indicates that males are less likely to prioritize air conditioning
and power steering compared to females.
Intercepts:
– The intercepts represent the threshold values for transitioning between different levels of
importance regarding air conditioning and power steering. For example, the intercept for
transitioning from level 1 to level 2 (1|2) is 0.0435 with a standard error of 0.2323 and a
t-value of 0.1874.
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
Residual Deviance and AIC:
– The residual deviance is 581.2956, indicating how well the model fits the data. A smaller
residual deviance suggests a better fit. The AIC value is 591.2956, which is a measure of
model goodness-of-fit while penalizing for model complexity. Lower AIC values indicate
better-fitting models.
Log-Likelihood:
– The log-likelihood values are -290.6478 for the model with predictors (df = 5) and -329.272
for the null model (df = 2). The difference in log-likelihood between these models is used
for model comparison and model selection.
For model given in (1): the maximum value of the log-likelihood function is ℓ(b) is
> logLik(res.polr)
' log Lik.' -290.6478 (df=5)
For the minimal model, with only β01 and β02 , the maximum value is
> res.polrN=polr(factor(response)~1,weights=frequency,data=CARS)
> logLik(res.polrN)
' log Lik.' -329.272 (df=2)
Likelihood ratio chi-squared statistic: the likelihood ratio chi-squared statistic is
C = 2 ℓ(b) – ℓ(bmin )
= 2 × (–290.648 + 329.272) = 77.248
Pseudo R2
ℓ(b) – ℓ(bmin )
Pseudo R2 =
ℓ(bmin )
= (–329.272 + 290.648)/(–329.272) = 0.117
and Akaike information criterion
AIC = –2ℓ(π̂; y) + 2p
= –2 × (–290.648) + 2 × 5 = 591.3
These last two statistics show that there is very little difference in how well the proportional odds
and the nominal logistic regression models describe the data.
STA03A3: Linear Models Nominal and Ordinal Logistic Regression 09 May 2024 Shiferaw
The parameter estimates for the proportional odds model are all quite similar to those from
the nominal logistic regression model.
The estimated probabilities are also similar; for example, for females aged 18 – 23, x1 =
0, x2 = 0 and x3 = 0, so
– log π̂ π̂+1π̂ = 0.0435 and
2 3
π̂1 +π̂2
– log π̂ = 1.6550.
3
– If these equations are solved with π̂1 + π̂2 + π̂3 = 1, the estimates are π̂1 = 0.5109, π̂1 =
0.3287 and π̂3 = 0.1604.