PH6205 R Tutorial 3
Haoran Xue
(Last Updated February 18, 2025)
CHD Example
Let’s reproduce the results for CHD example in our lecture.
Load Data into R
First, we need to load the CHD data into R. The name of the dataset is CHDData.csv. On my computer
the path to this dataset is “/Users/x/Library/CloudStorage/Dropbox/PH6205/Slides3”, you need
to change the path according to the location of the dataset on your own computer when you read the data.
We use the function read.csv().
CHD.data = read.csv("/Users/x/Library/CloudStorage/Dropbox/PH6205/Slides3/CHDData.csv")
We use the head() function to show the first 6 rows of CHD data:
head(CHD.data)
## ID AGE CHD Sex
## 1 1 20 0 Female
## 2 2 23 0 Male
## 3 3 24 0 Male
## 4 4 25 0 Female
## 5 5 25 1 Male
## 6 6 26 0 Male
Extract variables CHD and Age:
CHD = CHD.data$CHD
Age = CHD.data$AGE
Linear Regression Is Not Appropriate
Run the linear regression of CHD ∼ Age:
lm.CHDAge = lm(CHD~Age)
Now lets draw the scatter plot of CHD versus Age, using red color to denote samples with CHD, and blue to
denote samples without CHD. (Note: This part of code is not required.)
ind1 = which(CHD == 1)
ind2 = which(CHD == 0)
plot(Age[ind2],CHD[ind2],
ylim = c(0,1),xlim = range(Age),col = "blue",
xlab = "Age",ylab = "CHD")
points(Age[ind1],CHD[ind1],col = "red")
abline(lm.CHDAge)
1
1.0
0.8
0.6
CHD
0.4
0.2
0.0
20 30 40 50 60 70
Age
Show the details of linear regression results:
summary(lm.CHDAge)
##
## Call:
## lm(formula = CHD ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85793 -0.33992 -0.07274 0.31656 0.99269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.537960 0.168809 -3.187 0.00193 **
## Age 0.021811 0.003679 5.929 4.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.429 on 98 degrees of freedom
## Multiple R-squared: 0.264, Adjusted R-squared: 0.2565
## F-statistic: 35.15 on 1 and 98 DF, p-value: 4.575e-08
Simple Logistic Regression
Let’s fit the simple logistic regression of CHD ∼ Age:
logistic.CHDAge = glm(CHD ~ Age, family = binomial)
summary(logistic.CHDAge)
##
## Call:
## glm(formula = CHD ~ Age, family = binomial)
##
2
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.30945 1.13365 -4.683 2.82e-06 ***
## Age 0.11092 0.02406 4.610 4.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 107.35 on 98 degrees of freedom
## AIC: 111.35
##
## Number of Fisher Scoring iterations: 4
Multiple Logistic Regression
Let’s fit the multiple logistic regression of CHD ∼ Age + Sex:
Sex = CHD.data$Sex
logistic.CHDAgeSex = glm(CHD ~ Age + Sex, family = binomial)
summary(logistic.CHDAgeSex)
##
## Call:
## glm(formula = CHD ~ Age + Sex, family = binomial)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.43517 1.19000 -4.567 4.94e-06 ***
## Age 0.11041 0.02407 4.588 4.48e-06 ***
## SexMale 0.20176 0.54014 0.374 0.709
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 107.21 on 97 degrees of freedom
## AIC: 113.21
##
## Number of Fisher Scoring iterations: 4
Compare Nested Models Using LRT Test
We use function lrtest() from the R package lmtest to perform the LRT test. First, you should install the
package lmtest, which is similar to what we did to install package car in R Tutorial 2. After successfully
installing lmtest, we load the package using function library():
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
3
##
## as.Date, as.Date.numeric
Now we can use the function lrtest() to perform the LRT test:
lrtest(logistic.CHDAge, logistic.CHDAgeSex)
## Likelihood ratio test
##
## Model 1: CHD ~ Age
## Model 2: CHD ~ Age + Sex
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -53.677
## 2 3 -53.607 1 0.14 0.7083
HL Test
We use function hoslem.test() from the R package ResourceSelection to perform the Hosmer-Lemeshow
test. First, you should install the package ResourceSelection, which is similar to what we did to install
package car and lmtest. After successfully installing ResourceSelection, we load the package using
function library():
library(ResourceSelection)
## ResourceSelection 0.3-6 2023-06-27
The HL test takes two sets of values, with the first set as the binary outcome and the second set as the
predicted probabilities p̂i ’s. The function fitted() can be used to get the predicted probabilities:
fitted(logistic.CHDAgeSex)
## 1 2 3 4 5 6 7
## 0.03816625 0.06333496 0.07020957 0.06447555 0.07776843 0.08606576 0.08606576
## 8 9 10 11 12 13 14
## 0.10509814 0.10509814 0.11594475 0.12775098 0.12775098 0.12775098 0.10690499
## 15 16 17 18 19 20 21
## 0.12775098 0.12775098 0.15444378 0.15444378 0.14288754 0.16941897 0.18552758
## 22 23 24 25 26 27 28
## 0.15695020 0.15695020 0.15695020 0.18552758 0.17211894 0.17211894 0.22123058
## 29 30 31 32 33 34 35
## 0.18842604 0.22123058 0.24083707 0.20589388 0.24083707 0.22453305 0.26159750
## 36 37 38 39 40 41 42
## 0.28347926 0.24434039 0.26529728 0.30643177 0.33038578 0.28736807 0.35525318
## 43 44 45 46 47 48 49
## 0.35525318 0.35525318 0.35525318 0.38092752 0.33461756 0.38092752 0.40728524
## 50 51 52 53 54 55 56
## 0.35963241 0.40728524 0.40728524 0.43418763 0.43418763 0.46148346 0.46148346
## 57 58 59 60 61 62 63
## 0.48901221 0.43887755 0.48901221 0.51660777 0.51660777 0.51660777 0.54410241
## 64 65 66 67 68 69 70
## 0.54410241 0.54410241 0.57133084 0.57133084 0.59813416 0.62436343 0.62436343
## 71 72 73 74 75 76 77
## 0.60270860 0.60270860 0.67457219 0.65420877 0.69832867 0.69832867 0.72106802
## 78 79 80 81 82 83 84
## 0.72106802 0.72106802 0.70233019 0.74272480 0.70233019 0.74272480 0.74272480
## 85 86 87 88 89 90 91
## 0.70233019 0.76325219 0.76325219 0.76325219 0.78262110 0.74635133 0.76668026
4
## 92 93 94 95 96 97 98
## 0.80081893 0.81784789 0.83372326 0.80384284 0.84847137 0.86212770 0.86212770
## 99 100
## 0.87473498 0.91568744
Now we perform the HL test, the first argument CHD is the set of binary outcomes, and the second argument
fitted(logistic.CHDAgeSex) is the set of predicted probabilities:
hoslem.test(CHD,fitted(logistic.CHDAgeSex))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: CHD, fitted(logistic.CHDAgeSex)
## X-squared = 1.7167, df = 8, p-value = 0.9885