0% found this document useful (0 votes)
11 views15 pages

PH6205 RTutorial 2

This is the tutorial 2 for the course PH6205 at City University of Hong Kong.

Uploaded by

xuehrcityu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

PH6205 RTutorial 2

This is the tutorial 2 for the course PH6205 at City University of Hong Kong.

Uploaded by

xuehrcityu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PH6205 R Tutorial 2

Haoran Xue

(Last Updated January 27, 2025)

FEV1 Example
Let’s reproduce the results for FEV1 example in our lecture.

Load Data into R


First, we need to load the FEV1 data into R. The name of the dataset is [Link]. On my computer the
path to this dataset is “/Users/x/Library/CloudStorage/Dropbox/PH6205/Slides2/[Link]”,
you need to change the path according to the location of the dataset on your own computer when you read
the data. We use the function [Link]().
[Link] = [Link]("/Users/x/Library/CloudStorage/Dropbox/PH6205/Slides2/[Link]")

The FEV1 dataset has been loaded into your current environment, as an object named [Link], as shown
below.

Figure 1: FEV1 data has been loaded as [Link]” in your environment. See the upper-right subpanel of
your R studio. It has 212 observations and 7 variables.

Now we can use the head() function to view the first 6 rows (observations) of [Link].
head([Link])

## id age ht fev1 smok sex occ


## 1 1 7.904175 112.2 1.41 Smoker Female Service
## 2 2 7.107460 113.6 1.36 Smoker Female Para-professional
## 3 3 6.989733 113.7 1.12 Smoker Female Service
## 4 4 6.097194 114.0 1.16 Smoker Female Service
## 5 5 8.468172 114.0 1.48 Smoker Female Service
## 6 6 7.310061 115.9 1.28 Smoker Female Other

1
The data [Link] in your environment is a Data Frame in R, which can be check using function class():
class([Link])

## [1] "[Link]"
We can use “$” to extract certain variables in a data frame. Now we extract the variables FEV1 and Height.
FEV1 = [Link]$fev1
Height = [Link]$ht

Let’s draw the scatter plot of FEV1 versus Height:


plot(Height,FEV1)
5
4
FEV1

3
2
1

110 120 130 140 150 160 170

Height
Both Heigh and FEV1 are numerical variables:
class(FEV1)

## [1] "numeric"
class(Height)

## [1] "numeric"

Simple Linear Regression


Now we fit the simple linear regression of FEV1 versus Height. We use the function lm():
lm.FEV1Height = lm(FEV1 ~ Height)

Now the object lm.FEV1Height, which is our fitted model, is in your R environment. We can use the
function summary() to display the details of the results:
summary(lm.FEV1Height)

##
## Call:
## lm(formula = FEV1 ~ Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60804 -0.18521 -0.01861 0.14712 1.90951
##

2
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.52903 0.19800 -17.82 <2e-16 ***
## Height 0.03983 0.00140 28.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.282 on 210 degrees of freedom
## Multiple R-squared: 0.7939, Adjusted R-squared: 0.7929
## F-statistic: 808.9 on 1 and 210 DF, p-value: < 2.2e-16
These are the results we discussed in the lecture. Now we can use function abline() to add a red line to the
scatter plot, which indicates the fitted model:
plot(Height,FEV1)
abline(lm.FEV1Height,col="red")
5
4
FEV1

3
2
1

110 120 130 140 150 160 170

Height

Multiple Linear Regression


Now we fit the multiple linear regression of FEV1 versus Height and Age. We still use the lm() function,
just with one more variable Age.
Age = [Link]$age
class(Age)

## [1] "numeric"
lm.FEV1HeightAge = lm(FEV1 ~ Height + Age)
summary(lm.FEV1HeightAge)

##
## Call:
## lm(formula = FEV1 ~ Height + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66422 -0.19869 -0.00666 0.15142 1.83564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.123780 0.269745 -11.58 <2e-16 ***

3
## Height 0.033701 0.003124 10.79 <2e-16 ***
## Age 0.044572 0.020356 2.19 0.0297 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2795 on 209 degrees of freedom
## Multiple R-squared: 0.7985, Adjusted R-squared: 0.7966
## F-statistic: 414.2 on 2 and 209 DF, p-value: < 2.2e-16

Using Categorical Variable as Predictor


In the above simple and multiple linear regression, both predictors Height and Age are numerical. We can
also include the categorical variable Smoke in our model. The class of the Smoke variable is character,
which will be regarded as categorical variable.
Smoke = [Link]$smok
class(Smoke)

## [1] "character"
lm.FEV1Smoke = lm(FEV1 ~ Smoke)
summary(lm.FEV1Smoke)

##
## Call:
## lm(formula = FEV1 ~ Smoke)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52361 -0.17471 -0.01657 0.14250 2.27639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.96471 0.03798 51.728 <2e-16 ***
## SmokeNon-smoker 0.79890 0.05334 14.978 <2e-16 ***
## SmokeSmoker -0.48629 0.05371 -9.053 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3178 on 209 degrees of freedom
## Multiple R-squared: 0.7395, Adjusted R-squared: 0.737
## F-statistic: 296.6 on 2 and 209 DF, p-value: < 2.2e-16

Identify Effect Modifier by Including Interaction Term


Now we can include the interaction term between Height and Sex.
Sex = [Link]$sex
class(Sex)

## [1] "character"
lm.FEV1HeightSexInteract = lm(FEV1 ~ Height*Sex)
summary(lm.FEV1HeightSexInteract)

##
## Call:

4
## lm(formula = FEV1 ~ Height * Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67982 -0.13907 -0.00825 0.13124 1.83320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.041137 0.348027 -5.865 1.75e-08 ***
## Height 0.028172 0.002610 10.792 < 2e-16 ***
## SexMale -1.494765 0.462646 -3.231 0.001435 **
## Height:SexMale 0.012156 0.003311 3.671 0.000307 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2633 on 208 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.8194
## F-statistic: 320.2 on 3 and 208 DF, p-value: < 2.2e-16
As we discussed, R is a very powerful tool to plot graphs. Let’s showcase its power with the following example.
(Note: This part of code is not required for you to understand, it just serves as an example.)
[Link] = which(Sex == "Female")
[Link] = which(Sex == "Male")

plot(Height[[Link]],FEV1[[Link]],
xlab = "Height",ylab = "FEV1",col = "red",pch = 1,
xlim = range(Height),ylim = range(FEV1))
points(Height[[Link]],FEV1[[Link]],
col="blue",pch = 2)
abline(-2.041137,0.028172,col = "red")
abline(-3.535902,0.040328,col = "blue")
5
4
FEV1

3
2
1

110 120 130 140 150 160 170

Height

Carotene Example
Now let’s load the Carotene data from the file [Link] as an data frame named [Link]
in R.

5
[Link] = [Link]("/Users/x/Library/CloudStorage/Dropbox/PH6205/Slides2/[Link]")
class([Link])

## [1] "[Link]"
head([Link])

## age gender bmi vitgr fat fiber alco carotene


## 1 64 1 21.48380 1 57.0 6.3 0.0 199.99653
## 2 76 1 23.87631 1 50.1 15.8 0.0 124.00229
## 3 38 1 20.01080 2 83.6 19.1 14.1 327.99554
## 4 40 1 25.14062 3 97.5 26.5 0.5 152.99420
## 5 72 1 20.98504 1 82.6 16.2 0.0 92.00105
## 6 40 1 27.52136 3 56.0 9.6 1.3 147.99818
Carotene = [Link]$carotene

Draw the histogram of Carotene and log(Carotene) with the function hist().
hist(Carotene)

Histogram of Carotene
200
Frequency

50 100
0

0 500 1000 1500

Carotene
hist(log(Carotene))

6
Histogram of log(Carotene)

80
60
Frequency

40
20
0

3 4 5 6 7

log(Carotene)
Define the new variable LogCarotene:
LogCarotene = log([Link]$carotene)

Extract other variables:


Vitamin = [Link]$vitgr
Age = [Link]$age
Gender = [Link]$gender
BMI = [Link]$bmi
Fat = [Link]$fat
Fiber = [Link]$fiber
Alcohol = [Link]$alco

Check linearity and outlier using scatter plots:


plot(Age,LogCarotene)
abline(lm(LogCarotene~Age),col = "red")
7
LogCarotene

6
5
4
3

20 30 40 50 60 70 80

Age
plot(BMI,LogCarotene)
abline(lm(LogCarotene~BMI),col = "red")

7
7
LogCarotene

6
5
4
3

15 20 25 30 35 40 45 50

BMI
plot(Fat,LogCarotene)
abline(lm(LogCarotene~Fat),col = "red")
7
LogCarotene

6
5
4
3

50 100 150 200

Fat
plot(Fiber,LogCarotene)
abline(lm(LogCarotene~Fiber),col = "red")
7
LogCarotene

6
5
4
3

5 10 15 20 25 30 35

Fiber
plot(Alcohol,LogCarotene)
abline(lm(LogCarotene~Alcohol),col = "red")

8
7
LogCarotene

6
5
4
3

0 50 100 150 200

Alcohol
We can see an outlier with extremely large Alcohole value. Let us identify which one it is with the
[Link]() function.
[Link](Alcohol)

## [1] 56
It is the 56th observation. Let’s remove this outlier in our dataset, and re-define all variables.
[Link] = [Link][-56,]
LogCarotene = log([Link]$carotene)
Vitamin = [Link]([Link]$vitgr)
Age = [Link]$age
Gender = [Link]$gender
Gender = [Link](Gender)
BMI = [Link]$bmi
Fat = [Link]$fat
Fiber = [Link]$fiber
Alcohol = [Link]$alco

Now the scatter plot of Carotene versus Alcohol looks better.


plot(Alcohol,LogCarotene)
abline(lm(LogCarotene~Alcohol),col = "red")
7
LogCarotene

6
5
4
3

0 5 10 15 20 25 30 35

Alcohol
Let us select important variables using simple linear regressions.

9
summary(lm(LogCarotene~Age))

##
## Call:
## lm(formula = LogCarotene ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22245 -0.44895 -0.01865 0.47898 2.38531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.572083 0.151240 30.231 < 2e-16 ***
## Age 0.008041 0.002896 2.777 0.00584 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7314 on 294 degrees of freedom
## Multiple R-squared: 0.02556, Adjusted R-squared: 0.02224
## F-statistic: 7.711 on 1 and 294 DF, p-value: 0.005842
summary(lm(LogCarotene~Gender))

##
## Call:
## lm(formula = LogCarotene ~ Gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.37480 -0.43920 -0.02322 0.43964 2.24100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7118 0.1191 39.577 <2e-16 ***
## Gender1 0.3021 0.1275 2.369 0.0185 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7339 on 294 degrees of freedom
## Multiple R-squared: 0.01873, Adjusted R-squared: 0.01539
## F-statistic: 5.611 on 1 and 294 DF, p-value: 0.0185
summary(lm(LogCarotene~BMI))

##
## Call:
## lm(formula = LogCarotene ~ BMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.16824 -0.41291 -0.05875 0.50891 2.15676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.886288 0.182590 32.238 < 2e-16 ***

10
## BMI -0.035010 0.006834 -5.123 5.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7099 on 294 degrees of freedom
## Multiple R-squared: 0.08195, Adjusted R-squared: 0.07882
## F-statistic: 26.24 on 1 and 294 DF, p-value: 5.46e-07
summary(lm(LogCarotene~Fat))

##
## Call:
## lm(formula = LogCarotene ~ Fat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.40236 -0.44817 -0.02569 0.48633 2.25367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.183535 0.110375 46.963 <2e-16 ***
## Fat -0.002775 0.001355 -2.048 0.0414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7357 on 294 degrees of freedom
## Multiple R-squared: 0.01407, Adjusted R-squared: 0.01071
## F-statistic: 4.195 on 1 and 294 DF, p-value: 0.04143
summary(lm(LogCarotene~Fiber))

##
## Call:
## lm(formula = LogCarotene ~ Fiber)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3646 -0.4526 -0.0272 0.4279 2.2487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.559569 0.108844 41.891 < 2e-16 ***
## Fiber 0.032417 0.007838 4.136 4.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7202 on 294 degrees of freedom
## Multiple R-squared: 0.05499, Adjusted R-squared: 0.05177
## F-statistic: 17.11 on 1 and 294 DF, p-value: 4.616e-05
summary(lm(LogCarotene~Alcohol))

##
## Call:
## lm(formula = LogCarotene ~ Alcohol)
##

11
## Residuals:
## Min 1Q Median 3Q Max
## -2.3514 -0.4381 -0.0209 0.4746 2.2749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.990539 0.049111 101.617 <2e-16 ***
## Alcohol -0.006151 0.009444 -0.651 0.515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7403 on 294 degrees of freedom
## Multiple R-squared: 0.001441, Adjusted R-squared: -0.001956
## F-statistic: 0.4242 on 1 and 294 DF, p-value: 0.5154
As the P-value of Alcohol is 0.515, greater than the cutoff 0.25, we exclude it from our following multiple
linear regression.
[Link] = lm(LogCarotene~Vitamin+Age+Gender+BMI+Fat+Fiber)
summary([Link])

##
## Call:
## lm(formula = LogCarotene ~ Vitamin + Age + Gender + BMI + Fat +
## Fiber)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.95133 -0.36183 -0.03527 0.43318 1.89461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.009390 0.317607 15.772 < 2e-16 ***
## Vitamin2 -0.102717 0.098097 -1.047 0.29593
## Vitamin3 -0.332953 0.090505 -3.679 0.00028 ***
## Age 0.008295 0.002861 2.899 0.00403 **
## Gender1 0.328172 0.124363 2.639 0.00877 **
## BMI -0.030412 0.006373 -4.772 2.91e-06 ***
## Fat -0.002436 0.001321 -1.845 0.06613 .
## Fiber 0.029720 0.007522 3.951 9.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6564 on 288 degrees of freedom
## Multiple R-squared: 0.231, Adjusted R-squared: 0.2123
## F-statistic: 12.36 on 7 and 288 DF, p-value: 7.845e-14
To perform analysis of variance (ANOVA), we need to use the function Anova() from the R package car.
We need to install this package use the function [Link]() first, as shown below:

12
Now we should load the car package into our envrionment using function library():
library(car)

## Loading required package: carData


Now let’s run the ANOVA for the fitted linear regression model:
Anova([Link])

## Anova Table (Type II tests)


##
## Response: LogCarotene
## Sum Sq Df F value Pr(>F)
## Vitamin 5.987 2 6.9469 0.001131 **
## Age 3.622 1 8.4062 0.004027 **
## Gender 3.001 1 6.9634 0.008772 **
## BMI 9.811 1 22.7681 2.910e-06 ***
## Fat 1.466 1 3.4024 0.066128 .
## Fiber 6.728 1 15.6132 9.782e-05 ***
## Residuals 124.100 288
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear Regression Diagnostics


Now let’s draw the plots for diagnostics.
plot([Link],1)

13
Residuals vs Fitted

2
1 33
Residuals

0
−1
−2

217 99

4.0 4.5 5.0 5.5 6.0

Fitted values
lm(LogCarotene ~ Vitamin + Age + Gender + BMI + Fat + Fiber)
plot([Link],2)

Normal Q−Q

33
3
Standardized residuals

2
1
0
−1
−2
−3

217 99

−3 −2 −1 0 1 2 3

Theoretical Quantiles
lm(LogCarotene ~ Vitamin + Age + Gender + BMI + Fat + Fiber)
plot([Link],3)

14
Scale−Location
217 33 99
1.5
Standardized residuals

1.0
0.5
0.0

4.0 4.5 5.0 5.5 6.0

Fitted values
lm(LogCarotene ~ Vitamin + Age + Gender + BMI + Fat + Fiber)

15

You might also like