Beginner-Friendly Guide to R Code: Regression
Analysis, Assumptions, and PCA
1 Setting Up the Environment
The following code sets and confirms the working directory where R will look for and
save files.
1 setwd ( " C : / Users / oralc / Desktop / MDA " )
2 getwd ()
Purpose:
• setwd(): Sets the working directory.
• getwd(): Confirms the current working directory.
2 Reading the Dataset
This section loads the dataset and makes its columns accessible by name.
1 rdata = read . csv ( " Data _ lifestyle . csv " , header = TRUE )
2 rdata
3 names ( rdata )
4 attach ( rdata )
Explanation:
• Loads the dataset and assigns it to rdata.
• header = TRUE: First row contains column names.
• attach(): Allows direct access to columns (e.g., Y instead of rdata$Y).
3 Loading Required Libraries
These libraries provide tools for regression diagnostics and transformations.
1 library ( car )
2 library ( lmtest )
Purpose:
• car: For VIF and transformations.
• lmtest: For diagnostics like Durbin-Watson and Breusch-Pagan tests.
1
4 Building Regression Models
The following code builds multiple linear regression models.
1 Reg1 = lm ( Y ~ X1 + X2 )
2 summary ( Reg1 )
3
4 Reg2 = lm ( Y ~ X1 + X2 + ... + X21 )
5 summary ( Reg2 )
6
7 Rega = lm ( Y ~ . , data = rdata )
8 summary ( Rega )
Theory: Multiple Linear Regression is defined as:
Y = β0 + β1 X1 + β2 X2 + · · · + βn Xn + ε
The summary() function provides R2 , coefficients (β), and p-values.
5 Checking for Multicollinearity
This section checks for multicollinearity among predictors.
1 cor ( rdata )
2 View ( cor ( rdata ) )
3
4 vif ( Reg2 )
5 mean ( vif ( Reg2 ) )
Theory:
• High correlation among predictors indicates multicollinearity.
• Variance Inflation Factor (VIF):
1
VIFj =
1 − Rj2
VIF > 4 suggests multicollinearity.
6 Autocorrelation Check
The Durbin-Watson test checks for autocorrelation in residuals.
1 dwt ( Reg2 )
Durbin-Watson Test:
• DW ≈ 2: No autocorrelation.
• DW < 1.5: Indicates problematic positive autocorrelation.
2
7 Heteroscedasticity Test
This section tests for constant variance in residuals.
1 residuals ( Reg2 )
2 summary ( residuals ( Reg2 ) )
3 plot ( residuals ( Reg2 ) )
4
5 bptest ( Reg2 )
Breusch-Pagan Test:
• H0 : Constant variance (homoscedasticity).
• H1 : Non-constant variance (heteroscedasticity).
• p > 0.05: Assumption of homoscedasticity holds.
8 Normality of Residuals
The Shapiro-Wilk test checks if residuals are normally distributed.
1 shapiro . test ( residuals ( Reg2 ) )
Shapiro-Wilk Test:
• H0 : Residuals are normally distributed.
• p > 0.05: No violation of normality.
9 Outlier Detection Cooks Distance
Cooks Distance identifies influential points in the dataset.
1 cook = cooks . distance ( Reg2 )
2 boxplot ( cook )
3 hist ( cook )
4 plot ( cook )
5 which ( cook > 0.01)
6
7 rdata1 $ cooks . distance = cooks . distance ( Reg2 )
8 cleandata = subset ( rdata1 , cooks . distance < 0.01)
Theory:
• Cooks Distance identifies highly influential points.
• Threshold: 0.01 (or 4/n).
• Remove outliers to create cleandata.
3
10 Response Variable Transformation
Transformations address non-linearity or heteroscedasticity.
1 rdata $ logY = log ( rdata $ Y )
2 trReg = lm ( logY ~ . , data = rdata )
3 summary ( trReg )
4
5 pt = powerTransform ( rdata $ Y )
6 rdata $ newY = ( rdata $ Y ^ 0.71)
7 boxReg = lm ( newY ~ . , data = rdata )
8 summary ( boxReg )
When to Use Transformations:
• log(Y): For skewed or exponential data.
• sqrt(Y): For moderate skewness.
• 1/Y: For large values with low impact.
• Box-Cox: Auto-selects optimal transformation using powerTransform().
11 PCA Preparation (Optional for Multicollinear-
ity)
This step prepares the dataset for Principal Component Analysis (PCA).
1 rdata1 . df = data . frame ( rdata )
2 rdata1 = rdata1 . df [ , 2:22] # Drop Y
Prepares data by excluding the dependent variable.
12 Principal Component Analysis (PCA)
PCA reduces dimensionality and resolves multicollinearity.
1 install . packages ( " psy " )
2 install . packages ( " psych " )
3 install . packages ( " GPArotation " )
4
5 library ( psy )
6 library ( psych )
7 library ( GPArotation )
8
9 scree . plot ( rdata )
10
11 model modeled = pca ( rdata , nfactors = 15 , rotate = " none " )
12 model1 $ loadings
13
14 PCAmodel = pca ( rdata , nfactors = 4 , rotate = " varimax " , method =
" regression " , scores = TRUE )
4
15 PCAmodel $ loadings
16 PCAmodel $ scores
17
18 finalPCAdata = cbind ( rdata , PCAmodel $ scores )
19 write . csv ( finalPCAdata , file = " finalPCAdata . csv " )
Theory:
• PCA reduces dimensionality and resolves multicollinearity.
• Varimax rotation improves interpretability.
• scores = TRUE: Adds new components (PC1, PC2, etc.) to the dataset.
13 Optional GUI for Beginners
R Commander provides a GUI for statistical analysis.
1 install . packages ( " Rcmdr " )
2 library ( Rcmdr )
Opens R Commander for easier statistical analysis.
14 Summary: Code vs. Theory
Code / Concept Theory Explanation
lm() Multiple Linear Regression
vif() Detects multicollinearity (VIF > 4 is problematic)
dwt() Durbin-Watson Test (Autocorrelation)
bptest() Breusch-Pagan Test (Homoscedasticity)
[Link]() Normality of residuals (Shapiro-Wilk test)
[Link]() Influential Outliers (Cook’s D > 0.01)
log(Y), powerTransform() Fix non-linearity or heteroscedasticity
pca() Principal Component Analysis (PCA) for dimensionality
Table 1: Summary of R Code and Corresponding Theory