0% found this document useful (0 votes)
13 views5 pages

R Code Regression PCA Guide

This document serves as a beginner-friendly guide to performing regression analysis in R, covering essential steps such as setting up the environment, reading datasets, and building regression models. It includes methods for checking assumptions like multicollinearity, autocorrelation, heteroscedasticity, and normality of residuals, as well as techniques for outlier detection and data transformation. Additionally, it introduces Principal Component Analysis (PCA) for dimensionality reduction and provides a summary of R functions used in the analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

R Code Regression PCA Guide

This document serves as a beginner-friendly guide to performing regression analysis in R, covering essential steps such as setting up the environment, reading datasets, and building regression models. It includes methods for checking assumptions like multicollinearity, autocorrelation, heteroscedasticity, and normality of residuals, as well as techniques for outlier detection and data transformation. Additionally, it introduces Principal Component Analysis (PCA) for dimensionality reduction and provides a summary of R functions used in the analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Beginner-Friendly Guide to R Code: Regression

Analysis, Assumptions, and PCA

1 Setting Up the Environment


The following code sets and confirms the working directory where R will look for and
save files.
1 setwd ( " C : / Users / oralc / Desktop / MDA " )
2 getwd ()

Purpose:
• setwd(): Sets the working directory.
• getwd(): Confirms the current working directory.

2 Reading the Dataset


This section loads the dataset and makes its columns accessible by name.
1 rdata = read . csv ( " Data _ lifestyle . csv " , header = TRUE )
2 rdata
3 names ( rdata )
4 attach ( rdata )

Explanation:
• Loads the dataset and assigns it to rdata.
• header = TRUE: First row contains column names.
• attach(): Allows direct access to columns (e.g., Y instead of rdata$Y).

3 Loading Required Libraries


These libraries provide tools for regression diagnostics and transformations.
1 library ( car )
2 library ( lmtest )

Purpose:
• car: For VIF and transformations.
• lmtest: For diagnostics like Durbin-Watson and Breusch-Pagan tests.

1
4 Building Regression Models
The following code builds multiple linear regression models.
1 Reg1 = lm ( Y ~ X1 + X2 )
2 summary ( Reg1 )
3

4 Reg2 = lm ( Y ~ X1 + X2 + ... + X21 )


5 summary ( Reg2 )
6

7 Rega = lm ( Y ~ . , data = rdata )


8 summary ( Rega )

Theory: Multiple Linear Regression is defined as:

Y = β0 + β1 X1 + β2 X2 + · · · + βn Xn + ε

The summary() function provides R2 , coefficients (β), and p-values.

5 Checking for Multicollinearity


This section checks for multicollinearity among predictors.
1 cor ( rdata )
2 View ( cor ( rdata ) )
3

4 vif ( Reg2 )
5 mean ( vif ( Reg2 ) )

Theory:

• High correlation among predictors indicates multicollinearity.

• Variance Inflation Factor (VIF):


1
VIFj =
1 − Rj2

VIF > 4 suggests multicollinearity.

6 Autocorrelation Check
The Durbin-Watson test checks for autocorrelation in residuals.
1 dwt ( Reg2 )

Durbin-Watson Test:

• DW ≈ 2: No autocorrelation.

• DW < 1.5: Indicates problematic positive autocorrelation.

2
7 Heteroscedasticity Test
This section tests for constant variance in residuals.
1 residuals ( Reg2 )
2 summary ( residuals ( Reg2 ) )
3 plot ( residuals ( Reg2 ) )
4

5 bptest ( Reg2 )

Breusch-Pagan Test:

• H0 : Constant variance (homoscedasticity).

• H1 : Non-constant variance (heteroscedasticity).

• p > 0.05: Assumption of homoscedasticity holds.

8 Normality of Residuals
The Shapiro-Wilk test checks if residuals are normally distributed.
1 shapiro . test ( residuals ( Reg2 ) )

Shapiro-Wilk Test:

• H0 : Residuals are normally distributed.

• p > 0.05: No violation of normality.

9 Outlier Detection Cooks Distance


Cooks Distance identifies influential points in the dataset.
1 cook = cooks . distance ( Reg2 )
2 boxplot ( cook )
3 hist ( cook )
4 plot ( cook )
5 which ( cook > 0.01)
6

7 rdata1 $ cooks . distance = cooks . distance ( Reg2 )


8 cleandata = subset ( rdata1 , cooks . distance < 0.01)

Theory:

• Cooks Distance identifies highly influential points.

• Threshold: 0.01 (or 4/n).

• Remove outliers to create cleandata.

3
10 Response Variable Transformation
Transformations address non-linearity or heteroscedasticity.
1 rdata $ logY = log ( rdata $ Y )
2 trReg = lm ( logY ~ . , data = rdata )
3 summary ( trReg )
4

5 pt = powerTransform ( rdata $ Y )
6 rdata $ newY = ( rdata $ Y ^ 0.71)
7 boxReg = lm ( newY ~ . , data = rdata )
8 summary ( boxReg )

When to Use Transformations:

• log(Y): For skewed or exponential data.

• sqrt(Y): For moderate skewness.

• 1/Y: For large values with low impact.

• Box-Cox: Auto-selects optimal transformation using powerTransform().

11 PCA Preparation (Optional for Multicollinear-


ity)
This step prepares the dataset for Principal Component Analysis (PCA).
1 rdata1 . df = data . frame ( rdata )
2 rdata1 = rdata1 . df [ , 2:22] # Drop Y

Prepares data by excluding the dependent variable.

12 Principal Component Analysis (PCA)


PCA reduces dimensionality and resolves multicollinearity.
1 install . packages ( " psy " )
2 install . packages ( " psych " )
3 install . packages ( " GPArotation " )
4

5 library ( psy )
6 library ( psych )
7 library ( GPArotation )
8

9 scree . plot ( rdata )


10

11 model modeled = pca ( rdata , nfactors = 15 , rotate = " none " )


12 model1 $ loadings
13

14 PCAmodel = pca ( rdata , nfactors = 4 , rotate = " varimax " , method =


" regression " , scores = TRUE )

4
15 PCAmodel $ loadings
16 PCAmodel $ scores
17

18 finalPCAdata = cbind ( rdata , PCAmodel $ scores )


19 write . csv ( finalPCAdata , file = " finalPCAdata . csv " )

Theory:

• PCA reduces dimensionality and resolves multicollinearity.

• Varimax rotation improves interpretability.

• scores = TRUE: Adds new components (PC1, PC2, etc.) to the dataset.

13 Optional GUI for Beginners


R Commander provides a GUI for statistical analysis.
1 install . packages ( " Rcmdr " )
2 library ( Rcmdr )

Opens R Commander for easier statistical analysis.

14 Summary: Code vs. Theory

Code / Concept Theory Explanation


lm() Multiple Linear Regression
vif() Detects multicollinearity (VIF > 4 is problematic)
dwt() Durbin-Watson Test (Autocorrelation)
bptest() Breusch-Pagan Test (Homoscedasticity)
[Link]() Normality of residuals (Shapiro-Wilk test)
[Link]() Influential Outliers (Cook’s D > 0.01)
log(Y), powerTransform() Fix non-linearity or heteroscedasticity
pca() Principal Component Analysis (PCA) for dimensionality

Table 1: Summary of R Code and Corresponding Theory

You might also like