Beginner-Friendly Guide to R Code: Regression
Analysis, Assumptions, and PCA
1 Setting Up the Environment
The following code sets and confirms the working directory where R will look for and
save files.
1 setwd ( " C : / Users / oralc / Desktop / MDA " )
2 getwd ()
Purpose:
• setwd(): Sets the working directory.
• getwd(): Confirms the current working directory.
2 Reading the Dataset
This section loads the dataset and makes its columns accessible by name.
1 rdata = read . csv ( " Data _ lifestyle . csv " , header = TRUE )
2 rdata
3 names ( rdata )
4 attach ( rdata )
Explanation:
• Loads the dataset and assigns it to rdata.
• header = TRUE: First row contains column names.
• attach(): Allows direct access to columns (e.g., Y instead of rdata$Y).
3 Loading Required Libraries
These libraries provide tools for regression diagnostics and transformations.
1 library ( car )
2 library ( lmtest )
Purpose:
• car: For VIF and transformations.
• lmtest: For diagnostics like Durbin-Watson and Breusch-Pagan tests.
1
4 Building Regression Models
The following code builds multiple linear regression models.
1 Reg1 = lm ( Y ~ X1 + X2 )
2 summary ( Reg1 )
3
4 Reg2 = lm ( Y ~ X1 + X2 + ... + X21 )
5 summary ( Reg2 )
6
7 Rega = lm ( Y ~ . , data = rdata )
8 summary ( Rega )
Theory: Multiple Linear Regression is defined as:
Y = β0 + β1 X1 + β2 X2 + · · · + βn Xn + ε
The summary() function provides R2 , coefficients (β), and p-values.
5 Checking for Multicollinearity
This section checks for multicollinearity among predictors.
1 cor ( rdata )
2 View ( cor ( rdata ) )
3
4 vif ( Reg2 )
5 mean ( vif ( Reg2 ) )
Theory:
• High correlation among predictors indicates multicollinearity.
• Variance Inflation Factor (VIF):
1
VIFj =
1 − Rj2
VIF > 4 suggests multicollinearity.
6 Autocorrelation Check
The Durbin-Watson test checks for autocorrelation in residuals.
1 dwt ( Reg2 )
Durbin-Watson Test:
• DW ≈ 2: No autocorrelation.
• DW < 1.5: Indicates problematic positive autocorrelation.
2
7 Heteroscedasticity Test
This section tests for constant variance in residuals.
1 residuals ( Reg2 )
2 summary ( residuals ( Reg2 ) )
3 plot ( residuals ( Reg2 ) )
4
5 bptest ( Reg2 )
Breusch-Pagan Test:
• H0 : Constant variance (homoscedasticity).
• H1 : Non-constant variance (heteroscedasticity).
• p > 0.05: Assumption of homoscedasticity holds.
8 Normality of Residuals
The Shapiro-Wilk test checks if residuals are normally distributed.
1 shapiro . test ( residuals ( Reg2 ) )
Shapiro-Wilk Test:
• H0 : Residuals are normally distributed.
• p > 0.05: No violation of normality.
9 Outlier Detection Cooks Distance
Cooks Distance identifies influential points in the dataset.
1 cook = cooks . distance ( Reg2 )
2 boxplot ( cook )
3 hist ( cook )
4 plot ( cook )
5 which ( cook > 0.01)
6
7 rdata1 $ cooks . distance = cooks . distance ( Reg2 )
8 cleandata = subset ( rdata1 , cooks . distance < 0.01)
Theory:
• Cooks Distance identifies highly influential points.
• Threshold: 0.01 (or 4/n).
• Remove outliers to create cleandata.
3
10 Response Variable Transformation
Transformations address non-linearity or heteroscedasticity.
1 rdata $ logY = log ( rdata $ Y )
2 trReg = lm ( logY ~ . , data = rdata )
3 summary ( trReg )
4
5 pt = powerTransform ( rdata $ Y )
6 rdata $ newY = ( rdata $ Y ^ 0.71)
7 boxReg = lm ( newY ~ . , data = rdata )
8 summary ( boxReg )
When to Use Transformations:
• log(Y): For skewed or exponential data.
• sqrt(Y): For moderate skewness.
• 1/Y: For large values with low impact.
• Box-Cox: Auto-selects optimal transformation using powerTransform().
11 PCA Preparation (Optional for Multicollinear-
ity)
This step prepares the dataset for Principal Component Analysis (PCA).
1 rdata1 . df = data . frame ( rdata )
2 rdata1 = rdata1 . df [ , 2:22] # Drop Y
Prepares data by excluding the dependent variable.
12 Principal Component Analysis (PCA)
PCA reduces dimensionality and resolves multicollinearity.
1 install . packages ( " psy " )
2 install . packages ( " psych " )
3 install . packages ( " GPArotation " )
4
5 library ( psy )
6 library ( psych )
7 library ( GPArotation )
8
9 scree . plot ( rdata )
10
11 model modeled = pca ( rdata , nfactors = 15 , rotate = " none " )
12 model1 $ loadings
13
14 PCAmodel = pca ( rdata , nfactors = 4 , rotate = " varimax " , method =
" regression " , scores = TRUE )
4
15 PCAmodel $ loadings
16 PCAmodel $ scores
17
18 finalPCAdata = cbind ( rdata , PCAmodel $ scores )
19 write . csv ( finalPCAdata , file = " finalPCAdata . csv " )
Theory:
• PCA reduces dimensionality and resolves multicollinearity.
• Varimax rotation improves interpretability.
• scores = TRUE: Adds new components (PC1, PC2, etc.) to the dataset.
13 Optional GUI for Beginners
R Commander provides a GUI for statistical analysis.
1 install . packages ( " Rcmdr " )
2 library ( Rcmdr )
Opens R Commander for easier statistical analysis.
14 Summary: Code vs. Theory
Code / Concept Theory Explanation
lm() Multiple Linear Regression
vif() Detects multicollinearity (VIF > 4 is problematic)
dwt() Durbin-Watson Test (Autocorrelation)
bptest() Breusch-Pagan Test (Homoscedasticity)
shapiro.test() Normality of residuals (Shapiro-Wilk test)
cooks.distance() Influential Outliers (Cook’s D > 0.01)
log(Y), powerTransform() Fix non-linearity or heteroscedasticity
pca() Principal Component Analysis (PCA) for dimensionality
Table 1: Summary of R Code and Corresponding Theory
5
Corrected Beginner-Friendly Guide to PCA in R
July 13, 2025
1 Introduction
This guide provides a corrected and beginner-friendly explanation of performing Principal
Component Analysis (PCA) in R, including suitability tests, scree plots, and component
score generation. Corrections to common errors are highlighted, and best practices are
emphasized.
2 Set Working Directory
Sets and confirms the working directory where PCA data is stored.
1 setwd ( " C : / Users / oralc / Desktop / PCA " )
2 getwd ()
Purpose:
• setwd(): Sets the working directory for file operations.
• getwd(): Confirms the current working directory.
3 Read and Attach Data
Loads the dataset and checks column names.
1 PC = read . csv ( " pcadata . csv " , header = TRUE )
2 names ( PC )
3 attach ( PC )
Note:
• header = TRUE: First row contains column names.
• attach() is not recommended due to potential variable conflicts. Use PC$VariableName
instead for safer access.
1
4 Correlation Matrix & Normality Tests
Computes the correlation matrix and visualizes it appropriately.
1 r = cor ( PC )
2 View ( r )
3
4 par ( mar = c (1 , 1 , 1 , 1) )
5 hist ( as . vector ( r ) ) # Histogram of correlation coefficients
6 qqnorm ( as . vector ( r ) ) # QQ plot for normality
Corrections:
• Original: hist(r) and qqnorm(r) used a matrix, which is incorrect.
• Fixed: Convert matrix to vector with as.vector(r) for proper visualization.
5 Install & Load Packages
Installs and loads required packages for PCA and diagnostics.
1 install . packages ( " psy " )
2 install . packages ( " psych " )
3 install . packages ( " GPArotation " )
4 library ( psy )
5 library ( psych )
6 library ( GPArotation )
Purpose:
• psy, psych: For Bartletts Test, KMO, and PCA functions.
• GPArotation: For rotated PCA solutions (e.g., Varimax).
6 Suitability Tests
Tests whether the dataset is suitable for PCA.
1 cortest . bartlett ( PC ) # Bartletts Test of Sphericity
2 KMO ( PC ) # Kaiser - Meyer - Olkin ( KMO ) Test
Explanation:
• Bartletts Test: p < 0.05 indicates variables are correlated, suitable for PCA.
• KMO Test: KMO > 0.6 suggests adequate sampling adequacy.
7 Scree Plot for Factor Selection
Visualizes the number of components to retain.
1 scree ( PC ) # Scree plot from psych package
2 fa . parallel ( PC ) # Parallel analysis for optimal components
2
Correction:
• Original: scree.plot(PC) from psy is obsolete.
• Fixed: Use scree() or fa.parallel() from psych for better visualization and
component selection.
8 Unrotated PCA with 15 Components
Performs PCA without rotation for comparison.
1 model1 = pca ( PC , nfactors = 15 , rotate = " none " )
2 model1 $ loadings
Explanation:
• Extracts 15 principal components without rotation.
• Useful for initial exploration but less interpretable without rotation.
9 PCA with 4 Factors & Varimax Rotation
Performs PCA with Varimax rotation for interpretable results.
1 PCAmodel = pca ( PC , nfactors = 4 , rotate = " varimax " , method = "
regression " , scores = TRUE )
2 PCAmodel
3 PCAmodel $ loadings # Factor loadings
4 PCAmodel $ scores # Component scores
Explanation:
• rotate = "varimax": Orthogonal rotation for clearer factor interpretation.
• method = "regression": Ensures proper factor score estimation.
• scores = TRUE: Generates component scores for each observation.
10 Save Final Dataset
Appends PCA scores to the original dataset and saves it.
1 finalPCAdata = cbind ( PC , PCAmodel $ scores )
2 write . csv ( file = " finalPCAdata . csv " , finalPCAdata )
Purpose:
• Combines original data with PCA scores (PC1, PC2, etc.).
• Saves the enhanced dataset as a CSV file.
3
11 Optional GUI for Beginners
Installs and loads R Commander for a GUI-based interface.
1 install . packages ( " Rcmdr " )
2 library ( Rcmdr )
Correction:
• Original: install.pckages("Rcmdr") contained a typo.
• Fixed: Corrected to install.packages("Rcmdr").
12 Summary of Issues and Fixes
Line Issue Fix
qqnorm(r) Matrix passed instead of vector Use qqnorm(as.vector(r))
hist(r) Matrix passed instead of vector Use hist(as.vector(r))
scree.plot() Obsolete function Use scree() or fa.parallel()
install.pckages() Typo in function name Use install.packages()
attach() Risky for variable conflicts Use PC$colname instead
Table 1: Summary of Issues and Fixes in PCA Script
13 Additional Notes
If you need help interpreting PCA results (e.g., loadings, scree plot, or component scores)
or visualizing them (e.g., biplot or component score plots), let me know! The psych pack-
age also supports advanced visualizations like biplot.psych() for loadings and scores.
4
Beginner-Friendly Guide to Linear Discriminant
Analysis (LDA) in R on Iris Dataset
July 13, 2025
1 Introduction
This guide provides a step-by-step explanation of performing Linear Discriminant Analy-
sis (LDA) on the Iris dataset (renamed as lineardata.csv) in R. It covers data loading,
assumption testing, exploratory data analysis (EDA), data splitting, model training, vi-
sualization, evaluation, and exporting results. The dataset includes four numeric features
(SEPAL.LENGTH, SEPAL.WIDTH, PETAL.LENGTH, PETAL.WIDTH) and one categorical variable
(SPECIES).
2 Setup & Data Loading
Sets the working directory and loads the dataset.
1 setwd ( " C : / Users / oralc / Desktop / Dis Analysis " )
2 getwd ()
3 ddata = read . csv ( " lineardata . csv " , header = TRUE )
4 attach ( ddata )
5 names ( ddata )
Explanation:
• setwd(): Sets the working directory for file operations.
• read.csv(): Loads lineardata.csv with header = TRUE to recognize column
names.
• attach(): Allows direct access to column names (e.g., SPECIES instead of ddata$SPECIES).
Note: Use ddata$colname to avoid potential conflicts.
• names(): Lists column names (expected: SEPAL.LENGTH, SEPAL.WIDTH, PETAL.LENGTH,
PETAL.WIDTH, SPECIES).
3 Multivariate Assumptions Check
3.1 Normality
Tests multivariate normality of the numeric variables.
1
1 library ( mvnormtest )
2 y = cbind ( SEPAL . LENGTH , SEPAL . WIDTH , PETAL . LENGTH , PETAL . WIDTH )
3 mshapiro . test ( t ( y ) )
Explanation:
• cbind(): Combines the four numeric variables into a matrix y.
• mshapiro.test(): Performs Shapiro-Wilk test for multivariate normality (suitable
for n < 5000).
• t(y): Transposes the matrix since mshapiro.test() expects variables in rows.
3.2 Homogeneity of Variances (Levenes Test)
Tests whether variances are equal across species groups for each feature.
1 library ( car )
2 leveneTest ( SEPAL . LENGTH ~ SPECIES )
3 leveneTest ( SEPAL . WIDTH ~ SPECIES )
4 leveneTest ( PETAL . LENGTH ~ SPECIES )
5 leveneTest ( PETAL . WIDTH ~ SPECIES )
Explanation:
• leveneTest(): Checks if variances are equal across SPECIES groups for each vari-
able.
• p > 0.05: Indicates homogeneity of variances (assumption met).
3.3 Equality of Covariance Matrices (Boxs M Test)
Tests whether covariance matrices are equal across groups.
1 library ( heplots )
2 boxM ( y ~ SPECIES , data = ddata )
Explanation:
• boxM(): Evaluates equality of covariance matrices across SPECIES groups.
• p > 0.05: Suggests equal covariance matrices (assumption met).
4 Exploratory Data Analysis (EDA) Scatterplots
Visualizes pairwise relationships and correlations.
1 library ( psych )
2 pairs . panels ( ddata [1:4] , gap = 0 ,
3 bg = c ( " red " , " yellow " , " green " ) [ as . factor ( ddata $
SPECIES ) ] , pch = 21)
Explanation:
2
• pairs.panels(): Creates pairwise scatterplots, histograms, and correlation coeffi-
cients for the four numeric variables.
• bg: Colors points by SPECIES (red, yellow, green).
• pch = 21: Uses filled circles for scatterplot points.
5 Data Splitting
Splits the dataset into 75% training and 25% testing sets.
5.1 Method 1: Using caTools
1 install . packages ( " caTools " )
2 library ( caTools )
3 set . seed (101)
4 sample = sample . split ( ddata [ ,1] , SplitRatio = .75)
5 train = subset ( ddata , sample == TRUE )
6 test = subset ( ddata , sample == FALSE )
5.2 Method 2: Manual Split
1 split = sample (2 , nrow ( ddata ) , replace = TRUE , prob = c (0.75 ,
0.25) )
2 training = ddata [ split == 1 ,]
3 testing = ddata [ split == 2 ,]
Explanation:
• sample.split(): Randomly splits data based on the first column, ensuring 75%
training and 25% testing.
• sample(): Alternative manual method to assign rows to training (1) or testing (2)
with probabilities 0.75 and 0.25.
• set.seed(101): Ensures reproducibility of the split.
6 Linear Discriminant Analysis (LDA)
Fits the LDA model to the training data.
1 library ( MASS )
2 linear = lda ( SPECIES ~ SEPAL . LENGTH + PETAL . LENGTH + SEPAL . WIDTH
+ PETAL . WIDTH , data = training )
Explanation:
• lda(): Fits an LDA model to predict SPECIES using the four numeric features.
• Trained on the training dataset for robust evaluation.
3
7 Visualizations
7.1 Discriminant Histograms
Visualizes the linear discriminants by species.
1 p1 = predict ( linear , training )
2 ldahist ( p1 $ x [ ,1] , g = training $ SPECIES )
3 ldahist ( p1 $ x [ ,2] , g = training $ SPECIES )
Explanation:
• predict(): Generates discriminant scores for the training data.
• ldahist(): Plots histograms of the first (LD1) and second (LD2) linear discrimi-
nants, grouped by SPECIES.
7.2 Bi-Plot (Using ggord)
Creates a bi-plot to visualize class separation.
1 install . packages ( " devtools " )
2 install _ github ( " fawda123 / ggord " )
3 library ( ggord )
4 ggord ( linear , training $ SPECIES , ylim = c ( -10 , 10) )
Explanation:
• ggord(): Generates an LDA bi-plot showing class separation and variable contri-
butions.
• ylim = c(-10, 10): Sets y-axis limits for better visualization.
8 Model Evaluation
8.1 Confusion Matrix Training
Evaluates model performance on training data.
1 tab1 = table ( Predicted = p1 $ class , Actual = training $ SPECIES )
2 sum ( diag ( tab1 ) ) / sum ( tab1 )
Explanation:
• table(): Creates a confusion matrix comparing predicted and actual SPECIES.
• sum(diag(tab1)) / sum(tab1): Calculates training accuracy as the proportion of
correct predictions.
4
8.2 Confusion Matrix Testing
Evaluates model performance on testing data.
1 p2 = predict ( linear , testing )
2 tab2 = table ( Predicted = p2 $ class , Actual = testing $ SPECIES )
Explanation:
• predict(): Generates predictions for the testing data.
• table(): Creates a confusion matrix for the testing set.
9 Final Model on Full Dataset
Trains LDA on the entire dataset and exports results.
1 linearwhole = lda ( SPECIES ~ . , data = ddata )
2 p = predict ( linear , ddata )
3 finallineardata = cbind ( ddata $ SPECIES , p $ class )
4 write . csv ( file = " finallineardata . csv " , finallineardata )
Explanation:
• lda(SPECIES .): Fits LDA on the entire dataset using all features.
• predict(): Generates predictions for the full dataset.
• cbind(): Combines actual and predicted SPECIES into a data frame.
• write.csv(): Exports results to finallineardata.csv.
10 Summary Table of Steps
Step Task R Function/Package
1 Load Data read.csv(), attach()
2 Assumption Testing mshapiro.test(),
leveneTest(), boxM()
3 EDA pairs.panels()
4 Data Split sample.split(), sample()
5 Model Training lda() from MASS
6 Prediction predict()
7 Visualization ldahist(), ggord()
8 Evaluation table(), Accuracy calculation
9 Export Final Results cbind(), write.csv()
Table 1: Summary of LDA Workflow Steps