Exploratory Data Analysis
Module II: Leaves and Trees
Dr. Mark Williamson
DaCCoTA
University of North Dakota
Introduction
• Exploration of datasets to summarize main characteristics
• Last time:
• viewing data
• summary statistics
• basic graphs
• basic tests
• Coming up:
• Rational and descriptions
• Step-by-step examples and assessments
• Caveats and real-world examples
Rationales
Why should we perform exploratory data analysis?
1. Get to know your data
Catch Complex Coding
Issues Data Issues
2. Save time and effort in the long run SAS: model Temp= Weight → model Weight= Temp
R: Temp ~ Weight → Weight ~ Temp
Future Demo Reduce
Vision Results Mistakes
3. Defendable results
Rationale Summaries Assumptions
Descriptions
• Statistical models – mathematical description 𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝒆
of how data conceivably can be produced
• Parametric data – fits a normal distribution,
assumed for many statistical tests
• Paired data-two measurements not
independent (ex. before/after)
• Repeated measures-two or more
measurements not independent (ex. time
intervals)
• Independent variable-does not depend on
another variable; causative, predictor, X
• Dependent variable-variable of interest,
depends on other variables; response, Y
Step-by-step Example 1
• Software used: R
• In the datasets package, I’ll use data set trees
• Contains the diameter, height, and volume for Black Cherry Trees
• Research Question:
• Can we use girth or height to accurately predict volume?
• Useful because getting volume is difficult--girth and height much easier
Step-by-step Example 1
Girth Height Volume
1 8.3 70 10.3 Girth Height Volume
2 8.6 65 10.3 Min. : 8.30 Min. :63 Min. :10.20
1) Look at data 3 8.8 63 10.2
4 10.5 72 16.4
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
5 10.7 81 18.8 Median :12.90 Median :76 Median :24.20
> print(trees) 6 10.8 83 19.7 Mean :13.25 Mean :76 Mean :30.17
7 11.0 66 15.6 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
8 11.0 75 18.2
3 variables 9 11.1 80 22.6 Max. :20.60 Max. :87 Max. :77.00
10 11.2 75 19.9
All numerical 11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
No missing data 14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
2) Summary stats 19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
> summary(trees) 22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
Step-by-step Example 1
3) Graphing
> boxplot(trees)
> hist(trees$Girth)
> hist(trees$Height)
> hist(trees$Volume)
> trees$ln_Volume <-log(trees$Volume)
> hist(trees$ln_Volume)
> qqnorm(trees$Volume);qqline(trees$Volume)
> qqnorm(trees$ln_Volume);qqline(trees$ln_Volume)
> plot(trees$ln_Volume~trees$Girth)
> plot(trees$ln_Volume~trees$Height)
> plot(trees$Height~trees$Girth)
Step-by-step Example 1
4) Simple Tests
> cor(trees$ln_Volume, trees$Girth) [1] 0.9693838 Call:
> cor(trees$ln_Volume, trees$Height) [1] 0.6482742 Call:
lm(formula = trees$ln_Volume ~ trees$Height)
lm(formula = trees$ln_Volume ~ trees$Girth)
> cor(trees$Girth, trees$Height) [1] 0.5192801
Residuals:
Residuals:
Min 1Q Median 3Q Max
-0.66691 -0.26539 -0.065553Q0.42608
Min 1Q Median Max 0.58689
> lm1<-lm(trees$ln_Volume~trees$Girth) -0.22719 -0.11468 0.02889 0.07930 0.30436
> summary(lm1) Coefficients:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Estimate Std. Error
(Intercept) -0.79652 t value-0.894
0.89053 Pr(>|t|)0.378
> lm2<-lm(trees$ln_Volume~trees$Height) (Intercept)
trees$Height 1.118997
0.053540.104021
0.01168 10.76
4.5851.23e-11
8.03e-05******
trees$Girth
--- 0.162566 0.007647 21.26 < 2e-16 ***
> summary(lm2) ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4076 on 29 degrees of freedom
Conclusion: run regression Residual
Multiple standard
Multiple
R-squared:
R-squared:
error: 0.1314 on
0.4203,
0.9397,
29 degrees
Adjusted
Adjusted
of freedom
R-squared:
R-squared:
0.4003
0.9376
F-statistic: 21.02 on 1 and 29 DF, p-value: 8.026e-05
F-statistic: 452 on 1 and 29 DF, p-value: < 2.2e-16
Assessment 1
1. What variable is the X variable in the following R equation? What variable is the Y? leaf_number is Y (dependent) variable
>scatter(leaf_number ~ branch_number) branch_number is X (independent) variable
2. Which variable (Fig, Chestnut, and Oak) has the strongest relationship to Apple? Oak has the strongest relationship to Apple
>Cor(Apple, Fig) -------------> 0.56
>Cor(Apple, Chestnut) -----> 0.24
>Cor(Apple, Oak) ------------> -0.82
3. Is there a relationship between the two variables in the graphs below? If so, what kind? A) Yes, negative relationship
B) Yes, positive relationship
C) No
A) B) C)
4. What are two graphs you can use to visualize if data is normally distributed? Histogram, qq-plot
5. Is this data normally distributed? Yes, looks to be so
Assessment 1
1. What variable is the X variable in the following R equation? What variable is the Y? leaf_number is Y (dependent) variable
>scatter(leaf_number ~ branch_number) branch_number is X (independent) variable
2. Which variable (Fig, Chestnut, and Oak) has the strongest relationship to Apple? Oak has the strongest relationship to Apple
>Cor(Apple, Fig) -------------> 0.56
>Cor(Apple, Chestnut) -----> 0.24
>Cor(Apple, Oak) ------------> -0.82
3. Is there a relationship between the two variables in the graphs below? If so, what kind? A) Yes, negative relationship
B) Yes, positive relationship
C) No
A) B) C)
4. What are two graphs you can use to visualize if data is normally distributed? Histogram, qq-plot
5. Is this data normally distributed? Yes, looks to be so
Step-by-step Example 2
• Software used: SAS
• In the sashelp library, I’ll use data set fish
• Contains the Weight, Length (3 measurements), Height, and Width of 7
species of fish caught in Finland
• Research Question:
• Is there a width difference between the species of fish?
Step-by-step Example 2
Obs Species Weight Length1 Length2 Length3 Height Width
1) Look at data 1
2
Bream
Bream
242.0
290.0
Basic Statistical Measures
23.2
24.0
25.4
26.3
30.0 11.5200 4.0200
31.2 12.4800 4.3056
Location Variability
PROC PRINT data=fish; Mean
3 Bream 340.0 23.9 26.5 31.1 12.3778 4.6961
4.417486 Std Deviation 1.68580
4 Bream 363.0 26.3 29.0 33.5 12.7300 4.4555
7 variables Species is categorial nominal, rest are Median5 Bream 430.0 26.5 29.0
4.248500 34.0 12.4440 5.1340
Variance 2.84193
numerical continuous 6 Bream 450.0 26.8 29.7 34.7 13.6024 4.9274
Mode 7 Bream 500.0 26.8 29.7 Range
3.525000 34.5 14.1795 5.2785 7.09440
Weight has a missing value (observation 14) 8 Bream 390.0 27.6 30.0 35.0 12.6700 4.6900
9 Bream 450.0 27.6 30.0 Interquartile Range 4.8438
35.1 14.0049 2.21340
We’ll only use Species and Width (ignore the rest)
10 Bream 500.0 28.5 30.7 36.2 14.2266 4.9594
11 Bream 475.0 28.4 31.0 36.2 14.2628 5.1042
Cumulative Cumulative
12
Species Bream 500.0 28.7
Frequency 31.0Percent
36.2 14.3714 4.8146
Frequency Percent
2) Summary stats Bream13
14
Bream
Bream
500.0
.
29.1
35
29.5
31.5
32.0
36.4
22.01 13.7592 4.3680
35
37.3 13.9129 5.0728
22.01
Parkki 11 6.92 46 28.93
15 Bream 600.0 29.4 32.0 37.2 14.9544 5.1708
PROC UNIVARIATE data=fish; Perch16 Bream 600.0 56
29.4 32.0 35.22 102
37.2 15.4380 5.5800 64.15
var Width; Pike
17 Bream 700.0 30.4
17
33.0 38.3 14.8604 5.2854
10.69 119 74.84
18 Bream 700.0 30.4 33.0 38.5 14.9380 5.1975
PROC FREQ data=fish; Roach19 Bream 610.0 20
30.9 33.5 12.58 139
38.6 15.6330 5.1338 87.42
Smelt20 Bream 650.0 31.0
14 33.5 38.7 14.4738 5.7276
8.81 153 96.23
tables Species; …
Whitefish 6 3.77 159 100.00
Step-by-step Example 2
3) Graphing
PROC SGPLOT data=fish;
histogram Width;
PROC SGPLOT data=fish;
vbox Width / category=Species;
Step-by-step Example 2
Sum of
4) Simple Tests Source DF Squares Mean Square F Value Pr > F
Model 6 215.9175870 35.9862645 23.47 <.0001
Error 152 233.1080937 1.5336059
PROC SORT data=fish; Corrected Total 158 449.0256807
by Species; Levene's Test for Homogeneity of Width Variance
ANOVA of Absolute Deviations from Group Means
PROC UNIVARIATE data=fish normal; Sum of
Source DF Squares Mean Square F Value Pr > F
by Species;
Species 6 38.6585 6.4431 17.04 <.0001
var Width; Error 152 57.4674 0.3781
qqplot /normal (mu=est sigma=est);
histogram / normal;
PROC GLM data=fish;
class Species;
model Width=Species;
means Species / hovtest=levene(type=abs);
Conclusion: run modified
ANOVA
Assessment 2
1. What variable is the X variable in the following SAS equation? What variable is the Y? Length is Y (dependent) variable
model Length = Species Species is X (independent) variable
2. Based on the SAS output, is there equal variance? Yes, because the p-value (0.5479) is greater than 0.05, so we fail to
Levene's Test for Homogeneity of Length Variance reject the hypothesis that the variances are equal
ANOVA of Absolute Deviations from Group Means
Source DF Sum of Squares Mean Square F Value Pr > F
Make 2 63.5302 31.7651 0.61 0.5479
Error 42 2185.7 52.0408
3. Based on the boxplot, would you expect No, the quartile lengths are very different.
the categories of cars to have equal
variance? Why or why not?
4. How can the assumption of independent sampling be tested? It can’t. Good sampling design ensures the assumption is met.
5. Suppose your data consists of fuel efficiency (miles per gallon) across four different car b)
makes (Ford, Honda, Nissan, and Dodge). How should you test for normality to run an
ANOVA (aka, is there a difference in fuel efficiency across makes)?
a) check normality over all makes, b) check normality for each make individually
Assessment 2
1. What variable is the X variable in the following SAS equation? What variable is the Y? Length is Y (dependent) variable
model Length = Species Species is X (independent) variable
2. Based on the SAS output, is there equal variance? Yes, because the p-value (0.5479) is greater than 0.05, so we fail to
Levene's Test for Homogeneity of Length Variance reject the hypothesis that the variances are equal
ANOVA of Absolute Deviations from Group Means
Source DF Sum of Squares Mean Square F Value Pr > F
Make 2 63.5302 31.7651 0.61 0.5479
Error 42 2185.7 52.0408
3. Based on the boxplot, would you expect No, the quartile lengths are very different.
the categories of cars to have equal
variance? Why or why not?
4. How can the assumption of independent sampling be tested? It can’t. Good sampling design ensures the assumption is met.
5. Suppose your data consists of fuel efficiency (miles per gallon) across four different car b)
makes (Ford, Honda, Nissan, and Dodge). How should you test for normality to run an
ANOVA (aka, is there a difference in fuel efficiency across makes)?
a) check normality over all makes, b) check normality for each make individually
Caveats and Concerns
• Normality tests are an art
• Suggest using histograms and qq-plots over tests for normality
• There is more than one way of doing things
• Code output can be confusing
• Data can be problematic by nature and design
• Uneven samples sizes
• Unequal variances
Real World Examples
Zuur, A. F., et al. (2016). "A protocol for conducting and presenting results of regression-type
analyses." Methods in Ecology and Evolution 7(6): 636-645.
Real World Examples
Ahmed, R., et al. (2020). "United States County-level COVID-19 Death Rates and Case Fatality Rates
Vary by Region and Urban Status." Healthcare (Basel) 8(3).
Real World Examples
Schwartz, G. G., et al. (2019). "An exploration of colorectal cancer incidence rates in North Dakota,
USA, via structural equation modeling." International Journal of Colorectal Disease 34(9): 1571-
1576.
Summary and Conclusion
• Exploratory Data Analysis is a necessary first step in understanding
your data and determining how to analyze it
• Helps to:
• Get to know your data
• Save time and effort in the long run
• End with defendable results
• Many ways to get it done (R, SAS, SPSS, Excel, etc.)
• Tune in next time for a plunge into advanced topics of Exploratory
Data Analysis in Module III: Deep Dive