1
Exploratory Data Analysis
This is the first step in analysing data from an experiment.
Here, we do a descriptiver statistics analysis of the data.
Some of the main reasons why we do EDA are:
To detect mistakes.
To determine the relationship between variables.
To determine the main characteristics or features of the data.
NB: Before we start EDA in R, we will look at some important concepts that will help us in
handling this unit.
################################################### (PDF-NOTES).
Things to Know Before Start Learning R
Why use R
• R is an open source programming language and software environment for
statistical computing and graphics.
• R is an object oriented programming environment, much more than most other
statistical software packages.
• R is a comprehensive statistical platform, offering all manner of data-analytic
techniques – any type of data analysis can done in R.
• R has state-of-the-art graphics capabilities- visualize complex data.
• R is a powerful platform for interactive data analysis and exploration.
• Getting data into a usable form from multiple sources.
• R functionality can be integrated into applications written in other languages,
including C++, Java, Python , PHP, SAS and SPSS.
• R runs on a wide array of platforms, including Windows, Unix and Mac OS X.
• R is extensible; can be expanded by installing “packages”
2
Applications of R Programming in Real World
i. Statistical computing
ii. Data Science
iii. Machine Learning
Downloading and Installing R
########################################################
R Basics Sessions
R and R-Studio
R has Graphic User Interfaces (GUI). RStudio is an Integrated Development Environment
(IDE) that provides features to make using and managing R much easier.
Looking at R window and R studio window with simple examples.
1. Getting help in R
To get help on specific topics, we can use the help() function along with the topic we want to
search. We can also use the ? operator for this. Example:
help(Syntax)
?Syntax
2. Operations in R.
R uses the following operators:
1. +, -, *, /, %%, ^ - Arithmetic Operators
2. >, > =, <, < =, = =, != - Relational Operators
3. !, $ - Logical Operators
4. ~ - Model Formulae
5. < -, = - Assignment Operator
6. : - Creating Sequence
3
a. Arithmetic Operators
Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
b. Logical Operators include:
Operator Description
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
4
EXPLORATORY DATA
ANALYSIS
We will now start looking at exploratory data analysis in R.
1. Measures of Location.
i) Measures of Central Tendency
Measures that indicate the approximate center of a distribution are called
measures of central tendency.
Central tendency tells about how the group of data is clustered around the
centre value of the distribution.
Here we will look at the:
Arithmetic Mean
Geometric Mean
Harmonic Mean
Median
Mode
Arithmetic Mean
The arithmetic mean is simply called the average of the numbers which represents the central
value of the data distribution. It is calculated by adding all the values and then dividing by the
total number of observations.
Formulae
5
In R language, arithmetic mean can be calculated by mean() function.
Example:
# defining vector
x <- c (3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)
# Print mean
Mean(x)
Or
y = mean (x)
or
Print (mean(x))
Press ctrl + R to get the output or click on run at the upper corner of your console.
Output:
[1] 21.5
NB: You can also calculate the mean using:
W = sum(x)
MeanX = sum(x)/14
Or
n = length(x)
6
XMean = sum(x)/n
Or
Xmean = W/n
Given a Large Data set:
Let’s begin by looking at a simple example with a dataset that comes pre-loaded in your
version of R, called cars by Ezekiel (1930). These data give the speed of cars and the
distances taken to stop.
If we were to compute the mean for cars$speed (or the variable speed our dataset called cars)
we would simply sum the values in the column for speed and divide by 50.
(4 + 4 + 7 + 7 + 8 + 9 + 10 + 10 + 10 + 11 + 11 + 12 + 12 + 12 + 12 + 13 + 13 + 13 + 13 + 14
+ 14 + 14 + 14 + 15 + 15 + 15 + 16 + 16 + 17 + 17 + 17 + 18 + 18 + 18 + 18 + 19 + 19 + 19
+ 20 + 20 + 20 + 20 + 20 + 22 + 23 + 24 + 24 + 24 + 24 + 25) / 50
Or quite simply: 770/50 = 15.4.
But this data is too big to calculate the mean manually. To work with a large data set that is
pre-loaded in R, we:
View the data type:
View (cars)
or
cars
In R, we can compute the mean in several ways:
sumofspeed <- sum(cars$speed)
sumofspeed / 50
## [1] 15.4
or
7
sum(cars$speed) / length(cars$speed)
## [1] 15.4
or simply using the mean( ) function
mean(cars$speed)
## [1] 15.4
N/B:
Computing the mean for the cars data worked out nicely because there were no missing
values or NAs. If there were NAs we would be able to omit those from our calculations. For
example,
mean(cars$speed, [Link]=TRUE)
## [1] 15.4
While the mean is not a new concept to you, there’s some notation that is important for you
to understand.
n. Used to refer to the sample size. The number of sample of observations (rows) that we are
averaging. In the above example n=50
x. Used to refer to the sample elements.
EXERCISE
1. Geometric Mean
2. Harmonic Mean
8
The median
The median is another measure of central tendency. The middle value in a set of observations
is the median. For cars$speed, we can sort our variables in ascending order using the sort( )
function. This can help us identify the median.
Example 1
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)
# Print Median
median(x)
# output
21.5
Example 2
sort(cars$speed)
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
In this case, the middle value is at positions 25 and 26. The middle value is 15. If the value of
position 25 was 14 and the value of position 26 was 15 we’d take the average of the two
values and the median would be 14.5.
An easier way to compute the median is to use the median( ) function:
median(cars$speed)
## [1] 15
9
The Mode
The mode of a set of observations is the value that occurs most frequently. There’s not a
standard function in R that computes the mode. However, you can create a simple frequency
table to tally the number of times each value occurs.
Example 1: Single-mode value
In R language, there is no function to calculate mode. So, modifying the code to find out the
mode for a given set of values.
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37, 45, 1, 25, 8)
# Generate frequency table
y <- table(x)
# Print frequency table
print(y)
# Mode of x
m <- names(y)[which(y == max(y))]
# Print mode
print(m)
Output:
1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56
1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 2
[1] "23"
Example 2: Multiple Mode values
10
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37, 45, 1, 25, 8, 56, 56)
# Generate frequency table
y <- table(x)
# Print frequency table
print(y)
# Mode of x
m <- names(y)[which(y == max(y))]
# Print mode
print(m)
Output:
1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56
1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 4
[1] "23" "56"
table(cars$speed)
##
## 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
## 2 2 1 1 3 2 4 4 4 3 2 3 4 3 5 1 1 4 1
Here we see that the value 20 occurs 5 times.
11
You can also compute the mode using the following algorithm:
modeforcars <- table([Link](cars$speed))
names(modeforcars)[modeforcars == max(modeforcars)]
## [1] "20"
Exercise 2
1. Find the Mean, Median and Mode using mtcars dataset pre-loaded in R.
12
ii) Measures of Relative Positioning
The commonly used quantiles are; Quartiles, Deciles and Percentiles.
These 3 divides a sorted data set into four, ten and hundred divisions, respectively.
a) Quartiles
There are several quartiles of an observation variable. The first quartile, or lower
quartile, is the value that cuts off the first 25% of the data when it is sorted in
ascending order. The second quartile, or median, is the value that cuts off the first
50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.
Example
Find the quartiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the quartiles of eruptions.
duration = faithful$eruptions # the eruption durations
quantile(duration) # apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Answer
The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and
4.4543 minutes respectively.
b) Deciles
In statistics, deciles are numbers that split a dataset into ten groups of equal
frequency. The first decile is the point where 10% of all data values lie below it. The
second decile is the point where 20% of all data values lie below it, and so on. We can
use the following syntax to calculate the deciles for a dataset in R:
13
quantile(data, probs = seq(.1, .9, by = .1))
Example
Calculate Deciles in R
The following code shows how to create a dataset with 20 values and then calculate
the values for the deciles of the dataset:
#create dataset
data <- c(56, 58, 64, 67, 68, 73, 78, 83, 84, 88,89, 90, 91, 92, 93, 93, 94, 95, 97, 99)
#calculate deciles of dataset
quantile(data, probs = seq(.1, .9, by = .1))
Output
10% 20% 30% 40% 50% 60% 70% 80% 90%
63.4 67.8 76.5 83.6 88.5 90.4 92.3 93.2 95.2
The way to interpret the deciles is as follows:
10% of all data values lie below 63.4
20% of all data values lie below 67.8.
30% of all data values lie below 76.5.
40% of all data values lie below 83.6.
50% of all data values lie below 88.5.
60% of all data values lie below 90.4.
70% of all data values lie below 92.3.
80% of all data values lie below 93.2.
90% of all data values lie below 95.2.
14
c) Percentiles
The nth percentile of a dataset is the value that cuts off the first n percent of the data values
when all of the values are sorted from least to greatest.
For example, the 90th percentile of a dataset is the value that cuts of the bottom 90% of the
data values from the top 10% of data values.
One of the most commonly used percentiles is the 50th percentile, which represents the
median value of a dataset: this is the value at which 50% of all data values fall below.
Percentiles can be used to answer questions such as:
What score does a student need to earn on a particular test to be in the top 10% of
scores? To answer this, we would find the 90th percentile of all scores, which is the
value that separates the bottom 90% of values from the top 10%.
What heights encompass the middle 50% of heights for students at a particular
school? To answer this, we would find the 75th percentile of heights and 25th
percentile of heights, which are the two values that determine the upper and lower
bounds for the middle 50% of heights.
To Calculate Percentiles in R
We can easily calculate percentiles in R using the quantile() function, which uses the
following syntax:
quantile(x, probs = seq(0, 1, 0.25))
Example
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles of eruptions with the desired
percentage ratios.
15
duration = faithful$eruptions # the eruption durations
quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Answer
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330
minutes respectively.
Exercise
1. Find the 17th, 43rd, 67th and 85th percentiles of the eruption waiting periods in
faithful.
2.
16
2. Measures of Spread/
Dispersion
Spread is the degree of scatter or variation of the variable about the central value.
Examples of these measures include:
i) The range
ii) Inter-Quartile range
iii) Quartile Deviation also called semi Inter-Quartile range
iv) Mean Absolute Deviation
v) Variance
vi) Standard deviation
In addition to computing measures of central tendency, another summary statistic we’d like to
compute is variability. How spread out are the data? How far from the mean and median do
the observed values tend to be?
Range
The range of a variable is the largest value minus the smallest value. We can compute the
largest value using the max( ) function and the smallest value using the min( ) function. In the
case with cars$speed, the range is 25 – 4 or 21.
min(cars$speed)
## [1] 4
max(cars$speed)
## [1] 25
R has an even better function, range( ) that outputs the minimum and maximum value in a
vector
range(cars$speed)
17
## [1] 4 25
Interquartile range
The interquartile range is similar to the range, but instead of calculating the difference
between the biggest and smallest value, you calculate the difference between the 25th
quantile and the 75th quantile.
We can calculate the interquartile range (IQR) using IQR( ). This is the range spanned by the
middle half of the data. For example this is the 75th quantile minus the 25th quantile.
IQR(cars$speed)
## [1] 7
We can see all quantiles by typing the following:
quantile(cars$speed)
## 0% 25% 50% 75% 100%
## 4 12 15 19 25
Or just to see the 25% and 75% we can type:
quantile(cars$speed, probs=c(.25, .75))
## 25% 75%
## 12 19
Therefore, you can see the IQR is simply 19 – 12.
Variance
The variance is a numerical measure of how the data values are dispersed around the mean.
The variance measures how far a set of numbers are spread out. (A variance of zero indicates
that all the values are identical.) A non-zero variance is always positive: A small variance
18
indicates that the data points tend to be very close to the mean (expected value). A high
variance indicates that the data points are very spread out from the mean and from each other.
The variance of a dataset X is sometimes written as Var(X) but more commonly denoted as
S2 or for a given sample. The formula for the sample variance is:
To compute the sample variance in R we would type the following:
var(cars$speed)
## [1] 27.95918
Standard deviation
The square root of the variance is the standard deviation. Below is the formula for the sample
standard deviation.
To compute the sample standard deviation in R, type the following:
sqrt(var(cars$speed))
## [1] 5.287644
or you can use the sd( ) function
sd(cars$speed)
## [1] 5.287644
Measures of Skew and kurtosis
Skew and kurtosis are two more descriptive statistics that you may encounter.
Skew
19
Skewness is a measure of symmetry. If there are more extremely large values than extremely
small ones, the data can be described as positively skewed. If the data tend to have a lot of
extreme small values and not many extremely large values then the data is considered
negatively skewed. As a rule, negative skewness indicates that the mean of the data values is
less than the median, and the data distribution is left-skewed. Positive skewness would
indicate that the mean of the data values is larger than the median, and the data distribution is
right-skewed. See Figure below for an illustration.
Figure: From left to right: Positive skew, no skew, and negative skew
We can compute the skew by using a function called skew( ) from the psych package.
library(psych)
skew(cars$speed)
## [1] -0.1105533
Kurtosis
Kurtosis is the measure of the pointiness of the data. Intuitively, the kurtosis is a measure of
the peakedness of the data distribution. We can see how fat or thin the tails of a distribution
are relative to a normal distribution. Negative kurtosis would indicates a flat data distribution,
which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which
is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to
be mesokurtic. See Figure ?? for an illustration.
20
We can compute the kurtosis by using a function called kurtosi( ) from the psych package.
kurtosi(cars$speed)
## [1] -0.6730924
Where do you think cars$speed fall? Let’s plot it. See Figure below
DESCRIBE AND SUMMARY FUNCTIONS.
There’s an easier way to compute some measures of central tendency and variability using
the summary( ) function. The summary function provides the min( ), max( ), median( ),
mean( ), the 75% and 25% quantiles. To compute all these measures for a single variable
type:
summary(cars$speed)
21
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 15.0 15.4 19.0 25.0
To summarize a data frame, type:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Describing a data frame
A similar function to the summary( ) function is the describe( ) function in the psych
package. This function is useful when your data are interval or ratio scale. Unlike the
summary ( ) function, it calculates the descriptive statistics for any type of variable you
give it. It also includes other measures that we discussed earlier such as the trimmed mean
(default is 10%), skew, kurtosis, and range. n is the sample size (or the number of non-
missing values)
describe(cars)
There are more advanced functions to compute descriptive statistics by group using the psych
package. One such function is describeBy( ). You can specify a grouping variable. Let’s say
we wanted to obtain descriptive statistics separately for each grouping of data. For example,
we could group our data by the different speeds in cars. We could use speed as our grouping
variable as follows:
describeBy(cars, group=cars$speed)
22
Bivariate Data
So far we have confined our discussion to the distributions involving only one variable.
Sometimes, in practical applications, we might come across certain set of data, where each
item of the set may comprise of the values of two or more variables.
A Bivariate Data is a set of paired measurements which are of the form:
( , ), ( , ), .....,( , ).
1. Scatter Diagrams.
2. Correlation.
3. Regression.
1. Scatter Diagrams.
A scatter diagram is a tool for analysing relationships between two variables. One variable is
plotted on the horizontal axis and the other is plotted on the vertical axis. The pattern of their
intersecting points can graphically show relationship patterns. Most often a scatter diagram is
used to prove or disprove cause-and-effect relationships.
There are many ways to create a scatterplot in R.
23
i) The basic function is plot(x, y), where x and y are numeric vectors denoting the
(x,y) points to plot.
# Simple Scatterplot
x<-c(1,2,3,4,5,6,7)
y<-c(2,4,6,8,10,12,14)
plot(x,y)
Plot 2
attach(mtcars)
plot(wt, mpg, main="Scatterplot Example",
xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
Scatter diagrams will generally show one of six possible correlations between the
variables:
i) Strong Positive Correlation The value of Y clearly increases as the value of X
increases.
ii) Strong Negative Correlation The value of Y clearly decreases as the value of X
increases.
iii) Weak Positive Correlation The value of Y increases slightly as the value of X
increases.
iv) Weak Negative Correlation The value of Y decreases slightly as the value of X
increases.
v) Complex Correlation The value of Y seems to be related to the value of X, but the
relationship is not easily determined.
vi) No Correlation There is no demonstrated connection between the two variables
24
2. Correlation
Correlation is a statistical method to measure the relationship between the two quantitative
variables.
The correlation coefficient (r) measures the strength and direction of (linear) relationship
between the two quantitative variables. r can range from +1 (perfect positive correlation) to -
1 (perfect negative correlation).
The positive values of r indicate the positive relationship and vice versa. The higher the
absolute value of r, the stronger is the correlation. If the value of r is 0, it indicates that there
is no relationship between the two variables.
Interpretation of correlation coefficient (r)
The below table suggests the interpretation of r at different absolute values. These cut-offs
are arbitrary and should be used judiciously while interpreting the dataset.
25
Note: In interpretation, correlation can be positive or negative based on the sign of r
Types of correlation coefficients (r)
There are three main types of correlation coefficients:
i) Pearson’s product-moment correlation coefficient.
ii) Spearman’s rank-order (Spearman’s rho) correlation coefficient.
iii) Kendall’s Tau correlation coefficient.
Note: a) Most of the times correlation coefficients are referred to Pearson’s r unless specified.
b) The appropriate usage of different types of correlation coefficients largely depends
on underlying data types, sample size, linear or non-linear relationships between the
two variables, and their distributions.
i) Pearson’s product-moment correlation coefficient.
Pearson correlation (r), measures a linear dependence between two variables (x and y). It’s
also known as a parametric correlation test because it depends to the distribution of the data.
It can be used only when x and y are from normal distribution.
mx and my are the means of x and y variables.
26
Correlation coefficient can be computed in R using the functions cor() or [Link]():
cor() computes the correlation coefficient
[Link]() test for association/correlation between paired samples. It returns both the
correlation coefficient and the significance level(or p-value) of the correlation .
The simplified formats are:
cor(x, y, method = c("pearson", "kendall", "spearman"))
[Link](x, y, method=c("pearson", "kendall", "spearman"))
Where;
x, y: numeric vectors with the same length
Method: correlation method.
Example 1
# correlation of vectors in R
x <- c(0,1,1,2,3,5,8,13,21,34)
y <- log(x+1)
cor(x,y)
Example 2
x <- c(0,1,1,2,3,5,8,13,21,34)
y <- log(x+1)
cor(x,y,method="pearson")
3. Regression
Regression analysis, in general sense, means the estimation or prediction of the unknown
value of one variable from the known value of the other variable.
Regression analysis can be thought of as being sort of like the flip side of correlation.
It has to do with finding the equation for the kind of straight lines you were just looking at
27
Suppose we have a sample of size n and it has two sets of measures, denoted by x and y. We
can predict the values of y given the values of x by using the equation, .
Or equation
Not every problem can be solved with the same algorithm. In this case, linear regression
assumes that there exists a linear relationship between the response variable and the
explanatory variables. This means that you can fit a line between the two (or more variables).
In this particular example, you can calculate the height of a child if you know her age:
In this case, “a” and “b” are called the intercept and the slope respectively. With the same
example, “a” or the intercept, is the value from where you start measuring. Newborn babies
with zero months are not zero centimeters necessarily; this is the function of the intercept.
The slope measures the change of height with respect to the age in months. In general, for
every month older the child is, his or her height will increase with “b”.
Linear Regression in R
A linear regression can be calculated in R with the command lm().
Dependent Variable (Target) : Continuous
Independent Variable (Predictor(s)): Continuous/Discrete
28
Y = mX + c , where
m = slope of straight line
c = Y-intercept
R-Codes to load Data
require("datasets")
data("iris")
str(iris)
head(iris)
Linear Models
Since simple L.R. requires just one target, let’s take “[Link]”" attribute as our
target(Y) and “[Link]” attribute as Predictor(X) to find if there exists any kind of
relationship between them.
Example 1
y<-c(1,2,3,4,5,6,7)
x<-c(2,5,6,8,9,10,18)
M<-lm(x~y)
summary(M)
29
Example 2
Y<- iris[,"[Link]"] # select Target attribute
X<- iris[,"[Link]"] # select Predictor attribute
head(X)
lm
model1<- lm(Y~X)
model1 # provides regression line coefficients i.e. slope and y-intercept
Y = 3.41895 – 0.06188X
Interpretation
Holding X constant Y increases by 3.41895.
For unit increase in X Holding the intercept constant, β = 0, Y decreases by 0.06188.
Example 3
Model2<-lm([Link] ~ [Link], data=iris)$coefficients
summary (Model2)
30
The results can be interpreted as follows:
lm([Link] ~ [Link]).
[Link] = -0.363076 + 0.415755 [Link]
R-Squared: 0.9271*100 = 92.71% implying that 92.71% variability of Y has been explained
by X leaving 7.29% unexplained.
31
EXPLORATORY DATA ANALYSIS
PLOTS/GRAPHICS