Basic description and visualisation of multivariate data in R
Categorical variates
Contingency tables
Load the Titanic data set from a csv file (note that dots were used to represent missing data).
Titanic <- [Link]("[Link]",[Link]=".")
#
# Define Sex as factor (0 = Female, 1 = Male)
#
Titanic$Sex <- factor(Titanic$Sex,levels=c(0,1),labels=c("Female","Male"))
Joint distribution table (contingency table, cross tab, 2-way table): absolute frequencies.
table(Titanic$Sex,Titanic$Survived)
No Yes
Female 154 308
Male 708 142
Joint distribution table (contingency table): relative frequencies.
table(Titanic$Sex,Titanic$Survived)/nrow(Titanic)
No Yes
Female 0.1172887 0.2345773
Male 0.5392232 0.1081493
See also ftable function.
Marginal distributions
Given two variates x, which takes values x1 , x2 , . . . , xn , and y, which takes values y1 , y2 , . . . , yn , the marginal
frequency distributions can be obtained as:
n
X n
X
f r(xi ) = f r(xi , yj ) f r(yj ) = f r(xi , yj )
j=1 i=1
Option 1: using table separately for each variate.
# Relative frequencies
table(Titanic$Survived)/length(Titanic$Survived)
No Yes
0.6565118 0.3427266
table(Titanic$Sex)/length(Titanic$Sex)
Female Male
0.3518660 0.6473724
1
Option 2: using addmargin on a contingency table.
tableTitanic <- table(Titanic$Sex,Titanic$Survived)/nrow(Titanic)
addmargins(tableTitanic)
No Yes Sum
Female 0.1172887 0.2345773 0.3518660
Male 0.5392232 0.1081493 0.6473724
Sum 0.6565118 0.3427266 0.9992384
Conditional distributions
f r(xi , yj )
f r(xi |yj ) =
f r(yj )
Example: relative frequency of survival being a woman.
rel.f r(Survived = Y es, Sex = F emale)
rel.f r(Survived = Y es|Sex = F emale) = = 0.1671
rel.f r(Sex = F emale)
Conditional distributions of Survived given Sex:
[Link](tableTitanic,1)
No Yes
Female 0.3333333 0.6666667
Male 0.8329412 0.1670588
The second argument of [Link] sets the dimension of the table which the conditioning variate is located
at. In this case Sex is located in the first dimension (rows).
Conditional distributions of Sex given Survived:
[Link](tableTitanic,2)
No Yes
Female 0.1786543 0.6844444
Male 0.8213457 0.3155556
Relationship between pairs of categorical variates
The summary function can be used on a table class object in order to perform a Chi-square test of independence.
The null hypothesis is independence, no association.
summary(table(Titanic$Sex,Titanic$Survived))
Number of cases in table: 1312
Number of factors: 2
Test for independence of all factors:
Chisq = 331.5, df = 1, p-value = 4.444e-74
The p-value results to be extremely low and, hence, it provides support to conclude statistically significant
association between sex and survival.
2
Some graphical representations for categorical variates
Mosaics
plot(tableTitanic,main="Titanic",ylab="Survived",xlab="Sex",col=rainbow(2))
Titanic
Female Male
No
Survived
Yes
Sex
Using the HairEyeColor dataset included in R:
a <- [Link](HairEyeColor[,,"Male"]) # Extract table for Male category
a
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
plot(a,main="Hair and eye colour (Male)",ylab="Eye",xlab="Hair",col=rainbow(4))
3
Hair and eye colour (Male)
Black Brown Red Blond
Brown
Eye
Blue
Green Hazel
Hair
b <- [Link](HairEyeColor[,,"Female"]) # Extract table for Female category
b
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
plot(b,main="Hair and eye colour (Female)",ylab="Eye",xlab="Hair",col=rainbow(4))
Hair and eye colour (Female)
Black Brown Red Blond
Brown
Eye
BlueHazel
Green
Hair
4
Combined barplots
barplot(t(b),main="Hair and eye colour (Female)",col=rainbow(4),beside=T,xlab="Hair color",
legend=T,[Link]=list(x="topleft",bty="n"))
Hair and eye colour (Female)
Brown
60
Blue
Hazel
50
Green
40
30
20
10
0
Black Brown Red Blond
Hair color
Multi-way tables for categorical variates
The ftable function provides a neat format for contingency tables involving more than two variates. It
works on both raw data matrices and table objects. For example, using the HairEyeColor dataset, which is
a list of two tables (Male, Female), we obtain:
ftable(HairEyeColor)
Sex Male Female
Hair Eye
Black Brown 32 36
Blue 11 9
Hazel 10 5
Green 3 2
Brown Brown 53 66
Blue 50 34
Hazel 25 29
Green 15 14
Red Brown 10 16
Blue 10 7
Hazel 7 7
Green 7 7
Blond Brown 3 4
Blue 30 64
Hazel 5 5
5
Green 8 8
From raw data ([Link] file):
Personal <- [Link]("[Link]",sep=",",header=T)
head(Personal)
Sex Status Education City
1 Female Married University Glasgow
2 Female Married Primary Edinburgh
3 Female Married Secondary Dundee
4 Female Married Primary Glasgow
5 Female Married Secondary Glasgow
6 Female Married Secondary Glasgow
ftable(Personal)
City Dundee Edinburgh Glasgow
Sex Status Education
Female Couple Primary 0 0 0
Secondary 0 0 0
University 0 0 0
Divorced Primary 0 0 0
Secondary 0 0 0
University 0 0 0
Married Primary 0 3 2
Secondary 1 2 5
University 0 1 1
Single Primary 0 2 1
Secondary 0 1 1
University 0 2 0
Male Couple Primary 1 0 0
Secondary 1 0 0
University 1 0 0
Divorced Primary 0 0 0
Secondary 0 1 0
University 0 1 0
Married Primary 0 1 0
Secondary 0 0 0
University 0 0 0
Single Primary 4 0 0
Secondary 2 0 1
University 4 0 2
The variates can be arranged in different ways by the arguments [Link] and [Link]:
ftable(Personal,[Link]=c("Status","Education"))
Sex Female Male
City Dundee Edinburgh Glasgow Dundee Edinburgh Glasgow
Status Education
Couple Primary 0 0 0 1 0 0
Secondary 0 0 0 1 0 0
University 0 0 0 1 0 0
Divorced Primary 0 0 0 0 0 0
Secondary 0 0 0 0 1 0
University 0 0 0 0 1 0
6
Married Primary 0 3 2 0 1 0
Secondary 1 2 5 0 0 0
University 0 1 1 0 0 0
Single Primary 0 2 1 4 0 0
Secondary 0 1 1 2 0 1
University 0 2 0 4 0 2
ftable(Personal,[Link]=c("Status","Education"),[Link]="City")
City Dundee Edinburgh Glasgow
Status Education
Couple Primary 1 0 0
Secondary 1 0 0
University 1 0 0
Divorced Primary 0 0 0
Secondary 0 1 0
University 0 1 0
Married Primary 0 4 2
Secondary 1 2 5
University 0 1 1
Single Primary 4 2 1
Secondary 2 1 2
University 4 2 2
Numerical variates
Linear relationship measures: covariance and correlation
Sample covariance and correlation between two variates x and y:
Pn
i=1 (xi − x̄)(yi − ȳ) Sxy
Sxy = rxy =
n Sx Sy
Using the airquality dataset in R:
help(airquality)
air <- airquality[,1:4] # We focus on experimental measurements
Descriptive measures:
(tip: explore options to deal with missing data when using the following functions)
colMeans(air) # Mean vector
Ozone Solar.R Wind Temp
NA NA 9.957516 77.882353
colMeans(air,[Link]=T) # Mean vector omiting missing values (NA)
Ozone Solar.R Wind Temp
42.129310 185.931507 9.957516 77.882353
apply(air,2,var,[Link]=T) # Variances
Ozone Solar.R Wind Temp
1088.20052 8110.51941 12.41154 89.59133
7
apply(air,2,sd,[Link]=T) # Standard deviations
Ozone Solar.R Wind Temp
32.987885 90.058422 3.523001 9.465270
var(air,[Link]=T) # Covariance matrix
Ozone Solar.R Wind Temp
Ozone 1107.29009 1056.5835 -72.51124 221.52072
Solar.R 1056.58346 8308.7422 -41.24480 255.46765
Wind -72.51124 -41.2448 12.65732 -16.85717
Temp 221.52072 255.4676 -16.85717 90.82031
cov(air,use="[Link]") # Covariance matrix
Ozone Solar.R Wind Temp
Ozone 1107.29009 1056.5835 -72.51124 221.52072
Solar.R 1056.58346 8308.7422 -41.24480 255.46765
Wind -72.51124 -41.2448 12.65732 -16.85717
Temp 221.52072 255.4676 -16.85717 90.82031
cor(air,use="[Link]") # Correlation matrix
Ozone Solar.R Wind Temp
Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000
cov2cor(cov(air,use="[Link]")) # Convert covariance matrix into correlation matrix
Ozone Solar.R Wind Temp
Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000
Visualisation of relationships
Scatter plot
Reveals variables which are positively and negatively correlated, extreme observations, clusters, unusual
patterns, . . .
Plotting multivariate data is imperative before carrying out a formal statistical analysis.
plot(air$Temp,air$Wind,ylab="Wind",xlab="Temperature (ªF)")
8
20
15
Wind
10
5
60 70 80 90
Temperature (ªF)
# Adding smooth loess curve
[Link](air$Temp,air$Wind,ylab="Wind",xlab="Temperature (ªF)",lpars=list(col="red"))
20
15
Wind
10
5
60 70 80 90
Temperature (ªF)
Matrix of pairwise scatter plots:
plot(air) # (same as using pairs())
9
0 100 200 300 60 70 80 90
150
Ozone
0 50
300
Solar.R
150
0
15
Wind
5
80
Temp
60
0 50 100 150 5 10 15 20
3D scatter plot:
#[Link]("scatterplot3d")
library(scatterplot3d)
scatterplot3d(air$Temp,air$Wind,air$Solar.R,color="blue")
0 50 100 150 200 250 300 350
air$Solar.R
air$Wind
25
20
15
10
5
0
50 60 70 80 90 100
air$Temp
# Using the `iris` dataset in R
scatterplot3d(iris$[Link],iris$[Link],iris$[Link],color=[Link](iris$Species))
10
iris$[Link]
0.0 0.5 1.0 1.5 2.0 2.5
4
5
6
iris$[Link]
7
11
8
2.0
2.5
3.0
3.5
4.0
4.5
iris$[Link]