MH8131 Probability and Statistics
Chapter 2 Descriptive statistics
In this chapter we describe several techniques for visualizing data, as well as
for computing quantities that summarize it effectively. Such quantities are known
as descriptive statistics. As we will see in the following chapters, these statistics
can often be interpreted within a probabilistic framework, but they are also useful
when probabilistic assumptions are not warranted. Because of this, we present
them from a deterministic point of view.
1 Histogram
When we speak of the shape of a data set, we are usually referring to the shape
exhibited by an associated graphical display, such as a histogram. The shape can
tell us a lot about any underlying structure to the data, and can help us decide
which statistical procedure we should use to analyze them.
DEFINITION 1 A histogram is constructed by first deciding on a set of classes, or
bins, which partition the real line into a set of boxes into which the data values
fall. Then vertical bars are drawn over the bins with height proportional to the
number of observations that fell into the bin.
In other words, it is a plot of the frequency table with the bins on the x-axis
and the count (or proportion) on the y-axis. The width of the bins is a parameter
that can be adjusted to yield higher or lower resolution. If we interpret the data
as corresponding to samples from a random variable, then the histogram would
be a piecewise constant approximation to their pmf or pdf.
Figure 1 shows two histograms computed from temperature data gathered
at a weather station in Oxford over 150 years.1 Each data point represents the
maximum temperature recorded in January or August of a particular year. Figure
2 shows a histogram of the GDP per capita of all countries in the world in 2014
according to the United Nations.
There are many ways to plot histograms in R, and one of the easiest is with
the hist function. The following R codes for producing the above two histograms.
1
Figure 1: Histograms of temperature data taken in a weather station in Oxford
over 150 years. Each data point equals the maximum temperature recorded in a
certain month in a particular year.
2
Figure 2: Histogram of the GDP per capita of all countries in the world in 2014.
3
>temp=read.table( "/Users/lym/Desktop/plott/temp.txt" , fill = TRUE , header = TRUE )
>t1=temp[temp$mon==1,] # select the first month
>t8=temp[temp$mon==8,] # select the eight month
>tt1=t1[!is.na(t1$degC), ] # remove the missing value
>tt8=t8[!is.na(t8$degC), ] # remove the missing value
>ss=c(tt1$degC,tt8$degC) # put the two months data together
>h=hist(ss,breaks = 20) # plot the histgram
>cuts <- cut(h$breaks, c(-Inf,15,Inf)) # break the x axis into two parts
>plot(h, col=cuts,xlab ="Degrees",main = "") # add the color
>legend(’topright’, legend=c("Jan", "Aug"), # add the mark
fill=c("black", "red"))
>install.packages("readxl")
>library("readxl")
>gdp=read_xlsx("/Users/lym/Desktop/plott/GDPPCconstant-USD-countries.xlsx",col_names = T
>extract=as.matrix(gdp[,which(gdp[2,]==2014)])[-c(1:2),]/1000
>h=hist(extract,breaks = 30,col = "red",xlim=c(0,200),
main = "GDP per captia of all contries",xlab = "Thousand of dollars")
Watch what happens when we change the bins slightly (with the breaks argu-
ment to hist). To produce a plot is a relative frequency histogram by adding (freq
= FALSE).
2 Measures of Center
DEFINITION 2 The sample mean is denoted by x̄ (read "x-bar") and is simply the
arithmetic average of the observations:
n
1∑
x̄ = xi .
n
i=1
It is appropriate for use with data sets that are not highly skewed without extreme
observations.
◃ Good: natural, easy to compute, has nice mathematical properties
◃ Bad: sensitive to extreme values
4
You can calculate the sample mean of a data vector x with the R command
mean(x).
DEFINITION 3 The sample median is another popular measure of center and is
denoted med(x) or x̃. To calculate its value, first sort the data into an increasing
sequence of numbers. If the data set has an odd number of observations then
med(x) is the value of the middle observation, which lies in position (n + 1)/2;
otherwise, there are two middle observations and med(x) is the average of those
middle values.
One desirable property of the sample median is that it is resistant to extreme
observations, in the sense that the value of med(x) depends only the values of
the middle observations, and is quite robust estimate of location since it is not
influenced by outliers (extreme cases) that could skew the results. The same
cannot be said for the sample mean.
◃ Good: resistant to extreme values, easy to describe
◃ Bad: not as mathematically tractable, need to sort the data to calculate
You can calculate the sample median of x with the R command median(x).
3 Measures of Spread
DEFINITION 4 The sample variance of x1 , x2 , · · · , xn is denoted s2 and is calcu-
lated with the formula
n
1 ∑
s2 = (xi − x̄)2 .
n−1
i=1
√
The sample standard deviation is s = s2 .
Intuitively, the sample variance is approximately the average squared distance of
the observations from the sample mean. The sample standard deviation is used to
scale the estimate back to the measurement units of the original data. Empirical
Rule: If data follow a bell-shaped curve, then approximately 68%, 95%, and 99.7%
of the data are within 1, 2, and 3 standard deviations of the mean, respectively.
You might be wondering why the normalizing constant is 1/(n − 1) instead of
1/n. The reason is that this ensures that the expectation of the sample variance
5
equals the true variance when the data are iid. In practice there is not much
difference between the two normalizations.
From the console we may compute the sample range with range(x) and the
sample variance with var(x), where x is a numeric vector. The sample standard
deviation is sqrt(var(x)) or just sd(x).
4 Outlier
DEFINITION 5 An outlier is any value that is very distant from the other values
in a data set. If a value falls beyond three standard deviations away from the
mean, that data point is identified as an outlier.
Being an outlier in itself does not make a data value invalid or erroneous. Still,
outliers are often the result of data errors such as mixing data of different units
(kilometers versus meters) or bad readings from a sensor. When outliers are the
result of bad data, the mean will result in a poor estimate of location, while the
median will be still be valid.
What do we do about outliers? They merit further investigation. The primary
goal is to determine why the observation is outlying, if possible. If the observation
is a typographical error, then it should be corrected before continuing. If the
observation is from a subject that does not belong to the population of interest,
then perhaps the datum should be removed. Otherwise, perhaps the value is
hinting at some hidden structure to the data.
DEFINITION 6 (ANOMALY DETECTION) In contrast to typical data analysis,
where outliers are sometimes informative and sometimes a nuisance, in
anomaly detection the points of interest are the outliers, and the greater mass
of data serves primarily to define the "normal" against which anomalies are
measured.
5 Order statistics
In some cases, a data set is well described by its mean and standard deviation.
For example, in January the temperature in Oxford is around 6.73o C give or take
2o C. This is a pretty accurate account of the temperature data from the previous
section.
6
However, imagine that someone describes the GDP data set in Figure 8.2 as:
Countries typically have a GDP per capita of about $16 500 give or take $25
300. This description is pretty terrible. The problem is that most countries have
very small GDPs per capita, whereas a few have really large ones and the sample
mean and standard deviation don’t really convey this information. Order statistics
provide an alternative description, which is usually more informative when there
are extreme values in the data.
DEFINITION 7 (Quantiles and percentiles). Let x(1) ≤ x(2) ≤ x(n) denote the or-
dered elements of a set of data {x1 , x2 , xn }. The q quantile of the data for 0 < q
< 1 is x([q(n+1)] ), where [q(n + 1)] is the result of rounding q(n + 1) to the closest
integer. The 100 p quantile is known as the p percentile.
The 0.25 and 0.75 quantiles are known as the first and third quartiles, whereas
the 0.5 quantile is known as the sample median. A quarter of the data are smaller
than the 0.25 quantile, half are smaller (or larger) than the median and three
quarters are smaller than the 0.75 quartile. If n is even, the sample median is
usually set to
x(n/2) + x((n+1)/2)
. (5.1)
2
The difference between the third and the first quartile is known as the interquar-
tile range (IQR).
For the GDP data set, the median is $6,350. This means that half of the
countries have a GDP of less than $6,350. In contrast, 71% of the countries
have a GDP per capita lower than the sample mean! The IQR of these data is
$18,200. To provide a more complete description of the data set, we can list a
five-number summary of order statistics: the minimum x(1) , the first quartile, the
sample median, the third quartile and the maximum x(n) . For the GDP data set
these are $130, $1,960, $6,350, $20,100, and $188,000 respectively.
You can calculate the sample quantiles of any order p where 0 < p < 1 for
a data set stored in a data vector x with the quantile function, for instance, the
command quantile(x, probs = c(0, 0.25, 0.37)) will return the smallest observation,
the first quartile, q̃0.25 , and the 37th sample quantile, q̃0.37 . For q̃p simply change
the values in the probs argument to the value p.
6 Boxplot
We can visualize the main order statistics of a data set by using a box plot.
7
DEFINITION 8 Boxplots, introduced by Tukey [Tukey-1977], are based on per-
centiles and give a quick way to visualize the distribution of data.
The median is shown by the horizontal line in the box. The dashed lines,
referred to as whiskers, extend from the top and bottom to indicate the range
for the bulk of the data. The bottom and top of the box are the first and third
quartiles. The lower whisker is a line extending from the bottom of the box to
the smallest value within 1.5 IQR of the first quartile. The higher whisker extends
from the top of the box to the highest value within 1.5 IQR of the third quartile.
Values beyond the whiskers are considered outliers and are plotted separately.
Figure 3 applies box plots to visualize the temperature data set used in Figure 1.
Each box plot corresponds to the maximum temperature in a particular month
(January, April, August and November) over the last 150 years. The box plots
allow us to quickly compare the spread of temperatures in the different months.
Figure 4 shows a box plot of the GDP data from Figure 2. From the box plot
it is immediately apparent that most countries have very small GDPs per capita,
that the spread between countries increases for larger GDPs per capita and that
a small number of countries have very large GDPs per capita
Figure 3: Box plots of the Oxford temperature data set used in Figure 1.
The R format is boxplot(x), where x is the data.
t1=temp[temp$mon==1,] # select the first month
t4=temp[temp$mon==4,]
t8=temp[temp$mon==8,]
t11=temp[temp$mon==11,]
tt1=t1[!is.na(t1$degC), ]$degC # Remove the missing value and take the highest temperat
tt4=t4[!is.na(t4$degC), ]$degC
tt8=t8[!is.na(t8$degC), ]$degC
8
Figure 4: Box plot of the GDP per capita of all countries in the world in 2014.
Not all of the outliers are shown.
9
tt11=t11[!is.na(t11$degC), ]$degC
ma=list(tt1,tt4,tt8,tt11) # put the four months temperature together
boxplot(ma,names = c("Jan","Apr","Aug","Nov"),ylab="Degrees (Celsius)") # plot the boxpl
and
install.packages("readxl")
library("readxl")
gdp=read_xlsx("/Users/lym/Desktop/plott/GDPPCconstant-USD-countries.xlsx",col_names = TR
gdp1=gdp[,which(gdp[2,]==2014)][-c(1:2),]/1000
boxplot(gdp1,ylim=c(0,60),ylab="Thousands of dollars")
You need to change the name of the folder where the GDP data is stored.
7 Sample Covariance
In the previous sections we mostly considered data sets consisting of one-
dimensional data. In machine learning lingo, there was only one feature per
data point. We now study a multidimensional scenario, where there are several
features associated to each data point.
The sample covariance quantifies whether the two features of a two-dimensional
data set tend to vary in a similar way on average, just as the covariance quantifies
the expected joint variation of two random variables.
Figure 5: Scatterplot of the temperature in January and in August (left) and of the
maximum and minimum monthly temperature (right) in Oxford over the last 150
years.
10
DEFINITION 9 (Sample covariance). Let (x1 , y1 ), (x2 , y2 ), (xn , yn ) be a data set
where each example consists of a measurement of two different features. The
sample covariance is defined as
n
1 ∑
Sxy = (xi − x̄)(yi − ȳ).
n−1
i=1
In order to take into account that each individual feature may vary on a dif-
ferent scale, a common preprocessing step is to normalize each feature, dividing
it by its sample standard deviation. If we normalize before computing the covari-
ance, we obtain the sample correlation coefficient of the two features. One of
the advantages of the correlation coefficient is that we don’t need to worry about
the units in which the features are measured. In contrast, measuring a feature
representing distance in inches or miles can severely distort the covariance, if
we don’t scale the other feature accordingly
DEFINITION 10 (Sample correlation coefficient)). Let (x1 , y1 ), (x2 , y2 ), (xn , yn ) be
a data set where each example consists of a measurement of two different
features. The sample correlation coefficient is defined as
Sxy
ρxy = ,
sx sy
where sx means the standard deviation of x variables and sy is similarly defined.
By the Cauchy-Schwarz inequality the magnitude of the sample correlation
coefficient is bounded by one. If it is equal to 1 or -1, then the two centered data
sets are collinear.
Figure 5 is annotated with the sample correlation coefficients corresponding
to the two plots. Maximum and minimum temperatures within the same month
are highly correlated, whereas the maximum temperature in January and August
within the same year are only somewhat correlated.
Table below, called a correlation matrix, shows the correlation between the
daily returns for telecommunication stocks from July 2012 through June 2015.
From the table, you can see that Verizon (VZ) and ATT (T) have the highest
correlation. Level Three (LVLT), which is an infrastructure company, has the
lowest correlation. Note the diagonal of 1s (the correlation of a stock with itself
is 1), and the redundancy of the information above and below the diagonal.
11
Correlation between telecommunication stock returns
T CTL FTR VZ LVLT
T 1.000 0.475 0.328 0.678 0.279
CTL 0.475 1.000 0.420 0.417 0.287
FTR 0.328 0.420 1.000 0.287 0.260
VZ 0.678 0.417 0.287 1.000 0.242
LVLT 0.279 0.287 0.260 0.242 1.000
You may evaluate correlation and produce a scatter plot matrix in r as follows:
etfs <- sp500_px[row.names(sp500_px)>"2012-07-01",
sp500_sym[sp500_sym$sector=="etf", ’symbol’]]
<-cor(etfs)
<-plot(efts)
You could use the data set "mtcars" available in the R environment to illustrate.
8 Sample Covariance matrix and PCA
We now consider sets of multidimensional data. In particular, we are interested
in analyzing the variation in the data. The sample covariance matrix of a data set
contains the pairwise sample covariance between every pair of features.
DEFINITION 11 Let x1 , · · · , xn be a set of p dimensional real-valued data vectors.
The sample covariance matrix of these data is the p × p matrix
n
1 ∑
Sn = (xi − x̄)(xi − x̄)T ,
n−1
i=1
n
∑
1
where x̄ = n xi .
i=1
The (i,j) entry of the covariance matrix is given by
{
the variance of the ith entry ofx1 , · · · , xn , wheni = j
(Sn )ij =
the covariance of the ith entry and the jth entry of x1 , · · · , xn , wheni ̸= j
In order to characterize the variation of a multidimensional data set around
its center, we consider its variation in different directions. The average variation
of the data in a certain direction is quantified by the sample variance of the
12
projections of the data onto that direction. Let v be a unit-norm vector aligned
with a direction of interest. The sample variance of the data set in the direction
of v is given by
n n
1 ∑ T 1 ∑ T
(v xi − vx)2 = (v (xi − x̄))2
n−1 n−1
i=1 i=1
[ 1 ∑
n ]
T
=v (xi − x̄)(xi − x̄)T v = vT Sn v (8.1)
n−1
i=1
where
n
1∑ T
vx = v xi .
n
i=1
Consider the eigen-decomposition of the covariance matrix Sn
λ1 0 ··· 0
0 λ2 · · · 0
Sn = Un
UTn
···
0 0 · · · λn
where the eigenvectors matrix is Un = (u1 , · · · , un ) and the ordered eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λn .
THEOREM 1 For the above the eigen-decomposition of the covariance matrix
Sn , we have
λ1 = max vT Sn v, u1 = arg max vT Sn v
∥v∥=1 ∥v∥=1
λk = max vT Sn v, uk = arg max v T Sn v
∥v∥=1,v⊥u1 ,,··· ,uk−1 ∥v∥=1,v⊥u1 ,,··· ,uk−1
This means that u1 is the direction of maximum variation. The eigenvector u2
corresponding to the second largest eigenvalue λ2 is the direction of maximum
variation that is orthogonal to u1 .
In data analysis, the eigenvectors of the sample covariance matrix are usually
called principal directions. Computing these eigenvectors to quantify the variation
of a data set in different directions is called principal component analysis (PCA).
Figure 6 shows the principal directions for several 2D examples
Figure 7 illustrates the importance of centering before applying PCA. If the
data are concentrated around a point that is far from the origin, the first principal
direction tends be aligned with that point. This makes sense as projecting onto
13
Figure 6: PCA of a set consisting of n = 200 two-dimensional data points with
different configurations.
that direction captures more energy. As a result, the principal directions do not
reflect the directions of maximum variation within the cloud of data. Centering
the data set before applying PCA solves the issue.
EXAMPLE 1 You can compute principal components in R using the princomp
function. The following performs a PCA on the stock price returns for Chevron
(CVX) and ExxonMobil (XOM):
oil_px <- sp500_px[, c(’CVX’, ’XOM’)]
pca <- princomp(oil_px)
pca$loadings
Loadings:
Comp.1 Comp.2
CVX -0.747 0.665
XOM -0.665 -0.747
summary(pca)
Importance of components:
Comp.1 Comp.2
Standard deviation 0.8820924 0.3007512
Proportion of Variance 0.8958580 0.1041420
Cumulative Proportion 0.8958580 1.0000000
The weights for CVX and XOM for the first principal component are -0.747 and -
0.665 and for the second principal component they are 0.665 and -0.747. How to
14
Figure 7: PCA on the left for the data without centering. As a result the dominant
principal direction u1 lies in the direction of the mean of the data and PCA does
not reflect the actual structure. u1 becomes aligned with the direction of maximal
variation for the centered data.
15
interpret this? The first principal component is essentially an average of CVX and
XOM, reflecting the correlation between the two energy companies. The second
principal component measures when the stock prices of CVX and XOM diverge.
Using the sample covariance matrix we can express the variation in every direc-
tion.
9 QQ plot
DEFINITION 12 QQ plot is a plot to visualize how close a sample distribution is
to a normal.
To compare data to a standard normal distribution, you subtract the mean then di-
vide by the standard deviation; this is also called normalization or standardization
or Z-Scores. Write
x − x̄
z= .
s
These are commonly referred to as z-scores.
A QQ-Plot is used to visually determine how close a sample is to the normal
distribution. The QQ-Plot orders the z-scores from low to high, and plots each
value’s z-score on the y-axis; the x-axis is the corresponding quantile of a normal
distribution for that valueŠs rank. Since the data is normalized, the units corre-
spond to the number of standard deviations away of the data from the mean. If
the points roughly fall on the diagonal line, then the sample distribution can be
considered close to normal. Figure 5 shows a QQ-Plot for a sample of 100 values
randomly generated from a normal distribution; as expected, the points closely
follow the line. This figure can be produced in R with the qqnorm function:
norm_samp <- rnorm(100)
qqnorm(norm_samp)
abline(a=0, b=1, col=’grey’)
In contrast, in practice, one tail of a distribution is often longer than the other,
particularly in finance. A good example to illustrate the long-tailed nature of data
is stock returns. Figure 6 shows the QQ-Plot for the daily stock returns for
Netflix
16
Figure 8: QQ-Plot of a sample of 100 values drawn from a normal distribution.
17
Figure 9: QQ-Plot of the returns for NFLX.
18