0% found this document useful (0 votes)
19 views10 pages

Lecture06 Prel

The document discusses correlation coefficients, including Pearson's, Spearman's, and Kendall's, to measure associations between variables. It explains how to compute these coefficients and their interpretations in terms of linear and monotonic relationships. Additionally, it introduces time-series plots and emphasizes the importance of understanding parameters, variables, and populations in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Lecture06 Prel

The document discusses correlation coefficients, including Pearson's, Spearman's, and Kendall's, to measure associations between variables. It explains how to compute these coefficients and their interpretations in terms of linear and monotonic relationships. Additionally, it introduces time-series plots and emphasizes the importance of understanding parameters, variables, and populations in statistical analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

In few words, the correlation measures how well the association between two variables can

be described by a straight line. Let us consider the variable x and the four variables y in Table
22. Figure 31 shows the corresponding scatter plots.
• The association between x and y1 (top–left panel) is clearly linear and very strong, in fact,
we have a perfect linear association. This fact is adequately described by the correlation
coefficient, we get rxy1 = −1. Note that the sign indicates a negative association.

• There is a perfect association between x and y2 (top–right panel), in fact, we have that
y2 = exp(x). However, the association is not linear and the correlation coefficient equals
rxy2 = 0.6914 which, in loose words, can be interpreted as “the association between x
and y2 can be adequately described by a straight line”.

• There is a perfect association between x and y3 (bottom–left panel), in fact, we have


that y3 = (x − 6)2 . However, the association is not linear and the correlation coefficient
equals rxy3 = 0, which can be interpreted as “the association between x and y3 cannot
be described by a straight line at all”.

• The points in the bottom–right panel show no clear association of any type, in particular,
they show no linear association. This fact is reflected by the correlation coefficient which
is rxy4 = −0.2545, indicating that a straight line will be very poor at describing the
association between both variables.

x y1 y2 y3 y4
1 11 3 25 9
2 10 7 16 7
3 9 20 9 6
4 8 55 4 1
5 7 148 1 10
6 6 403 0 5
7 5 1097 1 3
8 4 2981 4 11
9 3 8103 9 8
10 2 22026 16 4
11 1 59874 25 2
Table 22: Different types of association

In R, the function cor() can be used for computing the correlation coefficient.

Correlation matrix Let us say that we have J variables, x1 , x2 , · · · , xJ . It is possible to


calculate J(J − 1)/2 correlations between pairs of variables and present them in a matrix that
is called the correlation matrix:
 
1 rx1 x2 ,U · · · rx1 xJ ,U
 rx x ,U
 2 1 1 · · · rx2 xJ ,U 

 .. .. .. .. 
 . . . . 
rxJ x1 ,U rxJ x2 ,U · · · 1

The ones in the diagonal are obtained by taking into account that rxj xj ,U = 1. Alternatively,
taking into account that the correlation is symmetric, i.e. that rxj xj0 ,U = rxj0 xj ,U , we can simply

41
Figure 31: Scatter plots of the variables in Table 22.

write the correlation matrix as


 
1 — ··· —
 rx x ,U
 2 1 1 · · · — 
 .. .. .. .. 
 . . . .
rxJ x1 ,U rxJ x2 ,U ··· 1

5.2 Spearman’s correlation coefficient


In the previous subsection we discussed Pearson’s correlation coefficient, a measure of the
linear association between two variables. In this subsection we will discuss another measure
that allows to measure a more general class of associations, namely, monotonic associations.
In other words, this measure allows to determine if the association between the two variables
can be adequately described by a monotonic function.
First, let us clarify what a monotonic function is. Let us take a look at the functions
plotted in Figure 32:

• The top–left panel shows an increasing function: as x increases, y also increases;

• The top–central panel shows a decreasing function: as x increases, y decreases;

• The top–right panel shows a constant function: as x increases, y does not change;

• The bottom–left panel shows a non–decreasing function: a function that is partly in-
creasing and partly constant;

• The bottom–central panel shows a non–increasing function: a function that is partly


decreasing and partly constant.

Any of the five type of functions described above is called a monotonic function. In loose
words a monotonic function is a function that is not increasing and decreasing. For instance,
the bottom–right panel shows a function that is decreasing first and then becomes increasing,
this is not a monotonic function.

42
Figure 32: Six functions.

Before defining Spearman’s correlation coefficient, we need to define the ranks of the ob-
servations of a variable. The rank of the observation xi in the population U , denoted by R(xi )
or simply Ri , is the position occupied by the observation when the values are sorted from
smallest to largest. This is one of the situations when things are simpler than they sound as
illustrated by the following example.
Example 45. Consider the population of ten students. The ranks of the number of points in
the assignment (x) and the exam (y) is shown in Table 23.

i 1 2 3 4 5 6 7 8 9 10
xi 0 9 2 24 25 23 4 20 28 5
R(xi ) 1 5 2 8 9 7 3 6 10 4
yi 8 15 5 36 40 30 9 21 32 27
R(yi ) 2 4 1 9 10 7 3 5 8 6
Table 23: Ranks of the points of ten students in an exam in Statistics

Spearman’s correlation coefficient is simply the correlation coefficient (11) calculated over
the ranks of x and y, instead of the actual values. More formally:
Definition 46. Let xi and yi be the values of two variables associated to the ith element
in U (i = 1, 2, · · · , N ). Let also R(xi ) and R(yi ) be their corresponding ranks. Spearman’s
correlation coefficient between x and y is defined as
P
s U (R(xi ) − R̄U )(R(yi ) − R̄U )
rxy,U ≡ P (12)
2 1/2
P 
2
U (R(xi ) − R̄U ) U (R(yi ) − R̄U )

where R̄U = (N + 1)/2. 


Let us illustrate the steps needed to calculate Spearman’s coefficient.

43
Example 47. Let U be the population of N = 10 students taking a Master course in statistics.
Let xi and yi be, respectively, the scores in a home assignment and the final exam of the ith
student (i = 1, 2, · · · , N ). The ranks of x and y were shown in Table 23. Using them let us
find Spearman’s correlation coefficient. We have R̄U = (N + 1)/2 = (10 + 1)/2 = 5.5.
Let us calculate the numerator first:
X
(R(xi )− R̄U )(R(yi )− R̄U ) = (1−5.5)(2−5.5)+(5−5.5)(4−5.5)+· · ·+(4−5.5)(6−5.5) =
U
15.75 + 0.75 + · · · + −0.75 = 75.5.

Now we calculate the first term in the denominator


X
(R(xi ) − R̄U )2 = (1 − 5.5)2 + (5 − 5.5)2 + · · · + (4 − 5.5)2 = 20.25 + 0.25 + · · · + 2.25 = 82.5.
U

And the second term in the denominator is


X
(yi − ȳU )2 = (2 − 5.5)2 + (4 − 5.5)2 + · · · + (6 − 5.5)2 = 12.25 + 2.25 + · · · + 0.25 = 82.5.
U

This gives
P
s U (R(xi ) − R̄U )(R(yi ) − R̄U ) 75.5
rxy,U = P 1/2 = 1/2
= 0.9152,
(82.5 · 82.5)
2
P 2
U (R(x i ) − R̄U ) U (R(y i ) − R̄ U )

which means that there is a high monotonic association between the number of points that
students get in the assignment and the exam. 
We have insisted in that Spearman’s correlation allows to measure monotonic associations
between two variables. Let us illustrate this with variables in Table 22 and Figure 31:
• Before we found that for the perfect linear association shown in the top–left panel,
Pearson’s correlation coefficient is equal to -1. This is the same value obtained by
Spearman’s correlation coefficient.

• We found that Pearson’s correlation coefficient for the increasing association shown in
the top–right panel is 0.6914. Spearman’s correlation coefficient is equal to 1. Indicating
that there is a perfect monotonic association between x and y: higher values of x are
associated to higher values of y.

• Spearman’s correlation for the variables shown in the bottom–left panel equals 0, the
same value as Pearson’s correlation. This indicates that there is no monotonic association
between x and y.

• Due to the the way the numbers were generated, Pearson’s and Spearman’s coefficient
correlation also coincide for the variables illustrated in the bottom–right panel: -0.2545.
Indicating a small monotonic association between the two variables.

5.3 Kendall’s correlation coefficient


Another parameter that allows for measuring the monotonic association between two variables
is Kendall’s correlation coefficient. In words, it works as follows:
It compares all pairs of observations and identifies them as either concordant or discordant.
A pair (xi , yi ) and (xj , yj ) is said to be concordant if the straight line that connects both points

44
has a positive slope. A pair (xi , yi ) and (xj , yj ) is said to be discordant if the straight line
that connects both points has a negative slope. This is illustrated in Figure 33 for a few pairs.
The three green segments represent three pairs that are concordant. The three red segments
represent three pairs that are discordant.

Figure 33: Three concordant (green) and three discordant (red) pairs in the scores of ten
students.

Once we have identified concordant and discordant pairs, we take the total number of
concordant pairs minus the total number of discordant pairs and divide this by the total
number of pairs. More formally:

Definition 48. Let xi and yi be the values of two variables associated to the ith element in
U (i = 1, 2, · · · , N ). Kendall’s correlation coefficient between x and y is defined as

k 2 X
rxy,U ≡ sgn(xj − xi )sgn(yj − yi ) (13)
n(n − 1) i<j

where 
1
 if x > 0
sgn(x) = 0 if x = 0 

−1 if x < 0

Using R, we find that Kendall’s correlation coefficient of the number of points of the
students in the home assignment and the exam is 0.7778.

5.4 Time-series plots


We close this section by describing a time-series plot, but before doing so, we need to define
what a time-series is. We have already seen a time-series plot in Section 4.
Consider a set of N measurements of the type (t , yt ) where t indicates the time at which yt
occurs. As the variable y is observed over time, the order of the observations is important. As
an example, consider the number of new cases of Covid reported everyday during a period of
interest. y1 is the number of cases reported on the first day, y2 is the number of cases reported
on the second day, and so on.
A time-series plot can be seen as a scatter plot of the N pairs (t , yt ). In this case, however,
the dots are joined by a line.

45
Example 49. Figure 34 shows the time-series of the number of new cases of Covid-19 reported
in Sweden everyday from January 1, 2021 to December 31, 2021. 

Figure 34: Number of new cases of Covid-19 reported in Sweden during 2021. Source: our-
[Link]

Up to this point we have introduced a set of parameters which are useful for describing
different characteristics of one or two variables measured on the individuals of one population
of interest. Note that we have highlighted in italics three words that are crucial to understand
the theory that we have covered: parameters, variables and population. We measure or observe
variables on the individuals of one population and these observations are used to compute the
parameters of interest. The notation we are using takes these three concepts into account:
we assign different symbols to the different parameters that have been defined, for instance,
we use a bar ( ¯ ) to denote the mean, a breve ( ˘ ) to denote the median, an r to denote
the correlation, and so on. Furthermore, we explicitly indicate which variable or variables are
involved in the calculation of the parameter and what population or set of individuals we are
talking about. For instance, x̄U means that we are calculating the mean of the variable x for
the individuals of the set U , rxy,U means that we are calculating the correlation between the
variables x and y for the individuals of the population U . Mastering this notation is a very
important and useful step for understanding the remaining part of the course.

6 Simple linear regression: the descriptive approach


Consider the situation where we have measurements of K + 1 variables x1 , x2 , · · · , xK and
y on a set s of n elements, i.e. the information available for the ith element is of the type
(x1,i , x2,i , · · · , xK,i , yi ) for all i = 1, 2, · · · , n. This information is typically stored in a rectan-
gular array as shown in Table 24.

x1 x2 ··· xK y
x11 x21 ··· xJ1 y1
x12 x22 ··· xJ2 y2
.. .. ... .. ..
. . . .
x1n x2n ··· xKn yn

Table 24: Array collecting the information in a regression problem

With this information we want to describe y as a function of the x-variables, i.e. we want
to express y as y = f (x1 , x2 , · · · , xK ). To begin with, in this section we will consider the case

46
with only one x-variable. Later in Section 7 we will return to the more general case with K
variables x.
Consider the situation where we have measurements of two variables x and y on a set s of n
elements, i.e. the information available is of the type (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ). With this
information we want to describe y as a function of x, i.e. we want to express y as y = f (x). To
begin with we will consider the case where f (x) = b0 + b1 x . Therefore our task is to express
y as a linear function of x as best as possible. Needless to say, if we want to express y as a
linear function of x it is because we have reasons to believe that there is a linear relationship
between the two variables.
We illustrate the idea with an example.

Example 50. Let us consider a set of n = 10 companies, s, producing tables, let xi = number
of workers in the ith company and yi = number of tables produced during one particular day
by the ith company. Table 25 shows the values of x and y and Figure 35 shows a scatter plot
of both variables. The scatter plot reveals some association between both variables, a higher
number of workers produce a higher number of tables. Furthermore, this association can be
adequately described by a straight line. 

i xi yi
1 12 20
2 14 21
3 15 27
4 18 30
5 19 32
6 24 50
7 26 54
8 27 57
9 28 61
10 30 60
Table 25: Number of tables y produced by x workers

Figure 35: Scatter plot of number of workers x and number of tables y.

We have said that we want to express y as a function of x. In the previous example we


want to express the number of tables as a function of the number of workers. It makes sense
to believe that the number of tables depends on the number of workers, not the other way
around. For this reason, we call y the dependent variable and x the independent variable. In

47
some contexts x and y are given different names, for instance, y is also called the output or
the response; and x is also called the input or the explanatory variable.
Once the problem has been identified, i.e. once we have decided that we have a pair of
variables x and y one of which we would like to express as a linear function of the other one
as best as possible, the next natural step is to fit this line. However, there are infinitely many
lines, how do we choose one that is appropriate for our purposes? In other words, how do
we choose the best possible straight line that relates x and y? Some lines are, evidently, not
adequate. For instance, in the top-left panel of Figure 36 we have fitted the line ŷ1 = 80 − 2 x
to the workers dataset. This line clearly does not describe in an adequate way the relation
between both variables. In the top-right panel, we have fitted the line ŷ2 = 20 + 0.25 x, which
does not look too bad for some values but looks quite bad for some others. On the other hand,
in the bottom-left panel, we have fitted the line ŷ3 = −20 + 3 x and in the bottom-right panel,
we have fitted the line ŷ4 = −13 + 2.5 x. Both these lines seem to adequately describe the
relation between x and y. Which one should we prefer? Can we find any other line that can
be considered to be better?

Figure 36: Four lines fitted to the workers dataset.

First, let us formalize what do we mean by a line ŷ adequately describing y as a function


of x. Intuitively, we consider a line ŷ to be adequate if the distances from it to the points
are small. The lines in the top panels of Figure 36 have large distances to the points, that is
why we immediately consider them to be inadequate. But the lines in the bottom panels have
small distances to the points. Which one should we prefer?
In order to answer this question we need to define a criterion for measuring the distance
from a line ŷ = b0 + b1 x to the observed points. We will consider the sum of squares error
—SSE—:
X
SSE = e2i where ei = yi − ŷi and ŷi = b0 + b1 xi .
s

The reason for the name sum of squares error can be interpreted as follows. The value ŷi is

48
the approximation to yi made by the straight line. Therefore the difference yi − ŷi can be
interpreted as the error ei made by the straight line when approximating the ith observation.
Our criterion is simply the sum of the squares of these errors.

Example 51. Let us calculate the sum of squares error —SSE— for each of the four lines ŷ1 ,
ŷ2 , ŷ3 and ŷ4 fitted to the workers dataset in Figure 36.
For ŷ1 we have the intercept b0 = 80 and the slope b1 = −2, which yields fitted values ŷ1
and errors ei = yi − ŷ1,i as shown in the fourth and fifth columns of Table 26, respectively.
Therefore we have
X
SSE1 = e2i = (−36)2 + (−31)2 + · · · + 402 = 8012.
s

i xi yi ŷ1,i ei
1 12 20 56 -36
2 14 21 52 -31
3 15 27 50 -23
4 18 30 44 -14
5 19 32 42 -10
6 24 50 32 18
7 26 54 28 26
8 27 57 26 31
9 28 61 24 37
10 30 60 20 40
Table 26: Number of tables y produced by x workers

The sum of squares error for the remaining lines ŷ2 , ŷ3 and ŷ4 are found in a analogous
way, we get SSE2 = 10046, SSE3 = 207 and SSE4 = 66. Therefore, according to the SSE
criterion, among these four lines, ŷ4 = −13 + 2.5 x is the one that better fits the observations.


In Example 51 we used SSE for deciding which line, among four different options, fits the
observations best. The natural next step would be to find the best line according to the SSE
criterion, in other words, finding the values for the intercept b0 and the slope b1 that minimize
the SSE. The solution is known as the least squares regression.

Definition 52. Let (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ) be the observations of two variables x and
y on a set s of n elements. The line that minimizes the SSE is given by the least squares
regression,
Sxy,s
ŷ = b0 + b1 x with b1 = 2 and b0 = ȳs − b1 x̄s (14)
Sx,s
where
1 X 2 1 X
Sxy,s = (xi − x̄s )(yi − ȳs ) and Sx,s = (xi − x̄s )2 . 
n−1 s n−1 s

Remark: The slope b1 in (14) can be written alternatively as


P P
Sy,s s (x i − x̄s )(yi − ȳs ) s xi yi − nx̄s ȳs
b1 = rxy,s or b1 = P 2
or b1 = P 2 2
,
Sx,s s (xi − x̄s ) s xi − nx̄s

49
where rxy,s is the correlation coefficient between x and y in s, Sx,s and Sy,s are the standard
deviations of x and y, respectively.
The functions =intercept() and =slope() can be used in Excel for obtaining the intercept
b0 and the slope b1 of a least squares regression.
The intercept b0 and, in particular, the slope b1 in a least squares regression have interesting
interpretations. The slope b1 indicates that, on average, a unit change in the independent
variable x is associated to a change of b1 in the dependent variable y. It is worth to emphasize
the need for the expression “on average” in the previous sentence: x and y do not follow exactly
a straight line, therefore the change is not deterministic, that is why we need to emphasize
that the slope measures the “average change”. The intercept, on the other hand, indicates the
value of the dependent variable y that is expected when the independent variable x is equal
to zero.

Example 53. Let us find the least squares regression for the workers dataset from Examples
2
50 and 51. We have x̄s = 21.3, ȳs = 41.2, Sx,s = 42.01 and Sxy,s = 106.9. Therefore

Sxy,s 106.9
b1 = 2
= = 2.545 and b0 = ȳs − b1 x̄s = 41.2 − 2.545 · 21.3 = −13.02.
Sx,s 42.01

The fitted regression line is


ŷ = −13.02 + 2.545 x,
which is shown in Figure 37.

Figure 37: Least squares regression fitted to the workers dataset.

It can be verified that the sum of squares error for the least squares regression is SSE = 56,
which is, in fact, smaller than the SSE of any of the four lines fitted in Example 51.
The intercept b0 is interpreted as: a company with no workers is expected to produce
around -13 tables. Admittedly, in this case, this interpretation makes no sense, it is simply
not possible to produce a negative number of tables. This undesired result is due to the fact
that the value x = 0 is not one of the observed values and it is, in fact, quite far from any
of the observed values. One must be careful when extrapolating the results of a regression
to observations that do not belong to the dataset, especially if they are quite different to the
observed values.
The slope b1 is interpreted as: in our set of 10 companies, on average, one extra worker is
associated to 2.5 more tables produced. 

50

You might also like