-
Notifications
You must be signed in to change notification settings - Fork 2k
Expand file tree
/
Copy pathcollinearity.Rmd
More file actions
253 lines (211 loc) · 5.36 KB
/
collinearity.Rmd
File metadata and controls
253 lines (211 loc) · 5.36 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "Collinearity"
author: "Rafa"
date: "January 31, 2015"
output: html_document
layout: page
---
```{r options, echo=FALSE}
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
```
## Collinearity
If an experiment is designed incorrectly we may not be able to estimate the parameters of interest. Similarly, when analyzing data we may incorrectly decide to use a model that can't be fit. If we are using linear models then we can detect these problems mathematically by looking for collinearity in the design matrix.
#### System of equations example
The following system of equations:
$$
\begin{align*}
a+c &=1\\
b-c &=1\\
a+b &=2
\end{align*}
$$
has more than one solution since there are an infinite number of triplets that satisfy $a=1-c, b=1+c$. Two examples are $a=1,b=1,c=0$ and $a=0,b=2,c=1$.
#### Matrix algebra approach
The system of equations above can be written like this:
$$
\,
\begin{pmatrix}
1&0&1\\
0&1&-1\\
1&1&0\\
\end{pmatrix}
\begin{pmatrix}
a\\
b\\
c
\end{pmatrix}
=
\begin{pmatrix}
1\\
1\\
2
\end{pmatrix}
$$
Note that the third column is a linear combination of the first two:
$$
\,
\begin{pmatrix}
1\\
0\\
1
\end{pmatrix}
+
-1 \begin{pmatrix}
0\\
1\\
1
\end{pmatrix}
=
\begin{pmatrix}
1\\
-1\\
0
\end{pmatrix}
$$
We say that the third column is collinear with the first 2. This implies that the system of equations can be written like this:
$$
\,
\begin{pmatrix}
1&0&1\\
0&1&-1\\
1&1&0
\end{pmatrix}
\begin{pmatrix}
a\\
b\\
c
\end{pmatrix}
=
a
\begin{pmatrix}
1\\
0\\
1
\end{pmatrix}
+
b \begin{pmatrix}
0\\
1\\
1
\end{pmatrix}
+
c
\begin{pmatrix}
1-0\\
0-1\\
1-1
\end{pmatrix}
$$
$$
=(a+c)
\begin{pmatrix}
1\\
0\\
1\\
\end{pmatrix}
+
(b-c)
\begin{pmatrix}
0\\
1\\
1\\
\end{pmatrix}
$$
The third column does not add a constraint and what we really have are three equations and two unknowns: $a+c$ and $b-c$. Once we have values for those two quantities, there are an infinity number of triplets that can be used.
#### Collinearity and least squares
Consider a design matrix $\mathbf{X}$ with two collinear columns. Here we create an extreme example in which one column is the opposite of another:
$$
\mathbf{X} = \begin{pmatrix}
\mathbf{1}&\mathbf{X}_1&\mathbf{X}_2&\mathbf{X}_3\\
\end{pmatrix}
\mbox{ with, say, }
\mathbf{X}_3 = - \mathbf{X}_2
$$
This means that we can rewrite the residuals like this:
$$
\mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 + \mathbf{X}_3\beta_3\right\}\\
= \mathbf{Y}- \left\{ \mathbf{1}\beta_0 + \mathbf{X}_1\beta_1 + \mathbf{X}_2\beta_2 - \mathbf{X}_2\beta_3\right\}\\
= \mathbf{Y}- \left\{\mathbf{1}\beta_0 + \mathbf{X}_1 \beta_1 + \mathbf{X}_2(\beta_2 - \beta_3)\right\}
$$
and if $\hat{\beta}_1$, $\hat{\beta}_2$, $\hat{\beta}_3$ is a least squares solution, then, for example, $\hat{\beta}_1$, $\hat{\beta}_2+1$, $\hat{\beta}_3+1$ is also a solution.
#### Confounding as an example
Now we will demonstrate how collinearity helps us determine problems with our design using one of the most common errors made in current experimental design: confounding. To illustrate, let's use an imagined experiment in which we are interested in the effect of four treatments A, B, C and D. We assign two mice to each treatment. After starting the experiment by giving A and B to female mice, we realize there might be a sex effect.
We decide to give C and D to males with hopes of estimating this effect. But can we estimate the sex effect? The described design implies the following design matrix:
$$
\,
\begin{pmatrix}
Sex & A & B & C & D\\
0 & 1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 \\
1 & 0 & 0 & 1 & 0 \\
1 & 0 & 0 & 1 & 0 \\
1 & 0 & 0 & 0 & 1 \\
1 & 0 & 0 & 0 & 1\\
\end{pmatrix}
$$
Here we can see that sex and treatment are confounded. Specifically, the sex column can be written as a linear combination of the C and D matrices.
$$
\,
\begin{pmatrix}
Sex \\
0\\
0 \\
0 \\
0 \\
1\\
1\\
1 \\
1 \\
\end{pmatrix}
=
\begin{pmatrix}
C \\
0\\
0\\
0\\
0\\
1\\
1\\
0\\
0\\
\end{pmatrix}
+
\begin{pmatrix}
D \\
0\\
0\\
0\\
0\\
0\\
0\\
1\\
1\\
\end{pmatrix}
$$
This implies that a unique least squares estimate is not achievable.
## Rank
The _rank_ of a matrix columns is the number of columns that are independent of all the others. If the rank is smaller than the number of columns, then the LSE are not unique. In R, we can obtain the rank of matrix with the function `qr`, which we will describe in more detail in a following section.
```{r}
Sex <- c(0,0,0,0,1,1,1,1)
A <- c(1,1,0,0,0,0,0,0)
B <- c(0,0,1,1,0,0,0,0)
C <- c(0,0,0,0,1,1,0,0)
D <- c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")
```
Here we will not be able to estimate the effect of sex.
## Removing Confounding
This particular experiment could have been designed better. Using the same number of male and female mice, we can easily design an experiment that allows us to compute the sex effect as well as all the treatment effects. Specifically, when we balance sex and treatments, the confounding is removed as demonstrated by the fact that the rank is now the same as the number of columns:
```{r}
Sex <- c(0,1,0,1,0,1,0,1)
A <- c(1,1,0,0,0,0,0,0)
B <- c(0,0,1,1,0,0,0,0)
C <- c(0,0,0,0,1,1,0,0)
D <- c(0,0,0,0,0,0,1,1)
X <- model.matrix(~Sex+A+B+C+D-1)
cat("ncol=",ncol(X),"rank=", qr(X)$rank,"\n")
```