-
Notifications
You must be signed in to change notification settings - Fork 2k
Expand file tree
/
Copy pathranktest.Rmd
More file actions
72 lines (56 loc) · 2.25 KB
/
ranktest.Rmd
File metadata and controls
72 lines (56 loc) · 2.25 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
layout: page
title: Rank tests
---
```{r options, echo=FALSE}
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
```
## Wilcoxon Rank Sum Test
We learned how the sample mean and SD are susceptible to outliers. The
t-test is based on these measures and is susceptible as well. The
Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an
alternative. In the code below, we perform a t-test on data for which
the null is true. However, we change one sum observation by mistake
in each sample and the values incorrectly entered are different. Here
we see that the t-test results in a small p-value, while the Wilcoxon
test does not:
```{r}
set.seed(779) ##779 picked for illustration purposes
N=25
x<- rnorm(N,0,1)
y<- rnorm(N,0,1)
```
Create outliers:
```{r}
x[1] <- 5
x[2] <- 7
cat("t-test pval:",t.test(x,y)$p.value)
cat("Wilcox test pval:",wilcox.test(x,y)$p.value)
```
The basic idea is to 1) combine all the data, 2) turn the values into ranks, 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.
```{r rank-test-illustration, fig.cap="Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values ",fig.width=10.5,fig.height=5.25}
library(rafalib)
mypar(1,2)
stripchart(list(x,y),vertical=TRUE,ylim=c(-7,7),ylab="Observations",pch=21,bg=1)
abline(h=0)
xrank<-rank(c(x,y))[seq(along=x)]
yrank<-rank(c(x,y))[-seq(along=x)]
stripchart(list(xrank,yrank),vertical=TRUE,ylab="Ranks",pch=21,bg=1,cex=1.25)
ws <- sapply(x,function(z) rank(c(z,y))[1]-1)
text( rep(1.05,length(ws)), xrank, ws, cex=0.8)
W <-sum(ws)
```
`W` is the sum of the ranks for the first group relative to the second
group. We can compute an exact p-value for $W$ based on
combinatorics. We can also use the CLT since
statistical theory tells us that this `W` is approximated by the
normal distribution. We can construct a z-score as follows:
```{r}
n1<-length(x);n2<-length(y)
Z <- (mean(ws)-n2/2)/ sqrt(n2*(n1+n2+1)/12/n1)
print(Z)
```
Here the `Z` is not large enough to give us a p-value less
than 0.05. These are part of the calculations performed by the R function
`wilcox.test`.