-
Notifications
You must be signed in to change notification settings - Fork 2k
Expand file tree
/
Copy pathpopulations_and_samples.Rmd
More file actions
99 lines (74 loc) · 4.36 KB
/
populations_and_samples.Rmd
File metadata and controls
99 lines (74 loc) · 4.36 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
layout: page
title: Population, Samples, and Estimates
---
```{r options, echo=FALSE}
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
```
## Populations, Samples and Estimates
Now that we have introduced the idea of a random variable, a null distribution, and a p-value, we are ready to describe the mathematical theory that permits us to compute p-values in practice. We will also learn about confidence intervals and power calculations.
#### Population parameters
A first step in statistical inference is to understand what population
you are interested in. In the mouse weight example, we have two
populations: female mice on control diets and female mice on high fat
diets, with weight being the outcome of interest. We consider this
population to be fixed, and the randomness comes from the
sampling. One reason we have been using this dataset as an example is
because we happen to have the weights of all the mice of this
type. We download [this](https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv) file to our working directory and read in to R:
```{r,message=FALSE,echo=FALSE}
library(downloader)
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "mice_pheno.csv"
url <- paste0(dir, filename)
if (!file.exists(filename)) download(url,destfile=filename)
```
```{r}
dat <- read.csv("mice_pheno.csv")
```
We can then access the population values and determine, for example, how many we have. Here we compute the size of the control population:
```{r,message=FALSE}
library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%
select(Bodyweight) %>% unlist
length(controlPopulation)
```
We usually denote these values as $x_1,\dots,x_m$. In this case, $m$ is the number computed above. We can do the same for the high fat diet population:
```{r}
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%
select(Bodyweight) %>% unlist
length(hfPopulation)
```
and denote with $y_1,\dots,y_n$.
We can then define summaries of interest for these populations, such as the mean and variance.
The mean:
$$\mu_X = \frac{1}{m}\sum_{i=1}^m x_i \mbox{ and } \mu_Y = \frac{1}{n} \sum_{i=1}^n y_i$$
The variance:
$$\sigma_X^2 = \frac{1}{m}\sum_{i=1}^m (x_i-\mu_X)^2 \mbox{ and } \sigma_Y^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\mu_Y)^2$$
with the standard deviation being the square root of the variance. We refer to such quantities that can be obtained from the population as _population parameters_. The question we started out asking can now be written mathematically: is $\mu_Y - \mu_X = 0$ ?
Although in our illustration we have all the values and can check if this is true, in practice we do not. For example, in practice it would be prohibitively expensive to buy all the mice in a population. Here we learn how taking a _sample_ permits us to answer our questions. This is the essence of statistical inference.
#### Sample estimates
In the previous chapter, we obtained samples of 12 mice from each
population. We represent data from samples with capital letters to
indicate that they are random. This is common practice in statistics,
although it is not always followed. So the samples are $X_1,\dots,X_M$
and $Y_1,\dots,Y_N$ and, in this case, $N=M=12$. In contrast and as we
saw above, when we list out the values of the population, which are
set and not random, we use lower-case letters.
Since we want to know if $\mu_Y - \mu_X$ is 0, we consider the sample version: $\bar{Y}-\bar{X}$ with:
$$
\bar{X}=\frac{1}{M} \sum_{i=1}^M X_i
\mbox{ and }\bar{Y}=\frac{1}{N} \sum_{i=1}^N Y_i.
$$
Note that this difference of averages is also a random
variable. Previously, we learned about the behavior of random variables
with an exercise that involved repeatedly sampling from the original
distribution. Of course, this is not an exercise that we can execute
in practice. In this particular case it would involve buying 24 mice
over and over again. Here we described the mathematical theory that
mathematically relates $\bar{X}$ to $\mu_X$ and $\bar{Y}$ to $\mu_Y$,
that will in turn help us understand the relationship between
$\bar{Y}-\bar{X}$ and $\mu_Y - \mu_X$. Specifically, we will describe
how the Central Limit Theorem permits us to use an approximation to
answer this question, as well as motivate the widely used t-distribution.