#r "nuget: FSharpAux, 1.1.0"
#r "nuget: Plotly.NET.Interactive, 4.0.0"
#r "nuget: FSharp.Stats, 0.4.11"
open Plotly.NET
open Plotly.NET.StyleParam
open Plotly.NET.LayoutObjects
module Chart =
let myAxis name = LinearAxis.init(Title=Title.init name,Mirror=StyleParam.Mirror.All,Ticks=StyleParam.TickOptions.Inside,ShowGrid=false,ShowLine=true)
let withAxisTitles x y chart =
chart
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxis (myAxis x)
|> Chart.withYAxis (myAxis y)
High throughput techniques like microarrays with its successor RNA-Seq and mass spectrometry proteomics lead to an huge data amount. Thousands of features (e.g. transcripts or proteins) are measured simultaneously. Differential expression analysis aims to identify features, that change significantly between two conditions. A common experimental setup is the analysis of which genes are over- or underexpressed between e.g. a wild type and a mutant.
Hypothesis tests aim to identify differences between two or more samples. The most common statistical test is the t test that tests a difference of means. Hypothesis tests report a p value, that correspond the probability of obtaining results at least as extreme as the observed results, assuming that the null hypothesis is correct. In other words:
If there is no effect (no mean difference), a p value of 0.05 indicates that in 5 % of the tests a false positive is reported.
Consider two population distributions that follow a normal distribution. Both have the same mean and standard deviation.
open FSharpAux
open FSharp.Stats
let distributionA = Distributions.ContinuousDistribution.normal 10.0 1.0
let distributionB = Distributions.ContinuousDistribution.normal 10.0 1.0
[
Chart.Area([5. .. 0.01 .. 15.] |> List.map (fun x -> x,distributionA.PDF x),Name = "distA")
Chart.Area([5. .. 0.01 .. 15.] |> List.map (fun x -> x,distributionB.PDF x),Name = "distB")
]
|> Chart.combine
|> Chart.withAxisTitles "variable X" "relative count"
|> Chart.withSize (900.,600.)
|> Chart.withTitle "null hypothesis"
Samples with sample size 5 are randomly drawn from both population distributions.
Both samples are tested if a mean difference exist using a two sample t test where equal variances of the underlying population distribution are assumed.
let getSample n (dist: Distributions.ContinuousDistribution<float,float>) =
Vector.init n (fun _ -> dist.Sample())
let sampleA = getSample 5 distributionA
let sampleB = getSample 5 distributionB
(Testing.TTest.twoSample true sampleA sampleB).PValue
0.760894895653307

Fig 1: p value distribution of the null hypothesis.
10,000 tests are performed, each with new randomly drawn samples. This corresponds to an experiment in which none of the features changed Note, that the mean intensities are arbitrary and must not be the same for all features! In the presented case all feature intensities are in average 10. The same simulation can be performed with pairwise comparisons from distributions that differ for each feature, but are the same within the feature. The resulting p values are uniformly distributed between 0 and 1.
Samples are called significantly different, if their p value is below a certain significance threshold ($\alpha$ level). While "the lower the better", a common threshold is a p value of 0.05 or 0.01. In the presented case in average $10,000 * 0.05 = 500$ tests are significant (red box), even though the populations do not differ. They are called false positives (FP). Now lets repeat the same experiment, but this time sample 70% of the time from null features (no difference) and add 30% samples of truly differing distributions. Therefore a third populations is generated, that differ in mean, but has an equal standard deviation:
let distributionC = Distributions.ContinuousDistribution.normal 11.5 1.0
Fig 2: p value distribution of the alternative hypothesis. Blue coloring indicate p values deriving from distribution A and B (null).
Orange coloring indicate p values deriving from distribution A and C (truly differing).
The pvalue distribution of the tests resulting from truly differing populations are right skewed, while the null tests again show a homogeneous distribution between 0 and 1. Many, but not all of the tests that come from the truly differing populations are below 0.05, and therefore would be reported as significant. In average 350 null features would be reported as significant even though they derive from null features (blue bars, 10,000 x 0.7 x 0.05 = 350).
The hypothesis testing framework with the p value definition given above was developed for performing just one test. If many tests are performed, like in modern high throughput studies, the probability to obtain a false positive result increases. The probability of at least one false positive is called Familywise error rate (FWER) and can be determined by $FWER=1-(1-\alpha)^m$ where $\alpha$ corresponds to the significance threshold (here 0.05) and $m$ is the number of performed tests.
let bonferroniLine =
Shape.init(
ShapeType = ShapeType.Line,
X0 = 0.,
X1 = 35.,
Y0 = 0.05,
Y1 = 0.05,
Line=Line.init(Dash=DrawingStyle.Dash)
)
[1..35]
|> List.map (fun x ->
x,(1. - (1. - 0.05)**(float x))
)
|> Chart.Point
|> Chart.withYAxisStyle("",MinMax=(0.,1.))
|> Chart.withAxisTitles "#tests" "p(at least one FP)"
|> Chart.withShape bonferroniLine
|> Chart.withTitle "FWER"
Fig 3: Familiy wise error rate depending on number of performed tests. The black dashed line indicates the Bonferroni corrected FWER by $p^* = \frac{\alpha}{m}$ .
When 10,000 null features are tested with a p value threshold of 0.05, in average 500 tests are reported significant even if there is not a single comparison in which the populations differ. If some of the features are in fact different, the number of false positives consequently decreases (remember, the p value is defined for tests of null features).
Why the interpretation of high throughput data based on p values is difficult: The more features are measured, the more false positives you can expect. If 100 differentially expressed genes are identified by p value thresholding, without further information about the magnitude of expected changes and the total number of measured transcripts, the data is useless.
The p value threshold has no straight-forward interpretation when many tests are performed. Of course, you could restrict the family wise error rate to 0.05, regardless how many tests are performed. This is realized by dividing the $\alpha$ significance threshold by the number of tests, which is known as Bonferroni correction: $p^* = \frac{\alpha}{m}$. This correction drastically limits the false positive rate, but in an experiment with a huge count of expected changes, it additionally would result in many false negatives. The FWER should be chosen if the costs of follow up studies to tests the candidates are dramatic or there is a huge waste of time to potentially study false positives.
A more reasonable measure of significance with a simple interpretation is the so-called false discovery rate (FDR). It describes the rate of expected false positives within the overall reported significant features. The goal is to identify as many sig. features as possible while incurring a relatively low proportion of false positives. Consequently, a set of reported significant features together with the FDR describes the confidence of this set, without the requirement to somehow incorporate the uncertainty that is introduced by the total number of tests performed. In the simulated case of 7,000 null tests and 3,000 tests resulting from truly differing distributions above, the FDR can be calculated exactly. Therefore at e.g. a p value of 0.05 the number of false positives (blue in red box) are divided by the number of significant reported tests (false positives + true positives).

Fig 4: p value distribution of the alternative hypothesis.
Given the conditions described in the first chapter, the FDR of this experiment with a p value threshold of 0.05 is 0.173. Out of the 2019 reported significant comparisons, in average 350 are expected to be false positives, which gives an straight forward interpretation of the data confidence. In real-world experiments the proportion of null tests and tests deriving from an actual difference is of course unknown. The proportion of null tests however tends to be distributed equally in the p value histogram. By identification of the average null frequency, a proportion of FP and TP can be determined, and the FDR can be defined. This frequency estimate is called $\pi_0$, which leads to an FDR definition of:

Fig 5: FDR calculation on simulated data.
Consequently, for each presented p value a corresponding FDR can be calculated. The minimum local FDR at each p value is called q value.
$$\hat q(p_i) = min_{t \geq p_i} \hat{FDR}(p_i)$$Since the q value is not monotonically increasing, it is smoothed by assigning the lowest FDR of all p values, that are equal or higher the current one.
By defining $\pi_0$, all other parameters can be calculated from the given p value distribution to determine the all q values. The most prominent FDR-controlling method is known as Benjamini-Hochberg correction. It sets $\pi_0$ as 1, assuming that all features are null. In studies with an expected high proportion of true positives, a $\pi_0$ of 1 is too conservative, since there definitely are true positives in the data.
A better estimation of $\pi_0$ is given in the following:
True positives are assumed to be right skewed while null tests are distributed equally between 0 and 1. Consequently, the right flat region of the p value histogram tends to correspond to the true frequency of null comparisons (Fig 5). As real world example 9856 genes were measured in triplicates under two conditions (control and treatment). The p value distribution of two sample t tests looks as follows:
let examplePVals =
System.IO.File.ReadAllLines(@"../../files/pvalExample.txt")
|> Array.tail
|> Array.map float
//number of tests
let m =
examplePVals
|> Array.length
|> float
let nullLine =
Shape.init(
ShapeType = ShapeType.Line,
X0 = 0.,
X1 = 1.,
Y0 = 1.,
Y1 = 1.,
Line=Line.init(Dash=DrawingStyle.Dash)
)
let empLine =
Shape.init(
ShapeType = ShapeType.Line,
X0 = 0.,
X1 = 1.,
Y0 = 0.4,
Y1 = 0.4,
Line=Line.init(Dash=DrawingStyle.DashDot,Color=Color.fromHex "#FC3E36")
)
[
[
examplePVals
|> Distributions.Frequency.create 0.025
|> Map.toArray
|> Array.map (fun (k,c) -> k,float c / (m * 0.025))
|> Chart.Column
|> Chart.withTraceInfo "density"
|> Chart.withAxisTitles "p value" "density"
|> Chart.withShapes [nullLine;empLine]
examplePVals
|> Distributions.Frequency.create 0.025
|> Map.toArray
|> Array.map (fun (k,c) -> k,float c)
|> Chart.Column
|> Chart.withTraceInfo "gene count"
|> Chart.withAxisTitles "p value" "gene count"
]
]
|> Chart.Grid()
|> Chart.withSize(1100.,550.)
Fig 6: p value distributions of real world example. The frequency is given on the right, its density on the left. The black dashed line indicates the distribution, if all features were null. The red dash-dotted line indicates the visual estimated pi0.
By performing t tests for all comparisons 3743 (38 %) of the genes lead to a pvalue lower than 0.05. By eye, you would estimate $\pi_0$ as 0.4, indicating, only a small fraction of the genes is unaltered (null). After q value calculations, you would filter for a specific FDR (e.g. 0.05) and end up with an p value threshold of 0.04613, indicating a FDR of max. 0.05 in the final reported 3642 genes.
pi0 = 0.4
m = 9856
D(p) = number of sig. tests at given p
FP(p) = p*0.4*9856
FDR(p) = FP(p) / D(p)
FDR(0.04613) = 0.4995
let pi0 = 0.4
let getD p =
examplePVals
|> Array.sumBy (fun x -> if x <= p then 1. else 0.)
let getFP p = p * pi0 * m
let getFDR p = (getFP p) / (getD p)
let qvaluesNotSmoothed =
examplePVals
|> Array.sort
|> Array.map (fun x ->
x, getFDR x)
|> Chart.Line
|> Chart.withTraceInfo "not smoothed"
let qvaluesSmoothed =
let pValsSorted =
examplePVals
|> Array.sortDescending
let rec loop i lowest acc =
if i = pValsSorted.Length then
acc |> List.rev
else
let p = pValsSorted.[i]
let q = getFDR p
if q > lowest then
loop (i+1) lowest ((p,lowest)::acc)
else loop (i+1) q ((p,q)::acc)
loop 0 1. []
|> Chart.Line
|> Chart.withTraceInfo "smoothed"
let eXpos = examplePVals |> Array.filter (fun x -> x <= 0.046135) |> Array.length
[qvaluesNotSmoothed;qvaluesSmoothed]
|> Chart.combine
|> Chart.withYAxisStyle("",MinMax=(0.,1.))
|> Chart.withAxisTitles "p value" "q value"
|> Chart.withShape empLine
|> Chart.withTitle (sprintf "#[genes with q value < 0.05] = %i" eXpos)