-
Notifications
You must be signed in to change notification settings - Fork 130
Description
Hi, I noticed some strange behavior when running HISAT2 (v 2.1.0) on many files at once. Some background: The data I'm working with are from 2 flow cells (fcA/B), 4 lanes each(L001/2/3/4), paired end, so we have 16 files per sample (e.g., sample1.L001.fcA.R1.fq, sample1.L001.fcA.R2.fq ... sample1.L004.fcB.R1.fq, sample1.L004.fcB.R2.fq). The issue: When setting up the HISAT2 pipeline, I noticed differences in number of aligned reads depending on the order that the fq files for a given sample were supplied to HISAT2 (meaning, when I set up the script manually, samples were entered by flowcell (L001.fcA, L002.fcA, etc). Using bash scripting provided the same samples, but ordered alphabetically - L001.fcA, L001.fcB, etc.). Here are the two summary files that were produced. By flowcell:
112154948 reads; of these:
112154948 (100.00%) were paired; of these:
6283267 (5.60%) aligned concordantly 0 times
99278753 (88.52%) aligned concordantly exactly 1 time
6592928 (5.88%) aligned concordantly >1 times
----
6283267 pairs aligned concordantly 0 times; of these:
624298 (9.94%) aligned discordantly 1 time
----
5658969 pairs aligned 0 times concordantly or discordantly; of these:
11317938 mates make up the pairs; of these:
6719600 (59.37%) aligned 0 times
3760871 (33.23%) aligned exactly 1 time
837467 (7.40%) aligned >1 times
97.00% overall alignment rate
By lane:
112154948 reads; of these:
112154948 (100.00%) were paired; of these:
6284631 (5.60%) aligned concordantly 0 times
99209992 (88.46%) aligned concordantly exactly 1 time
6660325 (5.94%) aligned concordantly >1 times
----
6284631 pairs aligned concordantly 0 times; of these:
624071 (9.93%) aligned discordantly 1 time
----
5660560 pairs aligned 0 times concordantly or discordantly; of these:
11321120 mates make up the pairs; of these:
6722040 (59.38%) aligned 0 times
3758172 (33.20%) aligned exactly 1 time
840908 (7.43%) aligned >1 times
I set up two scripts that explicitly gave the same sets of reads in different order and was able to reproduce the results. When I used featureCounts to get counts from each bam file, they were pretty close to one another (only 3191/60623 genes had different counts), but it seemed odd to me that specifying the same files in two different orders would lead to differences in alignment and counts. Has anyone else noticed anything like this? Am I doing something wrong, is this a bug, or is HISAT functioning properly? Does this post even make sense?
Here are the commands I ran. Samples ordered by flowcell:
hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-byflowcell.sorted.bam
Samples ordered by lane:
hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-bylane.sorted.bam