Skip to content

Sample order matters? #222

@louislamont

Description

@louislamont

Hi, I noticed some strange behavior when running HISAT2 (v 2.1.0) on many files at once. Some background: The data I'm working with are from 2 flow cells (fcA/B), 4 lanes each(L001/2/3/4), paired end, so we have 16 files per sample (e.g., sample1.L001.fcA.R1.fq, sample1.L001.fcA.R2.fq ... sample1.L004.fcB.R1.fq, sample1.L004.fcB.R2.fq). The issue: When setting up the HISAT2 pipeline, I noticed differences in number of aligned reads depending on the order that the fq files for a given sample were supplied to HISAT2 (meaning, when I set up the script manually, samples were entered by flowcell (L001.fcA, L002.fcA, etc). Using bash scripting provided the same samples, but ordered alphabetically - L001.fcA, L001.fcB, etc.). Here are the two summary files that were produced. By flowcell:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6283267 (5.60%) aligned concordantly 0 times
    99278753 (88.52%) aligned concordantly exactly 1 time
    6592928 (5.88%) aligned concordantly >1 times
    ----
    6283267 pairs aligned concordantly 0 times; of these:
      624298 (9.94%) aligned discordantly 1 time
    ----
    5658969 pairs aligned 0 times concordantly or discordantly; of these:
      11317938 mates make up the pairs; of these:
        6719600 (59.37%) aligned 0 times
        3760871 (33.23%) aligned exactly 1 time
        837467 (7.40%) aligned >1 times
97.00% overall alignment rate

By lane:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6284631 (5.60%) aligned concordantly 0 times
    99209992 (88.46%) aligned concordantly exactly 1 time
    6660325 (5.94%) aligned concordantly >1 times
    ----
    6284631 pairs aligned concordantly 0 times; of these:
      624071 (9.93%) aligned discordantly 1 time
    ----
    5660560 pairs aligned 0 times concordantly or discordantly; of these:
      11321120 mates make up the pairs; of these:
        6722040 (59.38%) aligned 0 times
        3758172 (33.20%) aligned exactly 1 time
        840908 (7.43%) aligned >1 times

I set up two scripts that explicitly gave the same sets of reads in different order and was able to reproduce the results. When I used featureCounts to get counts from each bam file, they were pretty close to one another (only 3191/60623 genes had different counts), but it seemed odd to me that specifying the same files in two different orders would lead to differences in alignment and counts. Has anyone else noticed anything like this? Am I doing something wrong, is this a bug, or is HISAT functioning properly? Does this post even make sense?

Here are the commands I ran. Samples ordered by flowcell:

hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-byflowcell.sorted.bam

Samples ordered by lane:

hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-bylane.sorted.bam

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions