0% found this document useful (0 votes)

51 views56 pages

NGS Unit 1

The document provides an overview of Next-Generation Sequencing (NGS) data analysis and high-performance computing (HPC), detailing its applications in various biological research areas such as cancer and microbiology. It discusses different NGS platforms, their costs, and sequencing technologies, including Illumina and Oxford Nanopore, along with essential file formats like FASTQ and SAM/BAM used for storing sequencing data. The content is aimed at equipping researchers with knowledge about computational tools and methods for effective NGS data analysis.

Uploaded by

Karan Palukuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views56 pages

NGS Unit 1

Uploaded by

Karan Palukuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

NGS DATA ANALYSIS & HPC

22BT741 – Unit-1

Dr. Prashantha Karunakar

Associate Professor
Department of Biotechnology
Dayananda Sagar College of Engineering
Bangalore
[email protected]
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS

What is NGS used for?

NGS is used to sequence vast amounts of genetic material, enabling researchers
to take a broad, unbiased approach to scientific research in a variety of
applications and biological systems.
- Rather than profiling select markers, you can identify variants across
thousands of regions, down to single-base resolution, in a single experiment.
NGS expands the scope of your experimental studies to help find the
answers to your boldest research questions.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS

Using NGS in key research areas

NGS can play an important role in pursuing the answer to a variety of
biological questions using a wide array of published methods for diverse
sample types. NGS enables the unbiased investigation of multiple biological
“omes”, such as the proteome, transcriptome, epigenome, and genome. A
combinatorial approach interrogating multiple omes at once, called
multiomics, can also be achieved with NGS.
Popular NGS methods and applications include:
Cancer research
NGS is used to bulk-sequence tumors and identify genetic mutations in tumors,
aiding in the development of targeted therapies and monitoring cancer
progression through liquid biopsies.

Microbiology and infectious diseases.

NGS helps in pathogen identification, outbreak tracking, and studying
antimicrobial resistance by sequencing the genomes of bacteria, viruses, and
other microbes.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms

Instrument Cost Typical

Platform Read Type Read Length Cost per Gb (INR)
(INR) Applications
Whole Genome
Illumina NovaSeq Short-read (Paired-
~₹10,50,00,000 2 × 150 bp ~₹84–168 Sequencing, large-
X Plus End)
scale projects
Targeted
Illumina NextSeq Short-read (Paired-
~₹2,52,00,000 2 × 150 bp ~₹1,260–2,520 sequencing, RNA-
2000 End)
seq
Amplicon
Short-read (Paired- sequencing, small
Illumina MiSeq ~₹84,00,000 2 × 300 bp ~₹8,400–12,600
End) genome
sequencing
Ion Torrent Clinical diagnostics,
~₹2,52,00,000 Short-read Up to 200 bp ~₹840–4,200
Genexus small panels
~₹4,20,00,000– Structural variants,
PacBio Sequel IIe Long-read (HiFi) ~10–25 kb ~₹672–1,260
6,30,00,000 isoform sequencing
Oxford Nanopore ~₹2,10,00,000– Long-read (ultra- De novo assembly,
>100 kb ~₹1,680–3,360
PromethION 24/48 4,20,00,000 long) metagenomics
Field-based
Oxford Nanopore ~₹84,000 (starter sequencing, pilot
Portable, long-read Up to 1 Mb ~₹4,200–8,400
MinION pack) studies,
educational use
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - DNA quality requirements

Some DNA left in the well

Sharp band of 20+kb

No sign of proteins

No smear of degraded DNA

No sign of RNA

NanoDrop: Qubit or Picogreen:

260/280 = 1.8 – 2.0 10 kb insert libraries: 3-5 ug

260/230 = 2.0 – 2.2 20 kb insert libraries: 10-20 ug
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Illumina sequencing technology provides clonal array formation and proprietary
reversible terminator technology for rapid and accurate large-scale sequencing.
The innovative and flexible sequencing system enables a broad array of
applications in genomics, transcriptomics, and epigenomics.
Cluster Generation
Sequencing templates are immobilized on
a proprietary flow cell surface (Figure 1)
designed to present the DNA in a manner
that facilitates access to enzymes while
ensuring high stability of surface bound
template and low non-specific binding of
fluorescently labeled nucleotides.

Several samples can be loaded onto the eight-lane

flow cell for simultaneous analysis on an Illumina
Sequencing System
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Solid-phase amplification (Figures 2–7) creates up to 1,000 identical copies of each single
template molecule in close proximity (diameter of one micron or less).
Figure 2: Prepare Genomic DNA Sample Figure 3: Attach DNA to Surface

Randomly fragment genomic DNA and ligate Bind single-stranded fragments randomly to
adapters to both ends of the fragments. the inside surface of the flow cell channels
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 4: Bridge Amplification Figure 5: Fragments Become Double Stranded

The enzyme incorporates nucleotides to build

Add unlabeled nucleotides and enzyme to double-stranded bridges on the solid-phase
initiate solid-phase bridge amplification. substrate.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 6: Denature the Double-Standed Molecules Figure 7: Complete Amplification

Several million dense clusters of double-

Denaturation leaves single-stranded templates stranded DNA are generated in each channel
anchored to the substrate. of the flow cell.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 8: Determine First Base Figure 9: Image First Base

After laser excitation, the emitted fluorescence

from each cluster is captured and the first base
is identified.

The first sequencing cycle begins by adding four labeled

reversible terminators, primers and DNA polymerase.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 10: Determine Second Base Figure 11: Image Second Chemistry Cycle

After laser excitation, the image is captured as

before and the identity of the second base is
recorded.

The next cycle repeats the incorporation of four

labeled reversible terminators, primers and DNA
polymerase.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 12: Figure 13: Align Data
Sequencing
Over
Multiple
Chemistry
Cycles

The
sequencing
cycles are
repeated to
determine
the sequence
of bases in
a fragment,
one base at a The data are aligned and compared to a reference,
time. and sequencing differences are identified.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
Oxford Nanopore Technologies (ONT) offers a range of sequencing devices based on the
principle of passing nucleic acid molecules (DNA or RNA) through a nanoscale pore
embedded in a membrane. As the molecule passes, it creates characteristic disruptions in an
electrical current, which are measured and decoded to determine the sequence.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore

Image adapted from Oxford Nanopore Technology. It shows a double-strand piece of DNA being unzipped
and a single strand passing through a nanopore sensor. The pore sends an electrical signal to show how
much of the current running through the pore is blocked by individual nucleotides (the building blocks of
nucleic acid - DNA and RNA). Specialised software is used to decode the signal to read the sequence.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore

Alpha-Hemolysin (α-HL)
• A heptameric pore-forming toxin from Staphylococcus aureus.
• Self-assembles into a transmembrane channel with a ~1.5 nm diameter, ideal for ssDNA or
RNA translocation.
• Provides low-noise ionic current and high sensitivity, making it a gold standard for
nanopore-based detection.
• Used to detect DNA/RNA, peptides, and even protein variants by analyzing current
blockades as molecules pass through.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
How nanopore
sequencing works.
The method uses
electrophoresis to drive
DNA strands or single
nucleotides through a
very small hole embedded
in a membrane. An
enzyme motor (not
shown) controls the rate
at which a DNA molecule
passes through the
nanopore. The sequence
is determined in real-time
based on the extent to
which the nucleotides
disrupt the current
flowing through a
nanopore sensor.
Credit: Daniel Power.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format

FASTQ is a text-based file format used to store both biological sequences (usually nucleotide
sequences) and their corresponding quality scores. It is the standard output format for
Next-Generation Sequencing (NGS) platforms like Illumina

Structure of a FASTQ File

Each read in a FASTQ file is represented using four lines:
1.Line 1: Starts with @ followed by a sequence identifier and (optionally) a
description.
2.Line 2: The nucleotide sequence (e.g., ATCG...).
3.Line 3: Starts with + and optionally repeats the sequence identifier.
4.Line 4: ASCII-encoded quality scores corresponding to each nucleotide in the
sequence.

@SEQ_ID
GATTTGGGGTTCAAA
+
!''*((((***+))%%%++)(%%%%).1**
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))** . → 46 − 33 = 13
1 → 49 − 33 = 16
) → 41 − 33 = 8 * → 42 − 33 = 9
! → 33 − 33 = 0 ) → 41 − 33 = 8 * → 42 − 33 = 9
' → 39 − 33 = 6 % → 37 − 33 = 4 * → 42 − 33 = 9
' → 39 − 33 = 6 % → 37 − 33 = 4 - → 45 − 33 = 12
* → 42 − 33 = 9 % → 37 − 33 = 4 + → 43 − 33 = 10
( → 40 − 33 = 7 + → 43 − 33 = 10 * → 42 − 33 = 9
( → 40 − 33 = 7 + → 43 − 33 = 10 ' → 39 − 33 = 6
( → 40 − 33 = 7 ) → 41 − 33 = 8 ' → 39 − 33 = 6
( → 40 − 33 = 7 ( → 40 − 33 = 7 ) → 41 − 33 = 8
* → 42 − 33 = 9 % → 37 − 33 = 4 ) → 41 − 33 = 8
* → 42 − 33 = 9 % → 37 − 33 = 4 * → 42 − 33 = 9
* → 42 − 33 = 9 % → 37 − 33 = 4 * → 42 − 33 = 9
+ → 43 − 33 = 10 % → 37 − 33 = 4
) → 41 − 33 = 8 TOTAL = 306
AVERAGE = 7.65
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

The SAM (Sequence Alignment/Map) and BAM (Binary Alignment/Map) formats are
foundational in NGS for storing sequence alignment data from high-throughput
sequencing
A SAM file has two main parts:
1.Header Section (optional, starts with @)
•Provides metadata like reference genome names and lengths
•Each line begins with @ and contains tab-separated fields
2.Alignment Section
•Each line corresponds to one read
•Has 11 mandatory fields + optional tags
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

• SAM (Sequence Alignment/Map) is a text-based format to store aligned

sequence data.
• Contains metadata (header) and alignment information (mandatory fields).
• Output of aligners like BWA, Bowtie, HISAT2.
• Easily converted to BAM (binary version of SAM) for efficient storage.

• Two sections:
1. Header section (optional): starts with '@', contains metadata
2. Alignment section: one line per read, with 11 mandatory fields
• Optional fields provide extra info (e.g., tags like NM:i:1)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

FLAG Field (Bitwise Explanation)

• Each bit in FLAG encodes read properties:
0 – Read is mapped
4 – Read is unmapped
16 – Read mapped to reverse strand
1 – Paired-end read
1024 – PCR duplicate

Applications of SAM Format

• Primary output from aligners (e.g., BWA, HISAT2)
• Input for variant callers, quantifiers, and genome browsers
• Used in workflows for:
– Variant calling (GATK)
– Transcriptomics (featureCounts)
– Visualization (IGV)
• Can be converted, sorted, indexed using samtools
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

A delins variant definition will appear in a CIGAR string as a deletion, followed by

an insertion eg, the 8D4I here:

>frg400 15M8D4I7M1X7M2X5M1X158M
CATTGGAACAGAAAGagatTTATCTGtTGTTTGCagTGAAGgAGTACAAAATG

The reverse-complement of this sequence, and its corresponding CIGAR, looks like
this:
>frg800 158M1X5M2X7M1X7M8D4I15M
CATTTTGTACTcCTTCActGCAAACAaCAGATAAatctCTTTCTGTTCCAATG
The delins sub-CIGAR 8D4I reads the same in both directions. The starting point
refers to the last nucleotide before the insertion, so a mapped starting point will
differ for all cases apart from 1DnI, where ‘n’ means any number.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

Typical File Sizes for Human Whole-Genome Sequencing

File Format Typical Size Description

Raw FASTQ files ~150–200 GB (paired-end)
Plain text, uncompressed,
SAM ~400–500 GB very large due to verbose
format
Binary format, compressed,
BAM ~100–150 GB commonly used in
pipelines
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
VCF file format
• VCF is the standard file format for storing variation data.
• VCF is a preferred format because it is unambiguous, scalable, and flexible, allowing
extra information to be added to the INFO field. Many millions of variants can be stored
in a single VCF file.
• VCF files are tab-delimited text files. They typically contain a header section with
metadata, followed by rows where each row represents a genetic variant.
• The format is human-readable and can be opened in any text editor, but specialized tools
are usually used for efficient analysis.
• VCF is widely used in genomics projects such as the 1000 Genomes Project and clinical
variant reporting.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
VCF file format

The data section includes the following FILTER: Indicates whether the variant
columns: passed filtering criteria.
CHROM: Chromosome or reference INFO: Key-value pairs providing additional
sequence ID. information about the variant, such as
POS: Position of the variant on the allele frequency, depth, etc.
chromosome. FORMAT: Specifies the format of the
ID: Unique identifier for the variant. sample-specific data. The GT in the
REF: Reference allele. FORMAT column tells us to expect
ALT: Alternate allele(s). genotypes in the following columns.
QUAL: Quality score for the variant call. Sample Columns: Contain genotype
Quality score out of 100. information for each sample, following
the format specified in the FORMAT
column.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
Causes of DNA degradation
Mechanical damage during tissue homogenization.

Wrong pH and ionic strength of extraction buffer.

Incomplete removal / contamination with nucleases.

Phenol: too old, or inappropriately buffered (pH 7.8 – 8.0); incomplete removal.

Wrong pH of DNA solvent (acidic water).

Recommended: 1:10 TE for short-term storage, or 1xTE for long-term storage.

Vigorous pipetting (wide-bore pipet tips).

Vortexing of DNA in high concentrations.

Too many freeze-thaw cycles (we tested 5, still Ok).

Debatable: sequence-dependent
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores

What are the main contaminants?

Polysaccharides Chitin
Lypopolysaccharides Protein
Growth media residuals Secondary metabolites
Pigments
Growth media residuals

Chitin Polyphenols
Fats Polysaccharides
Proteins Secondary metabolites
Pigments Pigments
By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
What do absorption ratios tell us?
Pure DNA 260/280: 1.8 – 2.0
< 1.8:
Too little DNA compared to other components of the solution; presence of organic
contaminants: proteins and phenol; glycogen - absorb at 280 nm.
> 2.0:
High share of RNA.

Pure DNA 260/230: 2.0 – 2.2

<2.0:
Salt contamination, humic acids, peptides, aromatic compounds, polyphenols, urea,
guanidine, thiocyanates (latter three are common kit components) – absorb at 230 nm.
>2.2:
High share of RNA, very high share of phenol, high turbidity, dirty instrument, wrong blank.

Photometrically active contaminants:

phenol, polyphenols, EDTA, thiocyanate, protein,
By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab RNA, nucleotides (fragments below 5 bp)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FastQC

FastQC is a modular quality control tool designed to evaluate raw sequence data from high-
throughput sequencing platforms. It provides a quick visual and statistical summary to
identify potential issues before downstream analysis.

Input Requirements
Accepts FASTQ files (compressed or uncompressed)
Compatible with data from Illumina, Ion Torrent, Oxford Nanopore, PacBio, etc.

The main functions of FastQC are

• Import of data from BAM, SAM or FastQ files (any variant)
• Providing a quick overview to tell you in which areas there may be problems
• Summary graphs and tables to quickly assess your data
• Export of results to an HTML based permanent report
• Offline operation to allow automated generation of reports without running the
interactive application
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FastQC
Good Illumina Data X-axis:
Position in read (bp):
Represents base
positions from 1 to 40
bp in the sequencing
reads.

Y-axis:
Phred Quality Score: A
logarithmic measure of
base calling accuracy.
Q = -10 log₁₀(P)
(where P is the
probability of an
incorrect base call)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FastQC
Bad Illumina Data
• Each box
represents the
interquartile range
(IQR): 25th to 75th
percentile.

• The red line

inside the box is
the median quality
score.

• Whiskers show
the 10th and 90th
percentiles.

• The blue line

shows the mean
quality score per
base position.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming
Adapter trimming is a crucial step in preprocessing NGS data.
It removes unwanted adapter sequences that can interfere with alignment, assembly, and
downstream analyses.

Why do adapters contaminate my sequences?

• Adapter ligation is essential during library preparation, where each DNA molecule is tagged
with adapters—typically after fragmentation in Illumina short-read protocols.
• Adapters serve multiple roles: they carry barcodes, primers for paired-end sequencing,
and sequences needed for flowcell binding and bridge amplification.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming
When are adapters sequences observed in the reads?
• In Illumina short-read sequencing, the 5′ adapter lies upstream of the read primer and is
not sequenced; only the DNA insert is typically captured.
• If the insert is shorter than the read length, sequencing extends into the 3′ adapter -
making adapter sequences visible only at the 3′ end and only under this condition.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming - cutadapt
Input: raw_reads.fastq
@read1
ACGTACGTAGATCGGAAGAGC
+
IIIIIIIIIIIIIIIIIIIII

cutadapt -a AGATCGGAAGAGC -o trimmed.fastq ./raw_reads.fastq

Cutadapt searches for the adapter AGATCGGAAGAGC inside the read.

ACGTACGTAGATCGGAAGAGC
↑ adapter starts here
Output
Trimmed read: ACGTACGT
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming - Trimmomatic
Trimmomatic is a versatile tool designed for preprocessing Illumina NGS data by trimming
low-quality bases and removing adapter sequences.
Core Functions of Trimmomatic
1. Adapter Removal (ILLUMINACLIP)
- Detects and removes adapter sequences using either palindrome or
simple matching.
- Example:
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10
2: seed mismatches
30: palindrome clip threshold
10: simple clip threshold
2. Quality Trimming
- LEADING: Removes low-quality bases from the start of the read.
- TRAILING: Removes low-quality bases from the end.
- SLIDINGWINDOW: Scans the read with a sliding window and trims when average
quality drops below a threshold.
SLIDINGWINDOW:4:15
- Window size: 4 bases
- Quality threshold: 15
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming - Trimmomatic

3. Length Filtering
- MINLEN: Discards reads shorter than a specified length after trimming.
MINLEN:36

4. Cropping Options
- CROP: Cuts the read to a fixed length.
- HEADCROP: Removes a fixed number of bases from the beginning.

Modes of Operation
- Paired-End Mode (PE): Processes forward and reverse reads together, maintaining pairing.
- Single-End Mode (SE): Processes individual reads.

Example Command (Paired-End)

java -jar trimmomatic-0.39.jar PE -phred33 \
input_forward.fq.gz input_reverse.fq.gz \
output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Filtering based on base quality

When sequencing DNA, the sequencer not only gives us the sequence of nucleotides (A, T,
G, C) but also a quality score for each base — this is called the Phred quality score. It tells us
how confident the machine is that a base was read correctly. For example, a score of 30
(Q30) means the base has a 1 in 1000 chance of being wrong (99.9% accuracy).

In filtering based on base quality, we remove or trim bases (or even entire reads) if their
quality score is below a chosen threshold. This helps reduce errors in downstream analysis,
like mapping or variant calling.

Example: If a read has Q-scores like 40 38 35 15 10 8, the last three bases have low
confidence, so we might trim them off.
This is important because poor-quality bases can cause false alignments or incorrect variant
calls.
Think of it like removing blurry sections from a photograph before you try to identify faces
— you only want the sharp, clear parts for analysis.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Basics

Linux is an open-source UNIX-based operating system. The main component of the Linux
operating system is Linux kernel. It is developed to provide low-cost or free operating system
service to personal system users, which includes an X-window system, Emacs editor, IP/TCP
GUI, etc.

Linux distribution
• A Linux system package, known as a distribution, consists of multiple Linux distributions
available for different computing needs.
• Linux distribution is developed using a set of software based on compatibility with the Linux
core kernel, using which Linux-based operations in different systems, such as personal
systems, embedded systems, etc.
• There are around 600 distributions available.

Each distribution has specialized packages installed to support specific tasks. This means you
can download software related to your field of work using a Linux distribution.

Some Linux distributions are: MX Linux, Manjaro, Linux Mint, elementary, Ubuntu, Debian,
Solus, Fedora, openSUSE, Deepin
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Basics

No. Command Purpose Example Description

Shows /mnt/c/Users/student/ngs_data to confirm
1 pwd Print current directory pwd
you’re in your project folder.
2 ls List files/folders ls *.fastq Lists all FASTQ files from your RNA-seq dataset.
3 cd Change directory cd Moves to the folder containing sequencing files.
bash mkdir
4 mkdir Make a directory Creates a new folder for storing cleaned reads.
trimmed_reads
cp sample1.fastq
5 cp Copy files Makes a backup copy of your original sample file.
sample1_backup.fastq
mv sample1.fastq
6 mv Move/rename files Moves sample1.fastq into the raw_data folder.
raw_data/
Head /
7 View first lines more -n 3 sample1.fastq Shows the first read from your sequencing file.
more
8 wc Count lines wc -l sample1.fastq Counts lines; divide by 4 for number of reads.
grep "ATCG"
9 grep Search pattern Finds sequences containing motif “ATCG”.
sample1.fastq
Lets you scroll through a huge FASTQ file without
10 less Scroll through file less sample1.fastq
opening it in Notepad.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture

Components of Linux
Like any operating system, Linux consists of software, computer programs, documentation,
and hardware.
The main components of Linux operating system are: Application, Shell, Kernel, Hardware,
Utilities.

1. Kernel
Kernel is the main core component it is
lies between the shell and the hardware.
It controls the activity of other hardware
components.
It visualizes the common hardware resources
and provide each process with necessary
virtual resources.
The kernel is software that manages
communication between the hardware and
the system. It cannot directly interact with
directories or files. Instead, the kernel handles
the communication between the computer
system and the hardware.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture

The kernel is responsible for:

Memory management: Manages and allocates
memory efficiently.
Resource allocation: Distributes system
resources to different processes.
Device management: Controls input/output
devices like printers and scanners.
Process management: Manages process
execution and scheduling.
Application interaction: Bridges applications
with system-level functions.
Security: Provides essential system-level
security.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
2. System Library
System libraries are some predefined functions by using which any application programs or
system utilities can access kernel's features. These libraries are the foundation upon which
any software can be built.
Some of the most common system libraries are:
GNU C library: This is the C library that provides the most fundamental system for the
interface and execution of C programs. This provides may in-built functions for the
execution.
libpthread (POSIX Threads): This library plays important role for multithreading in Linux, it
allows users for creating and managing multiple threads.
libdl (Dynamic Linker): This library is responsible for the loading and linking file at the
runtime.
libm (Math Library): This library provides user with all kind of mathematical function and
their execution.

Some other system libraries are: librt (Realtime Library), libcrypt (Cryptographic Library),
libnss (Name Service Switch Library), libstdc++ (C++ Standard Library)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
3. Shell
The Shell is also software or It can be determined as the interface to the kernel. It takes commands from
the user and interprets them. The shell transmits these commands to the kernel, which then performs
the requested operations. Users can just enter the commend and using the kernel's function that specific
task is performed accordingly.
1.Bourne Shell (sh) – One of the earliest Unix shells, it’s
simple, reliable, and still used for basic scripting tasks.
2.C Shell (csh) – A shell designed with C-like syntax,
offering history recall and interactive features but less
suited for complex scripting.
3.Korn Shell (ksh) – A powerful, backward-compatible
shell that blends Bourne and C shell features, popular in
enterprise environments.
4.Bash (Bourne Again Shell) – The most common Linux
shell, combining ease of use, scripting power, and
interactive features like auto-completion.
5.Z Shell (zsh) – A highly customizable shell with advanced
features, themes, and plugins, now the default on macOS.
6.Fish (Friendly Interactive Shell) – A beginner-friendly
shell with real-time suggestions, syntax highlighting, and
easy configuration.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
4. Hardware Layer
• Hardware layer of Linux is the lowest level of
operating system track.
• It is plays a vital role in managing all the hardware
components.
• It includes device drivers, kernel functions,
memory management, CPU control, and I/O
operations.
• This layer generalizes hard complexity, by providing
an interface for software by assuring proper
functionality of all the components.

5. System utility
• System utilities are the commend line tools that
preforms various tasks provided by user to make
system management and administration better.
• These utilities enables user to perform different
tasks, such as file management, system monitoring,
network configuration, user management etc.

NGSand App
No ratings yet
NGSand App
41 pages
Next Generation Sequencing Overview
No ratings yet
Next Generation Sequencing Overview
21 pages
Illumina Sequencing Given
No ratings yet
Illumina Sequencing Given
3 pages
Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
No ratings yet
Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview
81 pages
NGS and Bioinformatics Guide
No ratings yet
NGS and Bioinformatics Guide
5 pages
Next Generation Sequencing: Key Features and Advantages of NGS
No ratings yet
Next Generation Sequencing: Key Features and Advantages of NGS
5 pages
Next-Generation Sequencing Overview
100% (1)
Next-Generation Sequencing Overview
26 pages
Next-Generation Sequencing Overview
No ratings yet
Next-Generation Sequencing Overview
3 pages
1 NGS Workflow Presentation Updated
No ratings yet
1 NGS Workflow Presentation Updated
22 pages
Next-Generation Sequencing Data Analysis
No ratings yet
Next-Generation Sequencing Data Analysis
123 pages
Pyrosequencing and NGS Overview
No ratings yet
Pyrosequencing and NGS Overview
59 pages
NGSand App
No ratings yet
NGSand App
41 pages
Next Generation Sequencing Analysis Guide
No ratings yet
Next Generation Sequencing Analysis Guide
7 pages
Next Generation Sequencing PHD Thesis
100% (4)
Next Generation Sequencing PHD Thesis
8 pages
New Generation DNA Sequencing Project
No ratings yet
New Generation DNA Sequencing Project
8 pages
Next Generation Sequencing Overview
No ratings yet
Next Generation Sequencing Overview
12 pages
DCVMN Dgenovese
No ratings yet
DCVMN Dgenovese
69 pages
Next-Generation Sequencing Explained
No ratings yet
Next-Generation Sequencing Explained
20 pages
Next Generation Sequencing Platforms - FINAL - PTX
No ratings yet
Next Generation Sequencing Platforms - FINAL - PTX
29 pages
Next Generation Sequencing Platforms - 28-03-2025
No ratings yet
Next Generation Sequencing Platforms - 28-03-2025
29 pages
Next Generation Sequencing
No ratings yet
Next Generation Sequencing
20 pages
An Overview of Next-Generation Sequencing
No ratings yet
An Overview of Next-Generation Sequencing
25 pages
Next-Generation Sequencing Overview
No ratings yet
Next-Generation Sequencing Overview
50 pages
NGS Basics for Genomics Students
No ratings yet
NGS Basics for Genomics Students
19 pages
7 - APA478 - Clase 7. Aplicaciones Genómica
No ratings yet
7 - APA478 - Clase 7. Aplicaciones Genómica
40 pages
Lec21Molecular Techniques V - DNA Seq II - 2025
No ratings yet
Lec21Molecular Techniques V - DNA Seq II - 2025
32 pages
NGS in Pediatric Clinical Practice
No ratings yet
NGS in Pediatric Clinical Practice
3 pages
Dropbox - M98peba1gv0071p6zd 1416 23 Ebook NGS 150424
No ratings yet
Dropbox - M98peba1gv0071p6zd 1416 23 Ebook NGS 150424
110 pages
Next Generation Sequencing Overview
No ratings yet
Next Generation Sequencing Overview
26 pages
Understanding Genome Sequencing Basics
No ratings yet
Understanding Genome Sequencing Basics
8 pages
Understanding Next Generation Sequencing
No ratings yet
Understanding Next Generation Sequencing
27 pages
Bio Info 11
No ratings yet
Bio Info 11
10 pages
Next-Gen Sequencing Overview
No ratings yet
Next-Gen Sequencing Overview
13 pages
Next-Gen Sequencing Analysis Overview
No ratings yet
Next-Gen Sequencing Analysis Overview
19 pages
Intech NGS Imp PDF
No ratings yet
Intech NGS Imp PDF
59 pages
LIFS 3060 A Brief Introduction To Microbial Genomics
No ratings yet
LIFS 3060 A Brief Introduction To Microbial Genomics
21 pages
Illumina Sequencing Introduction
No ratings yet
Illumina Sequencing Introduction
12 pages
Introduction To Next-Generation Sequencing Technology
No ratings yet
Introduction To Next-Generation Sequencing Technology
12 pages
NGS Data Analysis Overview and Pipeline
100% (1)
NGS Data Analysis Overview and Pipeline
78 pages
RNA-seq: Insights and Techniques
No ratings yet
RNA-seq: Insights and Techniques
11 pages
ILL-0900 NGS Ebook vF1 PDF
No ratings yet
ILL-0900 NGS Ebook vF1 PDF
5 pages
4 P Venkatesan Proceedings 2018
No ratings yet
4 P Venkatesan Proceedings 2018
8 pages
Next Generation Sequencing
No ratings yet
Next Generation Sequencing
4 pages
New To Ngs Ebook
No ratings yet
New To Ngs Ebook
21 pages
DeepSimulator: Nanopore Sequencing Tool
No ratings yet
DeepSimulator: Nanopore Sequencing Tool
10 pages
E2017017 PDF
No ratings yet
E2017017 PDF
7 pages
Next Gen Sequencing (NGS) - 2
No ratings yet
Next Gen Sequencing (NGS) - 2
22 pages
What Is Next-Generation DNA Sequencing
No ratings yet
What Is Next-Generation DNA Sequencing
21 pages
NGS Bioinformatics Pipeline Guide
No ratings yet
NGS Bioinformatics Pipeline Guide
10 pages
Next Generation Sequencing Overview
No ratings yet
Next Generation Sequencing Overview
5 pages
Illumina and Nanopore Sequencing Methods
No ratings yet
Illumina and Nanopore Sequencing Methods
61 pages
Genome Sequencing Projects
No ratings yet
Genome Sequencing Projects
7 pages
Bacteriology2025 FirstLesson
No ratings yet
Bacteriology2025 FirstLesson
187 pages
Next-Gen DNA Sequencing Project
No ratings yet
Next-Gen DNA Sequencing Project
17 pages
Next Generation Sequencing Overview
No ratings yet
Next Generation Sequencing Overview
44 pages
DNA Sequencing Next Generation Sequencing
0% (1)
DNA Sequencing Next Generation Sequencing
31 pages
Next Generation Sequencing in Forensic Science A Primer - 1st Edition ISBN 0367478935, 9780367478933 Study Guide Download
No ratings yet
Next Generation Sequencing in Forensic Science A Primer - 1st Edition ISBN 0367478935, 9780367478933 Study Guide Download
15 pages
Next Generation Sequencing
No ratings yet
Next Generation Sequencing
23 pages
Prof Ed Answer Key
No ratings yet
Prof Ed Answer Key
25 pages
Subject-Verb Agreement Worksheet & Answers
No ratings yet
Subject-Verb Agreement Worksheet & Answers
8 pages
Analog Devices Welcomes Hittite Microwave Corporation: No Content On The Attached Document Has Changed
No ratings yet
Analog Devices Welcomes Hittite Microwave Corporation: No Content On The Attached Document Has Changed
8 pages
Central Panamericana and Tacna Solar
No ratings yet
Central Panamericana and Tacna Solar
14 pages
ML Movie Review
100% (1)
ML Movie Review
2 pages
Geosynthetics for Soil Filtration Design
No ratings yet
Geosynthetics for Soil Filtration Design
44 pages
IELTS 5.5B Writing - Weeks 8 & 9 Practice Test & Sample Analysis
No ratings yet
IELTS 5.5B Writing - Weeks 8 & 9 Practice Test & Sample Analysis
25 pages
IELTS Academic Writing Task 2
No ratings yet
IELTS Academic Writing Task 2
2 pages
ALICE GUO (Discourse Analysis Framework Matrix)
No ratings yet
ALICE GUO (Discourse Analysis Framework Matrix)
3 pages
Event Coordinator Role for 5k Events
No ratings yet
Event Coordinator Role for 5k Events
2 pages
Daftar Harga Alat Kesehatan 2010
No ratings yet
Daftar Harga Alat Kesehatan 2010
60 pages
Overcome The Platform Ebook
No ratings yet
Overcome The Platform Ebook
12 pages
Interval Estimation and Hypothesis Testing
No ratings yet
Interval Estimation and Hypothesis Testing
15 pages
Instrumentation and Control Systems Notes
No ratings yet
Instrumentation and Control Systems Notes
221 pages
Solving Second Order Differential Equations
No ratings yet
Solving Second Order Differential Equations
13 pages
Short Case Vinfast 2
No ratings yet
Short Case Vinfast 2
3 pages
Science-Grade 9 Learner Activity Sheets Quarter 1-Week 1: Respiratory and Circulatory System: Parts and Functions First Edition, 2021
No ratings yet
Science-Grade 9 Learner Activity Sheets Quarter 1-Week 1: Respiratory and Circulatory System: Parts and Functions First Edition, 2021
11 pages
Plato R E Allen The Dialogues of Plato
No ratings yet
Plato R E Allen The Dialogues of Plato
366 pages
Managerial Accounting Decision Making and Motivating Performance 1st Edition Datar Test Bank
No ratings yet
Managerial Accounting Decision Making and Motivating Performance 1st Edition Datar Test Bank
45 pages
Beige Clean Lines Marketing Executive Resume
No ratings yet
Beige Clean Lines Marketing Executive Resume
2 pages
Chile 09 q3 WPM
No ratings yet
Chile 09 q3 WPM
80 pages
Formal and Informal Words
No ratings yet
Formal and Informal Words
9 pages
Wajdi Mouawad - Scorched
No ratings yet
Wajdi Mouawad - Scorched
93 pages
SPP APC IFR Phraseology
No ratings yet
SPP APC IFR Phraseology
28 pages
Meeras Town PPC Advanced Case Study MCQs
No ratings yet
Meeras Town PPC Advanced Case Study MCQs
1 page
Wind Power's Evolution in the US
No ratings yet
Wind Power's Evolution in the US
6 pages
Fikadu Aschenaki Abebe
No ratings yet
Fikadu Aschenaki Abebe
12 pages
Netgear WN3000RPv3 WiF Range Extender User Manual
No ratings yet
Netgear WN3000RPv3 WiF Range Extender User Manual
46 pages
Census of India 1911, Baluchistan
100% (3)
Census of India 1911, Baluchistan
333 pages

NGS Unit 1

Uploaded by

NGS Unit 1

Uploaded by

NGS DATA ANALYSIS & HPC

Dr. Prashantha Karunakar

What is NGS used for?

Using NGS in key research areas

Microbiology and infectious diseases.

Instrument Cost Typical

Some DNA left in the well

Sharp band of 20+kb

No smear of degraded DNA

NanoDrop: Qubit or Picogreen:

260/280 = 1.8 – 2.0 10 kb insert libraries: 3-5 ug

Several samples can be loaded onto the eight-lane

The enzyme incorporates nucleotides to build

Several million dense clusters of double-

After laser excitation, the emitted fluorescence

The first sequencing cycle begins by adding four labeled

After laser excitation, the image is captured as

The next cycle repeats the incorporation of four

Structure of a FASTQ File

• SAM (Sequence Alignment/Map) is a text-based format to store aligned

FLAG Field (Bitwise Explanation)

Applications of SAM Format

A delins variant definition will appear in a CIGAR string as a deletion, followed by

Typical File Sizes for Human Whole-Genome Sequencing

File Format Typical Size Description

Wrong pH and ionic strength of extraction buffer.

Incomplete removal / contamination with nucleases.

Wrong pH of DNA solvent (acidic water).

Vigorous pipetting (wide-bore pipet tips).

Vortexing of DNA in high concentrations.

Too many freeze-thaw cycles (we tested 5, still Ok).

What are the main contaminants?

Pure DNA 260/230: 2.0 – 2.2

Photometrically active contaminants:

The main functions of FastQC are

• The red line

• The blue line

Why do adapters contaminate my sequences?

cutadapt -a AGATCGGAAGAGC -o trimmed.fastq ./raw_reads.fastq

Cutadapt searches for the adapter AGATCGGAAGAGC inside the read.

Example Command (Paired-End)

No. Command Purpose Example Description

The kernel is responsible for:

You might also like