NGS Unit 1
NGS Unit 1
22BT741 – Unit-1
No sign of RNA
Randomly fragment genomic DNA and ligate Bind single-stranded fragments randomly to
adapters to both ends of the fragments. the inside surface of the flow cell channels
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 4: Bridge Amplification Figure 5: Fragments Become Double Stranded
The
sequencing
cycles are
repeated to
determine
the sequence
of bases in
a fragment,
one base at a The data are aligned and compared to a reference,
time. and sequencing differences are identified.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
Oxford Nanopore Technologies (ONT) offers a range of sequencing devices based on the
principle of passing nucleic acid molecules (DNA or RNA) through a nanoscale pore
embedded in a membrane. As the molecule passes, it creates characteristic disruptions in an
electrical current, which are measured and decoded to determine the sequence.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
Image adapted from Oxford Nanopore Technology. It shows a double-strand piece of DNA being unzipped
and a single strand passing through a nanopore sensor. The pore sends an electrical signal to show how
much of the current running through the pore is blocked by individual nucleotides (the building blocks of
nucleic acid - DNA and RNA). Specialised software is used to decode the signal to read the sequence.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
Alpha-Hemolysin (α-HL)
• A heptameric pore-forming toxin from Staphylococcus aureus.
• Self-assembles into a transmembrane channel with a ~1.5 nm diameter, ideal for ssDNA or
RNA translocation.
• Provides low-noise ionic current and high sensitivity, making it a gold standard for
nanopore-based detection.
• Used to detect DNA/RNA, peptides, and even protein variants by analyzing current
blockades as molecules pass through.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
How nanopore
sequencing works.
The method uses
electrophoresis to drive
DNA strands or single
nucleotides through a
very small hole embedded
in a membrane. An
enzyme motor (not
shown) controls the rate
at which a DNA molecule
passes through the
nanopore. The sequence
is determined in real-time
based on the extent to
which the nucleotides
disrupt the current
flowing through a
nanopore sensor.
Credit: Daniel Power.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
FASTQ is a text-based file format used to store both biological sequences (usually nucleotide
sequences) and their corresponding quality scores. It is the standard output format for
Next-Generation Sequencing (NGS) platforms like Illumina
@SEQ_ID
GATTTGGGGTTCAAA
+
!''*((((***+))%%%++)(%%%%).1**
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))** . → 46 − 33 = 13
1 → 49 − 33 = 16
) → 41 − 33 = 8 * → 42 − 33 = 9
! → 33 − 33 = 0 ) → 41 − 33 = 8 * → 42 − 33 = 9
' → 39 − 33 = 6 % → 37 − 33 = 4 * → 42 − 33 = 9
' → 39 − 33 = 6 % → 37 − 33 = 4 - → 45 − 33 = 12
* → 42 − 33 = 9 % → 37 − 33 = 4 + → 43 − 33 = 10
( → 40 − 33 = 7 + → 43 − 33 = 10 * → 42 − 33 = 9
( → 40 − 33 = 7 + → 43 − 33 = 10 ' → 39 − 33 = 6
( → 40 − 33 = 7 ) → 41 − 33 = 8 ' → 39 − 33 = 6
( → 40 − 33 = 7 ( → 40 − 33 = 7 ) → 41 − 33 = 8
* → 42 − 33 = 9 % → 37 − 33 = 4 ) → 41 − 33 = 8
* → 42 − 33 = 9 % → 37 − 33 = 4 * → 42 − 33 = 9
* → 42 − 33 = 9 % → 37 − 33 = 4 * → 42 − 33 = 9
+ → 43 − 33 = 10 % → 37 − 33 = 4
) → 41 − 33 = 8 TOTAL = 306
AVERAGE = 7.65
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format
The SAM (Sequence Alignment/Map) and BAM (Binary Alignment/Map) formats are
foundational in NGS for storing sequence alignment data from high-throughput
sequencing
A SAM file has two main parts:
1.Header Section (optional, starts with @)
•Provides metadata like reference genome names and lengths
•Each line begins with @ and contains tab-separated fields
2.Alignment Section
•Each line corresponds to one read
•Has 11 mandatory fields + optional tags
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format
• Two sections:
1. Header section (optional): starts with '@', contains metadata
2. Alignment section: one line per read, with 11 mandatory fields
• Optional fields provide extra info (e.g., tags like NM:i:1)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format
>frg400 15M8D4I7M1X7M2X5M1X158M
CATTGGAACAGAAAGagatTTATCTGtTGTTTGCagTGAAGgAGTACAAAATG
The reverse-complement of this sequence, and its corresponding CIGAR, looks like
this:
>frg800 158M1X5M2X7M1X7M8D4I15M
CATTTTGTACTcCTTCActGCAAACAaCAGATAAatctCTTTCTGTTCCAATG
The delins sub-CIGAR 8D4I reads the same in both directions. The starting point
refers to the last nucleotide before the insertion, so a mapped starting point will
differ for all cases apart from 1DnI, where ‘n’ means any number.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format
The data section includes the following FILTER: Indicates whether the variant
columns: passed filtering criteria.
CHROM: Chromosome or reference INFO: Key-value pairs providing additional
sequence ID. information about the variant, such as
POS: Position of the variant on the allele frequency, depth, etc.
chromosome. FORMAT: Specifies the format of the
ID: Unique identifier for the variant. sample-specific data. The GT in the
REF: Reference allele. FORMAT column tells us to expect
ALT: Alternate allele(s). genotypes in the following columns.
QUAL: Quality score for the variant call. Sample Columns: Contain genotype
Quality score out of 100. information for each sample, following
the format specified in the FORMAT
column.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
Causes of DNA degradation
Mechanical damage during tissue homogenization.
Phenol: too old, or inappropriately buffered (pH 7.8 – 8.0); incomplete removal.
Debatable: sequence-dependent
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
Polysaccharides Chitin
Lypopolysaccharides Protein
Growth media residuals Secondary metabolites
Pigments
Growth media residuals
Chitin Polyphenols
Fats Polysaccharides
Proteins Secondary metabolites
Pigments Pigments
By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
What do absorption ratios tell us?
Pure DNA 260/280: 1.8 – 2.0
< 1.8:
Too little DNA compared to other components of the solution; presence of organic
contaminants: proteins and phenol; glycogen - absorb at 280 nm.
> 2.0:
High share of RNA.
FastQC is a modular quality control tool designed to evaluate raw sequence data from high-
throughput sequencing platforms. It provides a quick visual and statistical summary to
identify potential issues before downstream analysis.
Input Requirements
Accepts FASTQ files (compressed or uncompressed)
Compatible with data from Illumina, Ion Torrent, Oxford Nanopore, PacBio, etc.
Y-axis:
Phred Quality Score: A
logarithmic measure of
base calling accuracy.
Q = -10 log₁₀(P)
(where P is the
probability of an
incorrect base call)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FastQC
Bad Illumina Data
• Each box
represents the
interquartile range
(IQR): 25th to 75th
percentile.
• Whiskers show
the 10th and 90th
percentiles.
3. Length Filtering
- MINLEN: Discards reads shorter than a specified length after trimming.
MINLEN:36
4. Cropping Options
- CROP: Cuts the read to a fixed length.
- HEADCROP: Removes a fixed number of bases from the beginning.
Modes of Operation
- Paired-End Mode (PE): Processes forward and reverse reads together, maintaining pairing.
- Single-End Mode (SE): Processes individual reads.
When sequencing DNA, the sequencer not only gives us the sequence of nucleotides (A, T,
G, C) but also a quality score for each base — this is called the Phred quality score. It tells us
how confident the machine is that a base was read correctly. For example, a score of 30
(Q30) means the base has a 1 in 1000 chance of being wrong (99.9% accuracy).
In filtering based on base quality, we remove or trim bases (or even entire reads) if their
quality score is below a chosen threshold. This helps reduce errors in downstream analysis,
like mapping or variant calling.
Example: If a read has Q-scores like 40 38 35 15 10 8, the last three bases have low
confidence, so we might trim them off.
This is important because poor-quality bases can cause false alignments or incorrect variant
calls.
Think of it like removing blurry sections from a photograph before you try to identify faces
— you only want the sharp, clear parts for analysis.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Basics
Linux is an open-source UNIX-based operating system. The main component of the Linux
operating system is Linux kernel. It is developed to provide low-cost or free operating system
service to personal system users, which includes an X-window system, Emacs editor, IP/TCP
GUI, etc.
Linux distribution
• A Linux system package, known as a distribution, consists of multiple Linux distributions
available for different computing needs.
• Linux distribution is developed using a set of software based on compatibility with the Linux
core kernel, using which Linux-based operations in different systems, such as personal
systems, embedded systems, etc.
• There are around 600 distributions available.
Each distribution has specialized packages installed to support specific tasks. This means you
can download software related to your field of work using a Linux distribution.
Some Linux distributions are: MX Linux, Manjaro, Linux Mint, elementary, Ubuntu, Debian,
Solus, Fedora, openSUSE, Deepin
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Basics
Components of Linux
Like any operating system, Linux consists of software, computer programs, documentation,
and hardware.
The main components of Linux operating system are: Application, Shell, Kernel, Hardware,
Utilities.
1. Kernel
Kernel is the main core component it is
lies between the shell and the hardware.
It controls the activity of other hardware
components.
It visualizes the common hardware resources
and provide each process with necessary
virtual resources.
The kernel is software that manages
communication between the hardware and
the system. It cannot directly interact with
directories or files. Instead, the kernel handles
the communication between the computer
system and the hardware.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
Some other system libraries are: librt (Realtime Library), libcrypt (Cryptographic Library),
libnss (Name Service Switch Library), libstdc++ (C++ Standard Library)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
3. Shell
The Shell is also software or It can be determined as the interface to the kernel. It takes commands from
the user and interprets them. The shell transmits these commands to the kernel, which then performs
the requested operations. Users can just enter the commend and using the kernel's function that specific
task is performed accordingly.
1.Bourne Shell (sh) – One of the earliest Unix shells, it’s
simple, reliable, and still used for basic scripting tasks.
2.C Shell (csh) – A shell designed with C-like syntax,
offering history recall and interactive features but less
suited for complex scripting.
3.Korn Shell (ksh) – A powerful, backward-compatible
shell that blends Bourne and C shell features, popular in
enterprise environments.
4.Bash (Bourne Again Shell) – The most common Linux
shell, combining ease of use, scripting power, and
interactive features like auto-completion.
5.Z Shell (zsh) – A highly customizable shell with advanced
features, themes, and plugins, now the default on macOS.
6.Fish (Friendly Interactive Shell) – A beginner-friendly
shell with real-time suggestions, syntax highlighting, and
easy configuration.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
4. Hardware Layer
• Hardware layer of Linux is the lowest level of
operating system track.
• It is plays a vital role in managing all the hardware
components.
• It includes device drivers, kernel functions,
memory management, CPU control, and I/O
operations.
• This layer generalizes hard complexity, by providing
an interface for software by assuring proper
functionality of all the components.
5. System utility
• System utilities are the commend line tools that
preforms various tasks provided by user to make
system management and administration better.
• These utilities enables user to perform different
tasks, such as file management, system monitoring,
network configuration, user management etc.