0% found this document useful (0 votes)
51 views56 pages

NGS Unit 1

The document provides an overview of Next-Generation Sequencing (NGS) data analysis and high-performance computing (HPC), detailing its applications in various biological research areas such as cancer and microbiology. It discusses different NGS platforms, their costs, and sequencing technologies, including Illumina and Oxford Nanopore, along with essential file formats like FASTQ and SAM/BAM used for storing sequencing data. The content is aimed at equipping researchers with knowledge about computational tools and methods for effective NGS data analysis.

Uploaded by

Karan Palukuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views56 pages

NGS Unit 1

The document provides an overview of Next-Generation Sequencing (NGS) data analysis and high-performance computing (HPC), detailing its applications in various biological research areas such as cancer and microbiology. It discusses different NGS platforms, their costs, and sequencing technologies, including Illumina and Oxford Nanopore, along with essential file formats like FASTQ and SAM/BAM used for storing sequencing data. The content is aimed at equipping researchers with knowledge about computational tools and methods for effective NGS data analysis.

Uploaded by

Karan Palukuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NGS DATA ANALYSIS & HPC

22BT741 – Unit-1

Dr. Prashantha Karunakar


Associate Professor
Department of Biotechnology
Dayananda Sagar College of Engineering
Bangalore
[email protected]
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS

What is NGS used for?


NGS is used to sequence vast amounts of genetic material, enabling researchers
to take a broad, unbiased approach to scientific research in a variety of
applications and biological systems.
- Rather than profiling select markers, you can identify variants across
thousands of regions, down to single-base resolution, in a single experiment.
NGS expands the scope of your experimental studies to help find the
answers to your boldest research questions.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS

Using NGS in key research areas


NGS can play an important role in pursuing the answer to a variety of
biological questions using a wide array of published methods for diverse
sample types. NGS enables the unbiased investigation of multiple biological
“omes”, such as the proteome, transcriptome, epigenome, and genome. A
combinatorial approach interrogating multiple omes at once, called
multiomics, can also be achieved with NGS.
Popular NGS methods and applications include:
Cancer research
NGS is used to bulk-sequence tumors and identify genetic mutations in tumors,
aiding in the development of targeted therapies and monitoring cancer
progression through liquid biopsies.

Microbiology and infectious diseases.


NGS helps in pathogen identification, outbreak tracking, and studying
antimicrobial resistance by sequencing the genomes of bacteria, viruses, and
other microbes.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms

Instrument Cost Typical


Platform Read Type Read Length Cost per Gb (INR)
(INR) Applications
Whole Genome
Illumina NovaSeq Short-read (Paired-
~₹10,50,00,000 2 × 150 bp ~₹84–168 Sequencing, large-
X Plus End)
scale projects
Targeted
Illumina NextSeq Short-read (Paired-
~₹2,52,00,000 2 × 150 bp ~₹1,260–2,520 sequencing, RNA-
2000 End)
seq
Amplicon
Short-read (Paired- sequencing, small
Illumina MiSeq ~₹84,00,000 2 × 300 bp ~₹8,400–12,600
End) genome
sequencing
Ion Torrent Clinical diagnostics,
~₹2,52,00,000 Short-read Up to 200 bp ~₹840–4,200
Genexus small panels
~₹4,20,00,000– Structural variants,
PacBio Sequel IIe Long-read (HiFi) ~10–25 kb ~₹672–1,260
6,30,00,000 isoform sequencing
Oxford Nanopore ~₹2,10,00,000– Long-read (ultra- De novo assembly,
>100 kb ~₹1,680–3,360
PromethION 24/48 4,20,00,000 long) metagenomics
Field-based
Oxford Nanopore ~₹84,000 (starter sequencing, pilot
Portable, long-read Up to 1 Mb ~₹4,200–8,400
MinION pack) studies,
educational use
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - DNA quality requirements

Some DNA left in the well

Sharp band of 20+kb


No sign of proteins

No smear of degraded DNA

No sign of RNA

NanoDrop: Qubit or Picogreen:

260/280 = 1.8 – 2.0 10 kb insert libraries: 3-5 ug


260/230 = 2.0 – 2.2 20 kb insert libraries: 10-20 ug
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Illumina sequencing technology provides clonal array formation and proprietary
reversible terminator technology for rapid and accurate large-scale sequencing.
The innovative and flexible sequencing system enables a broad array of
applications in genomics, transcriptomics, and epigenomics.
Cluster Generation
Sequencing templates are immobilized on
a proprietary flow cell surface (Figure 1)
designed to present the DNA in a manner
that facilitates access to enzymes while
ensuring high stability of surface bound
template and low non-specific binding of
fluorescently labeled nucleotides.

Several samples can be loaded onto the eight-lane


flow cell for simultaneous analysis on an Illumina
Sequencing System
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Solid-phase amplification (Figures 2–7) creates up to 1,000 identical copies of each single
template molecule in close proximity (diameter of one micron or less).
Figure 2: Prepare Genomic DNA Sample Figure 3: Attach DNA to Surface

Randomly fragment genomic DNA and ligate Bind single-stranded fragments randomly to
adapters to both ends of the fragments. the inside surface of the flow cell channels
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 4: Bridge Amplification Figure 5: Fragments Become Double Stranded

The enzyme incorporates nucleotides to build


Add unlabeled nucleotides and enzyme to double-stranded bridges on the solid-phase
initiate solid-phase bridge amplification. substrate.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 6: Denature the Double-Standed Molecules Figure 7: Complete Amplification

Several million dense clusters of double-


Denaturation leaves single-stranded templates stranded DNA are generated in each channel
anchored to the substrate. of the flow cell.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 8: Determine First Base Figure 9: Image First Base

After laser excitation, the emitted fluorescence


from each cluster is captured and the first base
is identified.

The first sequencing cycle begins by adding four labeled


reversible terminators, primers and DNA polymerase.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 10: Determine Second Base Figure 11: Image Second Chemistry Cycle

After laser excitation, the image is captured as


before and the identity of the second base is
recorded.

The next cycle repeats the incorporation of four


labeled reversible terminators, primers and DNA
polymerase.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
Figure 12: Figure 13: Align Data
Sequencing
Over
Multiple
Chemistry
Cycles

The
sequencing
cycles are
repeated to
determine
the sequence
of bases in
a fragment,
one base at a The data are aligned and compared to a reference,
time. and sequencing differences are identified.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Illumina Sequencing Technology
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
Oxford Nanopore Technologies (ONT) offers a range of sequencing devices based on the
principle of passing nucleic acid molecules (DNA or RNA) through a nanoscale pore
embedded in a membrane. As the molecule passes, it creates characteristic disruptions in an
electrical current, which are measured and decoded to determine the sequence.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore

Image adapted from Oxford Nanopore Technology. It shows a double-strand piece of DNA being unzipped
and a single strand passing through a nanopore sensor. The pore sends an electrical signal to show how
much of the current running through the pore is blocked by individual nucleotides (the building blocks of
nucleic acid - DNA and RNA). Specialised software is used to decode the signal to read the sequence.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore

Alpha-Hemolysin (α-HL)
• A heptameric pore-forming toxin from Staphylococcus aureus.
• Self-assembles into a transmembrane channel with a ~1.5 nm diameter, ideal for ssDNA or
RNA translocation.
• Provides low-noise ionic current and high sensitivity, making it a gold standard for
nanopore-based detection.
• Used to detect DNA/RNA, peptides, and even protein variants by analyzing current
blockades as molecules pass through.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
How nanopore
sequencing works.
The method uses
electrophoresis to drive
DNA strands or single
nucleotides through a
very small hole embedded
in a membrane. An
enzyme motor (not
shown) controls the rate
at which a DNA molecule
passes through the
nanopore. The sequence
is determined in real-time
based on the extent to
which the nucleotides
disrupt the current
flowing through a
nanopore sensor.
Credit: Daniel Power.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
NGS platforms - Nanopore
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format

FASTQ is a text-based file format used to store both biological sequences (usually nucleotide
sequences) and their corresponding quality scores. It is the standard output format for
Next-Generation Sequencing (NGS) platforms like Illumina

Structure of a FASTQ File


Each read in a FASTQ file is represented using four lines:
1.Line 1: Starts with @ followed by a sequence identifier and (optionally) a
description.
2.Line 2: The nucleotide sequence (e.g., ATCG...).
3.Line 3: Starts with + and optionally repeats the sequence identifier.
4.Line 4: ASCII-encoded quality scores corresponding to each nucleotide in the
sequence.

@SEQ_ID
GATTTGGGGTTCAAA
+
!''*((((***+))%%%++)(%%%%).1**
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))** . → 46 − 33 = 13
1 → 49 − 33 = 16
) → 41 − 33 = 8 * → 42 − 33 = 9
! → 33 − 33 = 0 ) → 41 − 33 = 8 * → 42 − 33 = 9
' → 39 − 33 = 6 % → 37 − 33 = 4 * → 42 − 33 = 9
' → 39 − 33 = 6 % → 37 − 33 = 4 - → 45 − 33 = 12
* → 42 − 33 = 9 % → 37 − 33 = 4 + → 43 − 33 = 10
( → 40 − 33 = 7 + → 43 − 33 = 10 * → 42 − 33 = 9
( → 40 − 33 = 7 + → 43 − 33 = 10 ' → 39 − 33 = 6
( → 40 − 33 = 7 ) → 41 − 33 = 8 ' → 39 − 33 = 6
( → 40 − 33 = 7 ( → 40 − 33 = 7 ) → 41 − 33 = 8
* → 42 − 33 = 9 % → 37 − 33 = 4 ) → 41 − 33 = 8
* → 42 − 33 = 9 % → 37 − 33 = 4 * → 42 − 33 = 9
* → 42 − 33 = 9 % → 37 − 33 = 4 * → 42 − 33 = 9
+ → 43 − 33 = 10 % → 37 − 33 = 4
) → 41 − 33 = 8 TOTAL = 306
AVERAGE = 7.65
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FASTQ format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

The SAM (Sequence Alignment/Map) and BAM (Binary Alignment/Map) formats are
foundational in NGS for storing sequence alignment data from high-throughput
sequencing
A SAM file has two main parts:
1.Header Section (optional, starts with @)
•Provides metadata like reference genome names and lengths
•Each line begins with @ and contains tab-separated fields
2.Alignment Section
•Each line corresponds to one read
•Has 11 mandatory fields + optional tags
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

• SAM (Sequence Alignment/Map) is a text-based format to store aligned


sequence data.
• Contains metadata (header) and alignment information (mandatory fields).
• Output of aligners like BWA, Bowtie, HISAT2.
• Easily converted to BAM (binary version of SAM) for efficient storage.

• Two sections:
1. Header section (optional): starts with '@', contains metadata
2. Alignment section: one line per read, with 11 mandatory fields
• Optional fields provide extra info (e.g., tags like NM:i:1)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

FLAG Field (Bitwise Explanation)


• Each bit in FLAG encodes read properties:
0 – Read is mapped
4 – Read is unmapped
16 – Read mapped to reverse strand
1 – Paired-end read
1024 – PCR duplicate

Applications of SAM Format


• Primary output from aligners (e.g., BWA, HISAT2)
• Input for variant callers, quantifiers, and genome browsers
• Used in workflows for:
– Variant calling (GATK)
– Transcriptomics (featureCounts)
– Visualization (IGV)
• Can be converted, sorted, indexed using samtools
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

A delins variant definition will appear in a CIGAR string as a deletion, followed by


an insertion eg, the 8D4I here:

>frg400 15M8D4I7M1X7M2X5M1X158M
CATTGGAACAGAAAGagatTTATCTGtTGTTTGCagTGAAGgAGTACAAAATG

The reverse-complement of this sequence, and its corresponding CIGAR, looks like
this:
>frg800 158M1X5M2X7M1X7M8D4I15M
CATTTTGTACTcCTTCActGCAAACAaCAGATAAatctCTTTCTGTTCCAATG
The delins sub-CIGAR 8D4I reads the same in both directions. The starting point
refers to the last nucleotide before the insertion, so a mapped starting point will
differ for all cases apart from 1DnI, where ‘n’ means any number.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
SAM/BAM format

Typical File Sizes for Human Whole-Genome Sequencing

File Format Typical Size Description


Raw FASTQ files ~150–200 GB (paired-end)
Plain text, uncompressed,
SAM ~400–500 GB very large due to verbose
format
Binary format, compressed,
BAM ~100–150 GB commonly used in
pipelines
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
VCF file format
• VCF is the standard file format for storing variation data.
• VCF is a preferred format because it is unambiguous, scalable, and flexible, allowing
extra information to be added to the INFO field. Many millions of variants can be stored
in a single VCF file.
• VCF files are tab-delimited text files. They typically contain a header section with
metadata, followed by rows where each row represents a genetic variant.
• The format is human-readable and can be opened in any text editor, but specialized tools
are usually used for efficient analysis.
• VCF is widely used in genomics projects such as the 1000 Genomes Project and clinical
variant reporting.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
VCF file format

The data section includes the following FILTER: Indicates whether the variant
columns: passed filtering criteria.
CHROM: Chromosome or reference INFO: Key-value pairs providing additional
sequence ID. information about the variant, such as
POS: Position of the variant on the allele frequency, depth, etc.
chromosome. FORMAT: Specifies the format of the
ID: Unique identifier for the variant. sample-specific data. The GT in the
REF: Reference allele. FORMAT column tells us to expect
ALT: Alternate allele(s). genotypes in the following columns.
QUAL: Quality score for the variant call. Sample Columns: Contain genotype
Quality score out of 100. information for each sample, following
the format specified in the FORMAT
column.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
Causes of DNA degradation
Mechanical damage during tissue homogenization.

Wrong pH and ionic strength of extraction buffer.

Incomplete removal / contamination with nucleases.

Phenol: too old, or inappropriately buffered (pH 7.8 – 8.0); incomplete removal.

Wrong pH of DNA solvent (acidic water).


Recommended: 1:10 TE for short-term storage, or 1xTE for long-term storage.

Vigorous pipetting (wide-bore pipet tips).

Vortexing of DNA in high concentrations.

Too many freeze-thaw cycles (we tested 5, still Ok).

Debatable: sequence-dependent
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores

What are the main contaminants?

Polysaccharides Chitin
Lypopolysaccharides Protein
Growth media residuals Secondary metabolites
Pigments
Growth media residuals

Chitin Polyphenols
Fats Polysaccharides
Proteins Secondary metabolites
Pigments Pigments
By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Library preparation - Quality scores
What do absorption ratios tell us?
Pure DNA 260/280: 1.8 – 2.0
< 1.8:
Too little DNA compared to other components of the solution; presence of organic
contaminants: proteins and phenol; glycogen - absorb at 280 nm.
> 2.0:
High share of RNA.

Pure DNA 260/230: 2.0 – 2.2


<2.0:
Salt contamination, humic acids, peptides, aromatic compounds, polyphenols, urea,
guanidine, thiocyanates (latter three are common kit components) – absorb at 230 nm.
>2.2:
High share of RNA, very high share of phenol, high turbidity, dirty instrument, wrong blank.

Photometrically active contaminants:


phenol, polyphenols, EDTA, thiocyanate, protein,
By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab RNA, nucleotides (fragments below 5 bp)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FastQC

FastQC is a modular quality control tool designed to evaluate raw sequence data from high-
throughput sequencing platforms. It provides a quick visual and statistical summary to
identify potential issues before downstream analysis.

Input Requirements
Accepts FASTQ files (compressed or uncompressed)
Compatible with data from Illumina, Ion Torrent, Oxford Nanopore, PacBio, etc.

The main functions of FastQC are


• Import of data from BAM, SAM or FastQ files (any variant)
• Providing a quick overview to tell you in which areas there may be problems
• Summary graphs and tables to quickly assess your data
• Export of results to an HTML based permanent report
• Offline operation to allow automated generation of reports without running the
interactive application
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FastQC
Good Illumina Data X-axis:
Position in read (bp):
Represents base
positions from 1 to 40
bp in the sequencing
reads.

Y-axis:
Phred Quality Score: A
logarithmic measure of
base calling accuracy.
Q = -10 log₁₀(P)
(where P is the
probability of an
incorrect base call)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
FastQC
Bad Illumina Data
• Each box
represents the
interquartile range
(IQR): 25th to 75th
percentile.

• The red line


inside the box is
the median quality
score.

• Whiskers show
the 10th and 90th
percentiles.

• The blue line


shows the mean
quality score per
base position.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming
Adapter trimming is a crucial step in preprocessing NGS data.
It removes unwanted adapter sequences that can interfere with alignment, assembly, and
downstream analyses.

Why do adapters contaminate my sequences?


• Adapter ligation is essential during library preparation, where each DNA molecule is tagged
with adapters—typically after fragmentation in Illumina short-read protocols.
• Adapters serve multiple roles: they carry barcodes, primers for paired-end sequencing,
and sequences needed for flowcell binding and bridge amplification.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming
When are adapters sequences observed in the reads?
• In Illumina short-read sequencing, the 5′ adapter lies upstream of the read primer and is
not sequenced; only the DNA insert is typically captured.
• If the insert is shorter than the read length, sequencing extends into the 3′ adapter -
making adapter sequences visible only at the 3′ end and only under this condition.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming - cutadapt
Input: raw_reads.fastq
@read1
ACGTACGTAGATCGGAAGAGC
+
IIIIIIIIIIIIIIIIIIIII

cutadapt -a AGATCGGAAGAGC -o trimmed.fastq ./raw_reads.fastq

Cutadapt searches for the adapter AGATCGGAAGAGC inside the read.


ACGTACGTAGATCGGAAGAGC
↑ adapter starts here
Output
Trimmed read: ACGTACGT
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming - Trimmomatic
Trimmomatic is a versatile tool designed for preprocessing Illumina NGS data by trimming
low-quality bases and removing adapter sequences.
Core Functions of Trimmomatic
1. Adapter Removal (ILLUMINACLIP)
- Detects and removes adapter sequences using either palindrome or
simple matching.
- Example:
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10
2: seed mismatches
30: palindrome clip threshold
10: simple clip threshold
2. Quality Trimming
- LEADING: Removes low-quality bases from the start of the read.
- TRAILING: Removes low-quality bases from the end.
- SLIDINGWINDOW: Scans the read with a sliding window and trims when average
quality drops below a threshold.
SLIDINGWINDOW:4:15
- Window size: 4 bases
- Quality threshold: 15
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Adapter trimming - Trimmomatic

3. Length Filtering
- MINLEN: Discards reads shorter than a specified length after trimming.
MINLEN:36

4. Cropping Options
- CROP: Cuts the read to a fixed length.
- HEADCROP: Removes a fixed number of bases from the beginning.

Modes of Operation
- Paired-End Mode (PE): Processes forward and reverse reads together, maintaining pairing.
- Single-End Mode (SE): Processes individual reads.

Example Command (Paired-End)


java -jar trimmomatic-0.39.jar PE -phred33 \
input_forward.fq.gz input_reverse.fq.gz \
output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Filtering based on base quality

When sequencing DNA, the sequencer not only gives us the sequence of nucleotides (A, T,
G, C) but also a quality score for each base — this is called the Phred quality score. It tells us
how confident the machine is that a base was read correctly. For example, a score of 30
(Q30) means the base has a 1 in 1000 chance of being wrong (99.9% accuracy).

In filtering based on base quality, we remove or trim bases (or even entire reads) if their
quality score is below a chosen threshold. This helps reduce errors in downstream analysis,
like mapping or variant calling.

Example: If a read has Q-scores like 40 38 35 15 10 8, the last three bases have low
confidence, so we might trim them off.
This is important because poor-quality bases can cause false alignments or incorrect variant
calls.
Think of it like removing blurry sections from a photograph before you try to identify faces
— you only want the sharp, clear parts for analysis.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Basics

Linux is an open-source UNIX-based operating system. The main component of the Linux
operating system is Linux kernel. It is developed to provide low-cost or free operating system
service to personal system users, which includes an X-window system, Emacs editor, IP/TCP
GUI, etc.

Linux distribution
• A Linux system package, known as a distribution, consists of multiple Linux distributions
available for different computing needs.
• Linux distribution is developed using a set of software based on compatibility with the Linux
core kernel, using which Linux-based operations in different systems, such as personal
systems, embedded systems, etc.
• There are around 600 distributions available.

Each distribution has specialized packages installed to support specific tasks. This means you
can download software related to your field of work using a Linux distribution.

Some Linux distributions are: MX Linux, Manjaro, Linux Mint, elementary, Ubuntu, Debian,
Solus, Fedora, openSUSE, Deepin
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Basics

No. Command Purpose Example Description


Shows /mnt/c/Users/student/ngs_data to confirm
1 pwd Print current directory pwd
you’re in your project folder.
2 ls List files/folders ls *.fastq Lists all FASTQ files from your RNA-seq dataset.
3 cd Change directory cd Moves to the folder containing sequencing files.
bash mkdir
4 mkdir Make a directory Creates a new folder for storing cleaned reads.
trimmed_reads
cp sample1.fastq
5 cp Copy files Makes a backup copy of your original sample file.
sample1_backup.fastq
mv sample1.fastq
6 mv Move/rename files Moves sample1.fastq into the raw_data folder.
raw_data/
Head /
7 View first lines more -n 3 sample1.fastq Shows the first read from your sequencing file.
more
8 wc Count lines wc -l sample1.fastq Counts lines; divide by 4 for number of reads.
grep "ATCG"
9 grep Search pattern Finds sequences containing motif “ATCG”.
sample1.fastq
Lets you scroll through a huge FASTQ file without
10 less Scroll through file less sample1.fastq
opening it in Notepad.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture

Components of Linux
Like any operating system, Linux consists of software, computer programs, documentation,
and hardware.
The main components of Linux operating system are: Application, Shell, Kernel, Hardware,
Utilities.

1. Kernel
Kernel is the main core component it is
lies between the shell and the hardware.
It controls the activity of other hardware
components.
It visualizes the common hardware resources
and provide each process with necessary
virtual resources.
The kernel is software that manages
communication between the hardware and
the system. It cannot directly interact with
directories or files. Instead, the kernel handles
the communication between the computer
system and the hardware.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture

The kernel is responsible for:


Memory management: Manages and allocates
memory efficiently.
Resource allocation: Distributes system
resources to different processes.
Device management: Controls input/output
devices like printers and scanners.
Process management: Manages process
execution and scheduling.
Application interaction: Bridges applications
with system-level functions.
Security: Provides essential system-level
security.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
2. System Library
System libraries are some predefined functions by using which any application programs or
system utilities can access kernel's features. These libraries are the foundation upon which
any software can be built.
Some of the most common system libraries are:
GNU C library: This is the C library that provides the most fundamental system for the
interface and execution of C programs. This provides may in-built functions for the
execution.
libpthread (POSIX Threads): This library plays important role for multithreading in Linux, it
allows users for creating and managing multiple threads.
libdl (Dynamic Linker): This library is responsible for the loading and linking file at the
runtime.
libm (Math Library): This library provides user with all kind of mathematical function and
their execution.

Some other system libraries are: librt (Realtime Library), libcrypt (Cryptographic Library),
libnss (Name Service Switch Library), libstdc++ (C++ Standard Library)
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
3. Shell
The Shell is also software or It can be determined as the interface to the kernel. It takes commands from
the user and interprets them. The shell transmits these commands to the kernel, which then performs
the requested operations. Users can just enter the commend and using the kernel's function that specific
task is performed accordingly.
1.Bourne Shell (sh) – One of the earliest Unix shells, it’s
simple, reliable, and still used for basic scripting tasks.
2.C Shell (csh) – A shell designed with C-like syntax,
offering history recall and interactive features but less
suited for complex scripting.
3.Korn Shell (ksh) – A powerful, backward-compatible
shell that blends Bourne and C shell features, popular in
enterprise environments.
4.Bash (Bourne Again Shell) – The most common Linux
shell, combining ease of use, scripting power, and
interactive features like auto-completion.
5.Z Shell (zsh) – A highly customizable shell with advanced
features, themes, and plugins, now the default on macOS.
6.Fish (Friendly Interactive Shell) – A beginner-friendly
shell with real-time suggestions, syntax highlighting, and
easy configuration.
NGS DATA ANALYSIS & HPC
Computational Tools for NGS Data Analysis and basics of Linux
Linux Architecture
4. Hardware Layer
• Hardware layer of Linux is the lowest level of
operating system track.
• It is plays a vital role in managing all the hardware
components.
• It includes device drivers, kernel functions,
memory management, CPU control, and I/O
operations.
• This layer generalizes hard complexity, by providing
an interface for software by assuring proper
functionality of all the components.

5. System utility
• System utilities are the commend line tools that
preforms various tasks provided by user to make
system management and administration better.
• These utilities enables user to perform different
tasks, such as file management, system monitoring,
network configuration, user management etc.

You might also like