INTRODUCTION TO BIOINFORMATICS
LECTURE NO 2
SEQUENCE ALIGNMENT
By:
Laila Sehar
National Center for Bioinformatics
Quaid-e-Azam university, Islamabad
Introduction
Sequence Alignment
Definition: “Procedure for comparing two or more
sequences by searching for a series of individual
characters or character patterns that are in the same
order in the sequences”
Pair-wise alignment: compare two sequences
Multiple sequence alignment: compare more
than two sequences
Example sequence alignment
Align “abcdef” with “abdgf”
• Write second sequence below the first
abcdef
abdgf
• Move sequences to give maximum match
between them
• Show characters that match using vertical
bar
Example sequence alignment
abcdef
||
abdgf
Insert gap between b and d on lower sequence to allow d
and f to align
abcdef
| | | |
ab-dgf
Note e and g don’t match
Why do sequence alignments?
To find whether two (or more) genes or
proteins are evolutionarily related to each
other
To find structurally or functionally similar
regions within proteins
Pairwise Sequence Alignment
Pairwise sequence alignment methods are used
to find the best-matching piecewise (local) or
global alignments of two query sequences.
Pairwise Sequence Alignment is used to
identify regions of similarity that may indicate
functional, structural and/or evolutionary
relationships between two biological
sequences (protein or nucleic acid).
Methods for Pair wise Alignment
Dynamic Programming:
Needlemanwunch algorithm
Smithwaterman algorithm
Dynamic Programming
Two method for sequence alignment.
Local sequence alignment
Global sequence alignment
Global vs. Local Alignment
We distinguish
Global alignment algorithms which optimize
overall alignment between two sequences .
We assume that the two proteins are basically
similar over the entire length of one another.
Local alignment algorithms which seek only
relatively conserved pieces of sequence.
Allign those parts of sequences that appear to have
good similarity.
Global vs. Local Alignment
Global
LGPSSKQTGKGS-SRIWDN
| | | | | | |
LN-ITKSAGKGAIMRLGDA
Local
--------GKG--------
| | |
--------GKG--------
Elements of Global Sequence Alignment
Alignment scores. The score for an alignment
is taken to be the sum of scores for aligned pairs
of letters, and scores for letters aligned with
nulls.
Substitution scores. Scores for aligned pairs
of letters are called substitution scores, whether
the letter aligned are identical or not.
Gap scores. The score for a letter aligned with
a null is called a gap score.
Significance
It is commonly used in bioinformatics to
align protein or nucleotide sequences.
Alignments try to tell the evolutionary story of the
proteins.
Sequence Alignment is used to identify regions of
similarity that may indicate functional, structural and/or
evolutionary relationships between two biological
sequences (protein or nucleic acid).
One motivation for database searching: given a query
sequence q, find the sequences in a database D which make
‘good’ alignments to q.
Pairwise Global Alignment
Algorithm
The Needleman-Wunsch
algorithm
The Needleman-Wunsch algorithm is a dynamic
programming algorithm for optimal sequence
alignment (Needleman and Wunsch, 1970).
The optimal path can be determined by
incremental extension of the optimal sub-paths.
In a Needleman-Wunsch alignment, the optimal
path must stretch from beginning to end in both
sequences (hence the term ‘global alignment’).
The key for understanding this approach is to
observe how the alignment problem can be
divided into subproblems.
Steps:
Initialization
Matrix fill or scoring
Trace back and alignment
Pairwise Global Alignment of sequences
Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with gap penalty multiples
Fill in the matrix with max value of 3 possible moves:
Vertical move: Score + gap penalty
Horizontal move: Score + gap penalty
Diagonal move: Score + match/mismatch score
The optimal alignment score is in the lower-right corner
To reconstruct the optimal alignment, trace back where the max
at each step came from, stop when hit the origin.
Example of Global Alignment
Let gap = -2 Vertical move: Score + gap penalty
Horizontal move: Score + gap penalty
match = 1 Diagonal move: Score + match/mismatch score
mismatch = -1.
empty A A A C
empty 0 -2 -4 -6 -8
A -2 1 -1 -3 -5
G -4 -1 0 -2 -4
C -6 -3 -2 -1 -1
AAAC AAAC
A-GC -AGC
Finding the alignments that give
the highest score
The arrows constitutes paths in the matrix, and for
finding the highest-scoring alignments, we can follow
the paths from H(m,n) backwards to H(0,0).
if the arrow comes from H(i−1,j ), the column is
(qi ,−);
if the arrow comes from H(i,j−1), the column is
(−,dj );
if the arrow comes from H(i−1,j−1) , the column is
(qi ,dj ).
Example of Global Alignment
Let gap = -2
match = 1
mismatch = -1.
empty A A A C
empty 0 -2 -4 -6 -8
A -2 1 -1 -3 -5
G -4 -1 0 -2 -4
C -6 -3 -2 -1 -1
AAAC AAAC
A-GC -AGC
Local Alignment Algorithm
The Smith-Waterman algorithm
In 1981, T. F. Smith and M. S. Waterman
developed a new algorithm capable of
detecting also local similarity
Let us suppose to have a long DNA sequence
and want to isolate each subsequence similar to
each part of the yeast genome
Smith & Waterman
Place each sequence along one axis
Place score 0 at the up-left corner
Fill in 1st row & column with 0s
Fill in the matrix with max value of 4 possible values:
0
Vertical move: Score + gap penalty
Horizontal move: Score + gap penalty
Diagonal move: Score + match/mismatch score
The optimal alignment score is the max in the matrix
To reconstruct the optimal alignment, trace back where the MAX
at each step came from, stop when a zero is hit
Local Alignment
Let gap = -2
GATCACCT GATCACCT
match = 1 GATACCC GAT _ ACCC
mismatch = -1.
empty G A T C A C C T
empty
0 0 0 0 0 0 0 0 0
G 0 1 0 0 0 0 0 0 0
A 0 0 2 0 0 1 0 0 0
T 0 0 0 3 1 0 0 0 1
A 0 0 1 1 2 2 0 0 0
C 0 0 0 0 2 1 3 1 0
C 0 0 0 0 1 1 2 4 2
C 0 0 0 0 1 0 2 3 3
Vertical move: Score + gap penalty
Horizontal move: Score + gap penalty
Diagonal move: Score + match/mismatch score
Practice
Match = 3
CTTCAGC Mismatch= -2
Gap = -3
CTCAGC
Smith-Waterman Algorithm
Conclusion
The Needleman-Wunsch algorithm, for realizing global
alignments, and the technique by Smith and Waterman, for
local alignments, constitute the fundamental basis on
which numerous database search algorithms were built
BLAST
FASTA
Clustal
These algorithms use indexing techniques, heuristics, and
fast comparative methods to get a quick comparison
between a query sequence and an entire database