Sequence comparison: Motivation
Finding similarity between sequences is important
for many biological questions.
• Find homologous proteins
– Allows to predict structure and function
• Locate similar subsequences in DNA
– e.g: allows to identify regulatory elements
• Locate DNA sequences that might overlap
– Helps in sequence assembly
Sequence Alignment
• Input: two sequences over the same alphabet
• Output: an alignment of the two sequences
• Two basic variants of sequence alignment:
Global – all characters in both sequences participate
• Needleman-Wunsch, 1970
Local – find related regions within sequences
• Smith-Waterman, 1981
Sequence Alignment - Example
Input:
GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA
Possible output:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
• Three elements:
– Perfect matches
– Mismatches
– Insertions & deletions (indel)
For example, the two hypothetical sequences
abcdefghajklm
abbdhijk
Given the payoff matrix
payoff = { match => 4,
could be aligned like this
mismatch => -3,
abcdefghajklm
gap_open => -2,
|| | | ||
gap_extend => -1 };
abbd...hijk
As shown, there are 6 matches,
2 mismatches, and one gap of length 3.
The sequences
abcdefghajklm
abbdhijk
are aligned and scored like this
a b c d e f g h a j k l m
| | | | | |
a b b d . . . h i j k
match 4 4 4 4 4 4
mismatch -3 -3
gap_open -2
gap_extend -1-1-1
for a total score of 24-6-2-3 = 13.
Scoring Function
• Score each position independently:
– Match: +1
– Mismatch: -1
– Indel: -2
• Score of an alignment is sum of position scores
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score: (+1x5) + (-1x6) + (-2x11) = -23
Needleman-Wunsch Method
The algorithm guarantees that no other
alignment of these two sequences has a
higher score under this payoff matrix.
Output:
An alignment of two sequences is represented by three lines
The first line shows the first sequence
The third line shows the second sequence.
The second line has a row of symbols.
The symbol is a vertical bar wherever characters in the two
sequences match, and a space where ever they do not.
Dots may be inserted in either sequence to represent gaps.
Needleman-Wunsch Method
Typical output file
Global: HBA_HUMAN vs HBB_HUMAN
Score: 290.50
HBA_HUMAN 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44
|:| :|: | | |||| : | | ||| |: : :| |: :|
HBB_HUMAN 1 VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43
HBA_HUMAN 45 HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83
| ||| |: :|| ||||| | :: :||:|:: : |
HBB_HUMAN 44 SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88
HBA_HUMAN 84 SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128
|:|| || ||| ||:|| : |: || | |||| | |: |
HBB_HUMAN 89 SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133
HBA_HUMAN 129 LASVSTVLTSKYR 141
:| |: | ||
HBB_HUMAN 134 VAGVANALAHKYH 146
%id = 45.32 %similarity = 63.31
Overall %id = 43.15; Overall %similarity = 60.27
Needleman-Wunsch Method
Dynamic Programming
Potential difficulty. How does one come up with the optimal alignment in
the first place? We now introduce the concept of dynamic programming
(DP).
DP can be applied to a large search space that can be structured into a
succession of stages such that:
1) the initial stage contains trivial solutions to sub-problems
2) each partial solution in a later stage can be calculated by
recurring on only a fixed number of partial solutions in an
earlier stage.
3) the final stage contains the overall solution.
Three steps in Dynamic
Programming
1. Initialization
2 Matrix fill or scoring
3. Traceback and alignment
Two sequences will be aligned.
GAATTCAGTTA (sequence #1)
GGATCGA (sequence #2)
A simple scoring scheme will be used
Si,j = 1 if the residue at position I of sequence #1 is the same as
the residue at position j of the sequence #2 (called match score)
Si,j = 0 for mismatch score
w = gap penalty
Initialization step: Create Matrix with M + 1 columns
and N + 1 rows. First row and column filled with 0.
Matrix fill step: Each position Mi,j is defined to be the
MAXIMUM score at position i,j
Mi,j = MAXIMUM [
Mi-1, j-1 + si,,j (match or mismatch in the diagonal)
Mi, j-1 + w (gap in sequence #1)
Mi-1, j + w (gap in sequence #2)]
Fill in rest of row 1 and column 1
Fill in column 2
Fill in column 3
Column 3 with answers
Traceback step:
Position at current cell and look at direct predecessors
Seq#1 A
|
Seq#2 A
Traceback step:
Position at current cell and look at direct predecessors
Seq#1 G A A T T C A G T T A
| | | | | |
Seq#2 G G A T - C - G - - A
Over a decade after the initial publication of the Needleman-Wunsch algorithm, a
modification was made to allow for local alignments (Smith and Waterman, 1981).
In this adaptation, the alignment path does not need to reach the edges of the
search graph, but may begin and end internally. In order to accomplish this, 0 was
added as a term in the score calculation described by Needleman and Wunsch.
Smith-Waterman Algorithm (cont.)
• Only works effectively when gap C A G C C U C G C U U A G
penalties are used A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
• In example shown A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7
– match = +1 U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7
G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0
– mismatch = -1/3
C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
– gap = -1+1/3k (k=extent of C 1.0 0.7 0.0 1.0 3.0 1.7 ?
gap) A
• Start with all cell values = 0 U
• Looks in subcolumn and subrow U
shown and in direct diagonal for G
a score that is the highest when A
you take alignment score or gap C
penalty into account G
G
Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}
Smith-Waterman Algorithm (cont.)
• Four possible ways of forming a path
For every residue in the query sequence
1. Align with next residue of db sequence … score is previous
score plus similarity score for the two residues
2. Deletion (i.e. match residue of query with a gap) … score is
previous score minus gap penalty dependent on size of gap
3. Insertion (i.e. match residue of db sequence with a gap) …
score is previous score minus gap penalty dependent on size
of gap
4. Stop … score is zero
• Choose whichever of these is the highest
Smith-Waterman Algorithm (cont.)
C A G C C U C G C U U A G
Construct Alignment A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
• The score in each cell is the A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7
maximum possible score U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7
G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0
for an alignment of ANY
C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3
LENGTH ending at those
C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0
coordinates A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0
• Trace pathway back from U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0
highest scoring cell U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0
• This cell can be anywhere G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7
in the array A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0
C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0
• Align highest scoring G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0
segment G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0
GCC-UCG
GCCAUUG
Differences
• Needleman-Wunsch • Smith-Waterman
1. Global alignments 1. Local alignments
2. Requires alignment score for a pair of 2. Residue alignment score may be
residues to be >=0 positive or negative
3. No gap penalty required 3. Requires a gap penalty to work
effectively
4. Score cannot decrease between two 4. Score can increase, decrease or stay
cells of a pathway level between two cells of a pathway